Pattern Recognition and Machine Learning

Transcript

1

2 Information Science and Statistics Series Editors: M. Jordan J. Kleinberg B. Scho lkopf ̈

3 Information Science and Statistics Akaike and Kitagawa: The Practice of Time Series Analysis. Pattern Recognition and Machine Learning. Bishop: Probabilistic Networks and Cowell, Dawid, Lauritzen, and Spiegelhalter: Expert Systems. Doucet, de Freitas, and Gordon: Sequential Monte Carlo Methods in Practice. Fine: Feedforward Neural Network Methodology. Hawkins and Olwell: Cumulative Sum Charts and Charting for Quality Improvement. Jensen: Bayesian Networks and Decision Graphs. Marchette: Computer Intrusion Detection and Network Monitoring: A Statistical Viewpoint. Rubinstein and Kroese: The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte Carlo Simulation, and Machine Learning. Studený: Probabilistic Conditional Independence Structures . Vapnik: The Nature of Statistical Learning Theory, Second Edition. Wallace: Statistical and Inductive Inference by Minimum Massage Length.

4 Christopher M. Bishop Pattern Recognition and Machine Learning

5 Christopher M. Bishop F.R.Eng. Assistant Director Microsoft Research Ltd Cambridge CB3 0FB, U.K. [email protected] http://research.microsoft.com/  cmbishop Series Editors lkopf Bernhard Scho Professor Jon Kleinberg Michael Jordan ̈ Max Planck Institute for Department of Computer Department of Computer Biological Cybernetics Science Science and Department Spemannstrasse 38 Cornell University of Statistics bingen 72076 Tu Ithaca, NY 14853 University of California, ̈ Germany USA Berkeley Berkeley, CA 94720 USA Library of Congress Control Number: 2006922522 ISBN-10: 0-387-31073-8 ISBN-13: 978-0387-31073-2 Printed on acid-free paper. © 2006 Springer Science + Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science + Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in Singapore. (KYO) 987654321 springer.com

6 This book is dedicated to my family: Jenna, Mark, and Hugh Total eclipse of the sun, Antalya, Turkey, 29 March 2006.

7 Preface Pattern recognition has its origins in engineering, whereas machine learning grew out of computer science. However, these activities can be viewed as two facets of the same field, and together they have undergone substantial development over the past ten years. In particular, Bayesian methods have grown from a specialist niche to become mainstream, while graphical models have emerged as a general framework for describing and applying probabilistic models. Also, the practical applicability of Bayesian methods has been greatly enhanced through the development of a range of approximate inference algorithms such as variational Bayes and expectation propa- gation. Similarly, new models based on kernels have had significant impact on both algorithms and applications. This new textbook reflects these recent developments while providing a compre- hensive introduction to the fields of pattern recognition and machine learning. It is aimed at advanced undergraduates or first year PhD students, as well as researchers and practitioners, and assumes no previous knowledge of pattern recognition or ma- chine learning concepts. Knowledge of multivariate calculus and basic linear algebra is required, and some familiarity with probabilities would be helpful though not es- sential as the book includes a self-contained introduction to basic probability theory. Because this book has broad scope, it is impossible to provide a complete list of references, and in particular no attempt has been made to provide accurate historical attribution of ideas. Instead, the aim has been to give references that offer greater detail than is possible here and that hopefully provide entry points into what, in some cases, is a very extensive literature. For this reason, the references are often to more recent textbooks and review articles rather than to original sources. The book is supported by a great deal of additional material, including lecture slides as well as the complete set of figures used in the book, and the reader is encouraged to visit the book web site for the latest information: http://research.microsoft.com/ ∼ cmbishop/PRML vii

8 viii PREFACE Exercises The exercises that appear at the end of every chapter form an important com- ponent of the book. Each exercise has been carefully chosen to reinforce concepts explained in the text or to develop and generalize them in significant ways, and each ( ) , which denotes a simple exercise  is graded according to difficulty ranging from taking a few minutes to complete, through to ) , which denotes a significantly (  more complex exercise. It has been difficult to know to what extent these solutions should be made widely available. Those engaged in self study will find worked solutions very ben- eficial, whereas many course tutors request that solutions be available only via the publisher so that the exercises may be used in class. In order to try to meet these conflicting requirements, those exercises that help amplify key points in the text, or that fill in important details, have solutions that are available as a PDF file from the book web site. Such exercises are denoted by www . Solutions for the remaining exercises are available to course tutors by contacting the publisher (contact details are given on the book web site). Readers are strongly encouraged to work through the exercises unaided, and to turn to the solutions only as required. Although this book focuses on concepts and principles, in a taught course the students should ideally have the opportunity to experiment with some of the key algorithms using appropriate data sets. A companion volume (Bishop and Nabney, 2008) will deal with practical aspects of pattern recognition and machine learning, and will be accompanied by Matlab software implementing most of the algorithms discussed in this book. Acknowledgements ́ en who First of all I would like to express my sincere thanks to Markus Svens has provided immense help with preparation of figures and with the typesetting of A T X. His assistance has been invaluable. the book in L E I am very grateful to Microsoft Research for providing a highly stimulating re- search environment and for giving me the freedom to write this book (the views and opinions expressed in this book, however, are my own and are therefore not neces- sarily the same as those of Microsoft or its affiliates). Springer has provided excellent support throughout the final stages of prepara- tion of this book, and I would like to thank my commissioning editor John Kimmel for his support and professionalism, as well as Joseph Piliero for his help in design- ing the cover and the text format and MaryAnn Brickner for her numerous contribu- tions during the production phase. The inspiration for the cover design came from a discussion with Antonio Criminisi. I also wish to thank Oxford University Press for permission to reproduce ex- cerpts from an earlier textbook, Neural Networks for Pattern Recognition (Bishop, 1995a). The images of the Mark 1 perceptron and of Frank Rosenblatt are repro- duced with the permission of Arvin Calspan Advanced Technology Center. I would also like to thank Asela Gunawardana for plotting the spectrogram in Figure 13.1, ̈ and Bernhard Sch olkopf for permission to use his kernel PCA code to plot Fig- ure 12.17.

9 PREFACE ix Many people have helped by proofreading draft material and providing com- ́ ments and suggestions, including Shivani Agarwal, C edric Archambeau, Arik Azran, Andrew Blake, Hakan Cevikalp, Michael Fourman, Brendan Frey, Zoubin Ghahra- mani, Thore Graepel, Katherine Heller, Ralf Herbrich, Geoffrey Hinton, Adam Jo- hansen, Matthew Johnson, Michael Jordan, Eva Kalyvianaki, Anitha Kannan, Julia Lasserre, David Liu, Tom Minka, Ian Nabney, Tonatiuh Pena, Yuan Qi, Sam Roweis, Balaji Sanjiya, Toby Sharp, Ana Costa e Silva, David Spiegelhalter, Jay Stokes, Tara Symeonides, Martin Szummer, Marshall Tappen, Ilkay Ulusoy, Chris Williams, John Winn, and Andrew Zisserman. Finally, I would like to thank my wife Jenna who has been hugely supportive throughout the several years it has taken to write this book. Chris Bishop Cambridge February 2006

10

11 Mathematical notation I have tried to keep the mathematical content of the book to the minimum neces- sary to achieve a proper understanding of the field. However, this minimum level is nonzero, and it should be emphasized that a good grasp of calculus, linear algebra, and probability theory is essential for a clear understanding of modern pattern recog- nition and machine learning techniques. Nevertheless, the emphasis in this book is on conveying the underlying concepts rather than on mathematical rigour. I have tried to use a consistent notation throughout the book, although at times this means departing from some of the conventions used in the corresponding re- search literature. Vectors are denoted by lower case bold Roman letters such as denotes the x , and all vectors are assumed to be column vectors. A superscript T T will be a row vector. Uppercase bold x transpose of a matrix or vector, so that denotes a ) ,...,w ( , denote matrices. The notation M roman letters, such as w 1 M elements, while the corresponding column vector is written as row vector with M T ,...,w ) . w =( w 1 M interval from The notation is used to denote the closed [ a to b , that is the a, b ] a and b themselves, while ( a, b ) denotes the correspond- interval including the values a, b open b . Similarly, [ and ) denotes an a ing interval, that is the interval excluding but excludes b . For the most part, however, there will be interval that includes a little need to dwell on such refinements as whether the end points of an interval are included or not. , × M identity matrix (also known as the unit matrix) is denoted I The M M which will be abbreviated to where there is no ambiguity about it dimensionality. I = . that equal 1 if i j j and 0 if i  = I It has elements ij x A functional is denoted ] where y ( y ) is some function. The concept of a [ f functional is discussed in Appendix D. g ( x )= O ( f ( x )) denotes that | f ( x ) /g ( x ) | is bounded as x →∞ . The notation 2 2 , then g ( x +2 O ( x . ) )= g ( x )=3 x For instance if f The expectation of a function x, y ) with respect to a random variable x is de- ( . In situations where there is no ambiguity as to which variable [ f ( x, y )] noted by E x is being averaged over, this will be simplified by omitting the suffix, for instance xi

12 xii MATHEMATICAL NOTATION [ x . If the distribution of x is conditioned on another variable z , then the corre- E ] E sponding conditional expectation will be written | ( x ) [ z ] . Similarly, the variance f x var[ f is denoted x )] , and for vector variables the covariance is written cov[ x , y ] .We ( shall also use x ] as a shorthand notation for cov[ x cov[ x ] . The concepts of expecta- , tions and covariances are introduced in Section 1.2.2. T of a ) ,...,x x ,..., x =( , D -dimensional vector x values N If we have x N 1 D 1 th in which the n X we can combine the observations into a data matrix X row of T x corresponds to the row vector . Thus the n, i element of X corresponds to the n th th . For the case of one-dimensional variables we n element of the observation x i n th shall denote such a matrix by x , which is a column vector whose n element is x . n ) uses a different typeface to distinguish it x (which has dimensionality N Note that from x (which has dimensionality D ).

13 Contents Preface vii xi Mathematical notation xiii Contents 1 Introduction 1 1.1 Example: Polynomial Curve Fitting . . . . ... 4 ... 12 1.2 Probability Theory . . 1.2.1 Probability densities . ... 17 1.2.2 Expectations and covariances ... 19 1.2.3 Bayesian probabilities ... 21 1.2.4 The Gaussian distribution . . . . . ... 24 1.2.5 Curve fitting re-visited ... 28 1.2.6 Bayesian curve fitting ... 30 1.3 Model Selection . . . ... 32 1.4 The Curse of Dimensionality . ... 33 1.5 Decision Theory . . . ... 38 1.5.1 Minimizing the misclassification rate . . . ... 39 1.5.2 Minimizing the expected loss . . ... 41 1.5.3 The reject option . . . ... 42 1.5.4 Inference and decision ... 42 1.5.5 Loss functions for regression . . . . ... 46 1.6 Information Theory . . ... 48 1.6.1 Relative entropy and mutual information . ... 55 Exercises . . . ... 58 xiii

14 xiv CONTENTS 67 2 Probability Distributions 2.1 Binary Variables . . . ... 68 ... 71 2.1.1 The beta distribution . 2.2 Multinomial Variables ... 74 2.2.1 The Dirichlet distribution . . . . . . ... 76 2.3 The Gaussian Distribution . . ... 78 2.3.1 Conditional Gaussian distributions . ... 85 2.3.2 Marginal Gaussian distributions . . ... 88 2.3.3 Bayes’ theorem for Gaussian variables . . . ... 90 ... 93 2.3.4 Maximum likelihood for the Gaussian . . . ... 94 2.3.5 Sequential estimation . ... 97 2.3.6 Bayesian inference for the Gaussian ... 102 2.3.7 Student’s t-distribution 2.3.8 Periodic variables . . . ... 105 2.3.9 Mixtures of Gaussians ... 110 2.4 The Exponential Family . . . ... 113 2.4.1 Maximum likelihood and sufficient statistics . . . . . . . . 116 2.4.2 Conjugate priors . . . ... 117 2.4.3 Noninformative priors ... 117 2.5 Nonparametric Methods . . . ... 120 2.5.1 Kernel density estimators . . . . . . ... 122 2.5.2 Nearest-neighbour methods . . . . ... 124 ... 127 Exercises . . . 137 3 Linear Models for Regression 3.1 Linear Basis Function Models ... 138 3.1.1 Maximum likelihood and least squares . . . ... 140 ... 143 3.1.2 Geometry of least squares . . . . . ... 143 3.1.3 Sequential learning . . 3.1.4 Regularized least squares . . . . . . ... 144 3.1.5 Multiple outputs . . . ... 146 3.2 The Bias-Variance Decomposition . . . . . ... 147 3.3 Bayesian Linear Regression . ... 152 3.3.1 Parameter distribution ... 152 3.3.2 Predictive distribution ... 156 3.3.3 Equivalent kernel . . . ... 159 3.4 Bayesian Model Comparison . ... 161 3.5 The Evidence Approximation ... 165 3.5.1 Evaluation of the evidence function ... 166 3.5.2 Maximizing the evidence function . ... 168 3.5.3 Effective number of parameters . . ... 170 3.6 Limitations of Fixed Basis Functions . . . ... 172 Exercises . . . ... 173

15 CONTENTS xv 179 4 Linear Models for Classification 4.1 Discriminant Functions ... 181 ... 181 4.1.1 Two classes . . 4.1.2 Multiple classes ... 182 4.1.3 Least squares for classification ... 184 4.1.4 Fisher’s linear discriminant . . . . . ... 186 4.1.5 Relation to least squares . . . . . . ... 189 ... 191 4.1.6 Fisher’s discriminant for multiple classes . 4.1.7 The perceptron algorithm . . . . . . ... 192 4.2 Probabilistic Generative Models . . . ... 196 ... 198 4.2.1 Continuous inputs . . ... 200 4.2.2 Maximum likelihood solution . . . ... 202 4.2.3 Discrete features . . . ... 202 4.2.4 Exponential family . . 4.3 Probabilistic Discriminative Models . . . . ... 203 4.3.1 Fixed basis functions . ... 204 ... 205 4.3.2 Logistic regression . . ... 207 4.3.3 Iterative reweighted least squares 4.3.4 Multiclass logistic regression . . . ... 209 4.3.5 Probit regression . . . ... 210 4.3.6 Canonical link functions . . . . . . ... 212 4.4 The Laplace Approximation . ... 213 4.4.1 Model comparison and BIC . . . ... 216 ... 217 4.5 Bayesian Logistic Regression ... 217 4.5.1 Laplace approximation 4.5.2 Predictive distribution ... 218 Exercises . . . ... 220 5 Neural Networks 225 5.1 Feed-forward Network Functions . . . . . ... 227 5.1.1 Weight-space symmetries . . . . ... 231 5.2 Network Training . . . ... 232 5.2.1 Parameter optimization ... 236 5.2.2 Local quadratic approximation . . . ... 237 5.2.3 Use of gradient information . . . ... 239 5.2.4 Gradient descent optimization . . . ... 240 5.3 Error Backpropagation ... 241 5.3.1 Evaluation of error-function derivatives . . ... 242 5.3.2 A simple example ... 245 5.3.3 Efficiency of backpropagation . . . ... 246 5.3.4 The Jacobian matrix . ... 247 5.4 The Hessian Matrix . . ... 249 5.4.1 Diagonal approximation . . . . . ... 250 5.4.2 Outer product approximation . ... 251 5.4.3 Inverse Hessian ... 252

16 xvi CONTENTS s... 252 5.4.4 Finite difference 5.4.5 Exact evaluation of the Hessian . . ... 253 ... 254 5.4.6 Fast multiplication by the Hessian . 5.5 Regularization in Neural Networks . . . . ... 256 5.5.1 Consistent Gaussian priors . . . . ... 257 5.5.2 Early stopping ... 259 5.5.3 Invariances . . ... 261 ... 263 5.5.4 Tangent propagation . 5.5.5 Training with transformed data . . . ... 265 5.5.6 Convolutional networks . . . . . . ... 267 ... 269 5.5.7 Soft weight sharing . . ... 272 5.6 Mixture Density Networks . . ... 277 5.7 Bayesian Neural Networks . . ... 278 5.7.1 Posterior parameter distribution . . 5.7.2 Hyperparameter optimization . . . ... 280 5.7.3 Bayesian neural networks for classification ... 281 ... 284 Exercises . . . 291 6 Kernel Methods 6.1 Dual Representations . ... 293 6.2 Constructing Kernels . ... 294 6.3 Radial Basis Function Networks . . . ... 299 6.3.1 Nadaraya-Watson model . . . . . . ... 301 6.4 Gaussian Processes . . ... 303 ... 304 6.4.1 Linear regression revisited . . . . . ... 306 6.4.2 Gaussian processes for regression . 6.4.3 Learning the hyperparameters . . . ... 311 6.4.4 Automatic relevance determination ... 312 ... 313 6.4.5 Gaussian processes for classification ... 315 6.4.6 Laplace approximation 6.4.7 Connection to neural networks . . . ... 319 Exercises . . . ... 320 7 Sparse Kernel Machines 325 7.1 Maximum Margin Classifiers ... 326 7.1.1 Overlapping class distributions . . ... 331 7.1.2 Relation to logistic regression . . ... 336 7.1.3 Multiclass SVMs . . . ... 338 7.1.4 SVMs for regression . ... 339 7.1.5 Computational learning theory . . . ... 344 7.2 Relevance Vector Machines . ... 345 7.2.1 RVM for regression . . ... 345 7.2.2 Analysis of sparsity . . ... 349 7.2.3 RVM for classification ... 353 Exercises . . . ... 357

17 CONTENTS xvii 359 8 Graphical Models 8.1 Bayesian Networks . . ... 360 ... 362 8.1.1 Example: Polynomial regression . . 8.1.2 Generative models . . ... 365 8.1.3 Discrete variables . . . ... 366 8.1.4 Linear-Gaussian models . . . . . . ... 370 8.2 Conditional Independence . . ... 372 8.2.1 Three example graphs ... 373 8.2.2 D-separation . ... 378 8.3 Markov Random Fields . . . ... 383 8.3.1 Conditional independence properties ... 383 8.3.2 Factorization properties . . . . . . ... 384 8.3.3 Illustration: Image de-noising . . . ... 387 ... 390 8.3.4 Relation to directed graphs . . . . . ... 393 8.4 Inference in Graphical Models 8.4.1 Inference on a chain . ... 394 8.4.2 Trees ... 398 ... 399 8.4.3 Factor graphs . ... 402 8.4.4 The sum-product algorithm . . . . 8.4.5 The max-sum algorithm . . . . . . ... 411 8.4.6 Exact inference in general graphs . ... 416 8.4.7 Loopy belief propagation . . . . . . ... 417 8.4.8 Learning the graph structure . . . ... 418 Exercises . . . ... 418 423 9 Mixture Models and EM K -means Clustering . ... 424 9.1 9.1.1 Image segmentation and compression . . . ... 428 ... 430 9.2 Mixtures of Gaussians ... 432 9.2.1 Maximum likelihood . 9.2.2 EM for Gaussian mixtures . . . . . ... 435 9.3 An Alternative View of EM . ... 439 9.3.1 Gaussian mixtures revisited . . . ... 441 9.3.2 Relation to K -means . ... 443 9.3.3 Mixtures of Bernoulli distributions . ... 444 9.3.4 EM for Bayesian linear regression . ... 448 9.4 The EM Algorithm in General ... 450 Exercises . . . ... 455 10 Approximate Inference 461 10.1 Variational Inference . ... 462 10.1.1 Factorized distributions ... 464 10.1.2 Properties of factorized approximations . . ... 466 10.1.3 Example: The univariate Gaussian . ... 470 10.1.4 Model comparison . . ... 473 10.2 Illustration: Variational Mixture of Gaussians . . . ... 474

18 xviii CONTENTS ... 475 10.2.1 Variational distribution 10.2.2 Variational lower bound . . . . . . ... 481 ... 482 10.2.3 Predictive density . . . 10.2.4 Determining the number of components . . ... 483 10.2.5 Induced factorizations ... 485 10.3 Variational Linear Regression ... 486 10.3.1 Variational distribution ... 486 ... 488 10.3.2 Predictive distribution 10.3.3 Lower bound . ... 489 10.4 Exponential Family Distributions . . . . . ... 490 ... 491 10.4.1 Variational message passing . . . ... 493 10.5 Local Variational Methods . . ... 498 10.6 Variational Logistic Regression . . . ... 498 10.6.1 Variational posterior distribution . . 10.6.2 Optimizing the variational parameters . . . ... 500 10.6.3 Inference of hyperparameters ... 502 ... 505 10.7 Expectation Propagation . . . ... 511 10.7.1 Example: The clutter problem . . . 10.7.2 Expectation propagation on graphs . ... 513 Exercises . . . ... 517 11 Sampling Methods 523 11.1 Basic Sampling Algorithms . ... 526 11.1.1 Standard distributions ... 526 ... 528 11.1.2 Rejection sampling . . ... 530 11.1.3 Adaptive rejection sampling . . . . 11.1.4 Importance sampling . ... 532 11.1.5 Sampling-importance-resampling . ... 534 11.1.6 Sampling and the EM algorithm . . ... 536 11.2 Markov Chain Monte Carlo . ... 537 11.2.1 Markov chains ... 539 11.2.2 The Metropolis-Hastings algorithm ... 541 11.3 Gibbs Sampling . . . ... 542 11.4 Slice Sampling . . . . ... 546 11.5 The Hybrid Monte Carlo Algorithm . . . . ... 548 11.5.1 Dynamical systems . . ... 548 11.5.2 Hybrid Monte Carlo . ... 552 11.6 Estimating the Partition Function . . . . . ... 554 Exercises . . . ... 556 12 Continuous Latent Variables 559 12.1 Principal Component Analysis ... 561 12.1.1 Maximum variance formulation . . ... 561 12.1.2 Minimum-error formulation . . . . ... 563 12.1.3 Applications of PCA . ... 565 12.1.4 PCA for high-dimensional data . . ... 569

19 CONTENTS xix ... 570 12.2 Probabilistic PCA . . 12.2.1 Maximum likelihood PCA . . . . . ... 574 ... 577 12.2.2 EM algorithm for PCA 12.2.3 Bayesian PCA ... 580 12.2.4 Factor analysis ... 583 12.3 Kernel PCA ... 586 12.4 Nonlinear Latent Variable Models . . . . . ... 591 ... 591 12.4.1 Independent component analysis . . 12.4.2 Autoassociative neural networks . . ... 592 12.4.3 Modelling nonlinear manifolds . . . ... 595 Exercises . . . ... 599 13 Sequential Data 605 ... 607 13.1 Markov Models . . . . ... 610 13.2 Hidden Markov Models . . . 13.2.1 Maximum likelihood for the HMM ... 615 ... 618 13.2.2 The forward-backward algorithm . 13.2.3 The sum-product algorithm for the HMM . ... 625 13.2.4 Scaling factors ... 627 13.2.5 The Viterbi algorithm . ... 629 13.2.6 Extensions of the hidden Markov model . . ... 631 13.3 Linear Dynamical Systems . . ... 635 13.3.1 Inference in LDS . . . ... 638 ... 642 13.3.2 Learning in LDS . . . ... 644 13.3.3 Extensions of LDS . . 13.3.4 Particle filters . ... 645 Exercises . . . ... 646 653 14 Combining Models ... 654 14.1 Bayesian Model Averaging . . 14.2 Committees . . . . . . ... 655 14.3 Boosting ... 657 14.3.1 Minimizing exponential error . . . ... 659 14.3.2 Error functions for boosting . . . ... 661 14.4 Tree-based Models . . ... 663 14.5 Conditional Mixture Models . ... 666 14.5.1 Mixtures of linear regression models ... 667 14.5.2 Mixtures of logistic models . . . . ... 670 14.5.3 Mixtures of experts . . ... 672 Exercises . . . ... 674 Appendix A Data Sets 677 Appendix B Probability Distributions 685 Appendix C Properties of Matrices 695

20 xx CONTENTS Appendix D Calculus of Variations 703 Appendix E Lagrange Multipliers 707 References 711 Index 729

21 1 Introduction The problem of searching for patterns in data is a fundamental one and has a long and successful history. For instance, the extensive astronomical observations of Tycho th century allowed Johannes Kepler to discover the empirical laws of Brahe in the 16 planetary motion, which in turn provided a springboard for the development of clas- sical mechanics. Similarly, the discovery of regularities in atomic spectra played a key role in the development and verification of quantum physics in the early twenti- eth century. The field of pattern recognition is concerned with the automatic discov- ery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories. Consider the example of recognizing handwritten digits, illustrated in Figure 1.1. Each digit corresponds to a 28 × 28 pixel image and so can be represented by a vector x 784 real numbers. The goal is to build a machine that will take such a comprising vector x as input and that will produce the identity of the digit 0 ,..., 9 as the output. This is a nontrivial problem due to the wide variability of handwriting. It could be 1

22 2 1. INTRODUCTION Examples of hand-written dig- Figure 1.1 its taken from US zip codes. tackled using handcrafted rules or heuristics for distinguishing the digits based on the shapes of the strokes, but in practice such an approach leads to a proliferation of rules and of exceptions to the rules and so on, and invariably gives poor results. Far better results can be obtained by adopting a machine learning approach in which a large set of digits { x N ,..., x called a } is used to tune the training set N 1 parameters of an adaptive model. The categories of the digits in the training set are known in advance, typically by inspecting them individually and hand-labelling target vector t , which represents them. We can express the category of a digit using the identity of the corresponding digit. Suitable techniques for representing cate- gories in terms of vectors will be discussed later. Note that there is one such target t for each digit image x . vector The result of running the machine learning algorithm can be expressed as a function y ( x ) which takes a new digit image x as input and that generates an output vector y , encoded in the same way as the target vectors. The precise form of the y learning x ) is determined during the training phase, also known as the function ( phase, on the basis of the training data. Once the model is trained it can then de- termine the identity of new digit images, which are said to comprise a test set . The ability to categorize correctly new examples that differ from those used for train- ing is known as generalization . In practical applications, the variability of the input vectors will be such that the training data can comprise only a tiny fraction of all possible input vectors, and so generalization is a central goal in pattern recognition. prepro- For most practical applications, the original input variables are typically cessed to transform them into some new space of variables where, it is hoped, the pattern recognition problem will be easier to solve. For instance, in the digit recogni- tion problem, the images of the digits are typically translated and scaled so that each digit is contained within a box of a fixed size. This greatly reduces the variability within each digit class, because the location and scale of all the digits are now the same, which makes it much easier for a subsequent pattern recognition algorithm to distinguish between the different classes. This pre-processing stage is sometimes also called feature extraction . Note that new test data must be pre-processed using the same steps as the training data. Pre-processing might also be performed in order to speed up computation. For example, if the goal is real-time face detection in a high-resolution video stream, the computer must handle huge numbers of pixels per second, and presenting these directly to a complex pattern recognition algorithm may be computationally infeasi- ble. Instead, the aim is to find useful features that are fast to compute, and yet that

23 1. INTRODUCTION 3 also preserve useful discriminatory information enabling faces to be distinguished from non-faces. These features are then used as the inputs to the pattern recognition algorithm. For instance, the average value of the image intensity over a rectangular subregion can be evaluated extremely efficiently (Viola and Jones, 2004), and a set of such features can prove very effective in fast face detection. Because the number of such features is smaller than the number of pixels, this kind of pre-processing repre- sents a form of dimensionality reduction. Care must be taken during pre-processing because often information is discarded, and if this information is important to the solution of the problem then the overall accuracy of the system can suffer. Applications in which the training data comprises examples of the input vectors prob- along with their corresponding target vectors are known as supervised learning lems. Cases such as the digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories, are called classification problems. If the desired output consists of one or more continuous variables, then regression the task is called . An example of a regression problem would be the pre- diction of the yield in a chemical manufacturing process in which the inputs consist of the concentrations of reactants, the temperature, and the pressure. In other pattern recognition problems, the training data consists of a set of input x without any corresponding target values. The goal in such unsupervised vectors problems may be to discover groups of similar examples within the data, learning where it is called clustering , or to determine the distribution of data within the input space, known as density estimation , or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization . reinforcement learning Finally, the technique of (Sutton and Barto, 1998) is con- cerned with the problem of finding suitable actions to take in a given situation in order to maximize a reward. Here the learning algorithm is not given examples of optimal outputs, in contrast to supervised learning, but must instead discover them by a process of trial and error. Typically there is a sequence of states and actions in which the learning algorithm is interacting with its environment. In many cases, the current action not only affects the immediate reward but also has an impact on the re- ward at all subsequent time steps. For example, by using appropriate reinforcement learning techniques a neural network can learn to play the game of backgammon to a high standard (Tesauro, 1994). Here the network must learn to take a board position as input, along with the result of a dice throw, and produce a strong move as the output. This is done by having the network play against a copy of itself for perhaps a million games. A major challenge is that a game of backgammon can involve dozens of moves, and yet it is only at the end of the game that the reward, in the form of victory, is achieved. The reward must then be attributed appropriately to all of the moves that led to it, even though some moves will have been good ones and others less so. This is an example of a credit assignment problem. A general feature of re- inforcement learning is the trade-off between exploration , in which the system tries out new kinds of actions to see how effective they are, and exploitation , in which the system makes use of actions that are known to yield a high reward. Too strong a focus on either exploration or exploitation will yield poor results. Reinforcement learning continues to be an active area of machine learning research. However, a

24 4 1. INTRODUCTION Plot of a training data set of Figure 1.2 N = 10 points, shown as blue circles, each comprising an observation 1 of the input variable along with x the corresponding target variable t . t The green curve shows the used to gener- function sin(2 πx ) 0 ate the data. Our goal is to pre- for some new t dict the value of value of x , without knowledge of the green curve. −1 0 1 x detailed treatment lies beyond the scope of this book. Although each of these tasks needs its own tools and techniques, many of the key ideas that underpin them are common to all such problems. One of the main goals of this chapter is to introduce, in a relatively informal way, several of the most important of these concepts and to illustrate them using simple examples. Later in the book we shall see these same ideas re-emerge in the context of more sophisti- cated models that are applicable to real-world pattern recognition applications. This chapter also provides a self-contained introduction to three important tools that will be used throughout the book, namely probability theory, decision theory, and infor- mation theory. Although these might sound like daunting topics, they are in fact straightforward, and a clear understanding of them is essential if machine learning techniques are to be used to best effect in practical applications. 1.1. Example: Polynomial Curve Fitting We begin by introducing a simple regression problem, which we shall use as a run- ning example throughout this chapter to motivate a number of key concepts. Sup- pose we observe a real-valued input variable x and we wish to use this observation to predict the value of a real-valued target variable t . For the present purposes, it is in- structive to consider an artificial example using synthetically generated data because we then know the precise process that generated the data for comparison against any πx ) sin(2 learned model. The data for this example is generated from the function with random noise included in the target values, as described in detail in Appendix A. observations of x , Now suppose that we are given a training set comprising N T ,...,x , together with corresponding observations of the values ) written x ≡ ( x N 1 T . Figure 1.2 shows a plot of a training set comprising ,...,t ) ≡ ( t of t , denoted t N 1 N =10 data points. The input data set x in Figure 1.2 was generated by choos- , and the target 1] , , for n =1 ,...,N , spaced uniformly in range [0 x ing values of n data set t was obtained by first computing the corresponding values of the function

25 1.1. Example: Polynomial Curve Fitting 5 ) sin(2 and then adding a small level of random noise having a Gaussian distri- πx bution (the Gaussian distribution is discussed in Section 1.2.4) to each such point in order to obtain the corresponding value t . By generating data in this way, we are n capturing a property of many real data sets, namely that they possess an underlying regularity, which we wish to learn, but that individual observations are corrupted by random noise. This noise might arise from intrinsically stochastic (i.e. random) pro- cesses such as radioactive decay but more typically is due to there being sources of variability that are themselves unobserved. Our goal is to exploit this training set in order to make predictions of the value ̂ ̂ x t of the target variable for some new value of the input variable. As we shall see sin(2 ) . later, this involves implicitly trying to discover the underlying function πx This is intrinsically a difficult problem as we have to generalize from a finite data set. Furthermore the observed data are corrupted with noise, and so for a given ̂ x ̂ . Probability theory, discussed t there is uncertainty as to the appropriate value for in Section 1.2, provides a framework for expressing such uncertainty in a precise and quantitative manner, and decision theory, discussed in Section 1.5, allows us to exploit this probabilistic representation in order to make predictions that are optimal according to appropriate criteria. For the moment, however, we shall proceed rather informally and consider a simple approach based on curve fitting. In particular, we shall fit the data using a polynomial function of the form M ∑ j M 2 + + w = x + w x x w (1.1) ... + w x )= w y ( w x, 0 2 j M 1 =0 j j j raised to the power of x denotes . x of the polynomial, and M where order is the . are collectively denoted by the vector w ,...,w w The polynomial coefficients M 0 y x, w ) is a nonlinear function of x ,it Note that, although the polynomial function ( w . Functions, such as the polynomial, which is a linear function of the coefficients linear are linear in the unknown parameters have important properties and are called and will be discussed extensively in Chapters 3 and 4. models The values of the coefficients will be determined by fitting the polynomial to the training data. This can be done by minimizing an error function that measures the y ( x, w ) , for any given value of w , and the training set misfit between the function data points. One simple choice of error function, which is widely used, is given by for each data , w ) the sum of the squares of the errors between the predictions y ( x n , so that we minimize t and the corresponding target values x point n n N ∑ 1 2 (1.2) − y } ) w { t ( x , E ( w )= n n 2 n =1 where the factor of 1 / 2 is included for later convenience. We shall discuss the mo- tivation for this choice of error function later in this chapter. For the moment we simply note that it is a nonnegative quantity that would be zero if, and only if, the

26 6 1. INTRODUCTION The error function (1.2) corre- Figure 1.3 t n sponds to (one half of) the sum of t the squares of the displacements (shown by the vertical green bars) of each data point from the function y ( w ) x, . , w ) ( y x n x x n ( x, function ) were to pass exactly through each training data point. The geomet- y w rical interpretation of the sum-of-squares error function is illustrated in Figure 1.3. w We can solve the curve fitting problem by choosing the value of for which E w ) is as small as possible. Because the error function is a quadratic function of ( w , its derivatives with respect to the coefficients will be linear in the the coefficients w , and so the minimization of the error function has a unique solution, elements of  , which can be found in closed form. The resulting polynomial is Exercise 1.1 w denoted by  ( x, given by the function y w ) . M of the polynomial, and as There remains the problem of choosing the order we shall see this will turn out to be an example of an important concept called model comparison or model selection . In Figure 1.4, we show four examples of the results of fitting polynomials having orders M =0 , 1 , 3 ,and 9 to the data set shown in Figure 1.2. We notice that the constant ( =0 ) and first order ( M =1 ) polynomials M give rather poor fits to the data and consequently rather poor representations of the function πx ) . The third order ( M =3 ) polynomial seems to give the best fit sin(2 sin(2 πx ) to the function of the examples shown in Figure 1.4. When we go to a much higher order polynomial ( =9 ), we obtain an excellent fit to the training M  )=0 . E data. In fact, the polynomial passes exactly through each data point and w ( However, the fitted curve oscillates wildly and gives a very poor representation of the function sin(2 ) . This latter behaviour is known as over-fitting . πx As we have noted earlier, the goal is to achieve good generalization by making accurate predictions for new data. We can obtain some quantitative insight into the M by considering a separate test dependence of the generalization performance on set comprising 100 data points generated using exactly the same procedure used to generate the training set points but with new choices for the random noise values included in the target values. For each choice of M , we can then evaluate the residual   given by (1.2) for the training data, and we can also evaluate ) E ( w ) value of ( E w for the test data set. It is sometimes more convenient to use the root-mean-square

27 1.1. Example: Polynomial Curve Fitting 7 M =0 =1 M 1 1 t t 0 0 −1 −1 1 1 0 0 x x M =9 M =3 1 1 t t 0 0 −1 −1 0 1 1 0 x x Figure 1.4 Plots of polynomials having various orders M , shown as red curves, fitted to the data set shown in Figure 1.2. (RMS) error defined by √  = ( 2 E (1.3) w /N ) E RMS N in which the division by allows us to compare different sizes of data sets on an equal footing, and the square root ensures that E is measured on the same RMS scale (and in the same units) as the target variable t . Graphs of the training and test set RMS errors are shown, for various values of , in Figure 1.5. The test M set error is a measure of how well we are doing in predicting the values of t for new data observations of x . We note from Figure 1.5 that small values of M give relatively large values of the test set error, and this can be attributed to the fact that the corresponding polynomials are rather inflexible and are incapable of capturing  sin(2 πx ) . Values of M in the range 3  M the oscillations in the function 8 give small values for the test set error, and these also give reasonable representations of the generating function sin(2 πx ) , as can be seen, for the case of M =3 , from Figure 1.4.

28 8 1. INTRODUCTION Figure 1.5 Graphs of the root-mean-square 1 error, defined by (1.3), evaluated Training on the training set and on an inde- pendent test set for various values Test M of . 0.5 RMS E 0 9 6 3 0 M =9 , the training set error goes to zero, as we might expect because For M degrees of freedom corresponding to the 10 coefficients 10 this polynomial contains w ,...,w data points in the training set. , and so can be tuned exactly to the 10 9 0 However, the test set error has become very large and, as we saw in Figure 1.4, the  exhibits wild oscillations. ) y x, corresponding function ( w This may seem paradoxical because a polynomial of given order contains all M =9 polynomial is therefore capa- lower order polynomials as special cases. The M =3 polynomial. Furthermore, we ble of generating results at least as good as the sin(2 ) πx might suppose that the best predictor of new data would be the function from which the data was generated (and we shall see later that this is indeed the sin(2 πx ) contains case). We know that a power series expansion of the function terms of all orders, so we might expect that results should improve monotonically as we increase M . We can gain some insight into the problem by examining the values of the co-  obtained from polynomials of various order, as shown in Table 1.1. w efficients M increases, the magnitude of the coefficients typically gets larger. We see that, as =9 polynomial, the coefficients have become finely tuned In particular for the M to the data by developing large positive and negative values so that the correspond-  Table of the coefficients w for Table 1.1 M =6 M M =9 =0 M =1  polynomials of various order. 0.19 w 0.35 0.82 0.31 0 Observe how the typical mag-  w 232.37 7.99 -1.27 1 nitude of the coefficients in-  w -25.43 -5321.83 creases dramatically as the or- 2  17.37 48568.31 w der of the polynomial increases. 3  w -231639.30 4  w 640042.26 5  -1061800.52 w 6  w 1042400.18 7  w -557682.99 8  125201.43 w 9

29 1.1. Example: Polynomial Curve Fitting 9 =15 N = 100 N 1 1 t t 0 0 −1 −1 0 1 1 0 x x Plots of the solutions obtained by minimizing the sum-of-squares error function using the =9 Figure 1.6 M N =15 polynomial for N = 100 data points (right plot). We see that increasing the data points (left plot) and size of the data set reduces the over-fitting problem. ing polynomial function matches each of the data points exactly, but between data points (particularly near the ends of the range) the function exhibits the large oscilla- tions observed in Figure 1.4. Intuitively, what is happening is that the more flexible polynomials with larger values of are becoming increasingly tuned to the random M noise on the target values. It is also interesting to examine the behaviour of a given model as the size of the data set is varied, as shown in Figure 1.6. We see that, for a given model complexity, the over-fitting problem become less severe as the size of the data set increases. Another way to say this is that the larger the data set, the more complex (in other words more flexible) the model that we can afford to fit to the data. One rough heuristic that is sometimes advocated is that the number of data points should be no less than some multiple (say 5 or 10) of the number of adaptive parameters in the model. However, as we shall see in Chapter 3, the number of parameters is not necessarily the most appropriate measure of model complexity. Also, there is something rather unsatisfying about having to limit the number of parameters in a model according to the size of the available training set. It would seem more reasonable to choose the complexity of the model according to the com- plexity of the problem being solved. We shall see that the least squares approach maximum likelihood to finding the model parameters represents a specific case of (discussed in Section 1.2.5), and that the over-fitting problem can be understood as a general property of maximum likelihood. By adopting a Bayesian approach, the Section 3.4 over-fitting problem can be avoided. We shall see that there is no difficulty from a Bayesian perspective in employing models for which the number of parameters greatly exceeds the number of data points. Indeed, in a Bayesian model the effective number of parameters adapts automatically to the size of the data set. For the moment, however, it is instructive to continue with the current approach and to consider how in practice we can apply it to data sets of limited size where we

30 10 1. INTRODUCTION ln = − 18 λ =0 ln λ 1 1 t t 0 0 −1 −1 0 0 1 1 x x Figure 1.7 M =9 polynomials fitted to the data set shown in Figure 1.2 using the regularized error Plots of λ corresponding to ln λ = − 18 and ln λ =0 . The function (1.4) for two values of the regularization parameter λ , corresponding to ln λ = −∞ , is shown at the bottom right of Figure 1.4. =0 case of no regularizer, i.e., may wish to use relatively complex and flexible models. One technique that is often regularization , used to control the over-fitting phenomenon in such cases is that of which involves adding a penalty term to the error function (1.2) in order to discourage the coefficients from reaching large values. The simplest such penalty term takes the form of a sum of squares of all of the coefficients, leading to a modified error function of the form N ∑ λ 1 2 2 ̃ + (1.4) )= { y ( x ( , w ) − t w } E ‖ w ‖ n n 2 2 n =1 2 2 2 2 T where ‖ w ‖ = w ≡ governs the rel- + w w λ + ... + w , and the coefficient w 1 0 M ative importance of the regularization term compared with the sum-of-squares error term. Note that often the coefficient w is omitted from the regularizer because its 0 inclusion causes the results to depend on the choice of origin for the target variable (Hastie et al. , 2001), or it may be included but with its own regularization coefficient (we shall discuss this topic in more detail in Section 5.5.1). Again, the error function Exercise 1.2 in (1.4) can be minimized exactly in closed form. Techniques such as this are known shrinkage methods because they reduce the value of the in the statistics literature as coefficients. The particular case of a quadratic regularizer is called ridge regres- sion (Hoerl and Kennard, 1970). In the context of neural networks, this approach is known as weight decay . Figure 1.7 shows the results of fitting the polynomial of order M =9 to the same data set as before but now using the regularized error function given by (1.4). ln λ = − 18 , the over-fitting has been suppressed and we We see that, for a value of now obtain a much closer representation of the underlying function sin(2 πx ) . If, however, we use too large a value for λ then we again obtain a poor fit, as shown in Figure 1.7 for ln λ =0 . The corresponding coefficients from the fitted polynomials are given in Table 1.2, showing that regularization has the desired effect of reducing

31 1.1. Example: Polynomial Curve Fitting 11  Table of the coefficients for M = w Table 1.2 =0 λ ln − = λ ln −∞ = λ 18 ln  9 polynomials with various values for 0.13 0.35 0.35 w 0 . Note the regularization parameter λ  w 232.37 4.74 -0.05 1 that ln λ = −∞ corresponds to a  w -0.77 -0.06 -5321.83 model with no regularization, i.e., to 2  -0.05 48568.31 -31.97 w the graph at the bottom right in Fig- 3  ure 1.4. We see that, as the value of w -0.03 -231639.30 -3.89 4  λ increases, the typical magnitude of w 640042.26 55.28 -0.02 5 the coefficients gets smaller.  w -0.01 -1061800.52 41.32 6  w 1042400.18 -45.95 -0.00 7  w -557682.99 -91.53 0.00 8  125201.43 72.68 0.01 w 9 the magnitude of the coefficients. The impact of the regularization term on the generalization error can be seen by plotting the value of the RMS error (1.3) for both training and test sets against λ , ln λ as shown in Figure 1.8. We see that in effect now controls the effective complexity of the model and hence determines the degree of over-fitting. The issue of model complexity is an important one and will be discussed at length in Section 1.3. Here we simply note that, if we were trying to solve a practical application using this approach of minimizing an error function, we would have to find a way to determine a suitable value for the model complexity. The results above suggest a simple way of achieving this, namely by taking the available data and partitioning it into a training set, used to determine the coefficients w , and a separate validation set, also called a hold-out set, used to optimize the model complexity (either λ ). In many cases, however, this will prove to be too wasteful of or M valuable training data, and we have to seek more sophisticated approaches. Section 1.3 So far our discussion of polynomial curve fitting has appealed largely to in- tuition. We now seek a more principled approach to solving problems in pattern recognition by turning to a discussion of probability theory. As well as providing the foundation for nearly all of the subsequent developments in this book, it will also Figure 1.8 Graph of the root-mean-square er- 1 ln λ for the M =9 ror (1.3) versus Training polynomial. Test 0.5 RMS E 0 −30 −20 −35 −25 λ ln

32 12 1. INTRODUCTION give us some important insights into the concepts we have introduced in the con- text of polynomial curve fitting and will allow us to extend these to more complex situations. 1.2. Probability Theory A key concept in the field of pattern recognition is that of uncertainty. It arises both through noise on measurements, as well as through the finite size of data sets. Prob- ability theory provides a consistent framework for the quantification and manipula- tion of uncertainty and forms one of the central foundations for pattern recognition. When combined with decision theory, discussed in Section 1.5, it allows us to make optimal predictions given all the information available to us, even though that infor- mation may be incomplete or ambiguous. We will introduce the basic concepts of probability theory by considering a sim- ple example. Imagine we have two boxes, one red and one blue, and in the red box we have 2 apples and 6 oranges, and in the blue box we have 3 apples and 1 orange. This is illustrated in Figure 1.9. Now suppose we randomly pick one of the boxes and from that box we randomly select an item of fruit, and having observed which sort of fruit it is we replace it in the box from which it came. We could imagine repeating this process many times. Let us suppose that in so doing we pick the red box 40% of the time and we pick the blue box 60% of the time, and that when we remove an item of fruit from a box we are equally likely to select any of the pieces of fruit in the box. In this example, the identity of the box that will be chosen is a random variable, which we shall denote by B . This random variable can take one of two possible values, namely (corresponding to the red box) or b (corresponding to the blue r box). Similarly, the identity of the fruit is also a random variable and will be denoted by F . It can take either of the values a (for apple) or o (for orange). To begin with, we shall define the probability of an event to be the fraction of times that event occurs out of the total number of trials, in the limit that the total number of trials goes to infinity. Thus the probability of selecting the red box is 4 / 10 Figure 1.9 We use a simple example of two coloured boxes each containing fruit (apples shown in green and or- anges shown in orange) to intro- duce the basic ideas of probability.

33 1.2. Probability Theory 13 c i We can derive the sum and product rules of probability by Figure 1.10 } where { x X considering two random variables, , which takes the values i } Y , and y ,...,M =1 i , which takes the values { ,...,L =1 . } where j j =3 In this illustration we have M =5 and L . If we consider a total of instances of these variables, then we denote the number N number r y n y = n and , which is the number of by Y X = x of instances where j ij j i j ij } points in the corresponding cell of the array. The number of points in , and the number of , is denoted by c = x X , corresponding to i column i i r . , is denoted by Y = y points in row j , corresponding to j j x i and the probability of selecting the blue box is / 10 . We write these probabilities 6 p ( B = r )=4 / 10 and p ( B = b )=6 / 10 . Note that, by definition, probabilities as [0 1] . Also, if the events are mutually exclusive and if they must lie in the interval , include all possible outcomes (for instance, in this example the box must be either red or blue), then we see that the probabilities for those events must sum to one. We can now ask questions such as: “what is the overall probability that the se- lection procedure will pick an apple?”, or “given that we have chosen an orange, what is the probability that the box we chose was the blue one?”. We can answer questions such as these, and indeed much more complex questions associated with problems in pattern recognition, once we have equipped ourselves with the two el- sum rule and the . Having ementary rules of probability, known as the product rule obtained these rules, we shall then return to our boxes of fruit example. In order to derive the rules of probability, consider the slightly more general ex- X and Y (which could ample shown in Figure 1.10 involving two random variables for instance be the Box and Fruit variables considered above). We shall suppose that ,...,M y can take the values where i =1 Y , and can take any of the values x X i j where j =1 ,...,L . Consider a total of N trials in which we sample both of the and Y = y variables x and X , and let the number of such trials in which X = Y j i be . Also, let the number of trials in which X takes the value x (irrespective n i ij , and similarly let the number of trials in takes) be denoted by Y of the value that c i be denoted by r . takes the value Y which y j j X will take the value x The probability that and y is Y will take the value j i ( = x p written X y ,Y = and is called the joint probability of X = x and ) i i j as a fraction of the . It is given by the number of points falling in the cell i , j y Y = j total number of points, and hence n ij )= y = ,Y . (1.5) ( p X = x j i N Here we are implicitly considering the limit N →∞ . Similarly, the probability that and is ) x = irrespective of the value of Y is written as p ( X takes the value X x i i given by the fraction of the total number of points that fall in column i , so that c i X x ( p = )= . (1.6) i N Because the number of instances in column i in Figure 1.10 is just the sum of the ∑ = n and therefore, number of instances in each cell of that column, we have c i ij j

34 14 1. INTRODUCTION from (1.5) and (1.6), we have L ∑ ( = (1.7) p )= X x ) ,Y = y ( x = X p j i i =1 j ) is sometimes called the ( = x p of probability. Note that sum rule which is the X i marginal probability, because it is obtained by marginalizing, or summing out, the other variables (in this case Y ). , then the fraction of = X If we consider only those instances for which x i ) is written and is called the p ( Y = y | X = x y = Y such instances for which j i j probability of Y y conditional = X . It is obtained by finding the x = given i j fraction of those points in column i i , j and hence is given by that fall in cell n ij | X = x )= (1.8) . Y ( = y p i j c i From (1.5), (1.6), and (1.8), we can then derive the following relationship n n c ij ij i )= = · y ,Y = = x p ( X j i c N N i ) ) X = x (1.9) | p ( X = x = = y ( p Y i j i which is the product rule of probability. So far we have been quite careful to make a distinction between a random vari- able, such as the box B in the fruit example, and the values that the random variable r if the box were the red one. Thus the probability that B takes can take, for example r = p ( B the value r ) . Although this helps to avoid ambiguity, it leads is denoted to a rather cumbersome notation, and in many cases there will be no need for such pedantry. Instead, we may simply write ( B ) to denote a distribution over the ran- p dom variable ,or p ( r ) to denote the distribution evaluated for the particular value B r , provided that the interpretation is clear from the context. With this more compact notation, we can write the two fundamental rules of probability theory in the following form. The Rules of Probability ∑ p ( X, Y ) (1.10) ( X )= sum rule p Y product rule p ( X, Y )= p ( Y | X ) p ( X ) . (1.11) Here ( X, Y ) is a joint probability and is verbalized as “the probability of X and p Y ”. Similarly, the quantity p ( Y | X ) is a conditional probability and is verbalized as “the probability of Y given X ”, whereas the quantity p ( X ) is a marginal probability

35 1.2. Probability Theory 15 X ”. These two simple rules form the basis for all and is simply “the probability of of the probabilistic machinery that we use throughout this book. p )= p ( Y, X X, Y , ( From the product rule, together with the symmetry property ) we immediately obtain the following relationship between conditional probabilities Y ( p Y p ( X | ) ) )= | Y ( p X (1.12) X ( p ) and which plays a central role in pattern recognition which is called Bayes’ theorem and machine learning. Using the sum rule, the denominator in Bayes’ theorem can be expressed in terms of the quantities appearing in the numerator ∑ p ( X )= p ( X | Y ) (1.13) ( Y ) . p Y We can view the denominator in Bayes’ theorem as being the normalization constant required to ensure that the sum of the conditional probability on the left-hand side of equals one. (1.12) over all values of Y In Figure 1.11, we show a simple example involving a joint distribution over two variables to illustrate the concept of marginal and conditional distributions. Here a finite sample of N data points has been drawn from the joint distribution =60 and is shown in the top left. In the top right is a histogram of the fractions of data points having each of the two values of Y . From the definition of probability, these p ( Y ) in the limit N →∞ fractions would equal the corresponding probabilities .We can view the histogram as a simple way to model a probability distribution given only a finite number of points drawn from that distribution. Modelling distributions from data lies at the heart of statistical pattern recognition and will be explored in great detail in this book. The remaining two plots in Figure 1.11 show the corresponding p ( X ) and p ( histogram estimates of | Y =1) . X Let us now return to our example involving boxes of fruit. For the moment, we shall once again be explicit about distinguishing between the random variables and their instantiations. We have seen that the probabilities of selecting either the red or the blue boxes are given by p ( B = r )=4 / 10 (1.14) p ( = b )=6 / 10 (1.15) B B p = r )+ p ( B = b )=1 . respectively. Note that these satisfy ( Now suppose that we pick a box at random, and it turns out to be the blue box. Then the probability of selecting an apple is just the fraction of apples in the blue B = b )=3 / 4 . In fact, we can write out all four | a 3 box which is 4 , and so p ( F = / conditional probabilities for the type of fruit, given the selected box p ( F = a | B = r )=1 / 4 (1.16) = p = o | B F r )=3 / 4 (1.17) ( p ( F = a | B = b )=3 / 4 (1.18) = p F = o | B ( b )=1 / 4 . (1.19)

36 16 1. INTRODUCTION ) Y ( p ( p ) X, Y =2 Y =1 Y X Y | X ( p =1) ) X ( p X X , which , which takes An illustration of a distribution over two variables, possible values, and Y X Figure 1.11 9 60 points drawn from a joint probability distri- takes two possible values. The top left figure shows a sample of p ( X bution over these variables. The remaining figures show histogram estimates of the marginal distributions ) and ( Y ) , as well as the conditional distribution p ( X | Y =1) corresponding to the bottom row in the top left p figure. Again, note that these probabilities are normalized so that ( F = a | B = r )+ p p F = o | B = r )=1 (1.20) ( and similarly p ( F = a | B = b )+ p ( F = o | B = b )=1 . (1.21) We can now use the sum and product rules of probability to evaluate the overall probability of choosing an apple p ( F = a )= p ( F = a | B = r ) p ( B = r )+ p ( F = a | B = b ) p ( B = b ) 4 11 3 6 1 (1.22) × + = × = 10 10 20 4 4 from which it follows, using the sum rule, that p ( F . o )=1 − 11 / 20 = 9 / 20 =

37 1.2. Probability Theory 17 Suppose instead we are told that a piece of fruit has been selected and it is an orange, and we would like to know which box it came from. This requires that we evaluate the probability distribution over boxes conditioned on the identity of the fruit, whereas the probabilities in (1.16)–(1.19) give the probability distribution over the fruit conditioned on the identity of the box. We can solve the problem of reversing the conditional probability by using Bayes’ theorem to give ) r B ( p ) r = p ( F = o | B = 2 20 4 3 r p = B ( )= = F | o = × × = . (1.23) 4 F o ) ( 3 = 10 p 9 p ( B From the sum rule, it then follows that b | F = o )=1 − 2 / 3=1 / 3 . = We can provide an important interpretation of Bayes’ theorem as follows. If we had been asked which box had been chosen before being told the identity of the selected item of fruit, then the most complete information we have available is ( B ) . We call this the prior probability p provided by the probability because it is the before we observe the identity of the fruit. Once we are told that probability available the fruit is an orange, we can then use Bayes’ theorem to compute the probability ( B | F ) , which we shall call the p because it is the probability posterior probability obtained we have observed F . Note that in this example, the prior probability after 4 , so that we were more likely to select the blue box 10 of selecting the red box was / than the red one. However, once we have observed that the piece of selected fruit is an orange, we find that the posterior probability of the red box is now / 3 , so that 2 it is now more likely that the box we selected was in fact the red one. This result accords with our intuition, as the proportion of oranges is much higher in the red box than it is in the blue box, and so the observation that the fruit was an orange provides significant evidence favouring the red box. In fact, the evidence is sufficiently strong that it outweighs the prior and makes it more likely that the red box was chosen rather than the blue one. Finally, we note that if the joint distribution of two variables factorizes into the Y ( )= p ( X ) product of the marginals, so that ( X, Y ) , then X and Y are said to p p be . From the product rule, we see that p ( Y | X )= p ( Y ) , and so the independent X is indeed independent of the value of X .For conditional distribution of Y given instance, in our boxes of fruit example, if each box contained the same fraction of apples and oranges, then p ( F | B )= P ( F ) , so that the probability of selecting, say, an apple is independent of which box is chosen. 1.2.1 Probability densities As well as considering probabilities defined over discrete sets of events, we also wish to consider probabilities with respect to continuous variables. We shall limit ourselves to a relatively informal discussion. If the probability of a real-valued x falling in the interval ( x, x + δx ) is given by variable ( x ) δx for δx → 0 , then p p ( x ) is called the probability density over x . This is illustrated in Figure 1.12. The probability that will lie in an interval ( a, b ) is then given by x ∫ b x ( (1.24) p x. )d )) = a, b ∈ x ( p ( a

38 18 1. INTRODUCTION The concept of probability for Figure 1.12 ) x ( P discrete variables can be ex- p ( x ) tended to that of a probability ) x ( p density over a continuous x variable and is such that the lying in the inter- x probability of x, x δx + is given by p ( val ) x ) δx ( . The probability for δx → 0 density can be expressed as the derivative of a cumulative distri- bution function . ) x ( P x δx must lie some- Because probabilities are nonnegative, and because the value of x where on the real axis, the probability density p ( x ) must satisfy the two conditions p x )  0 (1.25) ( ∫ ∞ . x )d x =1 ( (1.26) p −∞ Under a nonlinear change of variable, a probability density transforms differently from a simple function, due to the Jacobian factor. For instance, if we consider ̃ y ( g ( f )= ( . f )) y becomes x ( f , then a function ) y ( g = x a change of variables ) p Now consider a probability density ( x ) that corresponds to a density p with ( y ) y x respect to the new variable y , where the suffices denote the fact that p ) ( x ) and p y ( y x ( + δx ) will, for small x, x are different densities. Observations falling in the range δy y , ( p  δx ) x ( ) p ) δy + where ( , be transformed into the range δx values of y, y x y and hence ∣ ∣ ∣ ∣ x d ∣ ∣ p ) p x ( ( y )= y x ∣ ∣ y d ′ = p ( y )) | g g ( y ) | . (1.27) ( x One consequence of this property is that the concept of the maximum of a probability density is dependent on the choice of variable. Exercise 1.4 is given by the x ( −∞ ,z The probability that lies in the interval cumulative ) distribution function defined by ∫ z )d x p (1.28) ( x ( P )= z −∞ ′ p ( x )= , as shown in Figure 1.12. ( x ) P which satisfies ,...,x , denoted collectively by the If we have several continuous variables x D 1 p vector , then we can define a joint probability density p ( x )= x ( x ,...,x ) such 1 D

39 1.2. Probability Theory 19 x falling in an infinitesimal volume x containing the point x that the probability of δ ( ) δ x . This multivariate probability density must satisfy p x is given by x )  p (1.29) ( 0 ∫ p ( x )d x =1 (1.30) x space. We can also consider joint in which the integral is taken over the whole of probability distributions over a combination of discrete and continuous variables. Note that if x is a discrete variable, then ( x ) is sometimes called a probability p because it can be regarded as a set of ‘probability masses’ concentrated mass function x . at the allowed values of The sum and product rules of probability, as well as Bayes’ theorem, apply equally to the case of probability densities, or to combinations of discrete and con- x and y are two real variables, then the sum and tinuous variables. For instance, if product rules take the form ∫ (1.31) y ( p )d x, y ( )= x p p ( x, y )= p ( y | x ) p ( x ) . (1.32) A formal justification of the sum and product rules for continuous variables (Feller, 1966) requires a branch of mathematics called and lies outside the measure theory scope of this book. Its validity can be seen informally, however, by dividing each real variable into intervals of width ∆ and considering the discrete probability dis- → 0 ∆ tribution over these intervals. Taking the limit then turns sums into integrals and gives the desired result. 1.2.2 Expectations and covariances One of the most important operations involving probabilities is that of finding f ( x weighted averages of functions. The average value of some function under a ) probability distribution ( x ) is called the expectation p f ( x ) and will be denoted by of E [ f ] . For a discrete distribution, it is given by ∑ f (1.33) p ( x ) ( x ) E ]= f [ x so that the average is weighted by the relative probabilities of the different values of x . In the case of continuous variables, expectations are expressed in terms of an integration with respect to the corresponding probability density ∫ E f ]= [ p ( x ) f ( x )d x. (1.34) In either case, if we are given a finite number N of points drawn from the probability distribution or probability density, then the expectation can be approximated as a

40 20 1. INTRODUCTION finite sum over these points N ∑ 1 . (1.35) f ( x ) [ f  E ] n N =1 n We shall make extensive use of this result when we discuss sampling methods in N →∞ . Chapter 11. The approximation in (1.35) becomes exact in the limit Sometimes we will be considering expectations of functions of several variables, in which case we can use a subscript to indicate which variable is being averaged over, so that for instance E f ( [ )] (1.36) x, y x ( x, y ) with respect to the distribution of x . Note denotes the average of the function f . ( f [ x, y )] will be a function of y E that x We can also consider a conditional expectation with respect to a conditional distribution, so that ∑ E f | y ]= p (1.37) [ ( x | y ) f ( x ) x x with an analogous definition for continuous variables. of f ( x The is defined by ) variance ] [ 2 (1.38) ( f ( x ) − E [ f ( x )]) E f ]= var[ and provides a measure of how much variability there is in f ( x ) around its mean E [ f ( x )] . Expanding out the square, we see that the variance can also be written value 2 Exercise 1.5 x ) and f ( ( ) f x in terms of the expectations of 2 2 f )] x ( (1.39) [ E . ] − ) x f [ E ]= f var[ ( x itself, which is given by In particular, we can consider the variance of the variable 2 2 (1.40) . ] ] − E [ x var[ ]= x [ x E x y , the covariance is defined by For two random variables and } x ] y [ { ] − E [ x ] }{ y − E [ ]= x, y cov[ E x,y x − [ xy ] (1.41) E [ ] ] E [ y = E x,y which expresses the extent to which x and y vary together. If x and y are indepen- Exercise 1.6 dent, then their covariance vanishes. and y , the covariance is a matrix x In the case of two vectors of random variables ] [ T T y [ E − } { x ] E [ x ] }{ y − y E ]= , x cov[ y x , T T (1.42) . ] [ xy = ] − E [ x ] E [ y E y x , If we consider the covariance of the components of a vector x with each other, then we use a slightly simpler notation cov[ . ] ≡ cov[ x , x ] x

41 1.2. Probability Theory 21 1.2.3 Bayesian probabilities So far in this chapter, we have viewed probabilities in terms of the frequencies or frequentist of random, repeatable events. We shall refer to this as the classical view, in Bayesian interpretation of probability. Now we turn to the more general which probabilities provide a quantification of uncertainty. Consider an uncertain event, for example whether the moon was once in its own orbit around the sun, or whether the Arctic ice cap will have disappeared by the end of the century. These are not events that can be repeated numerous times in order to define a notion of probability as we did earlier in the context of boxes of fruit. Nevertheless, we will generally have some idea, for example, of how quickly we think the polar ice is melting. If we now obtain fresh evidence, for instance from a new Earth observation satellite gathering novel forms of diagnostic information, we may revise our opinion on the rate of ice loss. Our assessment of such matters will affect the actions we take, for instance the extent to which we endeavour to reduce the emission of greenhouse gasses. In such circumstances, we would like to be able to quantify our expression of uncertainty and make precise revisions of uncertainty in the light of new evidence, as well as subsequently to be able to take optimal actions or decisions as a consequence. This can all be achieved through the elegant, and very general, Bayesian interpretation of probability. The use of probability to represent uncertainty, however, is not an ad-hoc choice, but is inevitable if we are to respect common sense while making rational coherent inferences. For instance, Cox (1946) showed that if numerical values are used to represent degrees of belief, then a simple set of axioms encoding common sense properties of such beliefs leads uniquely to a set of rules for manipulating degrees of belief that are equivalent to the sum and product rules of probability. This provided the first rigorous proof that probability theory could be regarded as an extension of Boolean logic to situations involving uncertainty (Jaynes, 2003). Numerous other authors have proposed different sets of properties or axioms that such measures of uncertainty should satisfy (Ramsey, 1931; Good, 1950; Savage, 1961; deFinetti, 1970; Lindley, 1982). In each case, the resulting numerical quantities behave pre- cisely according to the rules of probability. It is therefore natural to refer to these quantities as (Bayesian) probabilities. In the field of pattern recognition, too, it is helpful to have a more general no- gambling and with the new concept of insurance. One Thomas Bayes particularly important problem concerned so-called in- 1701–1761 verse probability. A solution was proposed by Thomas Bayes in his paper ‘Essay towards solving a problem Thomas Bayes was born in Tun- in the doctrine of chances’, which was published in bridge Wells and was a clergyman 1764, some three years after his death, in the as well as an amateur scientist and Philo- a mathematician. He studied logic sophical Transactions of the Royal Society . In fact, and theology at Edinburgh Univer- Bayes only formulated his theory for the case of a uni- sity and was elected Fellow of the form prior, and it was Pierre-Simon Laplace who inde- th Royal Society in 1742. During the 18 century, is- pendently rediscovered the theory in general form and sues regarding probability arose in connection with who demonstrated its broad applicability.

42 22 1. INTRODUCTION tion of probability. Consider the example of polynomial curve fitting discussed in Section 1.1. It seems reasonable to apply the frequentist notion of probability to the random values of the observed variables t . However, we would like to address and n quantify the uncertainty that surrounds the appropriate choice for the model param- eters . We shall see that, from a Bayesian perspective, we can use the machinery w w of probability theory to describe the uncertainty in model parameters such as ,or indeed in the choice of model itself. Bayes’ theorem now acquires a new significance. Recall that in the boxes of fruit example, the observation of the identity of the fruit provided relevant information that altered the probability that the chosen box was the red one. In that example, Bayes’ theorem was used to convert a prior probability into a posterior probability by incorporating the evidence provided by the observed data. As we shall see in detail later, we can adopt a similar approach when making inferences about quantities such as the parameters w in the polynomial curve fitting example. We capture our assumptions about w , before observing the data, in the form of a prior probability p ( w ) . The effect of the observed data D = { distribution t } is expressed ,...,t N 1 p D| w ) , and we shall see later, in Section 1.2.5, through the conditional probability ( how this can be represented explicitly. Bayes’ theorem, which takes the form w ( ( p p ) D| w ) (1.43) )= w ( p |D p ( D ) then allows us to evaluate the uncertainty in w after we have observed D in the form of the posterior probability p w |D ) . ( p D| w ) on the right-hand side of Bayes’ theorem is evaluated for The quantity ( the observed data set D and can be viewed as a function of the parameter vector , in which case it is called the likelihood function . It expresses how probable the w observed data set is for different settings of the parameter vector w . Note that the w w likelihood is not a probability distribution over , and its integral with respect to does not (necessarily) equal one. Given this definition of likelihood, we can state Bayes’ theorem in words posterior likelihood × prior (1.44) ∝ where all of these quantities are viewed as functions of w . The denominator in (1.43) is the normalization constant, which ensures that the posterior distribution on the left-hand side is a valid probability density and integrates to one. Indeed, integrating both sides of (1.43) with respect to w , we can express the denominator in Bayes’ theorem in terms of the prior distribution and the likelihood function ∫ . w )d ( p ( D| w ) p (1.45) w )= ( p D In both the Bayesian and frequentist paradigms, the likelihood function p ( D| w ) plays a central role. However, the manner in which it is used is fundamentally dif- ferent in the two approaches. In a frequentist setting, w is considered to be a fixed parameter, whose value is determined by some form of ‘estimator’, and error bars

43 1.2. Probability Theory 23 D . on this estimate are obtained by considering the distribution of possible data sets (namely By contrast, from the Bayesian viewpoint there is only a single data set D the one that is actually observed), and the uncertainty in the parameters is expressed through a probability distribution over w . , in which w A widely used frequentist estimator is maximum likelihood is set p ( D| w ) . This corresponds to to the value that maximizes the likelihood function w for which the probability of the observed data set is maxi- choosing the value of mized. In the machine learning literature, the negative log of the likelihood function is called an . Because the negative logarithm is a monotonically de- error function creasing function, maximizing the likelihood is equivalent to minimizing the error. One approach to determining frequentist error bars is the (Efron, 1979; bootstrap Hastie et al. , 2001), in which multiple data sets are created as follows. Suppose our N data points X original data set consists of { x = ,..., x } . We can create a new 1 N , with replacement, so that some X points at random from by drawing N X data set B X X points in may be replicated in , whereas other points in X may be absent from B N . This process can be repeated L times to generate and data sets each of size L X B X . The statistical accuracy of each obtained by sampling from the original data set parameter estimates can then be evaluated by looking at the variability of predictions between the different bootstrap data sets. One advantage of the Bayesian viewpoint is that the inclusion of prior knowl- edge arises naturally. Suppose, for instance, that a fair-looking coin is tossed three times and lands heads each time. A classical maximum likelihood estimate of the probability of landing heads would give 1, implying that all future tosses will land Section 2.1 heads! By contrast, a Bayesian approach with any reasonable prior will lead to a much less extreme conclusion. There has been much controversy and debate associated with the relative mer- its of the frequentist and Bayesian paradigms, which have not been helped by the fact that there is no unique frequentist, or even Bayesian, viewpoint. For instance, one common criticism of the Bayesian approach is that the prior distribution is of- ten selected on the basis of mathematical convenience rather than as a reflection of any prior beliefs. Even the subjective nature of the conclusions through their de- pendence on the choice of prior is seen by some as a source of difficulty. Reducing noninformative priors. Section 2.4.3 the dependence on the prior is one motivation for so-called However, these lead to difficulties when comparing different models, and indeed Bayesian methods based on poor choices of prior can give poor results with high confidence. Frequentist evaluation methods offer some protection from such prob- Section 1.3 lems, and techniques such as cross-validation remain useful in areas such as model comparison. This book places a strong emphasis on the Bayesian viewpoint, reflecting the huge growth in the practical importance of Bayesian methods in the past few years, while also discussing useful frequentist concepts as required. th century, the prac- Although the Bayesian framework has its origins in the 18 tical application of Bayesian methods was for a long time severely limited by the difficulties in carrying through the full Bayesian procedure, particularly the need to marginalize (sum or integrate) over the whole of parameter space, which, as we shall

44 24 1. INTRODUCTION see, is required in order to make predictions or to compare different models. The development of sampling methods, such as Markov chain Monte Carlo (discussed in Chapter 11) along with dramatic improvements in the speed and memory capacity of computers, opened the door to the practical use of Bayesian techniques in an im- pressive range of problem domains. Monte Carlo methods are very flexible and can be applied to a wide range of models. However, they are computationally intensive and have mainly been used for small-scale problems. More recently, highly efficient deterministic approximation schemes such as variational Bayes and expectation propagation (discussed in Chapter 10) have been developed. These offer a complementary alternative to sampling methods and have allowed Bayesian techniques to be used in large-scale applications (Blei , 2003). et al. 1.2.4 The Gaussian distribution We shall devote the whole of Chapter 2 to a study of various probability dis- tributions and their key properties. It is convenient, however, to introduce here one of the most important probability distributions for continuous variables, called the Gaussian distribution. We shall make extensive use of this distribution in normal or the remainder of this chapter and indeed throughout much of the book. x For the case of a single real-valued variable , the Gaussian distribution is de- fined by } { ( ) 1 1 2 2 (1.46) x | ) μ − x exp = − μ, σ ( N 2 2 / 1 2 σ 2 πσ (2 ) 2 μ , called the mean , and σ which is governed by two parameters: , called the vari- . The square root of the variance, given by σ standard deviation , ance , is called the 2 =1 β and the reciprocal of the variance, written as /σ , is called the precision .We shall see the motivation for these terms shortly. Figure 1.13 shows a plot of the Gaussian distribution. From the form of (1.46) we see that the Gaussian distribution satisfies 2 ) (1.47) . 0 > x N μ, σ | ( Also it is straightforward to show that the Gaussian is normalized, so that Exercise 1.7 earth is thought to have formed from the condensa- Pierre-Simon Laplace tion and cooling of a large rotating disk of gas and 1749–1827 ́ Th eorie dust. In 1812 he published the first edition of ́ , in which Laplace states It is said that Laplace was seri- Analytique des Probabilit es ously lacking in modesty and at one that “probability theory is nothing but common sense reduced to calculation”. This work included a discus- point declared himself to be the sion of the inverse probability calculation (later termed best mathematician in France at the ́ time, a claim that was arguably true. Bayes’ theorem by Poincar e), which he used to solve As well as being prolific in mathe- problems in life expectancy, jurisprudence, planetary matics, he also made numerous contributions to as- masses, triangulation, and error estimation. tronomy, including the nebular hypothesis by which the

45 1.2. Probability Theory 25 Figure 1.13 Plot of the univariate Gaussian 2 showing the mean and the μ ( x | μ, σ ) N standard deviation . σ 2 σ x μ ∫ ∞ ( ) 2 μ, σ =1 | d x x . (1.48) N −∞ Thus (1.46) satisfies the two requirements for a valid probability density. We can readily find expectations of functions of x under the Gaussian distribu- x is given by Exercise 1.8 tion. In particular, the average value of ∫ ∞ ( ) 2 μ, σ | N μ. x x d x = (1.49) [ ]= x E −∞ Because the parameter x under the distribution, it μ represents the average value of is referred to as the mean. Similarly, for the second order moment ∫ ∞ ( ) 2 2 2 2 2 ]= x | μ, σ (1.50) . x σ d x = μ N + x [ E −∞ is given by From (1.49) and (1.50), it follows that the variance of x 2 2 2 E [ x x ]= var[ x ] ] − σ E (1.51) [ = 2 is referred to as the variance parameter. The maximum of a distribution σ and hence is known as its mode. For a Gaussian, the mode coincides with the mean. Exercise 1.9 We are also interested in the Gaussian distribution defined over a D -dimensional x of continuous variables, which is given by vector } { 1 1 1 1 − T − μ x ) ( μ − x exp ) − Σ ( (1.52) x N )= Σ , μ | ( 2 D/ 2 1 / 2 Σ | ) π (2 | where the D -dimensional vector μ is called the mean, the D × D matrix Σ is called the covariance, and | Σ | denotes the determinant of Σ . We shall make use of the multivariate Gaussian distribution briefly in this chapter, although its properties will be studied in detail in Section 2.3.

46 26 1. INTRODUCTION Illustration of the likelihood function for Figure 1.14 a Gaussian distribution, shown by the x ) p ( red curve. Here the black points de- } , and x { note a data set of values n the likelihood function given by (1.53) 2 corresponds to the product of the blue ) μ, σ | N ( x n values. Maximizing the likelihood in- volves adjusting the mean and vari- ance of the Gaussian so as to maxi- mize this product. x x n T Now suppose that we have a data set of observations x x ,...,x , rep- ) =( N 1 N observations of the scalar variable x . Note that we are using the type- resenting x face to distinguish this from a single observation of the vector-valued variable T ( x ,...,x ) , which we denote by x . We shall suppose that the observations are 1 D 2 μ and variance σ drawn independently from a Gaussian distribution whose mean are unknown, and we would like to determine these parameters from the data set. Data points that are drawn independently from the same distribution are said to be independent and identically distributed , which is often abbreviated to i.i.d. We have seen that the joint probability of two independent events is given by the product of x is i.i.d., the marginal probabilities for each event separately. Because our data set 2 , in the form σ we can therefore write the probability of the data set, given μ and N ∏ ) ( 2 2 x μ, σ | p ( N μ, σ x )= | (1.53) . n =1 n 2 , this is the likelihood function for the Gaus- μ σ When viewed as a function of and sian and is interpreted diagrammatically in Figure 1.14. One common criterion for determining the parameters in a probability distribu- tion using an observed data set is to find the parameter values that maximize the likelihood function. This might seem like a strange criterion because, from our fore- going discussion of probability theory, it would seem more natural to maximize the probability of the parameters given the data, not the probability of the data given the parameters. In fact, these two criteria are related, as we shall discuss in the context of curve fitting. Section 1.2.5 For the moment, however, we shall determine values for the unknown parame- 2 in the Gaussian by maximizing the likelihood function (1.53). In prac- μ and σ ters tice, it is more convenient to maximize the log of the likelihood function. Because the logarithm is a monotonically increasing function of its argument, maximization of the log of a function is equivalent to maximization of the function itself. Taking the log not only simplifies the subsequent mathematical analysis, but it also helps numerically because the product of a large number of small probabilities can easily underflow the numerical precision of the computer, and this is resolved by computing instead the sum of the log probabilities. From (1.46) and (1.53), the log likelihood

47 1.2. Probability Theory 27 function can be written in the form N ∑ ( ) 1 N N 2 2 2 ln p π ln(2 . (1.54) ln σ ) − μ − μ, σ x ) | ( x = − − n 2 2 2 2 σ n =1 Maximizing (1.54) with respect to , we obtain the maximum likelihood solution μ given by Exercise 1.11 N ∑ 1 μ x (1.55) = ML n N n =1 , i.e., the mean of the observed values { x which is the sample mean . Similarly, } n 2 σ maximizing (1.54) with respect to , we obtain the maximum likelihood solution for the variance in the form N ∑ 1 2 2 (1.56) = ) μ ( x − σ n ML ML N n =1 . Note measured with respect to the sample mean μ which is the sample variance ML 2 ,but and μ σ that we are performing a joint maximization of (1.54) with respect to 2 decouples from that for μ in the case of the Gaussian distribution the solution for σ so that we can first evaluate (1.55) and then subsequently use this result to evaluate (1.56). Later in this chapter, and also in subsequent chapters, we shall highlight the sig- nificant limitations of the maximum likelihood approach. Here we give an indication of the problem in the context of our solutions for the maximum likelihood param- eter settings for the univariate Gaussian distribution. In particular, we shall show that the maximum likelihood approach systematically underestimates the variance bias and is related of the distribution. This is an example of a phenomenon called to the problem of over-fitting encountered in the context of polynomial curve fitting. Section 1.1 2 We first note that the maximum likelihood solutions μ σ are functions of and ML ML . Consider the expectations of these quantities with ,...,x x the data set values 1 N respect to the data set values, which themselves come from a Gaussian distribution 2 σ with parameters and μ Exercise 1.12 . It is straightforward to show that [ μ E ]= μ (1.57) ML ( ) 1 − N 2 2 [ E σ ]= σ (1.58) ML N so that on average the maximum likelihood estimate will obtain the correct mean but ( N − 1) /N . The intuition behind will underestimate the true variance by a factor this result is given by Figure 1.15. From (1.58) it follows that the following estimate for the variance parameter is unbiased N ∑ N 1 2 2 2 (1.59) ) = μ = ̃ − x ( σ . σ n ML ML N N − 1 − 1 =1 n

48 28 1. INTRODUCTION Figure 1.15 Illustration of how bias arises in using max- imum likelihood to determine the variance of a Gaussian. The green curve shows the true Gaussian distribution from which data is generated, and the three red curves (a) show the Gaussian distributions obtained by fitting to three data sets, each consist- ing of two data points shown in blue, us- ing the maximum likelihood results (1.55) and (1.56). Averaged across the three data sets, the mean is correct, but the variance (b) is systematically under-estimated because it is measured relative to the sample mean and not relative to the true mean. (c) In Section 10.1.3, we shall see how this result arises automatically when we adopt a Bayesian approach. Note that the bias of the maximum likelihood solution becomes less significant N as the number N →∞ the maximum of data points increases, and in the limit likelihood solution for the variance equals the true variance of the distribution that N , this bias will not generated the data. In practice, for anything other than small prove to be a serious problem. However, throughout this book we shall be interested in more complex models with many parameters, for which the bias problems asso- ciated with maximum likelihood will be much more severe. In fact, as we shall see, the issue of bias in maximum likelihood lies at the root of the over-fitting problem that we encountered earlier in the context of polynomial curve fitting. 1.2.5 Curve fitting re-visited We have seen how the problem of polynomial curve fitting can be expressed in terms of error minimization. Here we return to the curve fitting example and view it Section 1.1 from a probabilistic perspective, thereby gaining some insights into error functions and regularization, as well as taking us towards a full Bayesian treatment. The goal in the curve fitting problem is to be able to make predictions for the t given some new value of the input variable x on the basis of a set of target variable T N input values x =( x training data comprising ,...,x ) and their corresponding N 1 T target values =( t t ) . We can express our uncertainty over the value of ,...,t 1 N the target variable using a probability distribution. For this purpose, we shall assume , the corresponding value of t has a Gaussian distribution that, given the value of x y ( x, w ) of the polynomial curve given by (1.1). Thus with a mean equal to the value we have ( ) − 1 ( x, (1.60) ) ,β t | y w x, w ,β )= N p ( t | where, for consistency with the notation in later chapters, we have defined a preci- sion parameter β corresponding to the inverse variance of the distribution. This is illustrated schematically in Figure 1.16.

49 1.2. Probability Theory 29 Figure 1.16 Schematic illustration of a Gaus- x given by given sian conditional distribution for t t ) w x, ( y (1.60), in which the mean is given by the polyno- w ) , and the precision is given ( y mial function x, by the parameter β , which is related to the vari- 2 1 − σ = . ance by β w ) , y x ( σ 2 0 p ( t | x , w ,β ) 0 x x 0 { x , t } to determine the values of the unknown We now use the training data w by maximum likelihood. If the data are assumed to be drawn parameters β and independently from the distribution (1.60), then the likelihood function is given by N ∏ ( ) − 1 ) w , (1.61) . x N ,β t ( | y | p ( )= ,β w , x t n n n =1 As we did in the case of the simple Gaussian distribution earlier, it is convenient to maximize the logarithm of the likelihood function. Substituting for the form of the Gaussian distribution, given by (1.46), we obtain the log likelihood function in the form N ∑ N β N 2 π (1.62) ) − β ln . ln(2 t | x , w ,β )= − ln p ( − t } ) + w { , y x ( n n 2 2 2 n =1 Consider first the determination of the maximum likelihood solution for the polyno- . These are determined by maxi- mial coefficients, which will be denoted by w ML w mizing (1.62) with respect to . For this purpose, we can omit the last two terms w . Also, we note on the right-hand side of (1.62) because they do not depend on that scaling the log likelihood by a positive constant coefficient does not alter the location of the maximum with respect to , and so we can replace the coefficient w 2 β/ 1 / 2 . Finally, instead of maximizing the log likelihood, we can equivalently with minimize the negative log likelihood. We therefore see that maximizing likelihood is equivalent, so far as determining w sum-of-squares is concerned, to minimizing the defined by (1.2). Thus the sum-of-squares error function has arisen as error function a consequence of maximizing likelihood under the assumption of a Gaussian noise distribution. β of We can also use maximum likelihood to determine the precision parameter the Gaussian conditional distribution. Maximizing (1.62) with respect to β gives N ∑ 1 1 2 x = (1.63) . } { y ( t − , w ) ML n n N β ML =1 n

50 30 1. INTRODUCTION governing the mean and sub- Again we can first determine the parameter vector w ML β sequently use this to find the precision as was the case for the simple Gaussian ML Section 1.2.4 distribution. β , we can now make predictions for and w Having determined the parameters . Because we now have a probabilistic model, these are expressed new values of x predictive distribution that gives the probability distribution over , in terms of the t rather than simply a point estimate, and is obtained by substituting the maximum likelihood parameters into (1.60) to give ) ( 1 − (1.64) ,β ,β ) )= N . t | y ( x, w ( w x, p | t ML ML ML ML Now let us take a step towards a more Bayesian approach and introduce a prior distribution over the polynomial coefficients w . For simplicity, let us consider a Gaussian distribution of the form } { ) ( +1) M 2 ( / α α 1 − T w I − )= exp (1.65) w ( w | α N w 0 p ( ,α | )= 2 π 2 α is the precision of the distribution, and M +1 is the total number of elements where th order polynomial. Variables such as α , which control in the vector w for an M the distribution of model parameters, are called . Using Bayes’ hyperparameters theorem, the posterior distribution for w is proportional to the product of the prior distribution and the likelihood function ( w | x , t ,α,β ) ∝ p ( t | x , w ,β ) p ( w | α ) . (1.66) p w w given the data, by finding the most probable value of We can now determine in other words by maximizing the posterior distribution. This technique is called , or simply MAP . Taking the negative logarithm of (1.66) and maximum posterior combining with (1.62) and (1.65), we find that the maximum of the posterior is given by the minimum of N ∑ α β T 2 y ( x . , w ) − w t } (1.67) + { w n n 2 2 n =1 Thus we see that maximizing the posterior distribution is equivalent to minimizing the regularized sum-of-squares error function encountered earlier in the form (1.4), λ = α/β . with a regularization parameter given by 1.2.6 Bayesian curve fitting Although we have included a prior distribution p ( w | α ) , we are so far still mak- ing a point estimate of w and so this does not yet amount to a Bayesian treatment. In a fully Bayesian approach, we should consistently apply the sum and product rules of probability, which requires, as we shall see shortly, that we integrate over all val- ues of w . Such marginalizations lie at the heart of Bayesian methods for pattern recognition.

51 1.2. Probability Theory 31 x and , along with In the curve fitting problem, we are given the training data t , and our goal is to predict the value of a new test point t x . We therefore wish t t x, ( , | ) . Here we shall assume that the p to evaluate the predictive distribution x and parameters are fixed and known in advance (in later chapters we shall discuss α β how such parameters can be inferred from data in a Bayesian setting). A Bayesian treatment simply corresponds to a consistent application of the sum and product rules of probability, which allow the predictive distribution to be written in the form ∫ , p x ( t )= t | x, ( t | x, w ) p ( w p x , t )d w . (1.68) | Here ( t | x, w ) is given by (1.60), and we have omitted the dependence on α and p to simplify the notation. Here ) ( w | x , t β is the posterior distribution over param- p eters, and can be found by normalizing the right-hand side of (1.66). We shall see in Section 3.3 that, for problems such as the curve-fitting example, this posterior distribution is a Gaussian and can be evaluated analytically. Similarly, the integra- tion in (1.68) can also be performed analytically with the result that the predictive distribution is given by a Gaussian of the form ( ) 2 t p , t )= N ( | x, x m ( x ) ,s (1.69) ( x ) t | where the mean and variance are given by N ∑ T ) x ( φ (1.70) S t β ) ( x x ( m φ )= n n n =1 2 − 1 T β ( x + φ ( x ) )= S φ ( x ) . (1.71) s Here the matrix S is given by N ∑ − 1 T I + β (1.72) = α ) φ ( x x ) φ ( S n =1 n I is the unit matrix, and we have defined the vector where ( x ) with elements φ i ,...,M ( x )= x =0 for i . φ i We see that the variance, as well as the mean, of the predictive distribution in (1.69) is dependent on x . The first term in (1.71) represents the uncertainty in the t due to the noise on the target variables and was expressed already predicted value of − 1 in the maximum likelihood predictive distribution (1.64) through β . However, the ML second term arises from the uncertainty in the parameters w and is a consequence of the Bayesian treatment. The predictive distribution for the synthetic sinusoidal regression problem is illustrated in Figure 1.17.

52 32 1. INTRODUCTION The predictive distribution result- Figure 1.17 ing from a Bayesian treatment of polynomial curve fitting using an 1 M polynomial, with the fixed =9 − 3 and β = × 10 parameters α =5 t 11 . 1 (corresponding to the known noise variance), in which the red 0 curve denotes the mean of the predictive distribution and the red stan- region corresponds to ± 1 dard deviation around the mean. −1 1 0 x 1.3. Model Selection In our example of polynomial curve fitting using least squares, we saw that there was an optimal order of polynomial that gave the best generalization. The order of the polynomial controls the number of free parameters in the model and thereby governs the model complexity. With regularized least squares, the regularization coefficient λ also controls the effective complexity of the model, whereas for more complex models, such as mixture distributions or neural networks there may be multiple pa- rameters governing complexity. In a practical application, we need to determine the values of such parameters, and the principal objective in doing so is usually to achieve the best predictive performance on new data. Furthermore, as well as find- ing the appropriate values for complexity parameters within a given model, we may wish to consider a range of different types of model in order to find the best one for our particular application. We have already seen that, in the maximum likelihood approach, the perfor- mance on the training set is not a good indicator of predictive performance on un- seen data due to the problem of over-fitting. If data is plentiful, then one approach is simply to use some of the available data to train a range of models, or a given model with a range of values for its complexity parameters, and then to compare them on independent data, sometimes called a validation set , and select the one having the best predictive performance. If the model design is iterated many times using a lim- ited size data set, then some over-fitting to the validation data can occur and so it may be necessary to keep aside a third test set on which the performance of the selected model is finally evaluated. In many applications, however, the supply of data for training and testing will be limited, and in order to build good models, we wish to use as much of the available data as possible for training. However, if the validation set is small, it will give a relatively noisy estimate of predictive performance. One solution to this dilemma is to use cross-validation , which is illustrated in Figure 1.18. This allows a proportion ( S − 1) /S of the available data to be used for training while making use of all of the

53 1.4. The Curse of Dimensionality 33 The technique of Figure 1.18 S -fold cross-validation, illus- run 1 , involves tak- trated here for the case of =4 S ing the available data and partitioning it into S run 2 groups (in the simplest case these are of equal of the groups are used to train size). Then S − 1 run 3 a set of models that are then evaluated on the re- maining group. This procedure is then repeated possible choices for the held-out group, S for all run 4 indicated here by the red blocks, and the perfor- S mance scores from the runs are then averaged. data to assess performance. When data is particularly scarce, it may be appropriate S N , where N is the total number of data points, which gives to consider the case = leave-one-out the technique. One major drawback of cross-validation is that the number of training runs that must be performed is increased by a factor of S , and this can prove problematic for models in which the training is itself computationally expensive. A further problem with techniques such as cross-validation that use separate data to assess performance is that we might have multiple complexity parameters for a single model (for in- stance, there might be several regularization parameters). Exploring combinations of settings for such parameters could, in the worst case, require a number of training runs that is exponential in the number of parameters. Clearly, we need a better ap- proach. Ideally, this should rely only on the training data and should allow multiple hyperparameters and model types to be compared in a single training run. We there- fore need to find a measure of performance which depends only on the training data and which does not suffer from bias due to over-fitting. Historically various ‘information criteria’ have been proposed that attempt to correct for the bias of maximum likelihood by the addition of a penalty term to compensate for the over-fitting of more complex models. For example, the Akaike information criterion , or AIC (Akaike, 1974), chooses the model for which the quan- tity ) − M (1.73) ln p ( D| w ML is the number of ) is the best-fit log likelihood, and M w is largest. Here p ( D| ML Bayesian adjustable parameters in the model. A variant of this quantity, called the information criterion ,or BIC , will be discussed in Section 4.4.1. Such criteria do not take account of the uncertainty in the model parameters, however, and in practice they tend to favour overly simple models. We therefore turn in Section 3.4 to a fully Bayesian approach where we shall see how complexity penalties arise in a natural and principled way. 1.4. The Curse of Dimensionality In the polynomial curve fitting example we had just one input variable x . For prac- tical applications of pattern recognition, however, we will have to deal with spaces

54 34 1. INTRODUCTION Scatter plot of the oil flow data Figure 1.19 2 x ,in and x for input variables 7 6 which red denotes the ‘homoge- nous’ class, green denotes the ‘annular’ class, and blue denotes 1.5 the ‘laminar’ class. Our goal is to classify the new test point de- × ’. noted by ‘ x 7 1 0.5 0 1 0 0.5 0.75 0.25 x 6 of high dimensionality comprising many input variables. As we now discuss, this poses some serious challenges and is an important factor influencing the design of pattern recognition techniques. In order to illustrate the problem we consider a synthetically generated data set representing measurements taken from a pipeline containing a mixture of oil, wa- ter, and gas (Bishop and James, 1993). These three materials can be present in one of three different geometrical configurations known as ‘homogenous’, ‘annular’, and ‘laminar’, and the fractions of the three materials can also vary. Each data point com- 12 prises a -dimensional input vector consisting of measurements taken with gamma ray densitometers that measure the attenuation of gamma rays passing along nar- row beams through the pipe. This data set is described in detail in Appendix A. points from this data set on a plot showing two of the mea- Figure 1.19 shows 100 (the remaining ten input values are ignored for the purposes of and x surements x 7 6 this illustration). Each data point is labelled according to which of the three geomet- rical classes it belongs to, and our goal is to use this data as a training set in order to be able to classify a new observation ( x ) , such as the one denoted by the cross ,x 6 7 in Figure 1.19. We observe that the cross is surrounded by numerous red points, and so we might suppose that it belongs to the red class. However, there are also plenty of green points nearby, so we might think that it could instead belong to the green class. It seems unlikely that it belongs to the blue class. The intuition here is that the identity of the cross should be determined more strongly by nearby points from the training set and less strongly by more distant points. In fact, this intuition turns out to be reasonable and will be discussed more fully in later chapters. How can we turn this intuition into a learning algorithm? One very simple ap- proach would be to divide the input space into regular cells, as indicated in Fig- ure 1.20. When we are given a test point and we wish to predict its class, we first decide which cell it belongs to, and we then find all of the training data points that

55 1.4. The Curse of Dimensionality 35 Illustration of a simple approach Figure 1.20 2 to the solution of a classification problem in which the input space is divided into cells and any new test point is assigned to the class 1.5 that has a majority number of rep- resentatives in the same cell as the test point. As we shall see shortly, this simplistic approach x 7 1 has some severe shortcomings. 0.5 0 0 0.25 0.5 0.75 1 x 6 fall in the same cell. The identity of the test point is predicted as being the same as the class having the largest number of training points in the same cell as the test point (with ties being broken at random). There are numerous problems with this naive approach, but one of the most se- vere becomes apparent when we consider its extension to problems having larger numbers of input variables, corresponding to input spaces of higher dimensionality. The origin of the problem is illustrated in Figure 1.21, which shows that, if we divide a region of a space into regular cells, then the number of such cells grows exponen- tially with the dimensionality of the space. The problem with an exponentially large number of cells is that we would need an exponentially large quantity of training data in order to ensure that the cells are not empty. Clearly, we have no hope of applying such a technique in a space of more than a few variables, and so we need to find a more sophisticated approach. We can gain further insight into the problems of high-dimensional spaces by Section 1.1 returning to the example of polynomial curve fitting and considering how we would Figure 1.21 of the Illustration x 2 curse of dimensionality, showing how the number of regions of a regular grid grows exponentially x 2 of the D with the dimensionality space. For clarity, only a subset of the cubical regions are shown for =3 . D x 1 x x x 3 1 1 D =3 =2 D D =1

56 36 1. INTRODUCTION extend this approach to deal with input spaces having several variables. If we have D 3 would input variables, then a general polynomial with coefficients up to order take the form D D D D D D ∑ ∑ ∑ ∑ ∑ ∑ x + w x x (1.74) . x w x x + + w , w x w )= ( y i ij 0 i i ijk j j k i i j =1 i =1 j =1 i =1 =1 k =1 D As increases, so the number of independent coefficients (not all of the coefficients are independent due to interchange symmetries amongst the x variables) grows pro- 3 . In practice, to capture complex dependencies in the data, we may portionally to D M , the growth in need to use a higher-order polynomial. For a polynomial of order M Exercise 1.16 . Although this is now a power law growth, the number of coefficients is like D rather than an exponential growth, it still points to the method becoming rapidly unwieldy and of limited practical utility. Our geometrical intuitions, formed through a life spent in a space of three di- mensions, can fail badly when we consider spaces of higher dimensionality. As a r =1 in a space of simple example, consider a sphere of radius dimensions, and D ask what is the fraction of the volume of the sphere that lies between radius r =1 − and r =1 . We can evaluate this fraction by noting that the volume of a sphere of D , and so we write radius dimensions must scale as r r D in D V r )= K r (1.75) ( D D Exercise 1.18 . Thus the required fraction is given by depends only on D K where the constant D ) (1) − V (1 − V D D D ) =1 − (1 − (1.76) (1) V D which is plotted as a function of for various values of D in Figure 1.22. We see that, for large D 1 even for small values of . Thus, in spaces , this fraction tends to of high dimensionality, most of the volume of a sphere is concentrated in a thin shell near the surface! As a further example, of direct relevance to pattern recognition, consider the behaviour of a Gaussian distribution in a high-dimensional space. If we transform from Cartesian to polar coordinates, and then integrate out the directional variables, p ( we obtain an expression for the density ) as a function of radius r from the origin. Exercise 1.20 r Thus p ( r ) δr is the probability mass inside a thin shell of thickness δr located at radius . This distribution is plotted, for various values of D , in Figure 1.23, and we r see that for large D the probability mass of the Gaussian is concentrated in a thin shell. The severe difficulty that can arise in spaces of many dimensions is sometimes called the curse of dimensionality (Bellman, 1961). In this book, we shall make ex- tensive use of illustrative examples involving input spaces of one or two dimensions, because this makes it particularly easy to illustrate the techniques graphically. The reader should be warned, however, that not all intuitions developed in spaces of low dimensionality will generalize to spaces of many dimensions.

57 1.4. The Curse of Dimensionality 37 Plot of the fraction of the volume of Figure 1.22 1 =1  r − a sphere lying in the range D =20 r =1 for various values of the to . D dimensionality D =5 0.8 D =2 0.6 D =1 0.4 volume fraction 0.2 0 0.8 0 0.6 0.4 0.2 1 Although the curse of dimensionality certainly raises important issues for pat- tern recognition applications, it does not prevent us from finding effective techniques applicable to high-dimensional spaces. The reasons for this are twofold. First, real data will often be confined to a region of the space having lower effective dimension- ality, and in particular the directions over which important variations in the target variables occur may be so confined. Second, real data will typically exhibit some smoothness properties (at least locally) so that for the most part small changes in the input variables will produce small changes in the target variables, and so we can ex- ploit local interpolation-like techniques to allow us to make predictions of the target variables for new values of the input variables. Successful pattern recognition tech- niques exploit one or both of these properties. Consider, for example, an application in manufacturing in which images are captured of identical planar objects on a con- veyor belt, in which the goal is to determine their orientation. Each image is a point Figure 1.23 Plot of the probability density with 2 r respect to radius of a Gaus- sian distribution for various values =1 D of the dimensionality D .Ina high-dimensional space, most of the probability mass of a Gaussian is lo- =2 D ) r cated within a thin shell at a specific 1 ( p =20 D radius. 0 2 4 0 r

58 38 1. INTRODUCTION in a high-dimensional space whose dimensionality is determined by the number of pixels. Because the objects can occur at different positions within the image and in different orientations, there are three degrees of freedom of variability between embedded manifold images, and a set of images will live on a three dimensional within the high-dimensional space. Due to the complex relationships between the object position or orientation and the pixel intensities, this manifold will be highly nonlinear. If the goal is to learn a model that can take an input image and output the orientation of the object irrespective of its position, then there is only one degree of freedom of variability within the manifold that is significant. 1.5. Decision Theory We have seen in Section 1.2 how probability theory provides us with a consistent mathematical framework for quantifying and manipulating uncertainty. Here we turn to a discussion of decision theory that, when combined with probability theory, allows us to make optimal decisions in situations involving uncertainty such as those encountered in pattern recognition. x t of Suppose we have an input vector together with a corresponding vector given a new value for x . For regression target variables, and our goal is to predict t problems, will comprise continuous variables, whereas for classification problems t t p ( x , t ) provides a will represent class labels. The joint probability distribution complete summary of the uncertainty associated with these variables. Determination inference ( of , t ) from a set of training data is an example of p and is typically a x very difficult problem whose solution forms the subject of much of this book. In a practical application, however, we must often make a specific prediction for the value of t , or more generally take a specific action based on our understanding of the values t is likely to take, and this aspect is the subject of decision theory. Consider, for example, a medical diagnosis problem in which we have taken an X-ray image of a patient, and we wish to determine whether the patient has cancer x or not. In this case, the input vector is the set of pixel intensities in the image, and output variable will represent the presence of cancer, which we denote by the t , or the absence of cancer, which we denote by the class C . We might, for class C 2 1 and to be a binary variable such that t =0 corresponds to class C instance, choose t 1 t =1 corresponds to class C . We shall see later that this choice of label values is 2 particularly convenient for probabilistic models. The general inference problem then ,t , which ) x ( p , or equivalently ) , x involves determining the joint distribution ( p C k gives us the most complete probabilistic description of the situation. Although this can be a very useful and informative quantity, in the end we must decide either to give treatment to the patient or not, and we would like this choice to be optimal in some appropriate sense (Duda and Hart, 1973). This is the decision step, and it is the subject of decision theory to tell us how to make optimal decisions given the appropriate probabilities. We shall see that the decision stage is generally very simple, even trivial, once we have solved the inference problem. Here we give an introduction to the key ideas of decision theory as required for

59 1.5. Decision Theory 39 the rest of the book. Further background, as well as more detailed accounts, can be found in Berger (1985) and Bather (2000). Before giving a more detailed analysis, let us first consider informally how we might expect probabilities to play a role in making decisions. When we obtain the x for a new patient, our goal is to decide which of the two classes to X-ray image assign to the image. We are interested in the probabilities of the two classes given p the image, which are given by C ( ) . Using Bayes’ theorem, these probabilities x | k can be expressed in the form ) ( p ) C p ( x |C k k p C ( . (1.77) )= x | k p x ) ( Note that any of the quantities appearing in Bayes’ theorem can be obtained from p the joint distribution x , C ( ) by either marginalizing or conditioning with respect to k p ( C the appropriate variables. We can now interpret ) as the prior probability for the k x as the corresponding posterior probability. Thus ( C repre- | p ) , and p ( C ) C class 1 k k sents the probability that a person has cancer, before we take the X-ray measurement. ) x | is the corresponding probability, revised using Bayes’ theorem in ( p Similarly, C 1 light of the information contained in the X-ray. If our aim is to minimize the chance of assigning to the wrong class, then intuitively we would choose the class having x the higher posterior probability. We now show that this intuition is correct, and we also discuss more general criteria for making decisions. 1.5.1 Minimizing the misclassification rate Suppose that our goal is simply to make as few misclassifications as possible. x to one of the available classes. Such a We need a rule that assigns each value of called decision regions , one for each R rule will divide the input space into regions k are assigned to class C . The boundaries between class, such that all points in R k k decision regions are called decision boundaries or decision surfaces . Note that each decision region need not be contiguous but could comprise some number of disjoint regions. We shall encounter examples of decision boundaries and decision regions in later chapters. In order to find the optimal decision rule, consider first of all the case of two classes, as in the cancer problem for instance. A mistake occurs when an input is assigned to class C or vice versa. The probability of C vector belonging to class 1 2 this occurring is given by ) C , ∈R x ( , C p )+ ∈R x ( p (mistake) = p 2 1 2 1 ∫ ∫ = ( x , C )d )d x (1.78) p . x p ( x , C + 1 2 R R 2 1 We are free to choose the decision rule that assigns each point x to one of the two classes. Clearly to minimize p we should arrange that each x is assigned to (mistake) > ) x , whichever class has the smaller value of the integrand in (1.78). Thus, if ( p C 1 p ( x , C . From the ) for a given value of x , then we should assign that x to class C 1 2 ) x ( p ) )= p ( x . Because the factor | C product rule of probability we have C , x ( p k k p ( x ) is common to both terms, we can restate this result as saying that the minimum

60 40 1. INTRODUCTION x ̂ x 0 p ( x, C ) 1 p ( x, C ) 2 x R R 1 2 Schematic illustration of the joint probabilities p x, C Figure 1.24 ) for each of two classes plotted ( k x x against b x . Values of , together with the decision boundary  b x are classified as x = are classified x and hence belong to decision region R b , whereas points x< C class 2 2 C as and belong to R . Errors arise from the blue, green, and red regions, so that for 1 1 being misclassified as C (represented by the errors are due to points from class C x b x< 1 2  b x the the sum of the red and green regions), and conversely for points in the region x C errors are due to points from class being misclassified as C (represented by the blue 1 2 region). As we vary the location x of the decision boundary, the combined areas of the b blue and green regions remains constant, whereas the size of the red region varies. The optimal choice for x is where the curves for p ( x, C b and p ( x, C ) cross, corresponding to ) 1 2 , because in this case the red region disappears. This is equivalent to the minimum x = x b 0 misclassification rate decision rule, which assigns each value of x to the class having the . x ) | p C higher posterior probability ( k probability of making a mistake is obtained if each value of x is assigned to the class is largest. This result is illustrated for ) x | ( p for which the posterior probability C k x , in Figure 1.24. two classes, and a single input variable K For the more general case of classes, it is slightly easier to maximize the probability of being correct, which is given by K ∑ p (correct) = ) p ( x ∈R C , k k =1 k ∫ K ∑ x )d p C , (1.79) ( x = k R k k =1 are chosen such that each x is assigned which is maximized when the regions R k C ) is largest. Again, using the product rule p ( x , )= , to the class for which p ( x C k k p ( C ) ) ) p ( x x , and noting that the factor of p ( x | is common to all terms, we see k that each should be assigned to the class having the largest posterior probability x ) x . | ( p C k

61 1.5. Decision Theory 41 Figure 1.25 An example of a loss matrix with ele- cancer normal ( ) for the cancer treatment problem. The rows L ments kj 0 1000 cancer correspond to the true class, whereas the columns cor- 10 normal respond to the assignment of class made by our deci- sion criterion. 1.5.2 Minimizing the expected loss For many applications, our objective will be more complex than simply mini- mizing the number of misclassifications. Let us consider again the medical diagnosis problem. We note that, if a patient who does not have cancer is incorrectly diagnosed as having cancer, the consequences may be some patient distress plus the need for further investigations. Conversely, if a patient with cancer is diagnosed as healthy, the result may be premature death due to lack of treatment. Thus the consequences of these two types of mistake can be dramatically different. It would clearly be better to make fewer mistakes of the second kind, even if this was at the expense of making more mistakes of the first kind. loss function , also We can formalize such issues through the introduction of a cost function , which is a single, overall measure of loss incurred in taking called a any of the available decisions or actions. Our goal is then to minimize the total loss , whose value utility function incurred. Note that some authors consider instead a they aim to maximize. These are equivalent concepts if we take the utility to be simply the negative of the loss, and throughout this text we shall use the loss function and that we assign , the true class is C convention. Suppose that, for a new value of x k (where j may or may not be equal to k ). In so doing, we incur some to class x C j L level of loss that we denote by , which we can view as the k, j element of a loss kj . For instance, in our cancer example, we might have a loss matrix of the form matrix shown in Figure 1.25. This particular loss matrix says that there is no loss incurred if the correct decision is made, there is a loss of if a healthy patient is diagnosed as 1 having cancer, whereas there is a loss of 1000 if a patient having cancer is diagnosed as healthy. The optimal solution is the one which minimizes the loss function. However, the loss function depends on the true class, which is unknown. For a given input vector x , our uncertainty in the true class is expressed through the joint probability and so we seek instead to minimize the average loss, where the ) ( x , C distribution p k average is computed with respect to this distribution, which is given by ∫ ∑ ∑ [ ]= L E (1.80) x )d L , p ( x . C kj k R j j k . Our goal can be assigned independently to one of the decision regions R Each x j in order to minimize the expected loss (1.80), which is to choose the regions R j ∑ , L p ( x . As before, we can use C ) x implies that for each we should minimize k kj k p ( x , C the product rule ( )= ( C | x ) p p x ) to eliminate the common factor of p ( x ) . k k Thus the decision rule that minimizes the expected loss is the one that assigns each

62 42 1. INTRODUCTION Illustration of the reject option. Inputs Figure 1.26 ( ) x | C p ) x | ( p C 1 2 x such that the larger of the two poste- 1 0 . rior probabilities is less than or equal to θ will be rejected. θ some threshold 0 . 0 x reject region x new j for which the quantity to the class ∑ ) p ( C (1.81) | x L kj k k is a minimum. This is clearly trivial to do, once we know the posterior class proba- x . ) | p C ( bilities k 1.5.3 The reject option We have seen that classification errors arise from the regions of input space is significantly less than unity, ) x | C p where the largest of the posterior probabilities ( k p x , C or equivalently where the joint distributions ( ) have comparable values. These k are the regions where we are relatively uncertain about class membership. In some applications, it will be appropriate to avoid making decisions on the difficult cases in anticipation of a lower error rate on those examples for which a classification de- cision is made. This is known as the . For example, in our hypothetical reject option medical illustration, it may be appropriate to use an automatic system to classify those X-ray images for which there is little doubt as to the correct class, while leav- ing a human expert to classify the more ambiguous cases. We can achieve this by θ and rejecting those inputs x for which the largest of the introducing a threshold . This is illustrated for the θ is less than or equal to ) x | C p posterior probabilities ( k x case of two classes, and a single continuous input variable , in Figure 1.26. Note θ =1 will ensure that all examples are rejected, whereas if there are that setting K classes then setting θ< 1 /K will ensure that no examples are rejected. Thus the fraction of examples that get rejected is controlled by the value of θ . We can easily extend the reject criterion to minimize the expected loss, when a loss matrix is given, taking account of the loss incurred when a reject decision is made. Exercise 1.24 1.5.4 Inference and decision We have broken the classification problem down into two separate stages, the inference stage in which we use training data to learn a model for p ( C | x ) , and the k

63 1.5. Decision Theory 43 decision subsequent stage in which we use these posterior probabilities to make op- timal class assignments. An alternative possibility would be to solve both problems directly into decisions. Such x together and simply learn a function that maps inputs discriminant function . a function is called a In fact, we can identify three distinct approaches to solving decision problems, all of which have been used in practical applications. These are given, in decreasing order of complexity, by: (a) First solve the inference problem of determining the class-conditional densities p ( x |C for each class C ) individually. Also separately infer the prior class k k ) . Then use Bayes’ theorem in the form probabilities C ( p k p ( x |C ( C ) p ) k k p ( C )= | x (1.82) k p ( x ) p ( C to find the posterior class probabilities | x ) . As usual, the denominator k in Bayes’ theorem can be found in terms of the quantities appearing in the numerator, because ∑ . ) C (1.83) ( p ( x |C p ) x ( p )= k k k ) directly and then p C Equivalently, we can model the joint distribution ( , x k normalize to obtain the posterior probabilities. Having found the posterior probabilities, we use decision theory to determine class membership for each new input x . Approaches that explicitly or implicitly model the distribution of generative models , because by sampling inputs as well as outputs are known as from them it is possible to generate synthetic data points in the input space. (b) First solve the inference problem of determining the posterior class probabilities x ) to | x , and then subsequently use decision theory to assign each new p C ( k one of the classes. Approaches that model the posterior probabilities directly are called discriminative models . (c) Find a function f ( x ) , called a discriminant function, which maps each input x directly onto a class label. For instance, in the case of two-class problems, =1 f and ) might be binary valued and such that represents class C f =0 · ( f 1 . In this case, probabilities play no role. represents class C 2 Let us consider the relative merits of these three alternatives. Approach (a) is the most demanding because it involves finding the joint distribution over both x and will have high dimensionality, and consequently we . For many applications, x C k may need a large training set in order to be able to determine the class-conditional ) can often be esti- densities to reasonable accuracy. Note that the class priors p ( C k mated simply from the fractions of the training set data points in each of the classes. One advantage of approach (a), however, is that it also allows the marginal density of data p ( x ) to be determined from (1.83). This can be useful for detecting new data points that have low probability under the model and for which the predictions may

64 44 1. INTRODUCTION 1.2 5 C ( p C ( p x | ) ) x | 2 1 p ( x |C ) 2 1 4 0.8 3 0.6 2 class densities 0.4 p x ) ( |C 1 1 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0.6 0.8 1 0.2 0 0.4 x x Example of the class-conditional densities for two classes having a single input variable x (left Figure 1.27 plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the , shown in blue on the left plot, has no effect on the posterior probabilities. The ) x |C ( class-conditional density p 1 vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification rate. be of low accuracy, which is known as novelty detection (Bishop, or outlier detection 1994; Tarassenko, 1995). However, if we only wish to make classification decisions, then it can be waste- ful of computational resources, and excessively demanding of data, to find the joint ) when in fact we only really need the posterior probabilities x , C distribution p ( k x , which can be obtained directly through approach (b). Indeed, the class- ) | ( C p k conditional densities may contain a lot of structure that has little effect on the pos- terior probabilities, as illustrated in Figure 1.27. There has been much interest in exploring the relative merits of generative and discriminative approaches to machine learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al. , 2006). An even simpler approach is (c) in which we use the training data to find a discriminant function f ( ) that maps each x directly onto a class label, thereby x combining the inference and decision stages into a single learning problem. In the example of Figure 1.27, this would correspond to finding the value of x shown by the vertical green line, because this is the decision boundary giving the minimum probability of misclassification. With option (c), however, we no longer have access to the posterior probabilities . There are many powerful reasons for wanting to compute the posterior ) x | C ( p k probabilities, even if we subsequently use them to make decisions. These include: Minimizing risk. Consider a problem in which the elements of the loss matrix are subjected to revision from time to time (such as might occur in a financial

65 1.5. Decision Theory 45 application). If we know the posterior probabilities, we can trivially revise the minimum risk decision criterion by modifying (1.81) appropriately. If we have only a discriminant function, then any change to the loss matrix would require that we return to the training data and solve the classification problem afresh. Posterior probabilities allow us to determine a rejection criterion that Reject option. will minimize the misclassification rate, or more generally the expected loss, for a given fraction of rejected data points. Consider our medical X-ray problem again, and Compensating for class priors. suppose that we have collected a large number of X-ray images from the gen- eral population for use as training data in order to build an automated screening system. Because cancer is rare amongst the general population, we might find that, say, only 1 in every 1,000 examples corresponds to the presence of can- cer. If we used such a data set to train an adaptive model, we could run into severe difficulties due to the small proportion of the cancer class. For instance, a classifier that assigned every point to the normal class would already achieve 99.9% accuracy and it would be difficult to avoid this trivial solution. Also, even a large data set will contain very few examples of X-ray images corre- sponding to cancer, and so the learning algorithm will not be exposed to a broad range of examples of such images and hence is not likely to generalize well. A balanced data set in which we have selected equal numbers of exam- ples from each of the classes would allow us to find a more accurate model. However, we then have to compensate for the effects of our modifications to the training data. Suppose we have used such a modified data set and found models for the posterior probabilities. From Bayes’ theorem (1.82), we see that the posterior probabilities are proportional to the prior probabilities, which we can interpret as the fractions of points in each class. We can therefore simply take the posterior probabilities obtained from our artificially balanced data set and first divide by the class fractions in that data set and then multiply by the class fractions in the population to which we wish to apply the model. Finally, we need to normalize to ensure that the new posterior probabilities sum to one. Note that this procedure cannot be applied if we have learned a discriminant function directly instead of determining posterior probabilities. For complex applications, we may wish to break the problem Combining models. into a number of smaller subproblems each of which can be tackled by a sep- arate module. For example, in our hypothetical medical diagnosis problem, we may have information available from, say, blood tests as well as X-ray im- ages. Rather than combine all of this heterogeneous information into one huge input space, it may be more effective to build one system to interpret the X- ray images and a different one to interpret the blood data. As long as each of the two models gives posterior probabilities for the classes, we can combine the outputs systematically using the rules of probability. One simple way to do this is to assume that, for each class separately, the distributions of inputs for the X-ray images, denoted by x , and the blood data, denoted by x ,are I B

66 46 1. INTRODUCTION independent, so that x , x |C (1.84) )= p ( |C . |C ) ) p ( x p ( x B k I k I B k property, because the indepen- conditional independence This is an example of Section 8.2 . The posterior dence holds when the distribution is conditioned on the class C k probability, given both the X-ray and blood data, is then given by ( x x , x ) ∝ p | x ) , C ( |C p ) C p ( k B k k I B I ( x p ∝ |C C ) p ( x ) |C ( ) p k B k I k | x ) ) p ( C x | p C ( k I k B ∝ (1.85) p ) ( C k ) , which we can easily estimate p ( C Thus we need the class prior probabilities k from the fractions of data points in each class, and then we need to normalize the resulting posterior probabilities so they sum to one. The particular condi- tional independence assumption (1.84) is an example of the . Section 8.2.2 naive Bayes model ) , x will typically not factorize Note that the joint marginal distribution p x ( B I under this model. We shall see in later chapters how to construct models for combining data that do not require the conditional independence assumption (1.84). 1.5.5 Loss functions for regression So far, we have discussed decision theory in the context of classification prob- lems. We now turn to the case of regression problems, such as the curve fitting Section 1.1 example discussed earlier. The decision stage consists of choosing a specific esti- y ( x ) mate t for each input x . Suppose that in doing so, we incur a of the value of loss L ( t, y ( x )) . The average, or expected, loss is then given by ∫∫ p L ( t, y ( x )) (1.86) ( x ,t )d x d t. E [ ]= L A common choice of loss function in regression problems is the squared loss given 2 . In this case, the expected loss can be written t, y ( x )) = { y ( x ) − t } by ( L ∫∫ 2 )d ) { y ( x t. − t } (1.87) p ( x ,t d x L [ ]= E E y x Our goal is to choose so as to minimize ( [ L ] . If we assume a completely ) flexible function y ( x ) , we can do this formally using the calculus of variations to Appendix D give ∫ [ L ] E δ } x (1.88) . { y ( =0 ) − t t p ( x ,t )d =2 ) δy ( x Solving for y ( x ) , and using the sum and product rules of probability, we obtain ∫ ∫ tp ( x ,t )d t = )= y ( x t (1.89) ] x tp ( t | x )d | = E t [ t ) p ( x

67 1.5. Decision Theory 47 ( x ) , y The regression function Figure 1.28 which minimizes the expected t squared loss, is given by the mean of the conditional distri- x ( y ) x . p ) ( t | bution ) x ( y 0 p ( t | x ) 0 x x 0 t conditioned on x and is known as the which is the conditional average of regression function . This result is illustrated in Figure 1.28. It can readily be extended to mul- , in which case the optimal solution tiple target variables represented by the vector t [ t | x ] . Exercise 1.25 ( x )= E is the conditional average y t We can also derive this result in a slightly different way, which will also shed light on the nature of the regression problem. Armed with the knowledge that the optimal solution is the conditional expectation, we can expand the square term as follows 2 2 t − ] x | − = { y ( x ) } E [ t | x ]+ E [ t } t − ) x ( y { 2 2 y ( x ) − E [ t | x ] } x +2 { y ( x ) − E [ t | x ] }{ E [ t | = ] − t } + { E [ t | x ] − t } { x where, to keep the notation uncluttered, we use t | x ] to denote E . Substituting [ t | [ ] E t into the loss function and performing the integral over t , we see that the cross-term vanishes and we obtain an expression for the loss function in the form ∫ ∫ 2 2 x { y ( x ) − E [ t | x ] } (1.90) p ( x )d x + . { E [ t | x ] − t } x p ( )d ]= L E [ y x ) we seek to determine enters only in the first term, which will be The function ( y ( x ) minimized when E [ t | x ] , in which case this term will vanish. This is equal to is simply the result that we derived previously and that shows that the optimal least squares predictor is given by the conditional mean. The second term is the variance of the distribution of t , averaged over x . It represents the intrinsic variability of the target data and can be regarded as noise. Because it is independent of y x ) ,it ( represents the irreducible minimum value of the loss function. As with the classification problem, we can either determine the appropriate prob- abilities and then use these to make optimal decisions, or we can build models that make decisions directly. Indeed, we can identify three distinct approaches to solving regression problems given, in order of decreasing complexity, by: (a) First solve the inference problem of determining the joint density p ( x ,t ) . Then normalize to find the conditional density p ( t | x ) , and finally marginalize to find the conditional mean given by (1.89).

68 48 1. INTRODUCTION First solve the inference problem of determining the conditional density (b) | x ) , p ( t and then subsequently marginalize to find the conditional mean given by (1.89). y Find a regression function ) directly from the training data. x ( (c) The relative merits of these three approaches follow the same lines as for classifica- tion problems above. The squared loss is not the only possible choice of loss function for regression. Indeed, there are situations in which squared loss can lead to very poor results and where we need to develop more sophisticated approaches. An important example ( t | x ) is multimodal, as concerns situations in which the conditional distribution p often arises in the solution of inverse problems. Here we consider briefly one simple Section 5.6 generalization of the squared loss, called the loss, whose expectation is Minkowski given by ∫∫ q L E [ | y ( (1.91) ]= − t | x p ( x ,t )d x d t ) q q is t y − . The function | q | which reduces to the expected squared loss for =2 y − t for various values of plotted against in Figure 1.29. The minimum of E [ L q ] q is given by the conditional mean for q =2 , the conditional median for q =1 , and the conditional mode for q → 0 . Exercise 1.27 1.6. Information Theory In this chapter, we have discussed a variety of concepts from probability theory and decision theory that will form the foundations for much of the subsequent discussion in this book. We close this chapter by introducing some additional concepts from the field of information theory, which will also prove useful in our development of pattern recognition and machine learning techniques. Again, we shall focus only on the key concepts, and we refer the reader elsewhere for more detailed discussions (Viterbi and Omura, 1979; Cover and Thomas, 1991; MacKay, 2003) . We begin by considering a discrete random variable x and we ask how much information is received when we observe a specific value for this variable. The amount of information can be viewed as the ‘degree of surprise’ on learning the x . If we are told that a highly improbable event has just occurred, we will value of have received more information than if we were told that some very likely event has just occurred, and if we knew that the event was certain to happen we would receive no information. Our measure of information content will therefore depend ) p x ) , and we therefore look for a quantity h ( x ( that on the probability distribution is a monotonic function of the probability p ( x ) and that expresses the information content. The form of h ( · ) can be found by noting that if we have two events x and y that are unrelated, then the information gain from observing both of them should be the sum of the information gained from each of them separately, so that )+ x, y )= h h x ( h ( y ) . Two unrelated events will be statistically independent and ( so p ( x, y )= p ( x ) p ( y ) . From these two relationships, it is easily shown that h ( x ) must be given by the logarithm of p ( x ) and so we have Exercise 1.28

69 1.6. Information Theory 49 2 2 =1 q =0 . 3 q q q | | t t 1 1 − − y y | | 0 0 1 −2 −1 2 1 2 0 −1 −2 0 y − t − t y 2 2 q =10 q =2 q q | | t t 1 1 − − y y | | 0 0 0 −2 −1 2 1 2 1 −1 −2 0 t − t y y − q Plots of the quantity L . = | y − t | Figure 1.29 for various values of q q h x )= − log ( p ( x ) (1.92) 2 where the negative sign ensures that information is positive or zero. Note that low x correspond to high information content. The choice of basis probability events for the logarithm is arbitrary, and for the moment we shall adopt the convention 2 . In this case, as prevalent in information theory of using logarithms to the base of h ) x we shall see shortly, the units of are bits (‘binary digits’). ( Now suppose that a sender wishes to transmit the value of a random variable to a receiver. The average amount of information that they transmit in the process is obtained by taking the expectation of (1.92) with respect to the distribution ( x ) and p is given by ∑ . ) x p ( (1.93) ( x ) log p ]= x H[ − 2 x entropy of the random variable x . Note that This important quantity is called the whenever we encounter a )=0 x ( p ln p =0 and so we shall take p ( x )ln p lim 0 p → value for x such that p ( x )=0 . So far we have given a rather heuristic motivation for the definition of informa-

70 50 1. INTRODUCTION tion (1.92) and the corresponding entropy (1.93). We now show that these definitions x indeed possess useful properties. Consider a random variable having 8 possible x states, each of which is equally likely. In order to communicate the value of to a receiver, we would need to transmit a message of length 3 bits. Notice that the entropy of this variable is given by 1 1 × x ]= − H[ 8 = 3 bits . log 2 8 8 pos- 8 Now consider an example (Cover and Thomas, 1991) of a variable having { a, b, c, d, e, f, g, h } for which the respective probabilities are given by sible states 1 1 1 1 1 1 1 1 , , , ) , . The entropy in this case is given by , , , ( 64 16 8 64 4 2 64 64 1 1 1 1 1 4 1 1 1 1 . log − − log log − − = 2 bits log log − ]= x H[ 2 2 2 2 2 16 8 4 64 2 8 2 64 16 4 We see that the nonuniform distribution has a smaller entropy than the uniform one, and we shall gain some insight into this shortly when we discuss the interpretation of entropy in terms of disorder. For the moment, let us consider how we would transmit the identity of the variable’s state to a receiver. We could do this, as before, using a 3-bit number. However, we can take advantage of the nonuniform distribution by using shorter codes for the more probable events, at the expense of longer codes for the less probable events, in the hope of getting a shorter average code length. This { a, b, c, d, e, f, g, h } using, for instance, the can be done by representing the states following set of code strings: 0, 10, 110, 1110, 111100, 111101, 111110, 111111. The average length of the code that has to be transmitted is then 1 1 1 1 1 6 = 2 bits × × 2+ × × × 3+ 1+ 4+4 × average code length = 16 64 2 8 4 which again is the same as the entropy of the random variable. Note that shorter code strings cannot be used because it must be possible to disambiguate a concatenation of such strings into its component parts. For instance, 11001110 decodes uniquely c , a , into the state sequence . d This relation between entropy and shortest coding length is a general one. The noiseless coding theorem (Shannon, 1948) states that the entropy is a lower bound on the number of bits needed to transmit the state of a random variable. From now on, we shall switch to the use of natural logarithms in defining en- tropy, as this will provide a more convenient link with ideas elsewhere in this book. In this case, the entropy is measured in units of ‘nats’ instead of bits, which differ simply by a factor of ln 2 . We have introduced the concept of entropy in terms of the average amount of information needed to specify the state of a random variable. In fact, the concept of entropy has much earlier origins in physics where it was introduced in the context of equilibrium thermodynamics and later given a deeper interpretation as a measure of disorder through developments in statistical mechanics. We can understand this alternative view of entropy by considering a set of N identical objects that are to be th divided amongst a set of bins, such that there are n objects in the i bin. Consider i

71 1.6. Information Theory 51 N the number of different ways of allocating the objects to the bins. There are − N ( 1) ways to choose the first object, ways to choose the second object, and N ways to allocate all N ! ! N so on, leading to a total of objects to the bins, where N × (pronounced ‘factorial N − 1) ×···× 2 × 1 . However, N ’) denotes the product ( we don’t wish to distinguish between rearrangements of objects within each bin. In th i the ways of reordering the objects, and so the total number of ! bin there are n i N objects to the bins is given by ways of allocating the ! N ∏ (1.94) = W n ! i i which is called the multiplicity . The entropy is then defined as the logarithm of the multiplicity scaled by an appropriate constant ∑ 1 1 1 ln . ! n (1.95) − ! N ln = W ln H= i N N N i We now consider the limit , in which the fractions n N →∞ are held fixed, and /N i apply Stirling’s approximation !  N ln N − N (1.96) N ln which gives ) ( ) ( ∑ ∑ n n i i ln p = − (1.97) ln p lim H= − i i →∞ N N N i i ∑ ) n /N = N is the probability p = lim ( n . Here where we have used i i i N →∞ i th bin. In physics terminology, the specific ar- of an object being assigned to the i microstate rangements of objects in the bins is called a , and the overall distribution /N macrostate . , is called a n of occupation numbers, expressed through the ratios i The multiplicity of the macrostate. W is also known as the weight of a discrete random variable X , where x We can interpret the bins as the states i )= p is then . The entropy of the random variable X ( X = x p i i ∑ p − H[ ]= p ( x )ln (1.98) p ( x . ) i i i Distributions p ( x ) that are sharply peaked around a few values will have a relatively i low entropy, whereas those that are spread more evenly across many values will 1 , the entropy   p 0 have higher entropy, as illustrated in Figure 1.30. Because i is nonnegative, and it will equal its minimum value of 0 when one of the p = i 1 p and all other . The maximum entropy configuration can be found by =0  = i j maximizing H using a Lagrange multiplier to enforce the normalization constraint Appendix E on the probabilities. Thus we maximize ) ( ∑ ∑ ̃ p ( x 1 )ln p ( x − H= λ (1.99) − ) p ( x )+ i i i i i

72 52 1. INTRODUCTION 0.5 0.5 H = 1.77 H = 3.09 0.25 0.25 probabilities probabilities 0 0 30 Histograms of two probability distributions over Figure 1.30 bins illustrating the higher value of the entropy H for the broader distribution. The largest entropy would arise from a uniform distribution that would give H= / 30) = 3 . 40 . − ln(1 )=1 ( p ) are equal and are given by p ( x from which we find that all of the x /M i i . The corresponding value of the entropy where M is the total number of states x i H=ln is then . This result can also be derived from Jensen’s inequality (to be M discussed shortly). To verify that the stationary point is indeed a maximum, we can Exercise 1.29 evaluate the second derivative of the entropy, which gives ̃ ∂ H 1 (1.100) = − I ij x ∂p ) ( x p ) ∂p ( j i i are the elements of the identity matrix. I where ij p ( x ) We can extend the definition of entropy to include distributions over con- tinuous variables x as follows. First divide x into bins of width ∆ . Then, assuming p ( x ) is continuous, the mean value theorem (Weisstein, 1999) tells us that, for each such that such bin, there must exist a value x i ∫ i ( +1)∆ p ( x )d x = p ( x (1.101) )∆ . i ∆ i We can now quantize the continuous variable x by assigning any value x to the value th x x falls in the i bin. The probability of observing the value whenever x is then i i ( x p )∆ . This gives a discrete distribution for which the entropy takes the form i ∑ ∑ − p ln ∆ p ( x (1.102) )∆ ln ( p ( x ) )∆) = − = x p ( x ( )∆ ln − H i i i i ∆ i i ∑ where we have used , which follows from (1.101). We now omit p ( x )∆ = 1 i i the second term − ln ∆ on the right-hand side of (1.102) and then consider the limit

73 1.6. Information Theory 53 → . The first term on the right-hand side of (1.102) will approach the integral of 0 ∆ x )ln p ( x ) in this limit so that p ( } { ∫ ∑ lim )d p ( x x )∆ ln p ( x − ) x = ( (1.103) p ( x )ln p i i ∆ 0 → i where the quantity on the right-hand side is called the . We see differential entropy that the discrete and continuous forms of the entropy differ by a quantity ln ∆ , which ∆ → 0 . This reflects the fact that to specify a continuous diverges in the limit variable very precisely requires a large number of bits. For a density defined over multiple continuous variables, denoted collectively by the vector x , the differential entropy is given by ∫ H[ x ]= − p ( x )ln p ( x )d x . (1.104) In the case of discrete distributions, we saw that the maximum entropy con- figuration corresponded to an equal distribution of probabilities across the possible states of the variable. Let us now consider the maximum entropy configuration for a continuous variable. In order for this maximum to be well defined, it will be nec- x essary to constrain the first and second moments of as well as preserving the ( p ) normalization constraint. We therefore maximize the differential entropy with the dynamics, which states that the entropy of a closed Ludwig Boltzmann system tends to increase with time. By contrast, at 1844–1906 the microscopic level the classical Newtonian equa- tions of physics are reversible, and so they found it Ludwig Eduard Boltzmann was an difficult to see how the latter could explain the for- Austrian physicist who created the mer. They didn’t fully appreciate Boltzmann’s argu- field of statistical mechanics. Prior ments, which were statistical in nature and which con- to Boltzmann, the concept of en- cluded not that entropy could never decrease over tropy was already known from time but simply that with overwhelming probability it classical thermodynamics where it would generally increase. Boltzmann even had a long- quantifies the fact that when we take energy from a running dispute with the editor of the leading German system, not all of that energy is typically available physics journal who refused to let him refer to atoms to do useful work. Boltzmann showed that the ther- and molecules as anything other than convenient the- , a macroscopic quantity, could S modynamic entropy oretical constructs. The continued attacks on his work be related to the statistical properties at the micro- scopic level. This is expressed through the famous lead to bouts of depression, and eventually he com- mitted suicide. Shortly after Boltzmann’s death, new k equation S = ln W represents the in which W experiments by Perrin on colloidal suspensions veri- number of possible microstates in a macrostate, and − 23 (in units of Joules per Kelvin) is fied his theories and confirmed the value of the Boltz- k  1 . 38 × 10 known as Boltzmann’s constant. Boltzmann’s ideas = mann constant. The equation S is carved on k ln W Boltzmann’s tombstone. were disputed by many scientists of they day. One dif- ficulty they saw arose from the second law of thermo-

74 54 1. INTRODUCTION three constraints ∫ ∞ ( x =1 (1.105) )d x p −∞ ∫ ∞ x xp x = μ (1.106) ( )d −∞ ∫ ∞ 2 2 μ ) ( p ( x )d x = σ x . (1.107) − −∞ Appendix E The constrained maximization can be performed using Lagrange multipliers so that p ( x ) we maximize the following functional with respect to ( ) ∫ ∫ ∞ ∞ − p ( x )d p + λ 1 ( x )ln − p ( x )d x x 1 −∞ −∞ ( ) ) ( ∫ ∫ ∞ ∞ 2 2 x − μ λ + λ . xp ( + x σ x − μ ) )d p ( x )d x − ( 2 3 −∞ −∞ Using the calculus of variations, we set the derivative of this functional to zero giving Appendix D } { 2 (1.108) . − 1+ λ ) + λ μ x + λ − ( x p ( x )=exp 3 2 1 The Lagrange multipliers can be found by back substitution of this result into the three constraint equations, leading finally to the result Exercise 1.34 } { 2 x ( μ ) − 1 exp − (1.109) )= x ( p 2 2 / 1 2 σ 2 ) πσ (2 and so the distribution that maximizes the differential entropy is the Gaussian. Note that we did not constrain the distribution to be nonnegative when we maximized the entropy. However, because the resulting distribution is indeed nonnegative, we see with hindsight that such a constraint is not necessary. If we evaluate the differential entropy of the Gaussian, we obtain Exercise 1.35 { } 1 2 ]= H[ x 1+ln(2 ) πσ . (1.110) 2 Thus we see again that the entropy increases as the distribution becomes broader, 2 increases. This result also shows that the differential entropy, unlike the i.e., as σ 2 H( x ) < 0 in (1.110) for σ discrete entropy, can be negative, because < 1 (2 πe ) . / p y x , Suppose we have a joint distribution ) from which we draw pairs of values ( x and y . If a value of x is already known, then the additional information needed of to specify the corresponding value of y is given by − ln p ( y | x ) . Thus the average additional information needed to specify can be written as y ∫∫ H[ y | x ]= − p ( y (1.111) x )ln p ( y | x )d y d x ,

75 1.6. Information Theory 55 of which is called the x . It is easily seen, using the y conditional entropy given Exercise 1.37 product rule, that the conditional entropy satisfies the relation ]+H[ H[ y | x , x ] (1.112) x y ]=H[ x , y ] is the differential entropy of p ( x , y ) where H[ x ] is the differential en- H[ and p x ) . Thus the information needed to describe x tropy of the marginal distribution ( is given by the sum of the information needed to describe alone plus the y and x given x . additional information required to specify y 1.6.1 Relative entropy and mutual information So far in this section, we have introduced a number of concepts from information theory, including the key notion of entropy. We now start to relate these ideas to pattern recognition. Consider some unknown distribution p ( x ) , and suppose that q ( x ) .Ifweuse q ( x ) to we have modelled this using an approximating distribution x to a receiver, construct a coding scheme for the purpose of transmitting values of additional then the average amount of information (in nats) required to specify the value of x (assuming we choose an efficient coding scheme) as a result of using ( x ) q ( x ) is given by instead of the true distribution p ) ( ∫ ∫ ( )ln p ( x x q ( x )d x − )d − x p ( x )ln p KL( − q ‖ p )= { } ∫ q ( x ) − p = ( x (1.113) . x d )ln ) x p ( relative entropy or ,or KL diver- This is known as the Kullback-Leibler divergence ( p x (Kullback and Leibler, 1951), between the distributions and q ( x ) . Note gence ) KL( p ‖ q )  ≡ that it is not a symmetrical quantity, that is to say q ‖ p ) . KL( We now show that the Kullback-Leibler divergence satisfies p ‖ q )  0 with KL( equality if, and only if, ( x )= q ( x ) . To do this we first introduce the concept of p convex functions. A function f ( x ) is said to be convex if it has the property that every chord lies on or above the function, as shown in Figure 1.31. Any value of x λa x a to x = b in the interval from = +(1 − λ ) b where can be written in the form 0  λ  1 . The corresponding point on the chord is given by λf ( a )+(1 − λ ) f ( b ) , ory. This paper introduced the word ‘bit’, and his con- Claude Shannon cept that information could be sent as a stream of 1s 1916–2001 and 0s paved the way for the communications revo- lution. It is said that von Neumann recommended to After graduating from Michigan and Shannon that he use the term entropy, not only be- MIT, Shannon joined the AT&T Bell cause of its similarity to the quantity used in physics, Telephone laboratories in 1941. His but also because “nobody knows what entropy really paper ‘A Mathematical Theory of is, so in any discussion you will always have an advan- Communication’ published in the tage”. in Bell System Technical Journal 1948 laid the foundations for modern information the-

76 56 1. INTRODUCTION f Figure 1.31 ) is one for which ev- ( A convex function x ery chord (shown in blue) lies on or above f ( x ) the function (shown in red). chord x a x x λ λ b ( λa +(1 − λ ) b ) . Convexity then and the corresponding value of the function is f implies ( +(1 − λ ) b )  λf ( a )+(1 − λa ) f ( b ) . (1.114) f λ This is equivalent to the requirement that the second derivative of the function be 2 .A Exercise 1.36 (for x> 0 ) and x everywhere positive. Examples of convex functions are x ln x strictly convex function is called =0 and λ =1 . if the equality is satisfied only for λ If a function has the opposite property, namely that every chord lies on or below the concave strictly concave .If function, it is called , with a corresponding definition for ( x ) a function − f ( x ) will be concave. f is convex, then Exercise 1.38 Using the technique of proof by induction, we can show from (1.114) that a f x ) satisfies convex function ( ( ) M M ∑ ∑ x (1.115)  ) x λ λ ( f f i i i i i i =1 =1 ∑ . The result (1.115) is and x λ 0 =1 , for any set of points {  } λ where i i i i known as Jensen’s inequality . If we interpret the λ as the probability distribution i , then (1.115) can be written } { taking the values x x over a discrete variable i ( E [ x ])  E [ f ( x )] (1.116) f E · ] denotes the expectation. For continuous variables, Jensen’s inequality [ where takes the form ( ) ∫ ∫ p ( x )d x (1.117)  x f ( x ) p ( x )d x . f We can apply Jensen’s inequality in the form (1.117) to the Kullback-Leibler divergence (1.113) to give } { ∫ ∫ ( q ) x d ( )ln p =0 x )d x ( q ln −  x x (1.118) p − KL( )= q ‖ ) ( x p

77 1.6. Information Theory 57 − ln is a convex function, together with the nor- where we have used the fact that x ∫ malization condition )d x =1 . In fact, − ln x is a strictly convex function, ( q x for all ( p ( x ) )= x . Thus we can in- so the equality will hold if, and only if, q x terpret the Kullback-Leibler divergence as a measure of the dissimilarity of the two ( x ) and q ( distributions ) . p x We see that there is an intimate relationship between data compression and den- sity estimation (i.e., the problem of modelling an unknown probability distribution) because the most efficient compression is achieved when we know the true distri- bution. If we use a distribution that is different from the true one, then we must necessarily have a less efficient coding, and on average the additional information that must be transmitted is (at least) equal to the Kullback-Leibler divergence be- tween the two distributions. p ( x that we Suppose that data is being generated from an unknown distribution ) wish to model. We can try to approximate this distribution using some parametric distribution x | θ ) , governed by a set of adjustable parameters θ , for example a q ( θ is to minimize the Kullback-Leibler multivariate Gaussian. One way to determine p divergence between x ) and q ( x | θ ) with respect to θ . We cannot do this directly ( because we don’t know ( x ) . Suppose, however, that we have observed a finite set p x of training points n . Then the expectation ,...,N , drawn from p ( x ) , for =1 n p ( x ) can be approximated by a finite sum over these points, using with respect to (1.35), so that N ∑ x (1.119) . {− ln q ( | } θ )+ln p ( x )  KL( p q ) ‖ n n =1 n The second term on the right-hand side of (1.119) is independent of θ , and the first term is the negative log likelihood function for θ under the distribution q ( x | θ ) eval- uated using the training set. Thus we see that minimizing this Kullback-Leibler divergence is equivalent to maximizing the likelihood function. and given x y Now consider the joint distribution between two sets of variables ( by , y ) . If the sets of variables are independent, then their joint distribution will p x p ( x , y )= p ( x ) p ( y ) . If the variables are factorize into the product of their marginals not independent, we can gain some idea of whether they are ‘close’ to being indepen- dent by considering the Kullback-Leibler divergence between the joint distribution and the product of the marginals, given by x , y ] ≡ KL( p ( x , y ) ‖ p ( x ) p ( y )) I[ ) ( ∫∫ y ( p ) x ) ( p d )ln x ( p (1.120) y d x , y = − , x ( p y ) mutual information between the variables x and y . From the which is called the properties of the Kullback-Leibler divergence, we see that I ( x , y )  0 with equal- ity if, and only if, and y are independent. Using the sum and product rules of x probability, we see that the mutual information is related to the conditional entropy through Exercise 1.41 I[ x , y ]=H[ x ] (1.121) H[ x | y ]=H[ y ] − H[ y | x ] . −

78 58 1. INTRODUCTION x Thus we can view the mutual information as the reduction in the uncertainty about (or vice versa). From a Bayesian perspective, by virtue of being told the value of y as the posterior distribu- ( as the prior distribution for x and p ( x | y ) ) we can view p x . The mutual information therefore represents y tion after we have observed new data the reduction in uncertainty about x as a consequence of the new observation y . Exercises  ( www Consider the sum-of-squares error function given by (1.2) in which 1.1 ) ( w ) is given by the polynomial (1.1). Show that the coefficients y x, the function { w = w } that minimize this error function are given by the solution to the following i set of linear equations M ∑ T A (1.122) w = j ij i j =0 where N N ∑ ∑ i i j + A ) x (1.123) . ,T ) = t ( x = ( i n n ij n n n =1 =1 i denotes x raised ( x ) or i j Here a suffix denotes the index of a component, whereas to the power of i . 1.2 (  ) Write down the set of coupled linear equations, analogous to (1.122), satisfied which minimize the regularized sum-of-squares error function by the coefficients w i given by (1.4).  ) Suppose that we have three coloured boxes r (red), b (blue), and g ( 1.3 (green). r contains 3 apples, 4 oranges, and 3 limes, box b contains 1 apple, 1 orange, Box and 0 limes, and box g contains 3 apples, 3 oranges, and 4 limes. If a box is chosen p ( r )=0 at random with probabilities 2 , p ( b )=0 . 2 , p ( g )=0 . 6 , and a piece of . fruit is removed from the box (with equal probability of selecting any of the items in the box), then what is the probability of selecting an apple? If we observe that the selected fruit is in fact an orange, what is the probability that it came from the green box? 1.4 (  ) www Consider a probability density p defined over a continuous vari- ( x ) x able , and suppose that we make a nonlinear change of variable using x = g ( y ) x , so that the density transforms according to (1.27). By differentiating (1.27), show that the location ̂ y of the maximum of the density in y is not in general related to the location ̂ x of the maximum of the density over x by the simple functional relation y ̂ g ( ̂ = ) as a consequence of the Jacobian factor. This shows that the maximum x of a probability density (in contrast to a simple function) is dependent on the choice of variable. Verify that, in the case of a linear transformation, the location of the maximum transforms in the same way as the variable itself. var[ 1.5  ) Using the definition (1.38) show that ( f ( x )] satisfies (1.39).

79 Exercises 59 Show that if two variables x and y ) 1.6  ( are independent, then their covariance is zero. ) 1.7 (  In this exercise, we prove the normalization condition (1.48) for the www univariate Gaussian. To do this consider, the integral ) ( ∫ ∞ 1 2 = I x − exp d (1.124) x 2 σ 2 −∞ which we can evaluate by first writing its square in the form ) ( ∫ ∫ ∞ ∞ 1 1 2 2 2 I y. d x x = − exp d − (1.125) y 2 2 σ 2 σ 2 −∞ −∞ ( ) to polar coordinates Now make the transformation from Cartesian coordinates x, y 2 . Show that, by performing the integrals over θ and = r ( r, θ ) and then substitute u , and then taking the square root of both sides, we obtain u ( ) 1 / 2 2 (1.126) 2 πσ . = I 2 N | μ, σ x ( Finally, use this result to show that the Gaussian distribution ) is normal- ized.  ) 1.8 ( By using a change of variables, verify that the univariate Gaussian www distribution given by (1.46) satisfies (1.49). Next, by differentiating both sides of the normalization condition ∫ ∞ ( ) 2 N μ, σ x (1.127) d x =1 | −∞ 2 , verify that the Gaussian satisfies (1.50). Finally, show that (1.51) σ with respect to holds. www Show that the mode (i.e. the maximum) of the Gaussian distribution 1.9 (  ) (1.46) is given by μ . Similarly, show that the mode of the multivariate Gaussian (1.52) is given by μ . x www Suppose that the two variables and z are statistically independent. 1.10 )  ( Show that the mean and variance of their sum satisfies E [ x + z ]= E [ x ]+ E [ z ] (1.128) (1.129) var[ + z ]=var[ x ]+var[ z ] . x 1.11 (  ) By setting the derivatives of the log likelihood function (1.54) with respect to μ 2 equal to zero, verify the results (1.55) and (1.56). and σ

80 60 1. INTRODUCTION Using the results (1.49) and (1.50), show that www  ( 1.12 ) 2 2 (1.130) μ + I ]= σ x [ E x m n nm and x denote data points sampled from a Gaussian distribution with mean x where n m 2 I , and otherwise. =0 satisfies I I =1 if n = m and σ and variance μ nm nm nm Hence prove the results (1.57) and (1.58). ( ) Suppose that the variance of a Gaussian is estimated using the result (1.56) but 1.13  with the maximum likelihood estimate μ replaced with the true value of the μ ML mean. Show that this estimator has the property that its expectation is given by the 2 . true variance σ (  ) Show that an arbitrary square matrix with elements w 1.14 can be written in ij S S A A w = where are symmetric and anti-symmetric w + and w w w the form ij ij ij ij ij S S A A w matrices, respectively, satisfying .Now and w = j = w w − and for all i ij ji ji ij D dimensions, given consider the second order term in a higher order polynomial in by D D ∑ ∑ (1.131) w . x x j i ij i j =1 =1 Show that D D D D ∑ ∑ ∑ ∑ S w x = x (1.132) x x w i i ij j j ij =1 i =1 =1 i j =1 j so that the contribution from the anti-symmetric matrix vanishes. We therefore see can be chosen to be that, without loss of generality, the matrix of coefficients w ij 2 D symmetric, and so not all of the elements of this matrix can be chosen indepen- S is given w dently. Show that the number of independent parameters in the matrix ij D ( D +1) / 2 . by www In this exercise and the next, we explore how the number of indepen- 1.15 ( )  M dent parameters in a polynomial grows with the order of the polynomial and with th the dimensionality D of the input space. We start by writing down the M order D dimensions in the form term for a polynomial in D D D ∑ ∑ ∑ ··· (1.133) . x x x w ··· i i i i i i ··· 2 1 2 1 M M i =1 i =1 =1 i 2 1 M M elements, but the number of independent comprise D w The coefficients i i i ··· 2 1 M parameters is significantly fewer due to the many interchange symmetries of the factor x . Begin by showing that the redundancy in the coefficients can x ··· x i i i 1 2 M th order term in the form M be removed by rewriting this i i D M − 1 1 ∑ ∑ ∑ w ··· ··· (1.134) . x x ̃ x i i i i i ··· i 2 1 1 2 M M i =1 i =1 i =1 1 2 M

81 Exercises 61 ̃ w w coefficients need coefficients and Note that the precise relationship between the independent param- not be made explicit. Use this result to show that the number of D, M ) , which appear at order eters , satisfies the following recursion relation ( n M D ∑ (1.135) . − i, M n ( 1) )= D, M ( n =1 i Next use proof by induction to show that the following result holds D ∑ i + M − 2)! ( ( D + M − 1)! (1.136) = ( ! ( i − 1)! ( M − 1)! − 1)! M D =1 i D which can be done by first proving the result for M by making =1 and arbitrary , then assuming it is correct for dimension D 0!=1 use of the result and verifying that it is correct for dimension D +1 . Finally, use the two previous results, together with proof by induction, to show 1)! − M ( D + . (1.137) )= ( n D, M 1)! M ! ( D − M , and any value of To do this, first show that the result is true for D  1 , =2 by comparison with the result of Exercise 1.14. Then make use of (1.135), together M − 1 , then it will also hold at with (1.136), to show that, if the result holds at order order M 1.16 (  ) In Exercise 1.15, we proved the result (1.135) for the number of independent th order term of a D -dimensional polynomial. We now find an parameters in the M N ) of independent parameters in all of the D, M ( expression for the total number 6th order. First show that N ( D, M ) satisfies terms up to and including the M M ∑ ( ) D, m n (1.138) )= D, M ( N =0 m n D, m ) is the number of independent parameters in the term of order m . where ( Now make use of the result (1.137), together with proof by induction, to show that )! M + ( D . (1.139) )= d, M ( N D ! M ! This can be done by first proving that the result holds for M =0 and arbitrary D M , and hence showing that it holds at , then assuming that it holds at order 1  +1 . Finally, make use of Stirling’s approximation in the form order M n − n n !  n (1.140) e M , to show that, for D M , the quantity N ( D, M ) grows like D for large n D and for D it grows like M M . Consider a cubic ( M =3 ) polynomial in D dimensions, and evaluate numerically the total number of independent parameters for (i) D =10 and (ii) D = 100 , which correspond to typical small-scale and medium-scale machine learning applications.

82 62 1. INTRODUCTION www The gamma function is defined by ) 1.17 (  ∫ ∞ x − 1 − u e u (1.141) u. d ) x Γ( ≡ 0 x +1) = x Γ( x ) . Show also that Using integration by parts, prove the relation Γ( is an integer. Γ( x ! when x +1)= x and hence that Γ(1) = 1 We can use the result (1.126) to derive an expression for the surface www 1.18 (  ) , and the volume V dimensions. To do this, , of a sphere of unit radius in D area S D D consider the following result, which is obtained by transforming from Cartesian to polar coordinates ∫ ∫ D ∞ ∞ ∏ 2 2 − x − r D − 1 i e d e d x (1.142) r = S r. D i 0 −∞ i =1 Using the definition (1.141) of the Gamma function, together with (1.126), evaluate both sides of this equation, and hence show that D/ 2 2 π = . (1.143) S D Γ( 2) D/ to , show that the volume of the Next, by integrating with respect to radius from 1 0 unit sphere in D dimensions is given by S D V = . (1.144) D D √ Γ(1) = 1 Γ(3 / 2) = Finally, use the results and 2 to show that (1.143) and π/ D and D =3 . =2 (1.144) reduce to the usual expressions for (  ) Consider a sphere of radius a in D -dimensions together with the concentric 1.19 2 a , so that the sphere touches the hypercube at the centres of each hypercube of side of its sides. By using the results of Exercise 1.18, show that the ratio of the volume of the sphere to the volume of the cube is given by D/ 2 π volume of sphere = (1.145) . − 1 D 2 Γ( D/ 2) D volume of cube Now make use of Stirling’s formula in the form 2 / +1 x x − 2 1 / ) π (2  +1) x Γ( x (1.146) e which is valid for x 1 , to show that, as D →∞ , the ratio (1.145) goes to zero. Show also that the ratio of the distance from the centre of the hypercube to one of √ D , which the corners, divided by the perpendicular distance to one of the sides, is therefore goes to ∞ as D →∞ . From these results we see that, in a space of high dimensionality, most of the volume of a cube is concentrated in the large number of corners, which themselves become very long ‘spikes’!

83 Exercises 63 www In this exercise, we explore the behaviour of the Gaussian distribution ) (  1.20 dimensions given in high-dimensional spaces. Consider a Gaussian distribution in D by ) ( 2 1 ‖ ‖ x p ( )= x exp − (1.147) . 2 2 2 D/ σ 2 ) (2 πσ We wish to find the density with respect to radius in polar coordinates in which the direction variables have been integrated out. To do this, show that the integral of and thickness , where 1 ,is the probability density over a thin shell of radius r ( ) where p given by r ( ) 2 D − 1 S r r D r p ( )= − exp (1.148) 2 D/ 2 2 2 σ (2 ) πσ where S D dimensions. Show that the function is the surface area of a unit sphere in D √ ̂  Dσ . By considering r p ( r ) has a single stationary point located, for large D ,at ̂ + ) where r ̂ r , show that for large D , ( p ) ( 2 3 ( p ̂ r )exp + − r ̂ )= p (1.149) ( 2 2 σ r is a maximum of the radial probability density and also that ( r ) ̂ p which shows that ̂ r with length scale σ .Wehave decays exponentially away from its maximum at already seen that σ r for large D , and so we see that most of the probability ̂ mass is concentrated in a thin shell at large radius. Finally, show that the probability exp( ̂ r by a factor of D/ 2) . ) p density ( x is larger at the origin than at the radius We therefore see that most of the probability mass in a high-dimensional Gaussian distribution is located at a different radius from the region of high probability density. This property of distributions in spaces of high dimensionality will have important consequences when we consider Bayesian inference of model parameters in later chapters. 1.21 (  ) Consider two nonnegative numbers a and b , and show that, if a  b , then 1 / 2 . Use this result to show that, if the decision regions of a two-class a ab ) (  classification problem are chosen to minimize the probability of misclassification, this probability will satisfy ∫ 2 1 / { p ( x , C (1.150) ) p ( x , C . ) } x d (mistake) p  2 1 www Given a loss matrix with elements L , the expected risk is minimized ( )  1.22 kj if, for each x , we choose the class that minimizes (1.81). Verify that, when the loss matrix is given by L are the elements of the identity − I , where I =1 kj kj kj matrix, this reduces to the criterion of choosing the class having the largest posterior probability. What is the interpretation of this form of loss matrix? 1.23 (  ) Derive the criterion for minimizing the expected loss when there is a general loss matrix and general prior probabilities for the classes.

84 64 1. INTRODUCTION Consider a classification problem in which the loss incurred when www (  ) 1.24 C an input vector from class is given by the C is classified as belonging to class k j . , and for which the loss incurred in selecting the reject option is λ L loss matrix kj Find the decision criterion that will give the minimum expected loss. Verify that this reduces to the reject criterion discussed in Section 1.5.3 when the loss matrix is given by L =1 . What is the relationship between λ and the rejection threshold θ ? − I kj kj ) 1.25 (  Consider the generalization of the squared loss function (1.87) for a www t to the case of multiple target variables described by the vector single target variable t given by ∫∫ 2 , x − t ‖ ( p ‖ x ) t )d x d t . (1.151) y ( ( t , y ( x ))] = E [ L ( ) for which this expected y Using the calculus of variations, show that the function x . Show that this result reduces to (1.89) t | x ] [ y ( x )= E loss is minimized is given by t for the case of a single target variable . t ( 1.26 By expansion of the square in (1.151), derive a result analogous to (1.90) and  ) ( x that minimizes the expected squared loss for the y hence show that the function ) t case of a vector of target variables is again given by the conditional expectation of t . loss L www Consider the expected loss for regression problems under the ) (  1.27 q y x ) must satisfy in order function given by (1.91). Write down the condition that ( =1 ] . Show that, for , this solution represents the conditional q [ E to minimize L q is the y ) such that the probability mass for t

85 Exercises 65 The joint distribution p x, y ) for two binary variables Table 1.3 ( y y used in Exercise 1.39. and x 01 1/3 1/3 0 x 0 1/3 1 1.31 (  ) x and y having joint distribution p www x , y ) . Show Consider two variables ( that the differential entropy of this pair of variables satisfies , y ]  H[ x ]+H[ y ] (1.152) H[ x x and y are statistically independent. with equality if, and only if, ( 1.32 ) Consider a vector x of continuous variables with distribution p ( x ) and corre-  H[ x . Suppose that we make a nonsingular linear transformation sponding entropy ] to obtain a new variable x = Ax . Show that the corresponding entropy is given of y H[ y ]=H[ x ]+ln | A by where | A | denotes the determinant of A . | 1.33  ) Suppose that the conditional entropy H[ ( | x ] between two discrete random y variables x and y is zero. Show that, for all values of x such that p ( x ) > 0 , the x there is only one value variable , in other words for each y x must be a function of . y ( y | x )  =0 p of such that www Use the calculus of variations to show that the stationary point of the (  ) 1.34 functional (1.108) is given by (1.108). Then use the constraints (1.105), (1.106), and (1.107) to eliminate the Lagrange multipliers and hence show that the maximum entropy solution is given by the Gaussian (1.109). www Use the results (1.106) and (1.107) to show that the entropy of the ) 1.35 (  univariate Gaussian (1.109) is given by (1.110). 1.36  ) A strictly convex function is defined as one for which every chord lies above ( the function. Show that this is equivalent to the condition that the second derivative of the function be positive. (  ) Using the definition (1.111) together with the product rule of probability, prove 1.37 the result (1.112). 1.38 (  ) www Using proof by induction, show that the inequality (1.114) for convex functions implies the result (1.115). and  ) Consider two binary variables 1.39 ( y having the joint distribution given in x Table 1.3. Evaluate the following quantities (a) H[ x ] (c) H[ y | x ] (e) H[ x, y ] y (b) y ] (d) H[ x | H[ ] (f) I[ x, y ] . Draw a diagram to show the relationship between these various quantities.

86 66 1. INTRODUCTION x ( ) By applying Jensen’s inequality (1.115) with f 1.40  )=ln x , show that the arith- ( metic mean of a set of real numbers is never less than their geometrical mean. 1.41 (  ) www Using the sum and product rules of probability, show that the mutual information I ( x , y ) satisfies the relation (1.121).

87 2 Probability Distributions In Chapter 1, we emphasized the central role played by probability theory in the solution of pattern recognition problems. We turn now to an exploration of some particular examples of probability distributions and their properties. As well as be- ing of great interest in their own right, these distributions can form building blocks for more complex models and will be used extensively throughout the book. The distributions introduced in this chapter will also serve another important purpose, namely to provide us with the opportunity to discuss some key statistical concepts, such as Bayesian inference, in the context of simple models before we encounter them in more complex situations in later chapters. One role for the distributions discussed in this chapter is to model the prob- ability distribution p ( x ) of a random variable x , given a finite set x x of ,..., 1 N observations. This problem is known as density estimation . For the purposes of this chapter, we shall assume that the data points are independent and identically distributed. It should be emphasized that the problem of density estimation is fun- 67

88 68 2. PROBABILITY DISTRIBUTIONS damentally ill-posed, because there are infinitely many probability distributions that could have given rise to the observed finite data set. Indeed, any distribution x ) p ( x that is nonzero at each of the data points is a potential candidate. The x ,..., N 1 issue of choosing an appropriate distribution relates to the problem of model selec- tion that has already been encountered in the context of polynomial curve fitting in Chapter 1 and that is a central issue in pattern recognition. We begin by considering the binomial and multinomial distributions for discrete random variables and the Gaussian distribution for continuous random variables. These are specific examples of parametric distributions, so-called because they are governed by a small number of adaptive parameters, such as the mean and variance in the case of a Gaussian for example. To apply such models to the problem of density estimation, we need a procedure for determining suitable values for the parameters, given an observed data set. In a frequentist treatment, we choose specific values for the parameters by optimizing some criterion, such as the likelihood function. By contrast, in a Bayesian treatment we introduce prior distributions over the parameters and then use Bayes’ theorem to compute the corresponding posterior distribution given the observed data. We shall see that an important role is played by conjugate priors, that lead to posterior distributions having the same functional form as the prior, and that there- fore lead to a greatly simplified Bayesian analysis. For example, the conjugate prior Dirichlet distribution, for the parameters of the multinomial distribution is called the while the conjugate prior for the mean of a Gaussian is another Gaussian. All of these distributions are examples of the exponential family of distributions, which possess a number of important properties, and which will be discussed in some detail. One limitation of the parametric approach is that it assumes a specific functional form for the distribution, which may turn out to be inappropriate for a particular application. An alternative approach is given by density estimation nonparametric methods in which the form of the distribution typically depends on the size of the data set. Such models still contain parameters, but these control the model complexity rather than the form of the distribution. We end this chapter by considering three nonparametric methods based respectively on histograms, nearest-neighbours, and kernels. 2.1. Binary Variables We begin by considering a single binary random variable ∈{ 0 , 1 } . For example, x x might describe the outcome of flipping a coin, with x =1 representing ‘heads’, and =0 representing ‘tails’. We can imagine that this is a damaged coin so that x the probability of landing heads is not necessarily the same as that of landing tails. The probability of x =1 will be denoted by the parameter μ so that p ( x =1 | μ )= μ (2.1)

89 2.1. Binary Variables 69 μ  1 , from which it follows that p ( x =0 | μ )=1 −  . The probability where μ 0 can therefore be written in the form x distribution over x x − 1 x Bern( μ )= μ | ) − (2.2) (1 μ distribution. It is easily verified that this distribution Exercise 2.1 which is known as the Bernoulli is normalized and that it has mean and variance given by [ x ]= μ (2.3) E x μ (1 − μ ) . (2.4) var[ ]= . x of observed values of } ,...,x x { = D Now suppose we have a data set 1 N We can construct the likelihood function, which is a function of μ , on the assumption p ( x | μ ) , so that that the observations are drawn independently from N N ∏ ∏ x − 1 x n n (1 | μ )= . μ ) (2.5) x − μ ( p D| ( p μ )= n n =1 n =1 μ by maximizing the likelihood In a frequentist setting, we can estimate a value for function, or equivalently by maximizing the logarithm of the likelihood. In the case of the Bernoulli distribution, the log likelihood function is given by N N ∑ ∑ μ − (2.6) . )ln(1 x ln p ( x − | μ )= } ) +(1 μ { x ln )= μ D| ( p ln n n n =1 =1 n n N At this point, it is worth noting that the log likelihood function depends on the ∑ x only through their sum . This sum provides an example of a x observations n n n for the data under this distribution, and we shall study the impor- sufficient statistic tant role of sufficient statistics in some detail. If we set the derivative of p ( D| μ ) Section 2.4 ln with respect to μ equal to zero, we obtain the maximum likelihood estimator N ∑ 1 (2.7) = x μ ML n N =1 n his time, including Boyle and Hooke in England. When Jacob Bernoulli he returned to Switzerland, he taught mechanics and 1654–1705 became Professor of Mathematics at Basel in 1687. Jacob Bernoulli, also known as Unfortunately, rivalry between Jacob and his younger Jacques or James Bernoulli, was a brother Johann turned an initially productive collabora- Swiss mathematician and was the tion into a bitter and public dispute. Jacob’s most sig- The nificant contributions to mathematics appeared in first of many in the Bernoulli family published in 1713, eight years after Art of Conjecture to pursue a career in science and his death, which deals with topics in probability the- mathematics. Although compelled ory including what has become known as the Bernoulli to study philosophy and theology against his will by distribution. his parents, he travelled extensively after graduating in order to meet with many of the leading scientists of

90 70 2. PROBABILITY DISTRIBUTIONS Histogram plot of the binomial dis- Figure 2.1 0.3 tribution (2.9) as a function of for m . N 25 =10 and μ =0 . 0.2 0.1 0 5 6 7 8 9 10 0 1 2 3 4 m sample mean . If we denote the number of observations which is also known as the =1 (heads) within this data set by m , then we can write (2.7) in the form of x m (2.8) = μ ML N so that the probability of landing heads is given, in this maximum likelihood frame- work, by the fraction of observations of heads in the data set. Now suppose we flip a coin, say, 3 times and happen to observe 3 heads. Then =1 . In this case, the maximum likelihood result would and μ N = m =3 ML predict that all future observations should give heads. Common sense tells us that this is unreasonable, and in fact this is an extreme example of the over-fitting associ- ated with maximum likelihood. We shall see shortly how to arrive at more sensible μ . conclusions through the introduction of a prior distribution over m x =1 , of observations of We can also work out the distribution of the number N . This is called the binomial distribution, and given that the data set has size m m N − − μ ) . In order to obtain the (1 μ from (2.5) we see that it is proportional to N coin flips, we have to add up all normalization coefficient we note that out of of the possible ways of obtaining m heads, so that the binomial distribution can be written ( ) N m − N m (2.9) (1 μ ) − μ N, μ m Bin( )= | m where ( ) N N ! (2.10) ≡ ( N − m )! m ! m m objects out of a total of N identical objects. Exercise 2.3 is the number of ways of choosing Figure 2.1 shows a plot of the binomial distribution for N =10 and μ =0 . 25 . The mean and variance of the binomial distribution can be found by using the result of Exercise 1.10, which shows that for independent events the mean of the sum is the sum of the means, and the variance of the sum is the sum of the variances. Because m = x , and for each observation the mean and variance are + ... + x N 1

91 2.1. Binary Variables 71 given by (2.3) and (2.4), respectively, we have N ∑ [ ≡ m E ] Bin( m | N, μ )= Nμ (2.11) m =0 m N ∑ 2 (2.12) μ − ( m − . ) m ]) E Bin( m | N, μ )= Nμ (1 [ ≡ ] m var[ =0 m Exercise 2.4 These results can also be proved directly using calculus. 2.1.1 The beta distribution μ We have seen in (2.8) that the maximum likelihood setting for the parameter in the Bernoulli distribution, and hence in the binomial distribution, is given by the fraction of the observations in the data set having =1 . As we have already noted, x this can give severely over-fitted results for small data sets. In order to develop a p ( μ ) Bayesian treatment for this problem, we need to introduce a prior distribution μ over the parameter . Here we consider a form of prior distribution that has a simple interpretation as well as some useful analytical properties. To motivate this prior, we note that the likelihood function takes the form of the product of factors of the x 1 − x and . If we choose a prior to be proportional to powers of (1 − μ μ ) form μ μ ) , then the posterior distribution, which is proportional to the product of the − (1 prior and the likelihood function, will have the same functional form as the prior. conjugacy and we will see several examples of it later in this This property is called beta distribution, given by chapter. We therefore choose a prior, called the + Γ( a b ) 1 a − b − 1 μ (2.13) ) (1 − μ | Beta( )= a, b μ )Γ( b ) Γ( a where Γ( is the gamma function defined by (1.141), and the coefficient in (2.13) x ) ensures that the beta distribution is normalized, so that Exercise 2.5 ∫ 1 μ | a, b Beta( μ =1 . (2.14) )d 0 Exercise 2.6 The mean and variance of the beta distribution are given by a [ ]= μ E (2.15) a + b ab ]= var[ μ . (2.16) 2 b ) +1) ( a + b ( a + The parameters a are often called hyperparameters because they control the and b μ distribution of the parameter . Figure 2.2 shows plots of the beta distribution for various values of the hyperparameters. The posterior distribution of μ is now obtained by multiplying the beta prior (2.13) by the binomial likelihood function (2.9) and normalizing. Keeping only the μ , we see that this posterior distribution has the form factors that depend on − 1 b + l m + a − 1 μ (2.17) ) (1 − μ ∝ ) m, l, a, b | ( p μ

92 72 2. PROBABILITY DISTRIBUTIONS 3 3 =1 a . 1 a =0 b =1 1 . =0 b 2 2 1 1 0 0 1 1 0 0.5 0.5 0 μ μ 3 3 =2 a a =8 b =3 b =4 2 2 1 1 0 0 1 0.5 0.5 1 0 0 μ μ Plots of the beta distribution μ Figure 2.2 | a, b ) given by (2.13) as a function of μ for various values of the Beta( a . b hyperparameters and l − N where m , and therefore corresponds to the number of ‘tails’ in the coin = μ as the prior example. We see that (2.17) has the same functional dependence on distribution, reflecting the conjugacy properties of the prior with respect to the like- lihood function. Indeed, it is simply another beta distribution, and its normalization coefficient can therefore be obtained by comparison with (2.13) to give + a + l + b ) m Γ( 1 − b + l m + a − 1 . (2.18) ) μ (1 − μ m, l, a, b ( )= p μ | ) b l )Γ( a + m Γ( + m =1 x We see that the effect of observing a data set of and observations of observations of l =0 has been to increase the value of a by m , and the value of x b by l , in going from the prior distribution to the posterior distribution. This allows us to provide a simple interpretation of the hyperparameters and b in the prior as a an effective number of observations of x =1 and x =0 , respectively. Note that a and b need not be integers. Furthermore, the posterior distribution can act as the prior if we subsequently observe additional data. To see this, we can imagine taking observations one at a time and after each observation updating the current posterior

93 2.1. Binary Variables 73 2 2 2 prior likelihood function posterior 1 1 1 0 0 0 1 0 0.5 0 0 0.5 1 0.5 1 μ μ μ Illustration of one step of sequential Bayesian inference. The prior is given by a beta distribution Figure 2.3 =2 , b with parameters , and the likelihood function, given by (2.9) with N = m =1 , corresponds to a a =2 x =1 , so that the posterior is given by a beta distribution with parameters a single observation of , b =2 . =3 distribution by multiplying by the likelihood function for the new observation and then normalizing to obtain the new, revised posterior distribution. At each stage, the posterior is a beta distribution with some total number of (prior and actual) observed =1 and x =0 given by the parameters a and x . Incorporation of an values for b x =1 simply corresponds to incrementing the value of a additional observation of by , whereas for an observation of x =0 we increment b by 1 . Figure 2.3 illustrates 1 one step in this process. We see that this sequential approach to learning arises naturally when we adopt a Bayesian viewpoint. It is independent of the choice of prior and of the likelihood function and depends only on the assumption of i.i.d. data. Sequential methods make use of observations one at a time, or in small batches, and then discard them before the next observations are used. They can be used, for example, in real-time learning scenarios where a steady stream of data is arriving, and predictions must be made before all of the data is seen. Because they do not require the whole data set to be stored or loaded into memory, sequential methods are also useful for large data sets. Section 2.3.5 Maximum likelihood methods can also be cast into a sequential framework. If our goal is to predict, as best we can, the outcome of the next trial, then we x , given the observed data set D must evaluate the predictive distribution of . From the sum and product rules of probability, this takes the form ∫ ∫ 1 1 μ [ = (2.19) E p ( x =1 | μ ) p ( μ |D )d μ = . ] |D μp ( μ |D )d μ |D =1 x ( p )= 0 0 Using the result (2.18) for the posterior distribution ( μ |D ) , together with the result p (2.15) for the mean of the beta distribution, we obtain a + m (2.20) =1 )= x ( p |D + a + l + b m which has a simple interpretation as the total fraction of observations (both real ob- x =1 . Note that in servations and fictitious prior observations) that correspond to the limit of an infinitely large data set m, l →∞ the result (2.20) reduces to the maximum likelihood result (2.8). As we shall see, it is a very general property that the Bayesian and maximum likelihood results will agree in the limit of an infinitely

94 74 2. PROBABILITY DISTRIBUTIONS always lies between the large data set. For a finite data set, the posterior mean for μ μ corresponding to the relative prior mean and the maximum likelihood estimate for frequencies of events given by (2.7). Exercise 2.7 From Figure 2.2, we see that as the number of observations increases, so the posterior distribution becomes more sharply peaked. This can also be seen from the result (2.16) for the variance of the beta distribution, in which we see that the →∞ or variance goes to zero for →∞ . In fact, we might wonder whether it is a b a general property of Bayesian learning that, as we observe more and more data, the uncertainty represented by the posterior distribution will steadily decrease. To address this, we can take a frequentist view of Bayesian learning and show that, on average, such a property does indeed hold. Consider a general Bayesian inference problem for a parameter θ for which we have observed a data set D , de- p ( θ , D ) . The following result Exercise 2.8 scribed by the joint distribution E [ ]= E θ [ E (2.21) [ θ |D ]] θ θ D where ∫ ( ] ≡ p θ [ θ ) θ d θ (2.22) E θ { } ∫ ∫ E |D ]] ≡ E [ θ θ p ( θ |D )d θ [ p ( D )d D (2.23) D θ θ , averaged over the distribution generating the data, says that the posterior mean of θ . Similarly, we can show that is equal to the prior mean of [ θ ]= E [var [ θ (2.24) ]] + var [ E . [ θ |D ]] |D var D θ D θ θ The term on the left-hand side of (2.24) is the prior variance of θ . On the right- θ hand side, the first term is the average posterior variance of , and the second term measures the variance in the posterior mean of θ . Because this variance is a positive θ is smaller than quantity, this result shows that, on average, the posterior variance of the prior variance. The reduction in variance is greater if the variance in the posterior mean is greater. Note, however, that this result only holds on average, and that for a particular observed data set it is possible for the posterior variance to be larger than the prior variance. 2.2. Multinomial Variables Binary variables can be used to describe quantities that can take one of two possible values. Often, however, we encounter discrete variables that can take on one of K possible mutually exclusive states. Although there are various alternative ways to express such variables, we shall see shortly that a particularly convenient represen- tation is the 1 -of- K scheme in which the variable is represented by a K -dimensional vector x in which one of the elements x equals 1 , and all remaining elements equal k

95 2.2. Multinomial Variables 75 =6 K states and a particular 0. So, for instance if we have a variable that can take observation of the variable happens to correspond to the state where x =1 x , then 3 will be represented by T . (2.25) , 0 x 0 , 0) =(0 , 0 , 1 , ∑ K =1 x x =1 . If we denote the probability of Note that such vectors satisfy k k =1 k , then the distribution of x is given μ by the parameter k K ∏ x k μ (2.26) μ x p )= ( | k =1 k T μ where =( μ ) , and the parameters ,...,μ are constrained to satisfy μ  0 μ k k K 1 ∑ and =1 , because they represent probabilities. The distribution (2.26) can be μ k k regarded as a generalization of the Bernoulli distribution to more than two outcomes. It is easily seen that the distribution is normalized K ∑ ∑ | μ )= =1 p ( (2.27) μ x k x k =1 and that ∑ T ( x | μ μ x =( μ (2.28) ,...,μ . ) p = ) E [ x | μ ]= M 1 x . The ,..., x N independent observations x Now consider a data set D of N 1 corresponding likelihood function takes the form K K K N P ∏ ∏ ∏ ∏ x ) ( nk m x n k nk = μ . (2.29) μ μ = D| ( )= μ p k k k =1 n =1 k k =1 =1 k N data points only through the We see that the likelihood function depends on the quantities K ∑ x (2.30) = m k nk n sufficient . These are called the =1 x which represent the number of observations of k Section 2.4 statistics for this distribution. μ , we need to maximize In order to find the maximum likelihood solution for taking account of the constraint that the μ must sum ) with respect to μ p ( D| ln μ k k λ and maximizing Appendix E to one. This can be achieved using a Lagrange multiplier ) ( K K ∑ ∑ ln μ 1 + λ . m (2.31) − μ k k k =1 k =1 k to zero, we obtain Setting the derivative of (2.31) with respect to μ k μ = − m /λ. (2.32) k k

96 76 2. PROBABILITY DISTRIBUTIONS λ We can solve for the Lagrange multiplier by substituting (2.32) into the constraint ∑ μ λ = − N . Thus we obtain the maximum likelihood solution in =1 to give k k the form m k ML = (2.33) μ k N =1 . which is the fraction of the observations for which x N k We can consider the joint distribution of the quantities m , conditioned ,...,m K 1 on the parameters and on the total number N of observations. From (2.29) this μ takes the form ) ( K ∏ N m k m Mult( ,m | μ ,...,m ,N )= (2.34) μ K 2 1 k m m ...m 2 1 K =1 k multinomial distribution. The normalization coefficient is the which is known as the and is ,...,m K objects into N m number of ways of partitioning groups of size 1 K given by ( ) N N ! . (2.35) = m m ! m ...m ...m m ! ! 2 K 1 K 2 1 m Note that the variables are subject to the constraint k K ∑ m = N. (2.36) k k =1 2.2.1 The Dirichlet distribution We now introduce a family of prior distributions for the parameters { μ } of k the multinomial distribution (2.34). By inspection of the form of the multinomial distribution, we see that the conjugate prior is given by K ∏ α − 1 k ( μ α ∝ | ) p (2.37) μ k =1 k ∑ α . Here and  1 are the parameters of the ,...,α =1 μ μ  0 where k k K 1 k T . Note that, because of the summation ,...,α ) distribution, and ( α α denotes 1 K constraint, the distribution over the space of the { μ of } is confined to a simplex k dimensionality − 1 , as illustrated for K =3 in Figure 2.4. K The normalized form for this distribution is by Exercise 2.9 K ∏ ) α Γ( 0 1 − α k μ (2.38) | α )= μ Dir( k ) Γ( ) Γ( ··· α α 1 K =1 k which is called the Dirichlet distribution. Here Γ( x ) is the gamma function defined by (1.141) while K ∑ α (2.39) . = α k 0 =1 k

97 2.2. Multinomial Variables 77 μ 2 The Dirichlet distribution over three variables ,μ Figure 2.4 ,μ μ 2 3 1 is confined to a simplex (a bounded linear manifold) of the form shown, as a consequence of the constraints P and μ =1 .  1 0  μ k k k μ 1 μ 3 Plots of the Dirichlet distribution over the simplex, for various settings of the param- α eters , are shown in Figure 2.5. k Multiplying the prior (2.38) by the likelihood function (2.34), we obtain the posterior distribution for the parameters { μ } in the form k K ∏ + 1 α − m k k (2.40) μ . p ( μ | α ) ∝ p ( μ |D , α ) ∝ p ( D| μ ) k =1 k We see that the posterior distribution again takes the form of a Dirichlet distribution, confirming that the Dirichlet is indeed a conjugate prior for the multinomial. This allows us to determine the normalization coefficient by comparison with (2.38) so that ( μ |D , α )=Dir( μ | α + m ) p K ∏ + N ) Γ( α 0 − m + 1 α k k (2.41) μ = k α ) Γ( ) + ··· Γ( α m + m 1 K K 1 =1 k T ) ,...,m . As for the case of the binomial m m where we have denoted =( K 1 α distribution with its beta prior, we can interpret the parameters of the Dirichlet k =1 . x prior as an effective number of observations of k Note that two-state quantities can either be represented as binary variables and from ‘le jeune de Richelet’ (the young person from Lejeune Dirichlet Richelet). Dirichlet’s first paper, which was published 1805–1859 in 1825, brought him instant fame. It concerned Fer- Lejeune Johann Gustav Peter mat’s last theorem, which claims that there are no n n n 2 n> for = z y . + Dirichlet was a modest and re- x positive integer solutions to =5 n Dirichlet gave a partial proof for the case , which served mathematician who made was sent to Legendre for review and who in turn com- contributions in number theory, me- pleted the proof. Later, Dirichlet gave a complete proof chanics, and astronomy, and who n , although a full proof of Fermat’s last theo- =14 for gave the first rigorous analysis of rem for arbitrary n had to wait until the work of Andrew Fourier series. His family originated from Richelet th century. Wiles in the closing years of the 20 in Belgium, and the name Lejeune Dirichlet comes

98 78 2. PROBABILITY DISTRIBUTIONS Plots of the Dirichlet distribution over three vari ables, where the two horizontal axes are coordinates Figure 2.5 } =0 . 1 on the α { in the plane of the simplex and the vertical axis corresponds to the value of the density. Here k =10 } } =1 in the centre plot, and { α in the right plot. { left plot, α k k modelled using the binomial distribution (2.9) or as 1-of-2 variables and modelled K . using the multinomial distribution (2.34) with =2 2.3. The Gaussian Distribution The Gaussian, also known as the normal distribution, is a widely used model for the distribution of continuous variables. In the case of a single variable x , the Gaussian distribution can be written in the form } { 1 1 2 2 μ, σ ( x | N x (2.42) ) exp μ − − ( )= 2 1 / 2 2 2 σ (2 πσ ) 2 where μ is the mean and σ is the variance. For a D -dimensional vector x ,the multivariate Gaussian distribution takes the form } { 1 1 1 1 − T ( ) Σ μ − − x (2.43) exp N x | ( μ , Σ )= ( x ) μ − 2 2 D/ 1 / 2 (2 π ) | Σ | where μ is a D -dimensional mean vector, Σ is a D × D covariance matrix, and | Σ | denotes the determinant of . Σ The Gaussian distribution arises in many different contexts and can be motivated from a variety of different perspectives. For example, we have already seen that for Section 1.6 a single real variable, the distribution that maximizes the entropy is the Gaussian. This property applies also to the multivariate Gaussian. Exercise 2.14 Another situation in which the Gaussian distribution arises is when we consider the sum of multiple random variables. The central limit theorem (due to Laplace) tells us that, subject to certain mild conditions, the sum of a set of random variables, which is of course itself a random variable, has a distribution that becomes increas- ingly Gaussian as the number of terms in the sum increases (Walker, 1969). We can

99 2.3. The Gaussian Distribution 79 3 3 3 =2 =1 N N N =10 2 2 2 1 1 1 0 0 0 0 0.5 1 0.5 0.5 0 1 1 0 Histogram plots of the mean of N uniformly distributed numbers for various values of N Figure 2.6 .We observe that as increases, the distribution tends towards a Gaussian. N N x ,...,x each of which has a uniform illustrate this by considering variables 1 N [0 , 1] and then considering the distribution of the mean distribution over the interval ( x ) + + x , this distribution tends to a Gaussian, as illustrated ··· /N . For large N 1 N in Figure 2.6. In practice, the convergence to a Gaussian as N increases can be very rapid. One consequence of this result is that the binomial distribution (2.9), which is a distribution over N observations of the random m defined by the sum of , will tend to a Gaussian as (see Figure 2.1 for the case of →∞ x binary variable N =10 N ). The Gaussian distribution has many important analytical properties, and we shall consider several of these in detail. As a result, this section will be rather more tech- nically involved than some of the earlier sections, and will require familiarity with Appendix C various matrix identities. However, we strongly encourage the reader to become pro- ficient in manipulating Gaussian distributions using the techniques presented here as this will prove invaluable in understanding the more complex models presented in later chapters. We begin by considering the geometrical form of the Gaussian distribution. The ematician and scientist with a reputation for being a Carl Friedrich Gauss hard-working perfectionist. One of his many contribu- 1777–1855 tions was to show that least squares can be derived under the assumption of normally distributed errors. It is said that when Gauss went He also created an early formulation of non-Euclidean to elementary school at age 7, his geometry (a self-consistent geometrical theory that vi- teacher B uttner, trying to keep the ̈ class occupied, asked the pupils to olates the axioms of Euclid) but was reluctant to dis- sum the integers from 1 to 100. To cuss it openly for fear that his reputation might suffer the teacher’s amazement, Gauss if it were seen that he believed in such a geometry. arrived at the answer in a matter of moments by noting At one point, Gauss was asked to conduct a geodetic 1 + 100 , that the sum can be represented as 50 pairs ( survey of the state of Hanover, which led to his for- , etc.) each of which added to 101, giving the an- 2+99 mulation of the normal distribution, now also known swer 5,050. It is now believed that the problem which as the Gaussian. After his death, a study of his di- was actually set was of the same form but somewhat aries revealed that he had discovered several impor- harder in that the sequence had a larger starting value tant mathematical results years or even decades be- and a larger increment. Gauss was a German math- fore they were published by others.

100 80 2. PROBABILITY DISTRIBUTIONS is through the quadratic form functional dependence of the Gaussian on x − 1 T 2 ∆ − μ ) ( x − μ ) (2.44) Σ =( x is called the Mahalanobis distance ∆ which appears in the exponent. The quantity from x and reduces to the Euclidean distance when Σ is the identity matrix. The μ to x Gaussian distribution will be constant on surfaces in -space for which this quadratic form is constant. Σ First of all, we note that the matrix can be taken to be symmetric, without loss of generality, because any antisymmetric component would disappear from the exponent. Now consider the eigenvector equation for the covariance matrix Exercise 2.17 = λ (2.45) u Σu i i i i =1 ,...,D . Because Σ is a real, symmetric matrix its eigenvalues will be where Exercise 2.18 real, and its eigenvectors can be chosen to form an orthonormal set, so that T u I (2.46) u = j ij i where I i, j element of the identity matrix and satisfies is the ij { = , 1 j if i = (2.47) I ij 0 , otherwise. The covariance matrix Σ can be expressed as an expansion in terms of its eigenvec- Exercise 2.19 tors in the form D ∑ T (2.48) λ u u = Σ i i i =1 i − 1 can be expressed as Σ and similarly the inverse covariance matrix D ∑ 1 1 − T Σ = . u u (2.49) i i λ i i =1 Substituting (2.49) into (2.44), the quadratic form becomes D 2 ∑ y 2 i = (2.50) ∆ λ i =1 i where we have defined T y μ = ( x − (2.51) ) . u i i { y We can interpret } as a new coordinate system defined by the orthonormal vectors i u that are shifted and rotated with respect to the original x coordinates. Forming i i T ,wehave ,...,y ) y the vector =( y D 1 y = U (2.52) x − μ ) (

101 2.3. The Gaussian Distribution 81 x 2 The red curve shows the ellip- Figure 2.7 u 2 tical surface of constant proba- bility density for a Gaussian in u 1 a two-dimensional space = x on which the density ) ,x x ( 1 2 − 1 2) of its value at exp( / is y 2 μ . The major axes of x = y 1 the ellipse are defined by the of the covari- u eigenvectors i μ ance matrix, with correspond- 2 / 1 . ing eigenvalues λ i λ 2 1 / 2 λ 1 x 1 T is a matrix whose rows are given by U . From (2.46) it follows that U is where u i T T , and hence also = I U U = I , where I Appendix C orthogonal UU an matrix, i.e., it satisfies is the identity matrix. The quadratic form, and hence the Gaussian density, will be constant on surfaces are positive, then these λ for which (2.51) is constant. If all of the eigenvalues i , surfaces represent ellipsoids, with their centres at and their axes oriented along u μ i 2 / 1 λ and with scaling factors in the directions of the axes given by , as illustrated in i Figure 2.7. For the Gaussian distribution to be well defined, it is necessary for all of the of the covariance matrix to be strictly positive, otherwise the dis- eigenvalues λ i tribution cannot be properly normalized. A matrix whose eigenvalues are strictly . In Chapter 12, we will encounter Gaussian positive is said to be positive definite distributions for which one or more of the eigenvalues are zero, in which case the distribution is singular and is confined to a subspace of lower dimensionality. If all of the eigenvalues are nonnegative, then the covariance matrix is said to be positive semidefinite . Now consider the form of the Gaussian distribution in the new coordinate system coordinate system, we have a Jacobian y . In going from the x to the defined by the y i J with elements given by matrix ∂x i = = U (2.53) J ij ji ∂y j T where U are the elements of the matrix U . Using the orthonormality property of ji , we see that the square of the determinant of the Jacobian matrix is the matrix U ∣ ∣ ∣ ∣ ∣ ∣ 2 T 2 T T ∣ ∣ ∣ ∣ ∣ ∣ J | | U U = U | U = U = | = | I | =1 (2.54) and hence | J | =1 . Also, the determinant | Σ | of the covariance matrix can be written

102 82 2. PROBABILITY DISTRIBUTIONS as the product of its eigenvalues, and hence D ∏ 1 2 / 2 1 / = λ . (2.55) | Σ | j =1 j coordinate system, the Gaussian distribution takes the form y Thus in the j } { D 2 ∏ y 1 j = ( y )= p ( x ) p J | | − (2.56) exp 1 / 2 λ 2 πλ ) (2 j j =1 j independent univariate Gaussian distributions. The eigen- D which is the product of vectors therefore define a new set of shifted and rotated coordinates with respect to which the joint probability distribution factorizes into a product of independent distributions. The integral of the distribution in the y coordinate system is then { } ∫ ∫ D 2 ∞ ∏ y 1 j p = (2.57) =1 y d ( y exp y − )d j 2 1 / 2 λ πλ ) (2 j j −∞ j =1 where we have used the result (1.48) for the normalization of the univariate Gaussian. This confirms that the multivariate Gaussian (2.43) is indeed normalized. We now look at the moments of the Gaussian distribution and thereby provide an and Σ . The expectation of x under the Gaussian interpretation of the parameters μ distribution is given by } { ∫ 1 1 1 1 − T − ( x exp d x x μ − Σ ) μ ) − x ( x E ]= [ / D/ 2 2 1 2 Σ | ) π | (2 { } ∫ 1 1 1 − 1 T = z − )d z ( μ Σ + (2.58) z exp z 2 / 1 D/ 2 2 ) | (2 π | Σ where we have changed variables using = − μ . We now note that the exponent z x and, because the integrals over these are z is an even function of the components of −∞ , ∞ ) , the term in z in the factor ( z + μ ) will vanish by ( taken over the range symmetry. Thus E ]= μ (2.59) [ x as the mean of the Gaussian distribution. μ and so we refer to We now consider second order moments of the Gaussian. In the univariate case, 2 x [ E we considered the second order moment given by ] . For the multivariate Gaus- 2 E second order moments given by , which we can group [ x ] x sian, there are D i j T together to form the matrix [ xx E . This matrix can be written as ] { } ∫ 1 1 1 − 1 T T T [ E xx μ ) x ( − ]= ) − exp x − Σ d μ ( x xx / 2 D/ 2 1 2 Σ | ) π (2 | } { ∫ 1 1 1 − 1 T T = z μ + z exp ) − d z ( μ Σ + z z )( 2 / 1 2 D/ 2 Σ | | π ) (2

103 2.3. The Gaussian Distribution 83 − μ . Note that the cross-terms where again we have changed variables using z x = T T T μ involving z μμ and μ is constant z will again vanish by symmetry. The term and can be taken outside the integral, which itself is unity because the Gaussian T . Again, we can make distribution is normalized. Consider the term involving zz use of the eigenvector expansion of the covariance matrix given by (2.45), together with the completeness of the set of eigenvectors, to write D ∑ = z y u (2.60) j j j =1 T where y z = , which gives u j j } { ∫ 1 1 1 − 1 T T exp zz − d z Σ z z 2 1 / D/ 2 2 Σ | ) π (2 | { } ∫ D D D 2 ∑ ∑ ∑ y 1 1 T k y d exp u y y − u = i i j j / 1 2 2 D/ 2 λ | ) Σ (2 | π k j =1 i =1 =1 k D ∑ T λ u (2.61) u = Σ = i i i i =1 where we have made use of the eigenvector equation (2.45), together with the fact that the integral on the right-hand side of the middle line vanishes by symmetry i = j , and in the final line we have made use of the results (1.50) and (2.55), unless together with (2.48). Thus we have T T ]= μμ + Σ . (2.62) E xx [ For single random variables, we subtracted the mean before taking second mo- ments in order to define a variance. Similarly, in the multivariate case it is again convenient to subtract off the mean, giving rise to the covariance of a random vector x defined by ] [ T (2.63) . ( x − E [ x ])( x − E [ x ]) E ]= cov[ x E x ]= μ , For the specific case of a Gaussian distribution, we can make use of [ together with the result (2.62), to give x ]= cov[ . (2.64) Σ Because the parameter matrix Σ governs the covariance of x under the Gaussian distribution, it is called the covariance matrix. Although the Gaussian distribution (2.43) is widely used as a density model, it suffers from some significant limitations. Consider the number of free parameters in the distribution. A general symmetric covariance matrix will have D ( D +1) / 2 Σ independent parameters, and there are another D independent parameters in μ ,giv- Exercise 2.21 ing D ( D +3) / 2 parameters in total. For large D , the total number of parameters

104 84 2. PROBABILITY DISTRIBUTIONS Contours of constant Figure 2.8 x x x 2 2 2 probability density for a Gaussian distribution in two dimensions in which the covariance matrix is (a) of general form, (b) diagonal, in which the elliptical contours are aligned with the coordinate axes, and (c) x x x 1 1 1 proportional to the identity matrix, in (a) (b) (c) which the contours are concentric circles. , and the computational task of manipulating D therefore grows quadratically with and inverting large matrices can become prohibitive. One way to address this prob- lem is to use restricted forms of the covariance matrix. If we consider covariance 2 Σ = diag( matrices that are diagonal , so that σ ) , we then have a total of 2 D inde- i pendent parameters in the density model. The corresponding contours of constant density are given by axis-aligned ellipsoids. We could further restrict the covariance 2 I , known as an isotropic co- Σ = σ matrix to be proportional to the identity matrix, variance, giving D independent parameters in the model and spherical surfaces +1 of constant density. The three possibilities of general, diagonal, and isotropic covari- ance matrices are illustrated in Figure 2.8. Unfortunately, whereas such approaches limit the number of degrees of freedom in the distribution and make inversion of the covariance matrix a much faster operation, they also greatly restrict the form of the probability density and limit its ability to capture interesting correlations in the data. A further limitation of the Gaussian distribution is that it is intrinsically uni- modal (i.e., has a single maximum) and so is unable to provide a good approximation to multimodal distributions. Thus the Gaussian distribution can be both too flexible, in the sense of having too many parameters, while also being too limited in the range of distributions that it can adequately represent. We will see later that the introduc- latent variables, also called hidden variables or unobserved variables, allows tion of both of these problems to be addressed. In particular, a rich family of multimodal distributions is obtained by introducing discrete latent variables leading to mixtures of Gaussians, as discussed in Section 2.3.9. Similarly, the introduction of continuous latent variables, as described in Chapter 12, leads to models in which the number of free parameters can be controlled independently of the dimensionality D of the data space while still allowing the model to capture the dominant correlations in the data set. Indeed, these two approaches can be combined and further extended to derive a very rich set of hierarchical models that can be adapted to a broad range of prac- tical applications. For instance, the Gaussian version of the Markov random field , Section 8.3 which is widely used as a probabilistic model of images, is a Gaussian distribution over the joint space of pixel intensities but rendered tractable through the imposition of considerable structure reflecting the spatial organization of the pixels. Similarly, the linear dynamical system , used to model time series data for applications such Section 13.3 as tracking, is also a joint Gaussian distribution over a potentially large number of observed and latent variables and again is tractable due to the structure imposed on the distribution. A powerful framework for expressing the form and properties of

105 2.3. The Gaussian Distribution 85 such complex distributions is that of probabilistic graphical models, which will form the subject of Chapter 8. 2.3.1 Conditional Gaussian distributions An important property of the multivariate Gaussian distribution is that if two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is again Gaussian. Similarly, the marginal distribution of either set is also Gaussian. Consider first the case of conditional distributions. Suppose x D -dimensional is a ( x | μ , Σ ) and that we partition x into two dis- vector with Gaussian distribution N x . Without loss of generality, we can take x to form the first and joint subsets x a b a comprising the remaining D − M components, so that components of , with x M x b ( ) x a = x . (2.65) x b μ given by We also define corresponding partitions of the mean vector ) ( μ a (2.66) = μ μ b and of the covariance matrix Σ given by ) ( Σ Σ aa ab Σ = . (2.67) Σ Σ ba bb T Σ Σ of the covariance matrix implies that Σ = and Σ Note that the symmetry aa bb T are symmetric, while Σ Σ . = ba ab In many situations, it will be convenient to work with the inverse of the covari- ance matrix − 1 (2.68) ≡ Σ Λ precision matrix . In fact, we shall see that some properties which is known as the of Gaussian distributions are most naturally expressed in terms of the covariance, whereas others take a simpler form when viewed in terms of the precision. We therefore also introduce the partitioned form of the precision matrix ) ( Λ Λ aa ab (2.69) = Λ Λ Λ bb ba x . Because the inverse of a corresponding to the partitioning (2.65) of the vector Λ symmetric matrix is also symmetric, we see that Λ are symmetric, while Exercise 2.22 and aa bb T Λ Λ = is not simply . It should be stressed at this point that, for instance, Λ ba aa ab given by the inverse of Σ . In fact, we shall shortly examine the relation between aa the inverse of a partitioned matrix and the inverses of its partitions. Let us begin by finding an expression for the conditional distribution p ( x | x ) . a b From the product rule of probability, we see that this conditional distribution can be

106 86 2. PROBABILITY DISTRIBUTIONS , x ) simply by fixing x to the evaluated from the joint distribution ( x p )= p ( x b a b observed value and normalizing the resulting expression to obtain a valid probability x distribution over . Instead of performing this normalization explicitly, we can a obtain the solution more efficiently by considering the quadratic form in the exponent of the Gaussian distribution given by (2.44) and then reinstating the normalization coefficient at the end of the calculation. If we make use of the partitioning (2.65), (2.66), and (2.69), we obtain 1 1 − T − )= μ Σ − x ( ( μ − ) x 2 1 1 T T ( ( x x ) ) Λ x ( x ( − μ Λ ) − ) μ − μ − − μ − a ab a aa b a a b a a 2 2 1 1 T T − (2.70) ) − x ( x − μ . ) − ( ) Λ − μ − μ μ ) Λ x ( x ( a bb ba b b b b a b b 2 2 , this is again a quadratic form, and hence the cor- We see that as a function of x a p ( x responding conditional distribution | ) will be Gaussian. Because this distri- x b a bution is completely characterized by its mean and its covariance, our goal will be by inspection of | x ) to identify expressions for the mean and covariance of p ( x b a (2.70). This is an example of a rather common operation associated with Gaussian distributions, sometimes called ‘completing the square’, in which we are given a quadratic form defining the exponent terms in a Gaussian distribution, and we need to determine the corresponding mean and covariance. Such problems can be solved straightforwardly by noting that the exponent in a general Gaussian distribution | μ , N ) can be written ( x Σ 1 1 − 1 − 1 − 1 T T T ( x − Σ +const − μ x + Σ Σ (2.71) μ x )= x ) μ − x ( − 2 2 x , and we have made use of where ‘const’ denotes terms which are independent of the symmetry of . Thus if we take our general quadratic form and express it in Σ the form given by the right-hand side of (2.71), then we can immediately equate the matrix of coefficients entering the second order term in x to the inverse covariance 1 − − 1 and the coefficient of the linear term in to x Σ μ , from which we can matrix Σ obtain . μ Now let us apply this procedure to the conditional Gaussian distribution p ( x | x ) b a for which the quadratic form in the exponent is given by (2.70). We will denote the , respectively. Consider and Σ mean and covariance of this distribution by μ b a | b | a the functional dependence of (2.70) on x in which x is regarded as a constant. If b a ,wehave we pick out all terms that are second order in x a 1 T − (2.72) Λ x x aa a a 2 from which we can immediately conclude that the covariance (inverse precision) of is given by | x ) ( x p b a 1 − Σ . (2.73) Λ = b | a aa

107 2.3. The Gaussian Distribution 87 x Now consider all of the terms in (2.70) that are linear in a T − μ { − Λ Λ ( x x (2.74) μ } ) b ab aa b a a T = Λ . From our discussion of the general form (2.71), where we have used Λ ab ba − 1 the coefficient of x μ in this expression must equal Σ and hence a b | a | b a Λ − } ) μ { Σ − μ = Λ ( x μ b aa ab b | a b b | a a − 1 μ = − Λ (2.75) Λ ) ( x μ − b ab b a aa where we have made use of (2.73). The results (2.73) and (2.75) are expressed in terms of the partitioned precision ) . We can also express these results , x p x ( matrix of the original joint distribution b a in terms of the corresponding partitioned covariance matrix. To do this, we make use of the following identity for the inverse of a partitioned matrix Exercise 2.24 ) ( ( ) 1 − − 1 M − MBD AB = (2.76) − 1 − 1 − 1 − 1 CD CMBD D + D CM D − where we have defined − 1 − 1 ) C . (2.77) BD M A − =( − 1 of the matrix on the left-hand Schur complement is known as the M The quantity D side of (2.76) with respect to the submatrix . Using the definition ( ) ( ) − 1 Σ Λ Λ Σ aa ab aa ab (2.78) = Λ Λ Σ Σ bb bb ba ba and making use of (2.76), we have − 1 1 − − Σ (2.79) Σ =( Σ ) Σ Λ ba aa ab aa bb − 1 − 1 − 1 Λ Σ (2.80) Σ − = . Σ Σ ) ( − Σ Σ ab ab ba ab aa bb bb From these we obtain the following expressions for the mean and covariance of the conditional distribution p ( x | ) x a b 1 − μ Σ μ + Σ (2.81) ) = μ − ( x ab b | b a b a bb 1 − Σ Σ Σ − Σ (2.82) Σ = . ab aa ba a | b bb takes ) x | ( p Comparing (2.73) and (2.82), we see that the conditional distribution x a b a simpler form when expressed in terms of the partitioned precision matrix than when it is expressed in terms of the partitioned covariance matrix. Note that the , given by (2.81), is a linear function of ) x | ( p mean of the conditional distribution x a b . This represents an x and that the covariance, given by (2.82), is independent of x b a example of a linear-Gaussian model. Section 8.1.4

108 88 2. PROBABILITY DISTRIBUTIONS 2.3.2 Marginal Gaussian distributions ( , x ) is Gaussian, then the condi- We have seen that if a joint distribution x p a b ( tional distribution p x | x ) will again be Gaussian. Now we turn to a discussion of b a the marginal distribution given by ∫ x )d )= , p ( x x (2.83) p x ( b a b a which, as we shall see, is also Gaussian. Once again, our strategy for evaluating this distribution efficiently will be to focus on the quadratic form in the exponent of the joint distribution and thereby to identify the mean and covariance of the marginal ) . p ( x distribution a The quadratic form for the joint distribution can be expressed, using the par- titioned precision matrix, in the form (2.70). Because our goal is to integrate out , this is most easily achieved by first considering the terms involving x and then x b b completing the square in order to facilitate integration. Picking out just those terms ,wehave that involve x b 1 1 1 1 1 − 1 − − T T T T ) − x m x )+ − Λ Λ m m + (2.84) Λ ( x − Λ Λ m = x x ( m − bb bb b b b b b bb bb bb 2 2 2 where we have defined Λ μ . − (2.85) ) ( x − μ = m Λ ba bb a a b has been cast into the standard quadratic form of a We see that the dependence on x b Gaussian distribution corresponding to the first term on the right-hand side of (2.84), ). Thus, when (but that does depend on x plus a term that does not depend on x b a we take the exponential of this quadratic form, we see that the integration over x b required by (2.83) will take the form } { ∫ 1 − 1 − 1 T x ( . m ) − Λ ( x − − Λ exp (2.86) m ) x d Λ bb b b b bb bb 2 This integration is easily performed by noting that it is the integral over an unnor- malized Gaussian, and so the result will be the reciprocal of the normalization co- efficient. We know from the form of the normalized Gaussian given by (2.43), that this coefficient is independent of the mean and depends only on the determinant of , we can the covariance matrix. Thus, by completing the square with respect to x b integrate out x and the only term remaining from the contributions on the left-hand b side of (2.84) that depends on x is the last term on the right-hand side of (2.84) in a which m is given by (2.85). Combining this term with the remaining terms from

109 2.3. The Gaussian Distribution 89 , we obtain x (2.70) that depend on a 1 T − 1 [ Λ Λ μ ( x − − μ x )] μ Λ μ )] ( [ Λ Λ − − ba bb bb ba a a a a b b bb 2 1 T T − μ x + x + Λ ( Λ μ )+const Λ x aa ab a aa a b a a 2 1 1 − T x ( Λ Λ − Λ Λ x ) = − a ab ba aa a bb 2 − 1 − 1 T (2.87) − Λ +const Λ Λ + μ Λ ( ) x ba ab aa a a bb where ‘const’ denotes quantities independent of x . Again, by comparison with a p ( x (2.71), we see that the covariance of the marginal distribution of ) is given by a 1 − 1 − Λ Λ =( − Λ ) . (2.88) Λ Σ aa a ab ba bb Similarly, the mean is given by − 1 Λ Λ μ Λ Λ − (2.89) ( = ) μ Σ ba ab aa a a a bb where we have used (2.88). The covariance in (2.88) is expressed in terms of the partitioned precision matrix given by (2.69). We can rewrite this in terms of the corresponding partitioning of the covariance matrix given by (2.67), as we did for the conditional distribution. These partitioned matrices are related by ) ) ( ( − 1 Σ Λ Λ Σ aa aa ab ab = (2.90) Σ Σ Λ Λ ba bb ba bb Making use of (2.76), we then have ( ) − 1 − 1 Λ . − Λ Σ Λ (2.91) Λ = ba aa aa ab bb Thus we obtain the intuitively satisfying result that the marginal distribution p ( x ) a has mean and covariance given by μ ]= (2.92) [ x E a a ]= Σ . (2.93) cov[ x aa a We see that for a marginal distribution, the mean and covariance are most simply ex- pressed in terms of the partitioned covariance matrix, in contrast to the conditional distribution for which the partitioned precision matrix gives rise to simpler expres- sions. Our results for the marginal and conditional distributions of a partitioned Gaus- sian are summarized below. Partitioned Gaussians − 1 and x | μ , Σ ) with Λ ≡ Σ Given a joint Gaussian distribution N ( ( ) ) ( x μ a a x = = , (2.94) μ μ x b b

110 90 2. PROBABILITY DISTRIBUTIONS 10 1 x b | 7) . =0 x x ( p b a . =0 x 7 b 0.5 5 ,x p ( x ) b a p ( x ) a 0 0 0 0 0.5 1 1 0.5 x x a a The plot on the left shows the contours of a Gaussian distribution p ( x ) ,x Figure 2.9 over two variables, and a b p the plot on the right shows the marginal distribution ( x x p x ) | ( (blue curve) and the conditional distribution ) a a b for x . 7 (red curve). =0 b ) ( ( ) Σ Λ Λ Σ ab aa aa ab Λ , (2.95) = . = Σ Λ Λ Σ Σ bb ba bb ba Conditional distribution: 1 − Λ (2.96) | x ) )= N ( x | μ , x ( p b a | b a aa − 1 Λ μ ) − (2.97) . = μ Λ − ( x μ b ab b a b | a aa Marginal distribution: . μ ) )= N ( x Σ | (2.98) , x ( p aa a a a We illustrate the idea of conditional and marginal distributions associated with a multivariate Gaussian using an example involving two variables in Figure 2.9. 2.3.3 Bayes’ theorem for Gaussian variables In Sections 2.3.1 and 2.3.2, we considered a Gaussian p ( x ) in which we parti- tioned the vector x into two subvectors x =( x , ) and then found expressions for x b a ( ) . We noted ) | x x p and the marginal distribution x the conditional distribution ( p a a b that the mean of the conditional distribution p ( x . | x x ) was a linear function of b a b Here we shall suppose that we are given a Gaussian marginal distribution p ( x ) and a Gaussian conditional distribution p ( y | x ) in which p ( y | x ) has a mean that is a linear function of x , and a covariance which is independent of x . This is an example of

111 2.3. The Gaussian Distribution 91 (Roweis and Ghahramani, 1999), which we shall study in a linear Gaussian model y ( ) greater generality in Section 8.1.4. We wish to find the marginal distribution p | and the conditional distribution ) . This is a problem that will arise frequently x ( p y in subsequent chapters, and it will prove convenient to derive the general results here. We shall take the marginal and conditional distributions to be ( ) − 1 x )= N ( p (2.99) x | μ , Λ ) ( − 1 Ax + (2.100) , L y | b N y x | )= ( p μ A , and b are parameters governing the means, and , and L are precision where Λ x has dimensionality M and y has dimensionality D , then the matrix A matrices. If D has size . × M and . To do this, we x First we find an expression for the joint distribution over y define ( ) x (2.101) = z y and then consider the log of the joint distribution p ( z )=ln p ( x )+ln p ( y | x ) ln 1 T − = μ − x ( Λ ) ) μ − x ( 2 1 T − ) + const b − Ax − y ( (2.102) L ) b y ( − Ax − 2 and y . As before, we see that this is a where ‘const’ denotes terms independent of x , and hence p ( z ) is Gaussian distribution. quadratic function of the components of z To find the precision of this Gaussian, we consider the second order terms in (2.102), which can be written as 1 1 1 1 T T T T T T + ) − LA Ly A A Ly + + ( x Λ LAx y x x y − 2 2 2 2 )( ) ( ( ) T T T 1 1 x x − A A L + Λ LA T (2.103) = − Rz z = − y y LA − L 2 2 z has precision (inverse covariance) matrix and so the Gaussian distribution over given by ) ( T T + A L LA − A Λ . (2.104) = R L LA − The covariance matrix is found by taking the inverse of the precision, which can be Exercise 2.29 done using the matrix inversion formula (2.76) to give ) ( − 1 − 1 T Λ Λ A − 1 = (2.105) . z ]= R cov[ − 1 1 − 1 T − AΛ + A L AΛ

112 92 2. PROBABILITY DISTRIBUTIONS z by identify- Similarly, we can find the mean of the Gaussian distribution over ing the linear terms in (2.102), which are given by ) ( ) ( T T x Lb A − μ Λ T T T T x + Lb = A y x − μ Lb Λ . (2.106) y Lb Using our earlier result (2.71) obtained by completing the square over the quadratic is given by z form of a multivariate Gaussian, we find that the mean of ) ( T A Λ Lb − μ 1 − . (2.107) [ ]= R E z Lb Making use of (2.105), we then obtain Exercise 2.30 ( ) μ (2.108) . [ ]= E z b + A μ ( ) in which we have p y Next we find an expression for the marginal distribution marginalized over x . Recall that the marginal distribution over a subset of the com- ponents of a Gaussian random vector takes a particularly simple form when ex- pressed in terms of the partitioned covariance matrix. Specifically, its mean and Section 2.3 covariance are given by (2.92) and (2.93), respectively. Making use of (2.105) and p ( y ) are (2.108) we see that the mean and covariance of the marginal distribution given by E y ]= A μ + b (2.109) [ − 1 − 1 T + AΛ A . (2.110) ]= y cov[ L A special case of this result is when A = I , in which case it reduces to the convolu- tion of two Gaussians, for which we see that the mean of the convolution is the sum of the mean of the two Gaussians, and the covariance of the convolution is the sum of their covariances. p x | y ) . Recall that the results ( Finally, we seek an expression for the conditional for the conditional distribution are most easily expressed in terms of the partitioned Section 2.3 precision matrix, using (2.73) and (2.75). Applying these results to (2.105) and x | p ) has mean and covariance (2.108) we see that the conditional distribution ( y given by { } 1 − T T LA L ( ) − b )+ Λ μ A (2.111) y [ x | y ]=( Λ + A E T 1 − y ]=( Λ + A cov[ x | ) LA (2.112) . The evaluation of this conditional can be seen as an example of Bayes’ theorem. We can interpret the distribution ( x ) as a prior distribution over x . If the variable p is observed, then the conditional distribution y ( x | y ) represents the corresponding p posterior distribution over x . Having found the marginal and conditional distribu- tions, we effectively expressed the joint distribution p ( z )= p ( x ) p ( y | x ) in the form ) p x | y ( p ( y ) . These results are summarized below.

113 2.3. The Gaussian Distribution 93 Marginal and Conditional Gaussians x and a conditional Gaussian distri- Given a marginal Gaussian distribution for x bution for in the form given y − 1 ) (2.113) | μ , p ( x )= N ( x Λ − 1 ) (2.114) N ( p | Ax + b , L ( y | x )= y y x given y are and the conditional distribution of the marginal distribution of given by 1 − − T 1 AΛ A + ) (2.115) b , L p ( y )= N ( y | A μ + T Λ L ( y − b )+ μ } , Σ ) (2.116) N ( x | | Σ { A ( y )= x p where T − 1 LA ) . (2.117) + A Σ =( Λ 2.3.4 Maximum likelihood for the Gaussian T ) x ,..., x Given a data set X =( in which the observations { x } are as- n N 1 sumed to be drawn independently from a multivariate Gaussian distribution, we can estimate the parameters of the distribution by maximum likelihood. The log likeli- hood function is given by N ∑ N 1 ND 1 − T Σ ) − ln(2 π ln | |− μ ( x − − μ ) (2.118) Σ . x ( ) ( X | μ p , Σ )= − ln n n 2 2 2 =1 n By simple rearrangement, we see that the likelihood function depends on the data set only through the two quantities N N ∑ ∑ T (2.119) , x . x x n n n n =1 n =1 sufficient statistics for the Gaussian distribution. Using These are known as the (C.19), the derivative of the log likelihood with respect to μ is given by Appendix C N ∑ ∂ − 1 ( X | μ , Σ )= ln p ( x Σ − μ ) (2.120) n μ ∂ n =1 and setting this derivative to zero, we obtain the solution for the maximum likelihood estimate of the mean given by N ∑ 1 μ = x (2.121) n ML N n =1

114 94 2. PROBABILITY DISTRIBUTIONS which is the mean of the observed set of data points. The maximization of (2.118) Σ is rather more involved. The simplest approach is to ignore the with respect to Exercise 2.34 symmetry constraint and show that the resulting solution is symmetric as required. Alternative derivations of this result, which impose the symmetry and positive defi- niteness constraints explicitly, can be found in Magnus and Neudecker (1999). The result is as expected and takes the form N ∑ 1 T Σ (2.122) ) = x ( − μ μ )( x − n ML n ML ML N =1 n which involves μ because this is the result of a joint maximization with respect ML to Σ . Note that the solution (2.121) for μ μ and Σ , and so does not depend on ML ML Σ . and then use this to evaluate we can first evaluate μ ML ML If we evaluate the expectations of the maximum likelihood solutions under the true distribution, we obtain the following results Exercise 2.35 E [ μ μ (2.123) ]= ML N − 1 ]= Σ . (2.124) [ Σ E ML N We see that the expectation of the maximum likelihood estimate for the mean is equal to the true mean. However, the maximum likelihood estimate for the covariance has an expectation that is less than the true value, and hence it is biased. We can correct ̃ this bias by defining a different estimator given by Σ N ∑ 1 T ̃ . Σ = μ ) (2.125) ( x − − μ x )( n n ML ML − N 1 n =1 ̃ Clearly from (2.122) and (2.124), the expectation of is equal to Σ . Σ 2.3.5 Sequential estimation Our discussion of the maximum likelihood solution for the parameters of a Gaus- sian distribution provides a convenient opportunity to give a more general discussion of the topic of sequential estimation for maximum likelihood. Sequential methods allow data points to be processed one at a time and then discarded and are important for on-line applications, and also where large data sets are involved so that batch processing of all data points at once is infeasible. Consider the result (2.121) for the maximum likelihood estimator of the mean ( N ) μ N , which we will denote by observations. If we when it is based on μ ML ML

115 2.3. The Gaussian Distribution 95 z A schematic illustration of two correlated ran- Figure 2.10 , together with the θ dom variables z and f ) regression function θ ( given by the con- | θ ] . ditional expectation E [ z The Robbins- ) θ ( f Monro algorithm provides a general sequen-  of such tial procedure for finding the root θ functions. θ  θ x , we obtain dissect out the contribution from the final data point N N ∑ 1 ) ( N μ = x n ML N n =1 1 − N ∑ 1 1 = + x x N n N N =1 n − 1 1 N ( N − 1) + = x μ N ML N N 1 ( N − 1) − N 1) ( (2.126) − μ . + = μ ) ( x N ML ML N This result has a nice interpretation, as follows. After observing N − 1 data points ( N − 1) we have estimated μ μ by . We now observe data point , and we obtain our x N ML ( ) N μ revised estimate by moving the old estimate a small amount, proportional to ML 1) − N ( 1 x ( , in the direction of the ‘error signal’ /N − μ ) . Note that, as N increases, N ML so the contribution from successive data points gets smaller. The result (2.126) will clearly give the same answer as the batch result (2.121) because the two formulae are equivalent. However, we will not always be able to de- rive a sequential algorithm by this route, and so we seek a more general formulation Robbins-Monro of sequential learning, which leads us to the algorithm. Consider a θ and z pair of random variables p ( z, θ ) . The con- governed by a joint distribution ditional expectation of z given θ defines a deterministic function f ( θ ) that is given by ∫ zp ( z | θ )d z (2.127) ]= ( ) θ ≡ E [ z | θ f and is illustrated schematically in Figure 2.10. Functions defined in this way are called regression functions .   at which f ( θ )=0 . If we had a large data set Our goal is to find the root θ of observations of z and θ , then we could model the regression function directly and then obtain an estimate of its root. Suppose, however, that we observe values of z one at a time and we wish to find a corresponding sequential estimation scheme  for θ . The following general procedure for solving such problems was given by

116 96 2. PROBABILITY DISTRIBUTIONS is z Robbins and Monro (1951). We shall assume that the conditional variance of finite so that [ ] 2 E ∞ | θ ( < z − f ) (2.128) ( ) > 0 θ f and we shall also, without loss of generality, consider the case where for   ( θ ) < 0 for θ<θ and , as is the case in Figure 2.10. The Robbins-Monro f θ>θ  θ procedure then defines a sequence of successive estimates of the root given by N ) ( − − 1) ( 1) ( N N (2.129) ) θ z ( θ a + = θ 1 − N ( N ) ) ( N θ z when . The coefficients takes the value θ ) is an observed value of θ z where ( { a } represent a sequence of positive numbers that satisfy the conditions N =0 (2.130) a lim N N →∞ ∞ ∑ a = ∞ (2.131) N N =1 ∞ ∑ 2 . a (2.132) ∞ < N =1 N It can then be shown (Robbins and Monro, 1951; Fukunaga, 1990) that the sequence of estimates given by (2.129) does indeed converge to the root with probability one. Note that the first condition (2.130) ensures that the successive corrections decrease in magnitude so that the process can converge to a limiting value. The second con- dition (2.131) is required to ensure that the algorithm does not converge short of the root, and the third condition (2.132) is needed to ensure that the accumulated noise has finite variance and hence does not spoil convergence. Now let us consider how a general maximum likelihood problem can be solved sequentially using the Robbins-Monro algorithm. By definition, the maximum like- is a stationary point of the log likelihood function and hence θ lihood solution ML satisfies ∣ { } N ∣ ∑ 1 ∂ ∣ p ( x (2.133) | θ ) . =0 ln ∣ n N ∂θ ∣ =1 n θ ML N →∞ we have Exchanging the derivative and the summation, and taking the limit ] [ N ∑ 1 ∂ ∂ lim x θ ( ) p ( x p ln ln | E θ | (2.134) )= n x →∞ N ∂θ N ∂θ n =1 and so we see that finding the maximum likelihood solution corresponds to find- ing the root of a regression function. We can therefore apply the Robbins-Monro procedure, which now takes the form ∂ ( N ) − N ( N − 1) ( 1) ( (2.135) . a ) = θ + θ ln p | x θ N − 1 N 1) − N ( ∂θ

117 2.3. The Gaussian Distribution 97 z In the case of a Gaussian distribution, with Figure 2.11 θ , the regression corresponding to the mean μ μ ) z ( p | function illustrated in Figure 2.10 takes the form of a straight line, as shown in red. In this z corresponds to the case, the random variable derivative of the log likelihood function and is μ ML 2 /σ ) , and its expectation that x ( given by μ − ML defines the regression function is a straight line 2 /σ ) . The root of the regres- μ − μ ( given by ML μ sion function corresponds to the maximum like- . μ lihood estimator ML As a specific example, we consider once again the sequential estimation of the ( N ) is the estimate mean of a Gaussian distribution, in which case the parameter θ N ) ( μ of the mean of the Gaussian, and the random variable is given by z ML ∂ 1 2 z = p ( x | μ ) ,σ )= (2.136) . μ ln ( x − ML ML 2 σ ∂μ ML , as illustrated in Fig- μ Thus the distribution of − z is Gaussian with mean μ ML ure 2.11. Substituting (2.136) into (2.135), we obtain the univariate form of (2.126), 2 /N σ to have the form a . Note that = a provided we choose the coefficients N N although we have focussed on the case of a single variable, the same technique, , apply a together with the same restrictions (2.130)–(2.132) on the coefficients N equally to the multivariate case (Blum, 1965). 2.3.6 Bayesian inference for the Gaussian The maximum likelihood framework gave point estimates for the parameters μ and Σ . Now we develop a Bayesian treatment by introducing prior distributions over these parameters. Let us begin with a simple example in which we consider a 2 single Gaussian random variable x . We shall suppose that the variance σ is known, μ given a set of N observations and we consider the task of inferring the mean ,...,x } . The likelihood function, that is the probability of the observed = x { X 1 N μ , viewed as a function of μ ,isgivenby data given } { N N ∏ ∑ 1 1 2 x p (2.137) x ( | μ )= ) μ . ( − − exp p ( X | μ )= n n 2 2 2 N/ σ 2 (2 πσ ) =1 n n =1 Again we emphasize that the likelihood function p ( X | μ ) is not a probability distri- bution over and is not normalized. μ We see that the likelihood function takes the form of the exponential of a quad- ratic form in μ given by a Gaussian, it will be a p ( μ ) . Thus if we choose a prior

118 98 2. PROBABILITY DISTRIBUTIONS conjugate distribution for this likelihood function because the corresponding poste- rior will be a product of two exponentials of quadratic functions of μ and hence will also be Gaussian. We therefore take our prior distribution to be ( ) 2 μ N p )= ( μ μ | (2.138) ,σ 0 0 and the posterior distribution is given by μ | X ) ∝ p ( X | μ ) p ( μ ) . (2.139) p ( Simple manipulation involving completing the square in the exponent shows that the Exercise 2.38 posterior distribution is given by ( ) 2 μ μ (2.140) ,σ | | X )= N p μ ( N N where 2 2 σ Nσ 0 μ (2.141) = + μ μ 0 ML N 2 2 2 2 Nσ + σ + σ Nσ 0 0 1 1 N = (2.142) + 2 2 2 σ σ σ 0 N μ in which is the maximum likelihood solution for μ given by the sample mean ML N ∑ 1 μ = . (2.143) x ML n N n =1 It is worth spending a moment studying the form of the posterior mean and variance. First of all, we note that the mean of the posterior distribution given by and the maximum likelihood (2.141) is a compromise between the prior mean μ 0 , then (2.141) reduces =0 N . If the number of observed data points μ solution ML N , the posterior mean is given by the to the prior mean as expected. For →∞ maximum likelihood solution. Similarly, consider the result (2.142) for the variance of the posterior distribution. We see that this is most naturally expressed in terms of the inverse variance, which is called the precision. Furthermore, the precisions are additive, so that the precision of the posterior is given by the precision of the prior plus one contribution of the data precision from each of the observed data points. As we increase the number of observed data points, the precision steadily increases, corresponding to a posterior distribution with steadily decreasing variance. With no observed data points, we have the prior variance, whereas if the number of 2 goes to zero and the posterior distribution data points N →∞ , the variance σ N becomes infinitely peaked around the maximum likelihood solution. We therefore see that the maximum likelihood result of a point estimate for given by (2.143) is μ recovered precisely from the Bayesian formalism in the limit of an infinite number 2 of observations. Note also that for finite N , if we take the limit σ in which the →∞ 0 prior has infinite variance then the posterior mean (2.141) reduces to the maximum 2 2 /N . = σ σ likelihood result, while from (2.142) the posterior variance is given by N

119 2.3. The Gaussian Distribution 99 Figure 2.12 Illustration of Bayesian inference for 5 of a Gaussian distri- μ the mean bution, in which the variance is as- The curves sumed to be known. show the prior distribution over μ =10 N (the curve labelled N =0 ), which in this case is itself Gaussian, along =2 N with the posterior distribution given by (2.140) for increasing numbers N N =1 of data points. The data points are generated from a Gaussian of mean N =0 0 . 1 . , and the prior is 0 8 and variance . In both the 0 chosen to have mean 0 prior and the likelihood function, the −1 0 1 variance is set to the true value. We illustrate our analysis of Bayesian inference for the mean of a Gaussian D - distribution in Figure 2.12. The generalization of this result to the case of a x with known covariance and unknown mean dimensional Gaussian random variable Exercise 2.40 is straightforward. We have already seen how the maximum likelihood expression for the mean of a Gaussian can be re-cast as a sequential update formula in which the mean after Section 2.3.5 data points was expressed in terms of the mean after observing N − 1 observing N . In fact, the Bayesian data points together with the contribution from data point x N paradigm leads very naturally to a sequential view of the inference problem. To see this in the context of the inference of the mean of a Gaussian, we write the posterior separated out so that distribution with the contribution from the final data point x N [ ] 1 − N ∏ ∝ D | μ ( ) p μ ) . (2.144) p ( x p | μ ) ( p ( x ) | μ N n =1 n The term in square brackets is (up to a normalization coefficient) just the posterior distribution after observing N 1 data points. We see that this can be viewed as − a prior distribution, which is combined using Bayes’ theorem with the likelihood to arrive at the posterior distribution after x function associated with data point N observing N data points. This sequential view of Bayesian inference is very general and applies to any problem in which the observed data are assumed to be independent and identically distributed. So far, we have assumed that the variance of the Gaussian distribution over the data is known and our goal is to infer the mean. Now let us suppose that the mean is known and we wish to infer the variance. Again, our calculations will be greatly simplified if we choose a conjugate form for the prior distribution. It turns out to be 2 . The likelihood function for λ 1 /σ most convenient to work with the precision λ ≡ takes the form { } N N ∏ ∑ λ N/ 2 − 1 2 λ p | ( X )= ) ∝ ( x . exp | − μ, λ ) (2.145) N μ − ( x λ n n 2 =1 n =1 n

120 100 2. PROBABILITY DISTRIBUTIONS 2 2 2 a =4 . a 1 =1 a =0 b b =1 =0 . b =6 1 1 1 1 0 0 0 λ λ λ 0 1 2 2 1 0 1 2 0 Gam( | Plot of the gamma distribution a, b ) defined by (2.146) for various values of the parameters λ Figure 2.13 b . and a The corresponding conjugate prior should therefore be proportional to the product λ and the exponential of a linear function of λ . This corresponds to of a power of gamma the distribution which is defined by 1 a 1 − a )= | a, b λ Gam( bλ λ (2.146) exp( − . ) b ) a Γ( Γ( a ) is the gamma function that is defined by (1.141) and that ensures that Here a> (2.146) is correctly normalized. The gamma distribution has a finite integral if Exercise 2.41 0 ,  a . It is plotted, for various values of a and and the distribution itself is finite if 1 , in Figure 2.13. The mean and variance of the gamma distribution are given by Exercise 2.42 b a (2.147) [ λ ]= E b a λ ]= var[ (2.148) . 2 b ,b ) . If we multiply by the likelihood Gam( Consider a prior distribution | a λ 0 0 function (2.145), then we obtain a posterior distribution } { N ∑ λ 2 N/ a 2 − 1 0 λ ∝ p ) ( λ | X x − (2.149) exp b − ) ( λ μ − λ n 0 2 =1 n which we recognize as a gamma distribution of the form Gam( λ | a ,b ) where N N N a a = + (2.150) N 0 2 N ∑ N 1 2 2 b (2.151) + = ( x + − μ ) b = b σ 0 n 0 N ML 2 2 =1 n 2 is the maximum likelihood estimator of the variance. Note that in (2.149) where σ ML there is no need to keep track of the normalization constants in the prior and the likelihood function because, if required, the correct coefficient can be found at the end using the normalized form (2.146) for the gamma distribution.

121 2.3. The Gaussian Distribution 101 N From (2.150), we see that the effect of observing data points is to increase N/ 2 . Thus we can interpret the parameter a the value of the coefficient by a in 0 ‘effective’ prior observations. Similarly, from (2.151) we 2 a the prior in terms of 0 2 2 N Nσ data points contribute see that the , where 2 to the parameter / is b σ ML ML b the variance, and so we can interpret the parameter in the prior as arising from 0 . Recall /a ‘effective’ prior observations having variance 2 b b / (2 a )= 2 a the 0 0 0 0 0 Section 2.2 that we made an analogous interpretation for the Dirichlet prior. These distributions are examples of the exponential family, and we shall see that the interpretation of a conjugate prior in terms of effective fictitious data points is a general one for the exponential family of distributions. Instead of working with the precision, we can consider the variance itself. The conjugate prior in this case is called the inverse gamma distribution, although we shall not discuss this further because we will find it more convenient to work with the precision. Now suppose that both the mean and the precision are unknown. To find a μ λ and conjugate prior, we consider the dependence of the likelihood function on ( } { ) N 1 2 / ∏ λ λ 2 p | X ( )= μ, λ ( x − − μ ) exp n 2 2 π =1 n } { [ )] ( N N N 2 ∑ ∑ λμ λ 2 2 1 / exp . − x exp − (2.152) x λμ ∝ λ n n 2 2 =1 n n =1 p μ, λ ) that has the same functional We now wish to identify a prior distribution ( μ and λ dependence on as the likelihood function and that should therefore take the form )] ( [ β 2 λμ 2 1 / λ exp − } dλ − cλμ { exp p ∝ μ, λ ) ( 2 ) { } } ( { 2 βλ c 2 β/ 2 d λ exp − − (2.153) λ − =exp ) − μ ( c/β 2 2 β ( c d , and β are constants. Since we can always write where ( μ, λ )= p , μ | λ ) p ( λ ) , p we can find p ( μ | λ ) and p ( λ ) by inspection. In particular, we see that p ( μ | λ ) is a Gaussian whose precision is a linear function of λ p ( λ ) is a gamma distri- and that bution, so that the normalized prior takes the form − 1 , βλ ) ( (2.154) )Gam( λ | a, b ) ( μ p μ μ, λ )= ( N | 0 = b , = c/β , a =1+ β/ 2 μ where we have defined new constants given by 0 2 − d c 2 β / normal-gamma or Gaussian-gamma . The distribution (2.154) is called the distribution and is plotted in Figure 2.14. Note that this is not simply the product of an independent Gaussian prior over μ and a gamma prior over λ , because the λ precision of is a linear function of λ . Even if we chose a prior in which μ and μ were independent, the posterior distribution would exhibit a coupling between the precision of μ and the value of λ .

122 102 2. PROBABILITY DISTRIBUTIONS Figure 2.14 Contour plot of the normal-gamma 2 distribution (2.154) for parameter , β =2 , a =5 and =0 μ values 0 =6 b . λ 1 0 2 0 −2 μ ( ) − 1 , Λ x In the case of the multivariate Gaussian distribution | for a D - N μ x μ , assuming dimensional variable , the conjugate prior distribution for the mean the precision is known, is again a Gaussian. For known mean and unknown precision Λ , the conjugate prior is the matrix distribution given by Exercise 2.45 Wishart ) ( 1 D 1 2 / 1) − − ( ν − ) exp − Λ (2.155) Tr ( W Λ )= | ,ν | Λ B | ( W W 2 is called the number of degrees of freedom of the distribution, W ν D × D where is a ( · ) denotes the trace. The normalization constant B is given by scale matrix, and Tr ) ( 1 − ) ( D ∏ i − ν +1 1) / ν/ 2 νD/ 2 − D ( D − 4 Γ . (2.156) π 2 | W | B ( W ,ν )= 2 =1 i Again, it is also possible to define a conjugate prior over the covariance matrix itself, rather than over the precision matrix, which leads to the distribu- inverse Wishart tion, although we shall not discuss this further. If both the mean and the precision are unknown, then, following a similar line of reasoning to the univariate case, the conjugate prior is given by − 1 , Λ | μ ( p μ W ,ν )= N ( μ | μ ) , ( β Λ ) W (2.157) ) ,β, ( Λ | W ,ν 0 0 which is known as the normal-Wishart or Gaussian-Wishart distribution. 2.3.7 Student’s t-distribution We have seen that the conjugate prior for the precision of a Gaussian is given 1 − ( | μ, τ N by a gamma distribution. If we have a univariate Gaussian x ) together Section 2.3.6 with a Gamma prior Gam( τ | a, b ) and we integrate out the precision, we obtain the marginal distribution of x in the form Exercise 2.46

123 2.3. The Gaussian Distribution 103 Plot of Student’s t-distribution (2.159) Figure 2.15 0.5 and =1 for various values for λ =0 μ ν →∞ ν . The limit ν of corresponds →∞ =1 0 ν . 0.4 to a Gaussian distribution with mean μ . and precision λ ν 1 . =0 0.3 0.2 0.1 0 5 0 5 − ∫ ∞ − 1 )= μ, τ p ( N ( x | x | μ, a, b )Gam( τ | a, b )d τ (2.158) 0 ∫ ∞ ) { ( } − bτ ) a a − 1 ( 2 1 / τ τ e b τ 2 = ) ( x − μ − exp d τ 2 2 a Γ( π ) 0 ( ] [ ) a − 2 / 1 2 − 1 / a 2 1 x − μ ) ( b / 2) Γ( a +1 + b = Γ( 2 π ) a 2 2 / 2] . By convention +( x − = ) z τ [ b where we have made the change of variable μ ν =2 a and λ = a/b , in terms of which the we define new parameters given by p ( | μ, a, b ) takes the form x distribution [ ) ( ] − 2 / 1 2 2 / − ν/ 1 2 λ 2+1 ν/ Γ( 2) / ( − λ μ ) x μ, λ, ν | x St( )= (2.159) 1+ ν πν ν/ Γ( 2) Student’s t-distribution λ is sometimes called the which is known as . The parameter of the t-distribution, even though it is not in general equal to the inverse precision ν of the variance. The parameter degrees of freedom , and its effect is is called the illustrated in Figure 2.15. For the particular case of ν =1 , the t-distribution reduces to the Cauchy distribution, while in the limit ν →∞ the t-distribution St( x | μ, λ, ν ) − 1 ) with mean μ and precision λ . Exercise 2.47 μ, λ becomes a Gaussian x ( | N From (2.158), we see that Student’s t-distribution is obtained by adding up an infinite number of Gaussian distributions having the same mean but different preci- sions. This can be interpreted as an infinite mixture of Gaussians (Gaussian mixtures will be discussed in detail in Section 2.3.9. The result is a distribution that in gen- eral has longer ‘tails’ than a Gaussian, as was seen in Figure 2.15. This gives the t- distribution an important property called robustness , which means that it is much less sensitive than the Gaussian to the presence of a few data points which are outliers . The robustness of the t-distribution is illustrated in Figure 2.16, which compares the maximum likelihood solutions for a Gaussian and a t-distribution. Note that the max- imum likelihood solution for the t-distribution can be found using the expectation- maximization (EM) algorithm. Here we see that the effect of a small number of Exercise 12.24

124 104 2. PROBABILITY DISTRIBUTIONS 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 5 −5 5 10 0 −5 0 10 (b) (a) Figure 2.16 Illustration of the robustness of Student’s t-distribution compared to a Gaussian. (a) Histogram distribution of 30 data points drawn from a Gaussian distribution, together with the maximum likelihood fit ob- tained from a t-distribution (red curve) and a Gaussian (green curve, largely hidden by the red curve). Because the t-distribution contains the Gaussian as a special case it gives almost the same solution as the Gaussian. (b) The same data set but with three additional outlying data points showing how the Gaussian (green curve) is strongly distorted by the outliers, whereas the t-distribution (red curve) is relatively unaffected. outliers is much less significant for the t-distribution than for the Gaussian. Outliers can arise in practical applications either because the process that generates the data corresponds to a distribution having a heavy tail or simply through mislabelled data. Robustness is also an important property for regression problems. Unsurprisingly, the least squares approach to regression does not exhibit robustness, because it cor- responds to maximum likelihood under a (conditional) Gaussian distribution. By basing a regression model on a heavy-tailed distribution such as a t-distribution, we obtain a more robust model. ν =2 a , λ = If we go back to (2.158) and substitute the alternative parameters , and = τb/a , we see that the t-distribution can be written in the form η a/b ∫ ∞ ( ) − 1 (2.160) ν/ N η. x 2) d μ, ( ηλ ) ,ν/ 2 | Gam( η | | x St( μ, λ, ν )= 0 N ( x | μ , Λ ) to obtain the cor- We can then generalize this to a multivariate Gaussian responding multivariate Student’s t-distribution in the form ∫ ∞ − 1 ) ( x | μ , ( Λ (2.161) η N )Gam( η | ν/ 2 ,ν/ 2) d η. x St( | μ , Λ ,ν )= 0 Using the same technique as for the univariate case, we can evaluate this integral to give Exercise 2.48

125 2.3. The Gaussian Distribution 105 ] [ − D/ 2 ν/ 2 − 1 2 2 / | Λ | Γ( 2) ν/ 2+ D/ ∆ x )= St( μ | , Λ ,ν 1+ (2.162) 2 D/ ν Γ( 2) ν/ ( πν ) 2 is the squared Mahalanobis distance , and where D is the dimensionality of x ∆ defined by 2 T x x ) =( Λ ( μ − μ ) . (2.163) − ∆ This is the multivariate form of Student’s t-distribution and satisfies the following properties Exercise 2.49 [ x ]= μ , if ν> 1 (2.164) E ν − 1 Λ , (2.165) 2 ν> if x ]= cov[ 2) − ( ν mode[ x ]= μ (2.166) with corresponding results for the univariate case. 2.3.8 Periodic variables Although Gaussian distributions are of great practical significance, both in their own right and as building blocks for more complex probabilistic models, there are situations in which they are inappropriate as density models for continuous vari- ables. One important case, which arises in practical applications, is that of periodic variables. An example of a periodic variable would be the wind direction at a particular geographical location. We might, for instance, measure values of wind direction on a number of days and wish to summarize this using a parametric distribution. Another example is calendar time, where we may be interested in modelling quantities that are believed to be periodic over 24 hours or over an annual cycle. Such quantities  θ< can conveniently be represented using an angular (polar) coordinate π . 0 2 We might be tempted to treat periodic variables by choosing some direction as the origin and then applying a conventional distribution such as the Gaussian. Such an approach, however, would give results that were strongly dependent on the arbitrary choice of origin. Suppose, for instance, that we have two observations at ◦ ◦ =1 and θ , and we model them using a standard univariate Gaussian = 359 θ 1 2 ◦ distribution. If we choose the origin at 0 , then the sample mean of this data set ◦ ◦ ◦ will be 180 , whereas if we choose the origin at 180 179 , with standard deviation ◦ ◦ . We clearly need to and the standard deviation will be 1 0 then the mean will be develop a special approach for the treatment of periodic variables. Let us consider the problem of evaluating the mean of a set of observations = { θ D ,...,θ } is of a periodic variable. From now on, we shall assume that θ 1 N measured in radians. We have already seen that the simple average ( θ /N + ··· + θ ) N 1 will be strongly coordinate dependent. To find an invariant measure of the mean, we note that the observations can be viewed as points on the unit circle and can therefore be described instead by two-dimensional unit vectors x =1 ,..., x ‖ where ‖ x n 1 N for n =1 ,...,N , as illustrated in Figure 2.17. We can average the vectors { x } n

126 106 2. PROBABILITY DISTRIBUTIONS x 2 Illustration of the representation of val- Figure 2.17 x 3 of a periodic variable as two- ues θ n x 4 x dimensional vectors living on the unit n of x circle. Also shown is the average those vectors. ̄ x r ̄ x 2 ̄ θ x 1 x 1 instead to give N ∑ 1 = x (2.167) x n N n =1 and then find the corresponding angle θ of this average. Clearly, this definition will ensure that the location of the mean is independent of the origin of the angular coor- x will typically lie inside the unit circle. The Cartesian coordinates dinate. Note that ) θ =(cos θ , and we can write the Carte- , sin x of the observations are given by n n n r x r . Substituting θ, =( sin θ ) cos sian coordinates of the sample mean in the form components then gives x and x into (2.167) and equating the 1 2 N N ∑ ∑ 1 1 θ cos cos θ sin , r sin (2.168) = r θ = . θ n n N N n =1 =1 n θ to tan θ =sin Taking the ratio, and using the identity cos θ , we can solve for θ/ give ∑ } { sin θ n 1 − n ∑ θ (2.169) . = tan θ cos n n Shortly, we shall see how this result arises naturally as the maximum likelihood estimator for an appropriately defined distribution over a periodic variable. We now consider a periodic generalization of the Gaussian called the von Mises distribution. Here we shall limit our attention to univariate distributions, although periodic distributions can also be found over hyperspheres of arbitrary dimension. For an extensive discussion of periodic distributions, see Mardia and Jupp (2000). By convention, we will consider distributions p ( θ ) that have period 2 π .Any probability density p ( θ ) defined over θ must not only be nonnegative and integrate

127 2.3. The Gaussian Distribution 107 The von Mises distribution can be derived by considering Figure 2.18 x 2 a two-dimensional Gaussian of the form (2.173), whose density contours are shown in blue and conditioning on ( x ) p the unit circle shown in red. x 1 =1 r to one, but it must also be periodic. Thus ( θ ) must satisfy the three conditions p θ )  0 (2.170) p ( ∫ 2 π ( )d θ =1 (2.171) θ p 0 θ +2 p )= p ( θ ) . (2.172) ( π p ( θ + M 2 π )= p ( θ ) for any integer M . From (2.172), it follows that We can easily obtain a Gaussian-like distribution that satisfies these three prop- ,x ) x x erties as follows. Consider a Gaussian distribution over two variables =( 2 1 2 μ having mean μ =( 2 ) Σ = σ 2 ,μ where I is the and a covariance matrix × I 2 1 identity matrix, so that } { 2 2 ( x 1 ) +( x − μ μ ) − 2 1 1 2 . ,x (2.173) exp )= − ( x p 1 2 2 2 σ 2 πσ 2 The contours of constant p ( x ) are circles, as illustrated in Figure 2.18. Now suppose we consider the value of this distribution along a circle of fixed radius. Then by con- struction this distribution will be periodic, although it will not be normalized. We can determine the form of this distribution by transforming from Cartesian coordinates ( x r, θ to polar coordinates ( ,x ) so that ) 2 1 = r (2.174) θ, x θ. = r sin cos x 2 1 We also map the mean μ into polar coordinates by writing (2.175) = r . cos θ θ ,μ sin = r μ 0 0 0 0 1 2 Next we substitute these transformations into the two-dimensional Gaussian distribu- r =1 , noting that we are interested tion (2.173), and then condition on the unit circle θ . Focussing on the exponent in the Gaussian distribution only in the dependence on we have } { 1 2 2 cos ( r ) θ − r sin cos θ r ) θ +( r sin θ − − 0 0 0 0 2 σ 2 } { 1 2 θ sin − 1+ r cos θ − 2 r sin cos θ = θ r − 2 0 0 0 0 0 2 σ 2 r 0 = cos( θ − θ )+const (2.176) 0 2 σ

128 108 2. PROBABILITY DISTRIBUTIONS 4 π/ 4 π/ 3 =5, π/ 4 θ m = 0 4 m π/ =3 θ =1, 0 0 π 2 θ m = π/ 4 =5, 0 =1, =3 π/ 4 θ m 0 Figure 2.19 The von Mises distribution plotted for two different parameter values, shown as a Cartesian plot on the left and as the corresponding polar plot on the right. , and we have made use of the following θ where ‘const’ denotes terms independent of trigonometrical identities Exercise 2.51 2 2 +sin A =1 (2.177) A cos B A +sin A sin B =cos( A − B ) . (2.178) cos cos 2 /σ , we obtain our final expression for the distribution of If we now define m = r 0 p θ ) along the unit circle r =1 ( in the form 1 (2.179) } ) ,m )= cos( θ − θ m { exp p θ | θ ( 0 0 m πI 2 ) ( 0 which is called the von Mises distribution, or the circular normal . Here the param- eter θ corresponds to the mean of the distribution, while m , which is known as 0 the parameter, is analogous to the inverse variance (precision) for the concentration , ) ( m I Gaussian. The normalization coefficient in (2.179) is expressed in terms of 0 which is the zeroth-order Bessel function of the first kind (Abramowitz and Stegun, 1965) and is defined by ∫ 2 π 1 )= d (2.180) } θ. ( m θ exp { m cos I 0 π 2 0 For large m , the distribution becomes approximately Gaussian. The von Mises dis- Exercise 2.52 is plotted in Figure 2.20. ) ( m tribution is plotted in Figure 2.19, and the function I 0 m and θ Now consider the maximum likelihood estimators for the parameters 0 for the von Mises distribution. The log likelihood function is given by N ∑ θ − θ cos( (2.181) )= − N ln(2 π ) − N ln I . ( m )+ m ) ,m D| ln p θ ( n 0 0 0 n =1

129 2.3. The Gaussian Distribution 109 1 3000 2000 ) ( ) m ( A I m 0.5 0 1000 0 0 0 5 10 5 0 10 m m I Figure 2.20 ( Plot of the Bessel function ) defined by (2.180), together with the function A ( m ) defined by m 0 (2.186). Setting the derivative with respect to θ equal to zero gives 0 N ∑ (2.182) sin( θ . − θ )=0 0 n =1 n To solve for θ , we make use of the trigonometric identity 0 − B )=cos B sin A − cos A sin B (2.183) sin( A Exercise 2.53 from which we obtain ∑ } { θ sin n − 1 ML n ∑ (2.184) = tan θ 0 cos θ n n which we recognize as the result (2.169) obtained earlier for the mean of the obser- vations viewed in a two-dimensional Cartesian space. ′ )= ( m Similarly, maximizing (2.181) with respect to m , and making use of I 0 I m ) (Abramowitz and Stegun, 1965), we have ( 1 N ∑ 1 ML )= ( A m ) cos( θ (2.185) − θ n 0 N =1 n ML where we have substituted for the maximum likelihood solution for θ (recalling 0 θ and m ), and we have defined that we are performing a joint optimization over ( ) m I 1 . (2.186) m )= A ( ( I ) m 0 A ( m ) is plotted in Figure 2.20. Making use of the trigonometric iden- The function tity (2.178), we can write (2.185) in the form ) ( ) ( N N ∑ ∑ 1 1 ML ML ( A m . cos θ )= θ − θ cos sin θ sin (2.187) n n ML 0 0 N N =1 =1 n n

130 110 2. PROBABILITY DISTRIBUTIONS Plots of the ‘old faith- Figure 2.21 100 100 ful’ data in which the blue curves show contours of constant proba- bility density. On the left is a single Gaussian distribution which 80 80 has been fitted to the data us- ing maximum likelihood. Note that this distribution fails to capture the 60 60 two clumps in the data and indeed places much of its probability mass in the central region between the clumps where the data are relatively 40 40 3 4 5 6 5 1 6 2 1 2 3 4 sparse. On the right the distribution is given by a linear combination of two Gaussians which has been fitted to the data by maximum likelihood using techniques discussed Chap- ter 9, and which gives a better rep- resentation of the data. The right-hand side of (2.187) is easily evaluated, and the function A ( m ) can be inverted numerically. For completeness, we mention briefly some alternative techniques for the con- struction of periodic distributions. The simplest approach is to use a histogram of observations in which the angular coordinate is divided into fixed bins. This has the virtue of simplicity and flexibility but also suffers from significant limitations, as we shall see when we discuss histogram methods in more detail in Section 2.5. Another approach starts, like the von Mises distribution, from a Gaussian distribution over a Euclidean space but now marginalizes onto the unit circle rather than conditioning (Mardia and Jupp, 2000). However, this leads to more complex forms of distribution and will not be discussed further. Finally, any valid distribution over the real axis (such as a Gaussian) can be turned into a periodic distribution by mapping succes- sive intervals of width 2 π onto the periodic variable (0 , 2 π ) , which corresponds to ‘wrapping’ the real axis around unit circle. Again, the resulting distribution is more complex to handle than the von Mises distribution. One limitation of the von Mises distribution is that it is unimodal. By forming mixtures of von Mises distributions, we obtain a flexible framework for modelling periodic variables that can handle multimodality. For an example of a machine learn- ing application that makes use of von Mises distributions, see Lawrence et al. (2002), and for extensions to modelling conditional densities for regression problems, see Bishop and Nabney (1996). 2.3.9 Mixtures of Gaussians While the Gaussian distribution has some important analytical properties, it suf- fers from significant limitations when it comes to modelling real data sets. Consider the example shown in Figure 2.21. This is known as the ‘Old Faithful’ data set, and comprises 272 measurements of the eruption of the Old Faithful geyser at Yel- lowstone National Park in the USA. Each measurement comprises the duration of Appendix A

131 2.3. The Gaussian Distribution 111 Example of a Gaussian mixture distribution Figure 2.22 ) p ( x in one dimension showing three Gaussians (each scaled by a coefficient) in blue and their sum in red. x the eruption in minutes (horizontal axis) and the time in minutes to the next erup- tion (vertical axis). We see that the data set forms two dominant clumps, and that a simple Gaussian distribution is unable to capture this structure, whereas a linear superposition of two Gaussians gives a better characterization of the data set. Such superpositions, formed by taking linear combinations of more basic dis- tributions such as Gaussians, can be formulated as probabilistic models known as mixture distributions (McLachlan and Basford, 1988; McLachlan and Peel, 2000). In Figure 2.22 we see that a linear combination of Gaussians can give rise to very complex densities. By using a sufficient number of Gaussians, and by adjusting their means and covariances as well as the coefficients in the linear combination, almost any continuous density can be approximated to arbitrary accuracy. We therefore consider a superposition of Gaussian densities of the form K K ∑ , π (2.188) μ | ) Σ N ( x )= x ( p k k k =1 k mixture of Gaussians μ N ( x | which is called a . Each Gaussian density , Σ ) is k k called a of the mixture and has its own mean μ component . and covariance Σ k k Contour and surface plots for a Gaussian mixture having 3 components are shown in Figure 2.23. In this section we shall consider Gaussian components to illustrate the frame- work of mixture models. More generally, mixture models can comprise linear com- binations of other distributions. For instance, in Section 9.3.3 we shall consider mixtures of Bernoulli distributions as an example of a mixture model for discrete variables. Section 9.3.3 in (2.188) are called mixing coefficients . If we integrate both The parameters π k x , and note that both sides of (2.188) with respect to ( x ) and the individual Gaussian p components are normalized, we obtain K ∑ =1 . (2.189) π k k =1 Also, the requirement that p ( x )  0 , together with N ( x | μ Σ )  0 , implies , k k π . Combining this with the condition (2.189) we obtain  0 for all k k 0  π  1 . (2.190) k

132 112 2. PROBABILITY DISTRIBUTIONS 1 1 (a) (b) 0.5 0.5 0.2 0.3 0.5 0 0 0.5 0 0 0.5 1 1 Figure 2.23 Illustration of a mixture of 3 Gaussians in a two-dimensional space. (a) Contours of constant density for each of the mixture components, in which the 3 components are denoted red, blue and green, and the values of the mixing coefficients are shown below each component. (b) Contours of the marginal probability x ) of the mixture distribution. (c) A surface plot of the distribution p ( x ) . density p ( We therefore see that the mixing coefficients satisfy the requirements to be probabil- ities. From the sum and product rules, the marginal density is given by K ∑ k p ( ( ) p x | k ) (2.191) x )= ( p =1 k = p ( k ) as the prior prob- π which is equivalent to (2.188) in which we can view k th component, and the density N ( x | μ , ) as )= p ( x | k Σ ability of picking the k k k x conditioned on k . As we shall see in later chapters, an impor- the probability of p k | x ) , which are also known as tant role is played by the posterior probabilities ( . From Bayes’ theorem these are given by responsibilities γ ( x ) ≡ p ( k | x ) k ) p ( k ) p ( x | k ∑ = x p ) p ( l | l ) ( l ) Σ N ( x | μ , π k k k ∑ (2.192) . = N ( x | μ π , Σ ) l l l l We shall discuss the probabilistic interpretation of the mixture distribution in greater detail in Chapter 9. The form of the Gaussian mixture distribution is governed by the parameters π , } μ ,..., μ ≡{ μ , } ,...,π π , where we have used the notation π Σ and ≡{ μ K 1 K 1 ,... Σ } . One way to set the values of these parameters is to use Σ ≡{ Σ and K 1 maximum likelihood. From (2.188) the log of the likelihood function is given by } { N K ∑ ∑ , (2.193) , Σ )= ln p ( ) ln X | π Σ , π μ N ( x | μ n k k k =1 n =1 k

133 2.4. The Exponential Family 113 ,..., x } . We immediately see that the situation is now much X { where = x 1 N more complex than with a single Gaussian, due to the presence of the summation over inside the logarithm. As a result, the maximum likelihood solution for the k parameters no longer has a closed-form analytical solution. One approach to maxi- mizing the likelihood function is to use iterative numerical optimization techniques (Fletcher, 1987; Nocedal and Wright, 1999; Bishop and Nabney, 2008). Alterna- expectation maximization tively we can employ a powerful framework called , which will be discussed at length in Chapter 9. 2.4. The Exponential Family The probability distributions that we have studied so far in this chapter (with the exception of the Gaussian mixture) are specific examples of a broad class of distri- butions called the (Duda and Hart, 1973; Bernardo and Smith, exponential family 1994). Members of the exponential family have many important properties in com- mon, and it is illuminating to discuss these properties in some generality. The exponential family of distributions over , given parameters η , is defined to x be the set of distributions of the form { } T u ( x ) η (2.194) g ( η )exp ( x | η p h ( x ) )= where may be scalar or vector, and may be discrete or continuous. Here η are x . natural parameters u ( x called the is some function of x of the distribution, and ) The function g ( η ) can be interpreted as the coefficient that ensures that the distribu- tion is normalized and therefore satisfies ∫ { } T u )exp d η ( h ( x ) x x =1 (2.195) ( g ) η x is a discrete variable. where the integration is replaced by summation if We begin by taking some examples of the distributions introduced earlier in the chapter and showing that they are indeed members of the exponential family. Consider first the Bernoulli distribution x − 1 x (2.196) ) μ − (1 . | μ | x )=Bern( μ μ x ( p )= Expressing the right-hand side as the exponential of the logarithm, we have p ( x | μ ) = exp { x ln μ +(1 − x )ln(1 − μ ) } } { ( ) μ − μ )exp =(1 (2.197) x ln . − 1 μ Comparison with (2.194) allows us to identify ( ) μ =ln η (2.198) − μ 1

134 114 2. PROBABILITY DISTRIBUTIONS to give which we can solve for σ ( η ) , where μ μ = 1 )= η σ ( (2.199) 1+exp( − η ) function. Thus we can write the Bernoulli distribution logistic sigmoid is called the using the standard representation (2.194) in the form x | p )= σ ( − η ) exp( ηx ) (2.200) ( η 1 σ ( η )= σ − − η ) , which is easily proved from (2.199). Com- where we have used ( parison with (2.194) shows that ( x )= x (2.201) u ( (2.202) )=1 x h η )= σ ( − η ) . (2.203) g ( , takes x Next consider the multinomial distribution that, for a single observation the form } { M M ∏ ∑ x k )= p μ | x ( x μ (2.204) =exp μ ln k k k =1 k k =1 T =( x x where ,...,x . Again, we can write this in the standard representation ) N 1 (2.194) so that T ) x (2.205) ( η | )=exp( x p η T η where . Again, comparing with , and we have defined η =( η =ln ,...,η ) μ M 1 k k (2.194) we have ( x )= x (2.206) u ( h (2.207) x )=1 η (2.208) . ( g )=1 are not independent because the parameters μ are sub- Note that the parameters η k k ject to the constraint M ∑ μ =1 (2.209) k k =1 so that, given any M 1 of the parameters μ − , the value of the remaining parameter k is fixed. In some circumstances, it will be convenient to remove this constraint by M − 1 expressing the distribution in terms of only parameters. This can be achieved by using the relationship (2.209) to eliminate μ by expressing it in terms of the M parameters. Note 1 − M } where k =1 ,...,M − 1 , thereby leaving { remaining μ k that these remaining parameters are still subject to the constraints 1 − M ∑ (2.210) ,  μ . 1  1 0  μ k k =1 k

135 2.4. The Exponential Family 115 Making use of the constraint (2.209), the multinomial distribution in this representa- tion then becomes { } M ∑ exp ln μ x k k =1 k ( ) ( { )} 1 M − 1 M − M − 1 ∑ ∑ ∑ x μ μ ln =exp 1 − + ln x − 1 k k k k =1 =1 =1 k k k ) )} { ( ( 1 − − M 1 M ∑ ∑ μ k ln μ =exp x (2.211) . +ln − 1 ∑ k k − M 1 μ 1 − j =1 j =1 =1 k k We now identify ( ) μ k ∑ ln (2.212) = η k 1 − μ j j which we can solve for μ by first summing both sides over and then rearranging k k and back-substituting to give ) η exp( k ∑ μ = . (2.213) k 1+ exp( η ) j j This is called the softmax function, or the normalized exponential . In this represen- tation, the multinomial distribution therefore takes the form ( ) 1 − M − 1 ∑ T . (2.214) η exp( exp( η 1+ ) ) x p x )= η | ( k k =1 η = This is the standard form of the exponential family, with parameter vector T ) ,...,η in which η ( M − 1 1 ( )= u x (2.215) x ( x )=1 (2.216) h ) ( − 1 M − 1 ∑ )= ( η g η 1+ exp( . (2.217) ) k =1 k Finally, let us consider the Gaussian distribution. For the univariate Gaussian, we have } { 1 1 2 2 − (2.218) x ) ( μ − )= exp μ, σ x ( p | 2 2 / 1 2 2 σ ) πσ (2 } { 1 1 μ 1 2 2 = exp (2.219) − − + x μ x 2 2 2 2 / 1 2 2 2 σ σ σ πσ (2 )

136 116 2. PROBABILITY DISTRIBUTIONS which, after some simple rearrangement, can be cast in the standard exponential Exercise 2.57 family form (2.194) with ( ) 2 μ/σ η = (2.220) 2 − σ 2 / 1 ( ) x )= u x ( (2.221) 2 x / 1 − 2 (2.222) π x ( ) h )=(2 ( ) 2 η 1 / 2 1 ( η g )=( − η 2 ) . (2.223) exp 2 η 4 2 2.4.1 Maximum likelihood and sufficient statistics η Let us now consider the problem of estimating the parameter vector in the gen- eral exponential family distribution (2.194) using the technique of maximum likeli- hood. Taking the gradient of both sides of (2.195) with respect to η ,wehave ∫ { } T ) η ( g ∇ η ( u ( h ) x d x )exp x ∫ } { T + g ( η ) x )exp x u ( x ) ( u ( x )d η =0 . (2.224) h Rearranging, and making use again of (2.195) then gives ∫ { } 1 T − u [ E = x )d x ( u x h ( (2.225) )exp )] η x u ( x ) ( ) η )= η ( g ∇ ( g ( η g ) where we have used (2.194). We therefore obtain the result g ( η )= −∇ [ u ( x )] . (2.226) ln E u ( x ) can be expressed in terms of the second derivatives Note that the covariance of g of ) , and similarly for higher order moments. Thus, provided we can normalize a Exercise 2.58 ( η distribution from the exponential family, we can always find its moments by simple differentiation. Now consider a set of independent identically distributed data denoted by X = ,..., x } , for which the likelihood function is given by x { n 1 { ) } ( N N ∏ ∑ N T ) x (2.227) ) g ( η h η exp . ( ) u ( x ( p | )= η X n n =1 n =1 n Setting the gradient of ln p ( X | η ) with respect to η to zero, we get the following condition to be satisfied by the maximum likelihood estimator η ML N ∑ 1 ( η ) )= (2.228) −∇ ln g x ( u n ML N =1 n

137 2.4. The Exponential Family 117 . We see that the solution for the η which can in principle be solved to obtain ML ∑ maximum likelihood estimator depends on the data only through ( ) , which u x n n is therefore called the of the distribution (2.194). We do not need sufficient statistic to store the entire data set itself but only the value of the sufficient statistic. For the Bernoulli distribution, for example, the function x ) is given just by x and u ( } , whereas for the Gaussian { x so we need only keep the sum of the data points n 2 T 2 x, x x )=( ( u { x ) } and the sum of { x , and so we should keep both the sum of . } n n →∞ N , then the right-hand side of (2.228) becomes If we consider the limit u ( E )] , and so by comparing with (2.226) we see that in this limit η [ x will equal ML the true value . η In fact, this sufficiency property holds also for Bayesian inference, although we shall defer discussion of this until Chapter 8 when we have equipped ourselves with the tools of graphical models and can thereby gain a deeper insight into these important concepts. 2.4.2 Conjugate priors We have already encountered the concept of a conjugate prior several times, for example in the context of the Bernoulli distribution (for which the conjugate prior is the beta distribution) or the Gaussian (where the conjugate prior for the mean is a Gaussian, and the conjugate prior for the precision is the Wishart distribution). In p ( x | η general, for a given probability distribution , we can seek a prior p ( η ) that is ) conjugate to the likelihood function, so that the posterior distribution has the same functional form as the prior. For any member of the exponential family (2.194), there exists a conjugate prior that can be written in the form { } ν T exp ν η χ (2.229) η g ( | ) p χ ,ν )= f ( χ ,ν ) η ( f ( χ ,ν ) is a normalization coefficient, and g ( η ) is the same function as ap- where pears in (2.194). To see that this is indeed conjugate, let us multiply the prior (2.229) by the likelihood function (2.227) to obtain the posterior distribution, up to a nor- malization coefficient, in the form )} ( { N ∑ T N + ν . (2.230) η u ( x exp )+ ν χ , χ ,ν ) ∝ g ( η ) p ( η | X n n =1 This again takes the same functional form as the prior (2.229), confirming conjugacy. Furthermore, we see that the parameter ν can be interpreted as a effective number of pseudo-observations in the prior, each of which has a value for the sufficient statistic u ( x ) given by χ . 2.4.3 Noninformative priors In some applications of probabilistic inference, we may have prior knowledge that can be conveniently expressed through the prior distribution. For example, if the prior assigns zero probability to some value of variable, then the posterior dis- tribution will necessarily also assign zero probability to that value, irrespective of

138 118 2. PROBABILITY DISTRIBUTIONS any subsequent observations of data. In many cases, however, we may have little idea of what form the distribution should take. We may then seek a form of prior , which is intended to have as little influ- distribution, called a noninformative prior ence on the posterior distribution as possible (Jeffries, 1946; Box and Tao, 1973; Bernardo and Smith, 1994). This is sometimes referred to as ‘letting the data speak for themselves’. x | λ ) governed by a parameter λ , we might be tempted If we have a distribution p ( λ to propose a prior distribution as a suitable prior. If λ is a discrete ) = const p ( states, this simply amounts to setting the prior probability of each K variable with /K . In the case of continuous parameters, however, there are two potential state to 1 is unbounded, λ difficulties with this approach. The first is that, if the domain of λ this prior distribution cannot be correctly normalized because the integral over improper . In practice, improper priors can often diverges. Such priors are called be used provided the corresponding posterior distribution is , i.e., that it can proper be correctly normalized. For instance, if we put a uniform prior distribution over the mean of a Gaussian, then the posterior distribution for the mean, once we have observed at least one data point, will be proper. A second difficulty arises from the transformation behaviour of a probability h ( λ ) density under a nonlinear change of variables, given by (1.27). If a function 2 2 ̂ is constant, and we change variables to λ = η ( h ( η h ) will also be , then η )= to be constant, then the density ( λ ) p constant. However, if we choose the density λ η will be given, from (1.27), by of ∣ ∣ ∣ ∣ d λ 2 ∣ ∣ )2 ( η )= ∝ η (2.231) η ( λ ) = p η ( p p λ λ η ∣ ∣ η d will not be constant. This issue does not arise when we use η and so the density over ( x | λ ) maximum likelihood, because the likelihood function p is a simple function of and so we are free to use any convenient parameterization. If, however, we are to λ choose a prior distribution that is constant, we must take care to use an appropriate representation for the parameters. Here we consider two simple examples of noninformative priors (Berger, 1985). First of all, if a density takes the form ( x | μ )= f ( x − μ ) (2.232) p μ location parameter . This family of densities then the parameter is known as a ̂ , = x + c x because if we shift x by a constant to give exhibits translation invariance then ̂ (2.233) ̂ x | ̂ μ )= f ( ̂ x − ) μ p ( . Thus the density takes the same form in the ̂ μ = μ + c where we have defined new variable as in the original one, and so the density is independent of the choice of origin. We would like to choose a prior distribution that reflects this translation invariance property, and so we choose a prior that assigns equal probability mass to

139 2.4. The Exponential Family 119 c  B as to the shifted interval A − c  μ  B −  . This implies μ A an interval ∫ ∫ ∫ B − c B B p ( μ (2.234) μ )d p ( μ )d )d μ = μ c p ( μ − = A − A A c and ,wehave A B and because this must hold for all choices of μ − c )= p ( p ) (2.235) ( μ p ( μ ) is constant. An example of a location parameter would be which implies that μ of a Gaussian distribution. As we have seen, the conjugate prior distri- the mean 2 2 μ | μ bution for p in this case is a Gaussian μ ( )= ( | μ , and we obtain a ,σ N ,σ ) μ 0 0 0 0 2 noninformative prior by taking the limit σ →∞ . Indeed, from (2.141) and (2.142) 0 we see that this gives a posterior distribution over μ in which the contributions from the prior vanish. As a second example, consider a density of the form ) ( x 1 )= p ( x | σ (2.236) f σ σ 0 . Note that this will be a normalized density provided σ> ( x ) is correctly where f σ is known as a scale parameter , and the density exhibits Exercise 2.59 normalized. The parameter ̂ x = cx , then because if we scale x by a constant to give scale invariance ) ( 1 x ̂ ̂ | (2.237) ̂ σ )= x p ( f ̂ σ ̂ σ where we have defined σ = ̂ . This transformation corresponds to a change of cσ scale, for example from meters to kilometers if x is a length, and we would like to choose a prior distribution that reflects this scale invariance. If we consider an A  σ  B , and a scaled interval A/c  σ  B/c , then the prior should interval assign equal probability mass to these two intervals. Thus we have ) ( ∫ ∫ ∫ B B B/c 1 1 )d σ ( = )d σ = σ p p p σ ( d σ (2.238) σ c c A A/c A A B ,wehave and and because this must hold for choices of ) ( 1 1 )= σ p ( p σ (2.239) c c p ( σ and hence ∝ 1 /σ . Note that again this is an improper prior because the integral ) of the distribution over 0  σ  ∞ is divergent. It is sometimes also convenient to think of the prior distribution for a scale parameter in terms of the density of the log of the parameter. Using the transformation rule (1.27) for densities we see that p (ln σ ) = const . Thus, for this prior there is the same probability mass in the range 1  σ  10 . 10  σ  100 and in 100  σ  1000 as in the range

140 120 2. PROBABILITY DISTRIBUTIONS σ of a Gaussian An example of a scale parameter would be the standard deviation , because μ distribution, after we have taken account of the location parameter } { 1 − 2 2 ( μ, σ | x N − σ ̃ x/σ ) ( ) (2.240) exp ∝ where x = x − μ . As discussed earlier, it is often more convenient to work in terms ̃ 2 rather than σ itself. Using the transformation rule for /σ λ of the precision =1 p σ ) ∝ 1 /σ corresponds to a distribution over λ densities, we see that a distribution ( p ( λ ) ∝ 1 /λ . We have seen that the conjugate prior for λ was the gamma of the form Gam( distribution | a λ ,b Section 2.3 ) given by (2.146). The noninformative prior is obtained 0 0 = b =0 . Again, if we examine the results (2.150) and (2.151) as the special case a 0 0 for the posterior distribution of λ , we see that for a = =0 , the posterior depends b 0 0 only on terms arising from the data and not from the prior. 2.5. Nonparametric Methods Throughout this chapter, we have focussed on the use of probability distributions having specific functional forms governed by a small number of parameters whose values are to be determined from a data set. This is called the parametric approach to density modelling. An important limitation of this approach is that the chosen density might be a poor model of the distribution that generates the data, which can result in poor predictive performance. For instance, if the process that generates the data is multimodal, then this aspect of the distribution can never be captured by a Gaussian, which is necessarily unimodal. nonparametric approaches to density es- In this final section, we consider some timation that make few assumptions about the form of the distribution. Here we shall focus mainly on simple frequentist methods. The reader should be aware, however, et al. , that nonparametric Bayesian methods are attracting increasing interest (Walker ̈ et al. , 2006). 1999; Neal, 2000; M uller and Quintana, 2004; Teh Let us start with a discussion of histogram methods for density estimation, which we have already encountered in the context of marginal and conditional distributions in Figure 1.11 and in the context of the central limit theorem in Figure 2.6. Here we explore the properties of histogram density models in more detail, focussing on the case of a single continuous variable x . Standard histograms simply partition x into falling x of observations of and then count the number n distinct bins of width ∆ i i in bin i . In order to turn this count into a normalized probability density, we simply of the bins to of observations and by the width N divide by the total number ∆ i obtain probability values for each bin given by n i (2.241) = p i N ∆ i ∫ p ( x )d x =1 . This gives a model for the density for which it is easily seen that p ( x ) that is constant over the width of each bin, and often the bins are chosen to have =∆ . the same width ∆ i

141 2.5. Nonparametric Methods 121 An illustration of the histogram approach Figure 2.24 5 to density estimation, in which a data set ∆=0 04 . 50 of data points is generated from the distribution shown by the green curve. 0 Histogram density estimates, based on 0 0.5 1 are (2.241), with a common bin width ∆ 5 ∆=0 08 . . ∆ shown for various values of 0 1 0.5 0 5 ∆=0 . 25 0 0.5 0 1 In Figure 2.24, we show an example of histogram density estimation. Here the data is drawn from the distribution, corresponding to the green curve, which is formed from a mixture of two Gaussians. Also shown are three examples of his- togram density estimates corresponding to three different choices for the bin width ∆ . We see that when ∆ is very small (top figure), the resulting density model is very spiky, with a lot of structure that is not present in the underlying distribution that generated the data set. Conversely, if is too large (bottom figure) then the result is ∆ a model that is too smooth and that consequently fails to capture the bimodal prop- erty of the green curve. The best results are obtained for some intermediate value ∆ (middle figure). In principle, a histogram density model is also dependent on of the choice of edge location for the bins, though this is typically much less significant ∆ . than the value of Note that the histogram method has the property (unlike the methods to be dis- cussed shortly) that, once the histogram has been computed, the data set itself can be discarded, which can be advantageous if the data set is large. Also, the histogram approach is easily applied if the data points are arriving sequentially. In practice, the histogram technique can be useful for obtaining a quick visual- ization of data in one or two dimensions but is unsuited to most density estimation applications. One obvious problem is that the estimated density has discontinuities that are due to the bin edges rather than any property of the underlying distribution that generated the data. Another major limitation of the histogram approach is its scaling with dimensionality. If we divide each variable in a -dimensional space D D . This exponential scaling into M bins, then the total number of bins will be M with D is an example of the curse of dimensionality. In a space of high dimensional- Section 1.4 ity, the quantity of data needed to provide meaningful estimates of local probability density would be prohibitive. The histogram approach to density estimation does, however, teach us two im- portant lessons. First, to estimate the probability density at a particular location, we should consider the data points that lie within some local neighbourhood of that point. Note that the concept of locality requires that we assume some form of dis- tance measure, and here we have been assuming Euclidean distance. For histograms,

142 122 2. PROBABILITY DISTRIBUTIONS this neighbourhood property was defined by the bins, and there is a natural ‘smooth- ing’ parameter describing the spatial extent of the local region, in this case the bin width. Second, the value of the smoothing parameter should be neither too large nor too small in order to obtain good results. This is reminiscent of the choice of model M complexity in polynomial curve fitting discussed in Chapter 1 where the degree α of the regularization parameter, was of the polynomial, or alternatively the value optimal for some intermediate value, neither too large nor too small. Armed with these insights, we turn now to a discussion of two widely used nonparametric tech- niques for density estimation, kernel estimators and nearest neighbours, which have better scaling with dimensionality than the simple histogram model. 2.5.1 Kernel density estimators Let us suppose that observations are being drawn from some unknown probabil- ( x ) ity density D -dimensional space, which we shall take to be Euclidean, p in some p x ) . From our earlier discussion of locality, and we wish to estimate the value of ( R x . The probability mass associated let us consider some small region containing with this region is given by ∫ )d . p ( x (2.242) x P = R observations drawn Now suppose that we have collected a data set comprising N ( x ) . Because each data point has a probability P of falling within from , the total p R K of points that lie inside R will be distributed according to the binomial number Section 2.1 distribution N ! K − 1 K (1 (2.243) . ) − P P | Bin( )= N, P K )! K !( N − K Using (2.11), we see that the mean fraction of points falling inside the region is E K/N ]= P , and similarly using (2.12) we see that the variance around this mean [ N var[ ]= P (1 − P is /N . For large K/N , this distribution will be sharply peaked ) around the mean and so K  NP. (2.244) If, however, we also assume that the region R is sufficiently small that the probability density p ( x ) is roughly constant over the region, then we have P ( x ) V (2.245) p  V is the volume of R . Combining (2.244) and (2.245), we obtain our density where estimate in the form K (2.246) . )= p ( x NV Note that the validity of (2.246) depends on two contradictory assumptions, namely that the region R be sufficiently small that the density is approximately constant over the region and yet sufficiently large (in relation to the value of that density) that the number K of points falling inside the region is sufficient for the binomial distribution to be sharply peaked.

143 2.5. Nonparametric Methods 123 K We can exploit the result (2.246) in two different ways. Either we can fix and -nearest-neighbour determine the value of V from the data, which gives rise to the K from the data, giv- K V technique discussed shortly, or we can fix and determine ing rise to the kernel approach. It can be shown that both the K -nearest-neighbour density estimator and the kernel density estimator converge to the true probability →∞ V shrinks suitably with N provided K grows with N density in the limit , and N (Duda and Hart, 1973). We begin by discussing the kernel method in detail, and to start with we take R x at which we wish to the region to be a small hypercube centred on the point determine the probability density. In order to count the number K of points falling within this region, it is convenient to define the following function { | =1 1 , | u , ,...,D  1 / 2 , i i )= k u ( (2.247) 0 otherwise , which represents a unit cube centred on the origin. The function k u ) is an example ( , and in this context is also called a Parzen window . From (2.247), of a kernel function ) /h ) will be one if the data point x lies inside a cube of side k (( x − x the quantity n n centred on x , and zero otherwise. The total number of data points lying inside this h cube will therefore be N ) ( ∑ − x x n k . (2.248) = K h =1 n Substituting this expression into (2.246) then gives the following result for the esti- mated density at x N ) ( ∑ x − x 1 1 n x )= ( p (2.249) k D N h h n =1 D in D di- for the volume of a hypercube of side h = where we have used V h mensions. Using the symmetry of the function k ( u ) , we can now re-interpret this equation, not as a single cube centred on x but as the sum over N cubes centred on the data points N x . n As it stands, the kernel density estimator (2.249) will suffer from one of the same problems that the histogram method suffered from, namely the presence of artificial discontinuities, in this case at the boundaries of the cubes. We can obtain a smoother density model if we choose a smoother kernel function, and a common choice is the Gaussian, which gives rise to the following kernel density model { } N 2 ∑ 1 1 ‖ ‖ − x x n ( p x )= (2.250) exp − 2 2 1 2 / 2 h N ) (2 πh =1 n where h represents the standard deviation of the Gaussian components. Thus our density model is obtained by placing a Gaussian over each data point and then adding up the contributions over the whole data set, and then dividing by N so that the den- sity is correctly normalized. In Figure 2.25, we apply the model (2.250) to the data

144 124 2. PROBABILITY DISTRIBUTIONS Figure 2.25 Illustration of the kernel density model 5 (2.250) applied to the same data set used 005 . =0 h to demonstrate the histogram approach in Figure 2.24. We see that h acts as a 0 smoothing parameter and that if it is set 1 0.5 0 too small (top panel), the result is a very 5 . =0 h 07 noisy density model, whereas if it is set too large (bottom panel), then the bimodal nature of the underlying distribution from 0 which the data is generated (shown by the 0 0.5 1 green curve) is washed out. The best den- 5 h =0 . 2 sity model is obtained for some intermedi- ate value of (middle panel). h 0 1 0 0.5 set used earlier to demonstrate the histogram technique. We see that, as expected, the parameter plays the role of a smoothing parameter, and there is a trade-off h h and over-smoothing at large h between sensitivity to noise at small . Again, the optimization of h is a problem in model complexity, analogous to the choice of bin width in histogram density estimation, or the degree of the polynomial used in curve fitting. We can choose any other kernel function k ( u ) in (2.249) subject to the condi- tions k u )  0 , (2.251) ( ∫ k u )d u =1 (2.252) ( which ensure that the resulting probability distribution is nonnegative everywhere and integrates to one. The class of density model given by (2.249) is called a kernel Parzen estimator. It has a great merit that there is no compu- density estimator, or tation involved in the ‘training’ phase because this simply requires storage of the training set. However, this is also one of its great weaknesses because the computa- tional cost of evaluating the density grows linearly with the size of the data set. 2.5.2 Nearest-neighbour methods One of the difficulties with the kernel approach to density estimation is that the parameter h governing the kernel width is fixed for all kernels. In regions of high data density, a large value of h may lead to over-smoothing and a washing out of h may structure that might otherwise be extracted from the data. However, reducing lead to noisy estimates elsewhere in data space where the density is smaller. Thus h may be dependent on location within the data space. This the optimal choice for issue is addressed by nearest-neighbour methods for density estimation. We therefore return to our general result (2.246) for local density estimation, and instead of fixing V and determining the value of K from the data, we consider a fixed value of K and use the data to find an appropriate value for V . To do this, we consider a small sphere centred on the point x at which we wish to estimate the

145 2.5. Nonparametric Methods 125 K Illustration of -nearest-neighbour den- Figure 2.26 5 sity estimation using the same data set =1 K We see as in Figures 2.25 and 2.24. governs the degree K that the parameter 0 of smoothing, so that a small value of 0 1 0.5 leads to a very noisy density model K 5 K =5 (top panel), whereas a large value (bot- tom panel) smoothes out the bimodal na- ture of the true distribution (shown by the 0 green curve) from which the data set was 1 0.5 0 generated. 5 =30 K 0 0 1 0.5 p ( x ) , and we allow the radius of the sphere to grow until it contains precisely density data points. The estimate of the density K x ) is then given by (2.246) with V set to p ( nearest neighbours the volume of the resulting sphere. This technique is known as K and is illustrated in Figure 2.26, for various choices of the parameter K , using the same data set as used in Figure 2.24 and Figure 2.25. We see that the value of K now governs the degree of smoothing and that again there is an optimum choice for that is neither too large nor too small. Note that the model produced by nearest K K neighbours is not a true density model because the integral over all space diverges. Exercise 2.61 K -nearest-neighbour technique for We close this chapter by showing how the density estimation can be extended to the problem of classification. To do this, we apply the K -nearest-neighbour density estimation technique to each class separately and then make use of Bayes’ theorem. Let us suppose that we have a data set com- ∑ N = points in class C N with N points in total, so that .Ifwe N prising k k k k , we draw a sphere centred on containing precisely x x wish to classify a new point V K points irrespective of their class. Suppose this sphere has volume and contains points from class C . Then (2.246) provides an estimate of the density associated K k k with each class K k p ( x |C )= . (2.253) k N V k Similarly, the unconditional density is given by K (2.254) x )= p ( NV while the class priors are given by N k C p ( )= . (2.255) k N We can now combine (2.253), (2.254), and (2.255) using Bayes’ theorem to obtain the posterior probability of class membership K C ) ( p ) x |C p ( k k k (2.256) . = )= x | C ( p k ( x ) p K

146 126 2. PROBABILITY DISTRIBUTIONS (a) In the K Figure 2.27 -nearest- x x 2 2 neighbour classifier, a new point, shown by the black diamond, is clas- sified according to the majority class K membership of the closest train- = K ing data points, in this case . 3 (b) In the nearest-neighbour ) approach to classification, K ( =1 the resulting decision boundary is composed of hyperplanes that form perpendicular bisectors of pairs of points from different classes. x x 1 1 (b) (a) If we wish to minimize the probability of misclassification, this is done by assigning the test point x to the class having the largest posterior probability, corresponding to . Thus to classify a new point, we identify the K nearest /K K the largest value of k points from the training data set and then assign the new point to the class having the largest number of representatives amongst this set. Ties can be broken at random. The particular case of =1 is called the nearest-neighbour rule, because a test K point is simply assigned to the same class as the nearest point from the training set. These concepts are illustrated in Figure 2.27. In Figure 2.28, we show the results of applying the K -nearest-neighbour algo- rithm to the oil flow data, introduced in Chapter 1, for various values of K .As expected, we see that K K produces controls the degree of smoothing, so that small K leads to fewer larger regions. many small regions of each class, whereas large K =31 K K =1 =3 2 2 2 x x x 7 7 7 1 1 1 0 0 0 1 2 0 2 1 0 2 0 1 x x x 6 6 6 Figure 2.28 Plot of 200 data points from the oil data set showing values of x , where the plotted against x 7 6 red, green, and blue points correspond to the ‘laminar’, ‘annular’, and ‘homogeneous’ classes, respectively. Also shown are the classifications of the input space given by the K -nearest-neighbour algorithm for various values of K .

147 Exercises 127 K An interesting property of the nearest-neighbour ( =1 ) classifier is that, in the →∞ , the error rate is never more than twice the minimum achievable error limit N rate of an optimal classifier, i.e., one that uses the true class distributions (Cover and Hart, 1967) . K -nearest-neighbour method, and the kernel den- As discussed so far, both the sity estimator, require the entire training data set to be stored, leading to expensive computation if the data set is large. This effect can be offset, at the expense of some additional one-off computation, by constructing tree-based search structures to allow (approximate) near neighbours to be found efficiently without doing an exhaustive search of the data set. Nevertheless, these nonparametric methods are still severely limited. On the other hand, we have seen that simple parametric models are very restricted in terms of the forms of distribution that they can represent. We therefore need to find density models that are very flexible and yet for which the complexity of the models can be controlled independently of the size of the training set, and we shall see in subsequent chapters how to achieve this. Exercises Verify that the Bernoulli distribution (2.2) satisfies the following prop- ) 2.1 ( www erties 1 ∑ (2.257) p ( x | μ )=1 =0 x [ x ]= μ (2.258) E x μ (1 − μ ) . (2.259) var[ ]= H[ x ] of a Bernoulli distributed random binary variable x is Show that the entropy given by H[ ]= − μ ln μ − (1 x μ )ln(1 − μ ) . (2.260) − 2.2 ( ) The form of the Bernoulli distribution given by (2.2) is not symmetric be- tween the two values of x . In some situations, it will be more convenient to use an x 1 , 1 } , in which case the distribution can be ∈{− equivalent formulation for which written ( ) ) ( − 2 / ) (1 2 x ) / x (1+ 1 − μ 1+ μ (2.261) p ( μ x | )= 2 2 where μ ∈ [ − 1 , 1] . Show that the distribution (2.261) is normalized, and evaluate its mean, variance, and entropy. www In this exercise, we prove that the binomial distribution (2.9) is nor- ) 2.3 ( m identical malized. First use the definition (2.10) of the number of combinations of objects chosen from a total of N to show that ( ) ( ) ) ( N N +1 N + = . (2.262) m m 1 − m

148 128 2. PROBABILITY DISTRIBUTIONS Use this result to prove by induction the following result ( ) N ∑ N m N ) (1 + x x (2.263) = m =0 m x . which is known as the binomial theorem , and which is valid for all real values of Finally, show that the binomial distribution is normalized, so that ) ( N ∑ N − N m m =1 − ) (2.264) μ (1 μ m m =0 N out of the summation and μ ) which can be done by first pulling out a factor (1 − then making use of the binomial theorem. 2.4 ) Show that the mean of the binomial distribution is given by (2.11). To do this, ( and μ differentiate both sides of the normalization condition (2.264) with respect to n . Similarly, by differentiating then rearrange to obtain an expression for the mean of and making use of the result (2.11) for the mean of (2.264) twice with respect to μ the binomial distribution prove the result (2.12) for the variance of the binomial. www In this exercise, we prove that the beta distribution, given by (2.13), is ( ) 2.5 correctly normalized, so that (2.14) holds. This is equivalent to showing that ∫ 1 ) b )Γ( a Γ( − b 1 1 − a d μ = (1 − μ ) μ (2.265) . a + ) Γ( b 0 From the definition (1.141) of the gamma function, we have ∫ ∫ ∞ ∞ a − 1 1 − b ) x exp( (2.266) exp( d x y. d − x − y ) y b )= Γ( a )Γ( 0 0 y inside Use this expression to prove (2.265) as follows. First bring the integral over x the integrand of the integral over t = y + x , next make the change of variable where x is fixed, then interchange the order of the x and t integrations, and finally make the change of variable x tμ where t is fixed. = ( ) Make use of the result (2.265) to show that the mean, variance, and mode of the 2.6 beta distribution (2.13) are given respectively by a (2.267) [ E ]= μ a + b ab μ var[ ]= (2.268) 2 b ) a ( a + b +1) ( + 1 a − (2.269) . ]= mode[ μ a + b − 2

149 Exercises 129 ) Consider a binomial random variable x given by (2.9), with prior distribution 2.7 ( given by the beta distribution (2.13), and suppose we have observed m for μ occur- x and occurrences of x =1 . Show that the posterior mean value of l rences of x =0 . To do this, μ lies between the prior mean and the maximum likelihood estimate for times the prior mean plus (1 − λ ) show that the posterior mean can be written as λ   0 1 . This illustrates the con- times the maximum likelihood estimate, where λ cept of the posterior distribution being a compromise between the prior distribution and the maximum likelihood solution. x, y ) Consider two variables x and 2.8 with joint distribution p ( ( ) . Prove the follow- y ing two results E [ x ]= E [ E (2.270) [ x | y ]] x y [var (2.271) [ x | y ]] [ E . [ x | y ]] + var E var[ ]= x x x y y , ) [ x | y ] denotes the expectation of x under the conditional distribution p ( x | y Here E x with a similar notation for the conditional variance. ( 2.9 ) www . In this exercise, we prove the normalization of the Dirichlet dis- tribution (2.38) using induction. We have already shown in Exercise 2.5 that the M =2 , is normalized. beta distribution, which is a special case of the Dirichlet for M − 1 variables We now assume that the Dirichlet distribution is normalized for and prove that it is normalized for M variables. To do this, consider the Dirichlet ∑ M μ by =1 M distribution over variables, and take account of the constraint k =1 k μ eliminating , so that the Dirichlet is written M ( ) α − 1 M − M − 1 M 1 ∏ ∑ α − 1 k p ( μ (2.272) C )= 1 μ μ − ,...,μ M M M − j 1 1 k =1 j =1 k and our goal is to find an expression for C . To do this, integrate over μ , taking − M M 1 care over the limits of integration, and then make a change of variable so that this and making use C integral has limits . By assuming the correct result for 0 and 1 − 1 M . C of (2.265), derive the expression for M ( ) Using the property Γ( x +1) = x Γ( x ) of the gamma function, derive the 2.10 following results for the mean, variance, and covariance of the Dirichlet distribution given by (2.38) α j ]= (2.273) μ E [ j α 0 α α − α ) ( j j 0 μ var[ ]= (2.274) j 2 ( α +1) α 0 0 α α j l l = (2.275) ,j μ − ]= cov[ μ l j 2 +1) α α ( 0 0 where α is defined by (2.39). 0

150 130 2. PROBABILITY DISTRIBUTIONS www By expressing the expectation of ln μ under the Dirichlet distribution ( 2.11 ) j α (2.38) as a derivative with respect to , show that j (2.276) ) ]= ψ ( α α ) − ψ ( μ [ln E j j 0 where α is given by (2.39) and 0 d a ln Γ( ) (2.277) ψ ( a ) ≡ da is the digamma function. ( ) The uniform distribution for a continuous variable x is defined by 2.12 1 (2.278) b. ,a  x  x U( | a, b )= a b − Verify that this distribution is normalized, and find expressions for its mean and variance. 2.13 ( ) Evaluate the Kullback-Leibler divergence (1.113) between two Gaussians ( x )= N ( x | μ , Σ ) p q ( x )= N ( x | m , L ) . and 2.14 ( ) www This exercise demonstrates that the multivariate distribution with max- imum entropy, for a given covariance, is a Gaussian. The entropy of a distribution p ( x ) is given by ∫ p p ( x )ln (2.279) ( x )d x . H[ − ]= x H[ ] over all distributions p x x ) subject to the constraints that We wish to maximize ( ( x ) be normalized and that it have a specific mean and covariance, so that p ∫ p x )d x =1 (2.280) ( ∫ ( x ) x d x = μ (2.281) p ∫ T ( x )( x − μ )( x − μ ) p d x = Σ . (2.282) By performing a variational maximization of (2.279) and using Lagrange multipliers to enforce the constraints (2.280), (2.281), and (2.282), show that the maximum likelihood distribution is given by the Gaussian (2.43). ( ) Show that the entropy of the multivariate Gaussian N ( x | μ , Σ ) is given by 2.15 D 1 ln | Σ (2.283) + π )) (1 + ln(2 | ]= H[ x 2 2 where D is the dimensionality of x .

151 Exercises 131 www Consider two random variables x and x having Gaussian distri- ( 2.16 ) 1 2 butions with means μ τ and precisions τ respectively. Derive an expression , ,μ 2 2 1 1 + x . To do this, first find the x = x for the differential entropy of the variable 1 2 distribution of x by using the relation ∫ ∞ x p (2.284) x )d ( x | x ( ) p x ( p )= 2 2 2 −∞ and completing the square in the exponent. Then observe that this represents the convolution of two Gaussian distributions, which itself will be Gaussian, and finally make use of the result (1.110) for the entropy of the univariate Gaussian. www Consider the multivariate Gaussian distribution given by (2.43). By ) 2.17 ( − 1 as the sum of a sym- writing the precision matrix (inverse covariance matrix) Σ metric and an anti-symmetric matrix, show that the anti-symmetric term does not appear in the exponent of the Gaussian, and hence that the precision matrix may be taken to be symmetric without loss of generality. Because the inverse of a symmetric matrix is also symmetric (see Exercise 2.22), it follows that the covariance matrix may also be chosen to be symmetric without loss of generality. ( 2.18 Consider a real, symmetric matrix Σ whose eigenvalue equation is given ) by (2.45). By taking the complex conjugate of this equation and subtracting the , show that u original equation, and then forming the inner product with eigenvector i λ the eigenvalues are real. Similarly, use the symmetry property of Σ to show that i . Finally, show that λ and u = will be orthogonal provided λ u two eigenvectors j i i j without loss of generality, the set of eigenvectors can be chosen to be orthonormal, so that they satisfy (2.46), even if some of the eigenvalues are zero. ( having the eigenvector equation (2.45) ) Show that a real, symmetric matrix Σ 2.19 can be expressed as an expansion in the eigenvectors, with coefficients given by the − 1 has a eigenvalues, of the form (2.48). Similarly, show that the inverse matrix Σ representation of the form (2.49). 2.20 ( ) www A positive definite matrix Σ can be defined as one for which the quadratic form T Σa (2.285) a a is positive for any real value of the vector . Show that a necessary and sufficient of Σ , defined condition for Σ to be positive definite is that all of the eigenvalues λ i by (2.45), are positive. 2.21 ( ) Show that a real, symmetric matrix of size D × D has D ( D +1) / 2 independent parameters. ( ) 2.22 www Show that the inverse of a symmetric matrix is itself symmetric. 2.23 ( ) By diagonalizing the coordinate system using the eigenvector expansion (2.45), show that the volume contained within the hyperellipsoid corresponding to a constant

152 132 2. PROBABILITY DISTRIBUTIONS ∆ is given by Mahalanobis distance 1 / 2 D V | Σ (2.286) ∆ | D D dimensions, and the Mahalanobis is the volume of the unit sphere in V where D distance is defined by (2.44). ( ) 2.24 www Prove the identity (2.76) by multiplying both sides by the matrix ( ) AB (2.287) CD and making use of the definition (2.77). ( In Sections 2.3.1 and 2.3.2, we considered the conditional and marginal distri- ) 2.25 butions for a multivariate Gaussian. More generally, we can consider a partitioning x , with a corresponding par- , and x , x of the components of x into three groups c b a μ titioning of the mean vector Σ in the form and of the covariance matrix ) ) ( ( Σ μ Σ Σ ac aa ab a Σ Σ Σ μ Σ , . (2.288) = μ = bb bc ba b Σ μ Σ Σ ca cb cc c By making use of the results of Section 2.3, find an expression for the conditional p distribution x ( | x ) in which x has been marginalized out. a b c 2.26 ( ) A very useful result from linear algebra is the Woodbury matrix inversion formula given by 1 − 1 − 1 − 1 − 1 − − 1 − 1 B ( C A + (2.289) DA − A B ) . = DA A ( BCD + ) ( By multiplying both sides by + BCD ) prove the correctness of this result. A 2.27 ( ) Let x and z be two independent random vectors, so that p ( x , z )= p ( x ) p ( z ) . Show that the mean of their sum x + z is given by the sum of the means of each = y of the variable separately. Similarly, show that the covariance matrix of y is given by x and z . Confirm that this result agrees with the sum of the covariance matrices of that of Exercise 1.10. Consider a joint distribution over the variable www ) 2.28 ( ( ) x (2.290) z = y whose mean and covariance are given by (2.108) and (2.105) respectively. By mak- ing use of the results (2.92) and (2.93) show that the marginal distribution p ( x ) is given (2.99). Similarly, by making use of the results (2.81) and (2.82) show that the conditional distribution p ( y | x ) is given by (2.100).

153 Exercises 133 ( Using the partitioned matrix inversion formula (2.76), show that the inverse of 2.29 ) the precision matrix (2.104) is given by the covariance matrix (2.105). ( ) By starting from (2.107) and making use of the result (2.105), verify the result 2.30 (2.108). ) Consider two multidimensional random vectors x and z 2.31 ( having Gaussian p x )= N ( x | μ distributions ( , Σ ) ) and p ( z )= N ( respectively, together | μ Σ , z x z x z y = + z . Use the results (2.109) and (2.110) to find an expression for with their sum x ( y ) by considering the linear-Gaussian model comprising the marginal distribution p p ( x the product of the marginal distribution and the conditional distribution p ( y | x ) . ) 2.32 ( ) www This exercise and the next provide practice at manipulating the quadratic forms that arise in linear-Gaussian models, as well as giving an indepen- dent check of results derived in the main text. Consider a joint distribution p ( x , y ) defined by the marginal and conditional distributions given by (2.99) and (2.100). By examining the quadratic form in the exponent of the joint distribution, and using the technique of ‘completing the square’ discussed in Section 2.3, find expressions for the mean and covariance of the marginal distribution y ) in which the variable p ( has been integrated out. To do this, make use of the Woodbury matrix inversion x formula (2.289). Verify that these results agree with (2.109) and (2.110) obtained using the results of Chapter 2. 2.33 ( ) Consider the same joint distribution as in Exercise 2.32, but now use the technique of completing the square to find expressions for the mean and covariance p ( x | of the conditional distribution ) . Again, verify that these agree with the corre- y sponding expressions (2.111) and (2.112). www To find the maximum likelihood solution for the covariance matrix ( ) 2.34 of a multivariate Gaussian, we need to maximize the log likelihood function (2.118) Σ with respect to , noting that the covariance matrix must be symmetric and positive definite. Here we proceed by ignoring these constraints and doing a straightforward maximization. Using the results (C.21), (C.26), and (C.28) from Appendix C, show that the covariance matrix Σ that maximizes the log likelihood function (2.118) is given by the sample covariance (2.122). We note that the final result is necessarily symmetric and positive definite (provided the sample covariance is nonsingular). 2.35 ( ) Use the result (2.59) to prove (2.62). Now, using the results (2.59), and (2.62), show that T E x [ μμ ]= (2.291) x + I Σ nm n m where x μ denotes a data point sampled from a Gaussian distribution with mean n and covariance Σ , and I element of the identity matrix. Hence denotes the ( n, m ) nm prove the result (2.124). www Using an analogous procedure to that used to obtain (2.126), derive 2.36 ( ) an expression for the sequential estimation of the variance of a univariate Gaussian

154 134 2. PROBABILITY DISTRIBUTIONS distribution, by starting with the maximum likelihood expression N ∑ 1 2 2 σ ) (2.292) = − μ x ( . n ML N n =1 Verify that substituting the expression for a Gaussian distribution into the Robbins- Monro sequential estimation formula (2.135) gives a result of the same form, and . hence obtain an expression for the corresponding coefficients a N ) Using an analogous procedure to that used to obtain (2.126), derive an ex- 2.37 ( pression for the sequential estimation of the covariance of a multivariate Gaussian distribution, by starting with the maximum likelihood expression (2.122). Verify that substituting the expression for a Gaussian distribution into the Robbins-Monro se- quential estimation formula (2.135) gives a result of the same form, and hence obtain . a an expression for the corresponding coefficients N ( Use the technique of completing the square for the quadratic form in the expo- 2.38 ) nent to derive the results (2.141) and (2.142). 2.39 ) Starting from the results (2.141) and (2.142) for the posterior distribution ( of the mean of a Gaussian random variable, dissect out the contributions from the first N − 1 data points and hence obtain expressions for the sequential update of 2 and σ . Now derive the same results starting from the posterior distribution μ N N 2 ( μ | x p and multiplying by the likelihood func- ) )= N ( μ ,...,x μ | ,σ 1 N − 1 − N 1 − 1 N 2 ( x tion p | )= N ( x μ | μ, σ and then completing the square and normalizing to ) N N N observations. obtain the posterior distribution after www Consider a D -dimensional Gaussian random variable x with distribu- ( 2.40 ) tion N ( x | μ , Σ ) in which the covariance Σ is known and for which we wish to infer the mean from a set of observations = { x μ X x } . Given a prior distribution ,..., N 1 . ) X | μ ( p , find the corresponding posterior distribution ) , Σ ( p | μ ( N )= μ μ 0 0 2.41 ( ) Use the definition of the gamma function (1.141) to show that the gamma dis- tribution (2.146) is normalized. ( ) Evaluate the mean, variance, and mode of the gamma distribution (2.146). 2.42 ( ) The following distribution 2.43 ) ( q | x q | 2 exp (2.293) − )= ,q p ( σ | x 2 1 /q 2 σ 2 Γ(1 σ ) ) 2(2 /q is a generalization of the univariate Gaussian distribution. Show that this distribution is normalized so that ∫ ∞ 2 p ( x | σ (2.294) ,q )d x =1 −∞ q =2 . Consider a regression model in and that it reduces to the Gaussian when which the target variable is given by t = y is a random noise x , w )+  and  (

155 Exercises 135 variable drawn from the distribution (2.293). Show that the log likelihood function 2 and σ over w x ,..., , for an observed data set of input vectors { } and = X x N 1 T ,...,t ,isgivenby ) t t corresponding target variables =( N 1 N ∑ N 1 2 q 2 t , )= − (2.295) − ) w − )+const | x | y ( σ ln(2 X p ,σ ( w , t | ln n n 2 q 2 σ =1 n 2 ’ denotes terms independent of both w and σ const where ‘ . Note that, as a function error function considered in Section 1.5.5. , this is the L of w q 1 − ) having conjugate N ( x | μ, τ 2.44 ( ) Consider a univariate Gaussian distribution x = { x Gaussian-gamma prior given by (2.154), and a data set ,...,x } of i.i.d. N 1 observations. Show that the posterior distribution is also a Gaussian-gamma distri- bution of the same functional form as the prior, and write down expressions for the parameters of this posterior distribution. 2.45 ) Verify that the Wishart distribution defined by (2.155) is indeed a conjugate ( prior for the precision matrix of a multivariate Gaussian. www Verify that evaluating the integral in (2.158) leads to the result (2.159). ( 2.46 ) Show that in the limit ν →∞ , the t-distribution (2.159) becomes a www ( 2.47 ) Gaussian. Hint: ignore the normalization coefficient, and simply look at the depen- x dence on . ( ) By following analogous steps to those used to derive the univariate Student’s 2.48 t-distribution (2.159), verify the result (2.162) for the multivariate form of the Stu- η in (2.161). Using the dent’s t-distribution, by marginalizing over the variable definition (2.161), show by exchanging integration variables that the multivariate t-distribution is correctly normalized. 2.49 ( ) By using the definition (2.161) of the multivariate Student’s t-distribution as a convolution of a Gaussian with a gamma distribution, verify the properties (2.164), (2.165), and (2.166) for the multivariate t-distribution defined by (2.162). 2.50 ( ) Show that in the limit ν →∞ , the multivariate Student’s t-distribution (2.162) reduces to a Gaussian with mean μ Λ . and precision www The various trigonometric identities used in the discussion of periodic ( ) 2.51 variables in this chapter can be proven easily from the relation iA )=cos exp( + i sin A (2.296) A in which i is the square root of minus one. By considering the identity exp( iA ) exp( − iA )=1 (2.297) prove the result (2.177). Similarly, using the identity (2.298) A − B )= exp cos( i ( A − B ) } {

156 136 2. PROBABILITY DISTRIBUTIONS denotes the real part, prove (2.178). Finally, by using A − B )= where sin( { exp A − B ) } , where denotes the imaginary part, prove the result (2.183). ( i ) For large m , the von Mises distribution (2.179) becomes sharply peaked ( 2.52 2 / 1 around the mode θ = . By defining θ ( θ − ξ m ) and making the Taylor ex- 0 0 pansion of the cosine function given by 2 α 4 O α + ( α =1 cos − (2.299) ) 2 m , the von Mises distribution tends to a Gaussian. show that as →∞ ( ) Using the trigonometric identity (2.183), show that solution of (2.182) for θ 2.53 is 0 given by (2.184). ( By computing first and second derivatives of the von Mises distribution (2.179), 2.54 ) and using I > 0 for m> ) , show that the maximum of the distribution occurs 0 ( m 0 θ = θ . + π (mod 2 π ) and that the minimum occurs when when θ = θ 0 0 2.55 ) By making use of the result (2.168), together with (2.184) and the trigonometric ( for the concentra- m identity (2.178), show that the maximum likelihood solution ML tion of the von Mises distribution satisfies m A ( r )= r is the radius of the where ML mean of the observations viewed as unit vectors in the two-dimensional Euclidean plane, as illustrated in Figure 2.17. www Express the beta distribution (2.13), the gamma distribution (2.146), 2.56 ( ) and the von Mises distribution (2.179) as members of the exponential family (2.194) and thereby identify their natural parameters. 2.57 ( ) Verify that the multivariate Gaussian distribution can be cast in exponential family form (2.194) and derive expressions for η , u ( x ) , h ( x ) and g ( η ) analogous to (2.220)–(2.223). 2.58 ) The result (2.226) showed that the negative gradient of ln g ( η ) for the exponen- ( u x ) . By taking the second derivatives of tial family is given by the expectation of ( (2.195), show that T T x ( u ]=cov[ ] ) x ( (2.300) u [ E )] . )] − E [ u ( x E ) ( u x ( u [ x )= η ( g ln −∇∇ ) 2.59 ( ) By changing variables using y = x/σ , show that the density (2.236) will be correctly normalized, provided f x ) is correctly normalized. ( ( ) 2.60 www Consider a histogram-like density model in which the space x is di- over ( vided into fixed regions for which the density ) takes the constant value h p x i th the i region, and that the volume of region i is denoted ∆ . Suppose we have a set i of these observations fall in region . Using a i such that n N x observations of of i Lagrange multiplier to enforce the normalization constraint on the density, derive an . } h expression for the maximum likelihood estimator for the { i 2.61 ( ) Show that the K -nearest-neighbour density model defines an improper distribu- tion whose integral over all space is divergent.

157 3 Linear Models for Regression The focus so far in this book has been on unsupervised learning, including topics such as density estimation and data clustering. We turn now to a discussion of super- vised learning, starting with regression. The goal of regression is to predict the value target of one or more continuous t given the value of a D -dimensional vec- variables tor x of input variables. We have already encountered an example of a regression problem when we considered polynomial curve fitting in Chapter 1. The polynomial is a specific example of a broad class of functions called linear regression models, which share the property of being linear functions of the adjustable parameters, and which will form the focus of this chapter. The simplest form of linear regression models are also linear functions of the input variables. However, we can obtain a much more useful class of functions by taking linear combinations of a fixed set of nonlinear functions of the input variables, known as basis functions . Such models are linear functions of the parameters, which gives them simple analytical properties, and yet can be nonlinear with respect to the input variables. 137

158 138 3. LINEAR MODELS FOR REGRESSION } , where n =1 ,...,N , N observations x Given a training data set comprising { n { t together with corresponding target values t , the goal is to predict the value of } n x . In the simplest approach, this can be done by directly con- for a new value of ( x ) whose values for new inputs x constitute the structing an appropriate function y . More generally, from a probabilistic predictions for the corresponding values of t ( t | x ) because this expresses perspective, we aim to model the predictive distribution p t for each value of . From this conditional dis- our uncertainty about the value of x t x , in such a way as to tribution we can make predictions of , for any new value of minimize the expected value of a suitably chosen loss function. As discussed in Sec- tion 1.5.5, a common choice of loss function for real-valued variables is the squared t loss, for which the optimal solution is given by the conditional expectation of . Although linear models have significant limitations as practical techniques for pattern recognition, particularly for problems involving input spaces of high dimen- sionality, they have nice analytical properties and form the foundation for more so- phisticated models to be discussed in later chapters. 3.1. Linear Basis Function Models The simplest linear model for regression is one that involves a linear combination of the input variables x w (3.1) + ... + x + w w )= w , x ( y 1 1 D D 0 T . The key linear regression . This is often simply known as ,...,x ) x x =( where 1 D w property of this model is that it is a linear function of the parameters ,...,w .Itis D 0 also, however, a linear function of the input variables x , and this imposes significant i limitations on the model. We therefore extend the class of models by considering linear combinations of fixed nonlinear functions of the input variables, of the form M − 1 ∑ ( + (3.2) w ) φ x y , w )= w ( x j j 0 =1 j where φ x ) are known as basis functions . By denoting the maximum value of the ( j j 1 M − index , the total number of parameters in this model will be M . by allows for any fixed offset in the data and is sometimes called w The parameter 0 a bias parameter (not to be confused with ‘bias’ in a statistical sense). It is often so that )=1 x ( convenient to define an additional dummy ‘basis function’ φ 0 M − 1 ∑ T (3.3) ) x ( φ w )= w x φ ( ( )= w , x y j j =0 j T T . In many practical ap- ,...,w ) ) ,...,φ and φ =( φ =( w where w 0 0 M − 1 1 − M plications of pattern recognition, we will apply some form of fixed pre-processing,

159 3.1. Linear Basis Function Models 139 or feature extraction, to the original data variables. If the original variables com- prise the vector x , then the features can be expressed in terms of the basis functions { φ ( x } . ) j ( x , w ) to be a non- By using nonlinear basis functions, we allow the function y . Functions of the form (3.2) are called linear linear function of the input vector x w . It is this linearity in the pa- models, however, because this function is linear in rameters that will greatly simplify the analysis of this class of models. However, it also leads to some significant limitations, as we discuss in Section 3.6. The example of polynomial regression considered in Chapter 1 is a particular , and the basis func- x example of this model in which there is a single input variable j so that φ tions take the form of powers of x x x ( . One limitation of polynomial )= j basis functions is that they are global functions of the input variable, so that changes in one region of input space affect all other regions. This can be resolved by dividing the input space up into regions and fit a different polynomial in each region, leading spline functions (Hastie et al. , 2001). to There are many other possible choices for the basis functions, for example } { 2 ( x − μ ) j φ x )=exp (3.4) ( − j 2 s 2 where the μ govern the locations of the basis functions in input space, and the pa- j governs their spatial scale. These are usually referred to as ‘Gaussian’ rameter s basis functions, although it should be noted that they are not required to have a prob- abilistic interpretation, and in particular the normalization coefficient is unimportant . because these basis functions will be multiplied by adaptive parameters w j Another possibility is the sigmoidal basis function of the form ) ( x − μ j (3.5) σ ( x )= φ j s σ ( a ) is the logistic sigmoid function defined by where 1 (3.6) . )= σ ( a a − 1+exp( ) Equivalently, we can use the ‘ tanh ’ function because this is related to the logistic sigmoid by tanh( a )=2 σ ( a ) − 1 , and so a general linear combination of logistic sigmoid functions is equivalent to a general linear combination of ‘ ’ functions. tanh These various choices of basis function are illustrated in Figure 3.1. Yet another possible choice of basis function is the Fourier basis, which leads to an expansion in sinusoidal functions. Each basis function represents a specific fre- quency and has infinite spatial extent. By contrast, basis functions that are localized to finite regions of input space necessarily comprise a spectrum of different spatial frequencies. In many signal processing applications, it is of interest to consider ba- sis functions that are localized in both space and frequency, leading to a class of functions known as wavelets . These are also defined to be mutually orthogonal, to simplify their application. Wavelets are most applicable when the input values live

160 140 3. LINEAR MODELS FOR REGRESSION 1 1 1 0.5 0.75 0.75 0.5 0.5 0 0.25 −0.5 0.25 0 −1 0 −1 1 0 1 0 −1 0 −1 1 Examples of basis functions, showing polynomials on the left, Gaussians of the form (3.4) in the Figure 3.1 centre, and sigmoidal of the form (3.5) on the right. on a regular lattice, such as the successive time points in a temporal sequence, or the pixels in an image. Useful texts on wavelets include Ogden (1997), Mallat (1999), and Vidakovic (1999). Most of the discussion in this chapter, however, is independent of the particular choice of basis function set, and so for most of our discussion we shall not specify the particular form of the basis functions, except for the purposes of numerical il- lustration. Indeed, much of our discussion will be equally applicable to the situation φ ( x ) of basis functions is simply the identity φ ( x )= in which the vector . Fur- x thermore, in order to keep the notation simple, we shall focus on the case of a single t . However, in Section 3.1.5, we consider briefly the modifications target variable needed to deal with multiple target variables. 3.1.1 Maximum likelihood and least squares In Chapter 1, we fitted polynomial functions to data sets by minimizing a sum- of-squares error function. We also showed that this error function could be motivated as the maximum likelihood solution under an assumed Gaussian noise model. Let us return to this discussion and consider the least squares approach, and its relation to maximum likelihood, in more detail. As before, we assume that the target variable t is given by a deterministic func- ( x tion w ) with additive Gaussian noise so that y , = y ( x , w )+  (3.7) t  is a zero mean Gaussian random variable with precision (inverse variance) where β . Thus we can write − 1 ) . (3.8) , w ,β )= N ( t | y ( x , w ) ,β p ( t | x Recall that, if we assume a squared loss function, then the optimal prediction, for a new value of x , will be given by the conditional mean of the target variable. In the Section 1.5.5 case of a Gaussian conditional distribution of the form (3.8), the conditional mean

161 3.1. Linear Basis Function Models 141 will be simply ∫ | ]= [ x t E | x )d t = y ( x , w ) . tp ( t (3.9) Note that the Gaussian noise assumption implies that the conditional distribution of is unimodal, which may be inappropriate for some applications. An ex- given t x tension to mixtures of conditional Gaussian distributions, which permit multimodal conditional distributions, will be discussed in Section 14.5.1. ,..., x } with corresponding target Now consider a data set of inputs x X = { 1 N into a column vector that we . We group the target variables { t } ,...,t t values N n 1 denote by t where the typeface is chosen to distinguish it from a single observation . Making the assumption that of a multivariate target, which would be denoted t these data points are drawn independently from the distribution (3.8), we obtain the following expression for the likelihood function, which is a function of the adjustable parameters and β , in the form w N ∏ 1 − T w ,β )= p ( t | X , N t w ( φ ( x (3.10) ) ,β | ) n n =1 n where we have used (3.3). Note that in supervised learning problems such as regres- sion (and classification), we are not seeking to model the distribution of the input variables. Thus x will always appear in the set of conditioning variables, and so x from expressions such as p ( t | x , w ,β ) in or- from now on we will drop the explicit der to keep the notation uncluttered. Taking the logarithm of the likelihood function, and making use of the standard form (1.46) for the univariate Gaussian, we have N ∑ 1 − T N ( ) ) | w ln φ ( x ,β t p ( t | w ,β )= ln n n n =1 N N = (3.11) ) w ( ) − βE ln(2 − β ln π D 2 2 where the sum-of-squares error function is defined by N ∑ 1 2 T E w (3.12) ( } { t ) − w . φ ( x )= n n D 2 =1 n Having written down the likelihood function, we can use maximum likelihood to determine w and β . Consider first the maximization with respect to w . As observed already in Section 1.2.5, we see that maximization of the likelihood function under a conditional Gaussian noise distribution for a linear model is equivalent to minimizing a sum-of-squares error function given by E w ) . The gradient of the log likelihood ( D function (3.11) takes the form N ∑ } { T T p )= t w ln ∇ | ,β ( . − w (3.13) φ ( x ) ) t φ ( x n n n =1 n

162 142 3. LINEAR MODELS FOR REGRESSION Setting this gradient to zero gives ) ( N N ∑ ∑ T T T t ) w . x ( φ (3.14) − φ ( x ) ) φ ( x 0= n n n n =1 n =1 n we obtain w Solving for ( ) − 1 T T Φ Φ = t (3.15) Φ w ML for the least squares problem. Here normal equations is an which are known as the Φ ( φ , ) x = , whose elements are given by M N matrix, called the Φ design matrix × n j nj so that ⎞ ⎛ x ( φ ) φ ( x ··· ) φ ) ( x 1 0 1 1 M − 1 1 φ x ( x ) φ ( ) ) ··· φ x ( ⎟ ⎜ 2 M − 1 2 0 2 1 ⎜ ⎟ = Φ . (3.16) . . . . ⎠ ⎝ . . . . . . . . φ ( x ) ) ··· φ φ x ( ( x ) N N 0 − M N 1 1 The quantity ) ( − 1 † T T ≡ Φ Φ (3.17) Φ Φ Moore-Penrose pseudo-inverse (Rao and Mitra, is known as the of the matrix Φ 1971; Golub and Van Loan, 1996). It can be regarded as a generalization of the Φ notion of matrix inverse to nonsquare matrices. Indeed, if is square and invertible, 1 − † 1 − 1 − 1 − A we see that Φ = ≡ Φ B . ( ) then using the property AB w At this point, we can gain some insight into the role of the bias parameter .If 0 we make the bias parameter explicit, then the error function (3.12) becomes 1 − M N ∑ ∑ 1 2 (3.18) t . − w } − w ( ) )= { w x φ ( E j j n 0 n D 2 =1 n =1 j equal to zero, and solving for w , we obtain w Setting the derivative with respect to 0 0 1 M − ∑ w t − (3.19) w = φ j 0 j =1 j where we have defined N N ∑ ∑ 1 1 . ) t x , φ ( = (3.20) φ t = j n n j N N n n =1 =1 compensates for the difference between the averages (over the w Thus the bias 0 training set) of the target values and the weighted sum of the averages of the basis function values. We can also maximize the log likelihood function (3.11) with respect to the noise β , giving precision parameter N ∑ 1 1 T 2 = (3.21) ) x { t ( − w } φ n n ML N β ML =1 n

163 3.1. Linear Basis Function Models 143 Geometrical interpretation of the least-squares Figure 3.2 S solution, in an -dimensional space whose axes N t ,...,t . The least-squares are the values of t N 1 regression function is obtained by finding the or- φ 2 onto the t thogonal projection of the data vector φ y 1 x ( ) φ subspace spanned by the basis functions j in which each basis function is viewed as a vec- x with elements ) . of length N ( φ φ tor n j j and so we see that the inverse of the noise precision is given by the residual variance of the target values around the regression function. 3.1.2 Geometry of least squares At this point, it is instructive to consider the geometrical interpretation of the N -dimensional space whose axes least-squares solution. To do this we consider an T , so that t =( t is a vector in this space. Each basis ,...,t ) t are given by the N n 1 ( x ) , evaluated at the N data points, can also be represented as a vector in function φ j n corresponds φ , as illustrated in Figure 3.2. Note that φ the same space, denoted by j j th th j to the , whereas φ ( x column of ) corresponds to the n Φ row of Φ . If the n number M of basis functions is smaller than the number N of data points, then the ( x . We define ) will span a linear subspace S of dimensionality M M vectors φ n j th ) w , , where element is given by y ( x -dimensional vector whose N to be an y n n =1 ,...,N . Because y is an arbitrary linear combination of the vectors φ n , it can j M -dimensional subspace. The sum-of-squares error (3.12) is live anywhere in the then equal (up to a factor of 1 2 ) to the squared Euclidean distance between y and / t . Thus the least-squares solution for w corresponds to that choice of y that lies in subspace . Intuitively, from Figure 3.2, we anticipate that t S and that is closest to onto the subspace S . This this solution corresponds to the orthogonal projection of t is given y is indeed the case, as can easily be verified by noting that the solution for , and then confirming that this takes the form of an orthogonal projection. Exercise 3.2 Φw by ML In practice, a direct solution of the normal equations can lead to numerical diffi- T Φ is close to singular. In particular, when two or more of the basis culties when Φ are co-linear, or nearly so, the resulting parameter values can have large φ vectors j magnitudes. Such near degeneracies will not be uncommon when dealing with real data sets. The resulting numerical difficulties can be addressed using the technique of singular value decomposition ,or SVD (Press et al. , 1992; Bishop and Nabney, 2008). Note that the addition of a regularization term ensures that the matrix is non- singular, even in the presence of degeneracies. 3.1.3 Sequential learning Batch techniques, such as the maximum likelihood solution (3.15), which in- volve processing the entire training set in one go, can be computationally costly for large data sets. As we have discussed in Chapter 1, if the data set is sufficiently large, it may be worthwhile to use sequential algorithms, also known as on-line algorithms,

164 144 3. LINEAR MODELS FOR REGRESSION in which the data points are considered one at a time, and the model parameters up- dated after each such presentation. Sequential learning is also appropriate for real- time applications in which the data observations are arriving in a continuous stream, and predictions must be made before all of the data points are seen. We can obtain a sequential learning algorithm by applying the technique of sequential gradient descent , also known as , as follows. If stochastic gradient descent ∑ the error function comprises a sum over data points = E , then after presen- E n n tation of pattern n , the stochastic gradient descent algorithm updates the parameter using w vector ( τ +1) ( τ ) w − η ∇ E = (3.22) w n denotes the iteration number, and η is a learning rate parameter. We shall where τ shortly. The value of η is initialized to some starting discuss the choice of value for w (0) vector w . For the case of the sum-of-squares error function (3.12), this gives ( τ +1) )T τ ( τ ) ( φ η (3.23) − w + w = t φ ) ( w n n n . LMS algorithm = φ ( x or the ) . This is known as least-mean-squares where φ n n η needs to be chosen with care to ensure that the algorithm converges The value of (Bishop and Nabney, 2008). 3.1.4 Regularized least squares In Section 1.1, we introduced the idea of adding a regularization term to an error function in order to control over-fitting, so that the total error function to be minimized takes the form (3.24) ( w )+ λE ) ( w E W D λ is the regularization coefficient that controls the relative importance of the where ( ( w ) and the regularization term E . One of the sim- ) w data-dependent error E W D plest forms of regularizer is given by the sum-of-squares of the weight vector ele- ments 1 T E (3.25) )= w . ( w w W 2 If we also consider the sum-of-squares error function given by N ∑ 1 2 T x − } ( φ { t ) (3.26) w E )= w ( n n 2 n =1 then the total error function becomes N ∑ λ 1 T T 2 w φ − w (3.27) . ( x w ) } { + t n n 2 2 =1 n This particular choice of regularizer is known in the machine learning literature as weight decay because in sequential learning algorithms, it encourages weight values to decay towards zero, unless supported by the data. In statistics, it provides an ex- ample of a parameter shrinkage method because it shrinks parameter values towards

165 3.1. Linear Basis Function Models 145 =0 . q =1 q =2 q =4 q 5 q . Figure 3.3 Contours of the regularization term in (3.29) for various values of the parameter zero. It has the advantage that the error function remains a quadratic function of , and so its exact minimizer can be found in closed form. Specifically, setting the w gradient of (3.27) with respect to to zero, and solving for w as before, we obtain w ( ) − 1 T T t Φ (3.28) + Φ Φ λ I . w = This represents a simple extension of the least-squares solution (3.15). A more general regularizer is sometimes used, for which the regularized error takes the form N M ∑ ∑ 1 λ T 2 q | { φ ( x (3.29) ) } − + w | w t n n j 2 2 n =1 =1 j q =2 corresponds to the quadratic regularizer (3.27). Figure 3.3 shows con- where tours of the regularization function for different values of . q The case of q =1 is know as the lasso in the statistics literature (Tibshirani, 1996). It has the property that if λ is sufficiently large, some of the coefficients sparse model in which the corresponding basis are driven to zero, leading to a w j functions play no role. To see this, we first note that minimizing (3.29) is equivalent Exercise 3.5 to minimizing the unregularized sum-of-squares error (3.12) subject to the constraint M ∑ q η | w  | (3.30) j j =1 for an appropriate value of the parameter η , where the two approaches can be related using Lagrange multipliers. The origin of the sparsity can be seen from Figure 3.4, Appendix E which shows that the minimum of the error function, subject to the constraint (3.30). As λ is increased, so an increasing number of parameters are driven to zero. Regularization allows complex models to be trained on data sets of limited size without severe over-fitting, essentially by limiting the effective model complexity. However, the problem of determining the optimal model complexity is then shifted from one of finding the appropriate number of basis functions to one of determining a suitable value of the regularization coefficient λ . We shall return to the issue of model complexity later in this chapter.

166 146 3. LINEAR MODELS FOR REGRESSION w w 2 2 Plot of the contours Figure 3.4 of the unregularized error function (blue) along with the constraint re- gion (3.30) for the quadratic regular- on the left and the lasso q izer =2 regularizer on the right, in =1 q which the optimum value for the pa-  . w w rameter vector is denoted by The lasso gives a sparse solution in   w  . =0 w which 1 w w w 1 1 For the remainder of this chapter we shall focus on the quadratic regularizer (3.27) both for its practical importance and its analytical tractability. 3.1.5 Multiple outputs So far, we have considered the case of a single target variable . In some applica- t 1 tions, we may wish to predict K> target variables, which we denote collectively t . This could be done by introducing a different set of basis func- by the target vector t , leading to multiple, independent regression problems. tions for each component of However, a more interesting, and more common, approach is to use the same set of basis functions to model all of the components of the target vector so that T ( φ x ) (3.31) w W x )= ( , y is a K -dimensional column vector, W is an M × K matrix of parameters, where y φ ( x ) is an M -dimensional column vector with elements φ and x ( ) , with φ )=1 ( x 0 j as before. Suppose we take the conditional distribution of the target vector to be an isotropic Gaussian of the form T 1 − | x , W ,β )= N ( t | W p ( t x ,β φ I ) . (3.32) ) ( t If we have a set of observations T t , we can combine these into a matrix ,..., 1 N th T t row is given by . Similarly, we can combine the N K n of size × such that the n . The log likelihood function is then given X ,..., x into a matrix input vectors x N 1 by N ∑ T − 1 , ,β )= X | T ( p ln W t I | W ) φ ( x N ) ,β ( ln n n n =1 ( ) N ∑ ∥ ∥ NK β β 2 T ∥ ∥ = φ (3.33) . − t ) − W x ( ln n n π 2 2 2 =1 n

167 3.2. The Bias-Variance Decomposition 147 W As before, we can maximize this function with respect to , giving ) ( − 1 T T = Φ (3.34) Φ T . Φ W ML If we examine this result for each target variable t ,wehave k ) ( − 1 T T † (3.35) Φ Φ t t = = Φ Φ w k k k t where is an N -dimensional column vector with components t . ,...N n =1 for k nk Thus the solution to the regression problem decouples between the different target † , which is Φ variables, and we need only compute a single pseudo-inverse matrix . w shared by all of the vectors k The extension to general Gaussian noise distributions having arbitrary covari- K inde- ance matrices is straightforward. Again, this leads to a decoupling into Exercise 3.6 pendent regression problems. This result is unsurprising because the parameters W define only the mean of the Gaussian noise distribution, and we know from Sec- tion 2.3.4 that the maximum likelihood solution for the mean of a multivariate Gaus- sian is independent of the covariance. From now on, we shall therefore consider a t single target variable for simplicity. 3.2. The Bias-Variance Decomposition So far in our discussion of linear models for regression, we have assumed that the form and number of basis functions are both fixed. As we have seen in Chapter 1, the use of maximum likelihood, or equivalently least squares, can lead to severe over-fitting if complex models are trained using data sets of limited size. However, limiting the number of basis functions in order to avoid over-fitting has the side effect of limiting the flexibility of the model to capture interesting and important trends in the data. Although the introduction of regularization terms can control over-fitting for models with many parameters, this raises the question of how to determine a suitable value for the regularization coefficient λ . Seeking the solution that minimizes the regularized error function with respect to both the weight vector w λ is clearly not the right approach since this and the regularization coefficient λ leads to the unregularized solution with . =0 As we have seen in earlier chapters, the phenomenon of over-fitting is really an unfortunate property of maximum likelihood and does not arise when we marginalize over parameters in a Bayesian setting. In this chapter, we shall consider the Bayesian view of model complexity in some depth. Before doing so, however, it is instructive to consider a frequentist viewpoint of the model complexity issue, known as the bias- variance trade-off. Although we shall introduce this concept in the context of linear basis function models, where it is easy to illustrate the ideas using simple examples, the discussion has more general applicability. In Section 1.5.5, when we discussed decision theory for regression problems, we considered various loss functions each of which leads to a corresponding optimal prediction once we are given the conditional distribution . A popular choice is ( t | x ) p

168 148 3. LINEAR MODELS FOR REGRESSION the squared loss function, for which the optimal prediction is given by the conditional expectation, which we denote by ) and which is given by h ( x ∫ )= [ t | x x ]= h E ( | x )d t. (3.36) t tp ( At this point, it is worth distinguishing between the squared loss function arising from decision theory and the sum-of-squares error function that arose in the maxi- mum likelihood estimation of model parameters. We might use more sophisticated techniques than least squares, for example regularization or a fully Bayesian ap- t | x p . These can all be combined proach, to determine the conditional distribution ( ) with the squared loss function for the purpose of making predictions. We showed in Section 1.5.5 that the expected squared loss can be written in the form ∫ ∫ 2 2 L ]= [ E ) − { ( x ) } y p ( x )d x + ( { h ( x ) − t } x p ( x ,t )d x d t. (3.37) h Recall that the second term, which is independent of ( x ) , arises from the intrinsic y noise on the data and represents the minimum achievable value of the expected loss. The first term depends on our choice for the function ( x ) , and we will seek a so- y y ( x ) which makes this term a minimum. Because it is nonnegative, the lution for smallest that we can hope to make this term is zero. If we had an unlimited supply of data (and unlimited computational resources), we could in principle find the regres- sion function ( x ) to any desired degree of accuracy, and this would represent the h y ( ) . However, in practice we have a data set D containing only optimal choice for x N a finite number of data points, and consequently we do not know the regression function h ( x ) exactly. h ( x ) using a parametric function If we model the ( x , w ) governed by a pa- y rameter vector , then from a Bayesian perspective the uncertainty in our model is w w expressed through a posterior distribution over . A frequentist treatment, however, w based on the data set D involves making a point estimate of , and tries instead to interpret the uncertainty of this estimate through the following thought experi- ment. Suppose we had a large number of data sets each of size N and each drawn independently from the distribution p ( t, x ) . For any given data set D , we can run our learning algorithm and obtain a prediction function x ; D ) . Different data sets ( y from the ensemble will give different functions and consequently different values of the squared loss. The performance of a particular learning algorithm is then assessed by taking the average over this ensemble of data sets. Consider the integrand of the first term in (3.37), which for a particular data set D takes the form 2 . (3.38) x ; D ) − h ( x ) } { y ( Because this quantity will be dependent on the particular data set , we take its aver- D age over the ensemble of data sets. If we add and subtract the quantity E D )] y ( x ; [ D

169 3.2. The Bias-Variance Decomposition 149 inside the braces, and then expand, we obtain 2 { ; D ) − E y x ( D ( ; D )] + E [ y y ( x ; x )] − h ( x ) } [ D D 2 2 ( x ; D ) − E x [ y ( x ; D )] } = + { E } [ y ( x ; D )] − h ( { y ) D D } { ( x ; D ) − E (3.39) [ y ( x ; D )] }{ E . [ y ( x ; D )] − h ( +2 ) y x D D D We now take the expectation of this expression with respect to and note that the final term will vanish, giving ] [ 2 y ( x ; D ) − { ( x ) } h E D ] [ 2 2 (3.40) . ( x ; D )] − h ( x ) } = { E [ } )] D ; + E x [ { y ( x ; D ) − E ( y y D D D ︷︷ ︸ ︸ ︷︷ ︸ ︸ 2 bias ) ( variance y x ; We see that the expected squared difference between ) and the regression D ( h x ) can be expressed as the sum of two terms. The first term, called the function ( squared bias , represents the extent to which the average prediction over all data sets variance , differs from the desired regression function. The second term, called the measures the extent to which the solutions for individual data sets vary around their average, and hence this measures the extent to which the function ( x ; D ) is sensitive y to the particular choice of data set. We shall provide some intuition to support these definitions shortly when we consider a simple example. x . If we substitute this expansion So far, we have considered a single input value back into (3.37), we obtain the following decomposition of the expected squared loss 2 noise + (3.41) + variance ) bias expected loss =( where ∫ 2 2 (bias) { E (3.42) [ y ( x ; D )] − h ( x ) } = p ( x )d x D ∫ [ ] 2 variance = y ( x ; D ) − E (3.43) [ y ( x ; D )] } E { p ( x )d x D D ∫ 2 )d { x ) − t } (3.44) p ( x ,t ( x d t h noise = and the bias and variance terms now refer to integrated quantities. Our goal is to minimize the expected loss, which we have decomposed into the sum of a (squared) bias, a variance, and a constant noise term. As we shall see, there is a trade-off between bias and variance, with very flexible models having low bias and high variance, and relatively rigid models having high bias and low variance. The model with the optimal predictive capability is the one that leads to the best balance between bias and variance. This is illustrated by considering the sinusoidal data set from Chapter 1. Here we generate 100 data sets, each containing N =25 Appendix A ) data points, independently from the sinusoidal curve x )=sin(2 πx ( . The data h ( l ) ,...,L , where L = 100 , and for each data set D sets are indexed by l =1 we

170 150 3. LINEAR MODELS FOR REGRESSION 1 1 =2 . ln 6 λ t t 0 0 −1 −1 0 1 1 0 x x 1 1 − 0 . 31 = λ ln t t 0 0 −1 −1 0 1 1 0 x x 1 1 ln λ 2 . 4 − = t t 0 0 −1 −1 1 0 1 0 x x Figure 3.5 Illustration of the dependence of bias and variance on model complexity, governed by a regulariza- tion parameter λ , using the sinusoidal data set from Chapter 1. There are L = 100 data sets, each having N =25 data points, and there are 24 Gaussian basis functions in the model so that the total number of parameters is M =25 including the bias parameter. The left column shows the result of fitting the model to the data sets for various values of ln λ (for clarity, only 20 of the 100 fits are shown). The right column shows the corresponding average of the 100 fits (red) along with the sinusoidal function from which the data sets were generated (green).

171 3.2. The Bias-Variance Decomposition 151 Plot of squared bias and variance, Figure 3.6 0.15 together with their sum, correspond- 2 (bias) ing to the results shown in Fig- 0.12 variance ure 3.5. Also shown is the average 2 test set error for a test data set size + variance (bias) of 1000 points. The minimum value test error 0.09 2 + variance occurs around (bias) of − 0 . λ 31 ln = , which is close to the 0.06 value that gives the minimum error on the test data. 0.03 0 −2 −1 0 1 2 −3 λ ln 24 Gaussian basis functions by minimizing the regularized error fit a model with ) l ( The ( x ) as shown in Figure 3.5. y function (3.27) to give a prediction function λ that gives low top row corresponds to a large value of the regularization coefficient variance (because the red curves in the left plot look similar) but high bias (because the two curves in the right plot are very different). Conversely on the bottom row, for λ is small, there is large variance (shown by the high variability between the which red curves in the left plot) but low bias (shown by the good fit between the average model fit and the original sinusoidal function). Note that the result of averaging many M =25 is a very good fit to the regression solutions for the complex model with function, which suggests that averaging may be a beneficial procedure. Indeed, a weighted averaging of multiple solutions lies at the heart of a Bayesian approach, although the averaging is with respect to the posterior distribution of parameters, not with respect to multiple data sets. We can also examine the bias-variance trade-off quantitatively for this example. The average prediction is estimated from L ∑ 1 ( l ) x (3.45) y ( y )= ( x ) L =1 l and the integrated squared bias and integrated variance are then given by N ∑ 1 2 2 ) bias ( ) y = x (3.46) { − h ( x } ) ( n n N n =1 N L ∑ ∑ { } 1 1 2 ( l ) x y (3.47) ) x ( ( y ) − = variance n n N L n =1 =1 l where the integral over x weighted by the distribution p ( x ) is approximated by a finite sum over data points drawn from that distribution. These quantities, along with their sum, are plotted as a function of ln λ in Figure 3.6. We see that small values of λ allow the model to become finely tuned to the noise on each individual

172 152 3. LINEAR MODELS FOR REGRESSION λ pulls the weight data set leading to large variance. Conversely, a large value of parameters towards zero leading to large bias. Although the bias-variance decomposition may provide some interesting in- sights into the model complexity issue from a frequentist perspective, it is of lim- ited practical value, because the bias-variance decomposition is based on averages with respect to ensembles of data sets, whereas in practice we have only the single observed data set. If we had a large number of independent training sets of a given size, we would be better off combining them into a single large training set, which of course would reduce the level of over-fitting for a given model complexity. Given these limitations, we turn in the next section to a Bayesian treatment of linear basis function models, which not only provides powerful insights into the issues of over-fitting but which also leads to practical techniques for addressing the question model complexity. 3.3. Bayesian Linear Regression In our discussion of maximum likelihood for setting the parameters of a linear re- gression model, we have seen that the effective model complexity, governed by the number of basis functions, needs to be controlled according to the size of the data set. Adding a regularization term to the log likelihood function means the effective model complexity can then be controlled by the value of the regularization coeffi- cient, although the choice of the number and form of the basis functions is of course still important in determining the overall behaviour of the model. This leaves the issue of deciding the appropriate model complexity for the par- ticular problem, which cannot be decided simply by maximizing the likelihood func- tion, because this always leads to excessively complex models and over-fitting. In- dependent hold-out data can be used to determine model complexity, as discussed in Section 1.3, but this can be both computationally expensive and wasteful of valu- able data. We therefore turn to a Bayesian treatment of linear regression, which will avoid the over-fitting problem of maximum likelihood, and which will also lead to automatic methods of determining model complexity using the training data alone. Again, for simplicity we will focus on the case of a single target variable t . Ex- tension to multiple target variables is straightforward and follows the discussion of Section 3.1.5. 3.3.1 Parameter distribution We begin our discussion of the Bayesian treatment of linear regression by in- troducing a prior probability distribution over the model parameters . For the mo- w ment, we shall treat the noise precision parameter β as a known constant. First note p ( t | w ) defined by (3.10) is the exponential of a quadratic that the likelihood function w . The corresponding conjugate prior is therefore given by a Gaussian function of distribution of the form (3.48) , S ) N ( w | m p ( w )= 0 0 and covariance S . having mean m 0 0

173 3.3. Bayesian Linear Regression 153 Next we compute the posterior distribution, which is proportional to the product of the likelihood function and the prior. Due to the choice of a conjugate Gaus- sian prior distribution, the posterior will also be Gaussian. We can evaluate this distribution by the usual procedure of completing the square in the exponential, and then finding the normalization coefficient using the standard result for a normalized Gaussian. However, we have already done the necessary work in deriving the gen- Exercise 3.7 eral result (2.116), which allows us to write down the posterior distribution directly in the form w )= w ( | | m ( p t N ) , S (3.49) N N where ( ) T − 1 (3.50) m S β Φ = t S + m 0 N N 0 T − 1 − 1 S S (3.51) + β Φ = Φ . 0 N Note that because the posterior distribution is Gaussian, its mode coincides with its mean. Thus the maximum posterior weight vector is simply given by w = m . MAP N − 1 = α m , the mean I with α → 0 If we consider an infinitely broad prior S N 0 w given of the posterior distribution reduces to the maximum likelihood value ML N =0 , then the posterior distribution reverts to the prior. by (3.15). Similarly, if Furthermore, if data points arrive sequentially, then the posterior distribution at any stage acts as the prior distribution for the subsequent data point, such that the new posterior distribution is again given by (3.49). Exercise 3.8 For the remainder of this chapter, we shall consider a particular form of Gaus- sian prior in order to simplify the treatment. Specifically, we consider a zero-mean isotropic Gaussian governed by a single precision parameter α so that − 1 I ) (3.52) ,α | 0 α p )= N ( w | w ( is then given by (3.49) with and the corresponding posterior distribution over w T (3.53) S β Φ = t m N N T − 1 S α I + β Φ = Φ . (3.54) N The log of the posterior distribution is given by the sum of the log likelihood and , takes the form the log of the prior and, as a function of w N ∑ α β 2 T T w +const t w { φ ( x − ) } . − (3.55) w )= − ln p ( w | t n n 2 2 =1 n Maximization of this posterior distribution with respect to w is therefore equiva- lent to the minimization of the sum-of-squares error function with the addition of a quadratic regularization term, corresponding to (3.27) with λ = α/β . We can illustrate Bayesian learning in a linear basis function model, as well as the sequential update of a posterior distribution, using a simple example involving straight-line fitting. Consider a single input variable x , a single target variable t and

174 154 3. LINEAR MODELS FOR REGRESSION + w x . Because this has just two adap- y ( a linear model of the form w x, )= w 1 0 tive parameters, we can plot the prior and posterior distributions directly in parameter ( x, a )= a f space. We generate synthetic data from the function a x with param- + 1 0 eter values a = − 0 . 3 and a from the uniform =0 . 5 by first choosing values of x 1 n 0 x U( 1 , 1) , then evaluating f ( |− x distribution a ) , and finally adding Gaussian noise , n . Our goal is to recover . 2 to obtain the target values t with standard deviation of 0 n the values of a and from such data, and we will explore the dependence on the a 1 0 size of the data set. We assume here that the noise variance is known and hence we 2 =(1 set the precision parameter to its true value 0 . 2) β / =25 . Similarly, we fix the parameter α to 2 . 0 . We shall shortly discuss strategies for determining α and β from the training data. Figure 3.7 shows the results of Bayesian learning in this model as the size of the data set is increased and demonstrates the sequential nature of Bayesian learning in which the current posterior distribution forms the prior when a new data point is observed. It is worth taking time to study this figure in detail as it illustrates several important aspects of Bayesian inference. The first row of this figure corresponds to the situation before any data points are observed and shows a plot of the prior distribution in space together with six samples of the function w x, y ) in which the values of w are drawn from the prior. In the second row, we ( w ( x, t see the situation after observing a single data point. The location of the data ) point is shown by a blue circle in the right-hand column. In the left-hand column is a for this data point as a function of p t | x, w plot of the likelihood function ( w . Note ) that the likelihood function provides a soft constraint that the line must pass close to the data point, where close is determined by the noise precision β . For comparison, used to generate the data set 5 . = − 0 . 3 and a =0 the true parameter values a 1 0 are shown by a white cross in the plots in the left column of Figure 3.7. When we multiply this likelihood function by the prior from the top row, and normalize, we obtain the posterior distribution shown in the middle plot on the second row. Sam- y ( x, w ) ples of the regression function w from this obtained by drawing samples of posterior distribution are shown in the right-hand plot. Note that these sample lines all pass close to the data point. The third row of this figure shows the effect of ob- serving a second data point, again shown by a blue circle in the plot in the right-hand column. The corresponding likelihood function for this second data point alone is shown in the left plot. When we multiply this likelihood function by the posterior distribution from the second row, we obtain the posterior distribution shown in the middle plot of the third row. Note that this is exactly the same posterior distribution as would be obtained by combining the original prior with the likelihood function for the two data points. This posterior has now been influenced by two data points, and because two points are sufficient to define a line this already gives a relatively compact posterior distribution. Samples from this posterior distribution give rise to the functions shown in red in the third column, and we see that these functions pass close to both of the data points. The fourth row shows the effect of observing a total th data data points. The left-hand plot shows the likelihood function for the 20 of 20 point alone, and the middle plot shows the resulting posterior distribution that has now absorbed information from all 20 observations. Note how the posterior is much sharper than in the third row. In the limit of an infinite number of data points, the

175 3.3. Bayesian Linear Regression 155 Figure 3.7 Illustration of sequential Bayesian learning for a simple linear model of the form y ( x, w )= w + w x . A detailed description of this figure is given in the text. 0 1

176 156 3. LINEAR MODELS FOR REGRESSION posterior distribution would become a delta function centred on the true parameter values, shown by the white cross. Other forms of prior over the parameters can be considered. For instance, we can generalize the Gaussian prior to give ( ) [ ] M M ( ) ∑ /q 1 1 α q α q p α )= ( w | − exp (3.56) | w | j Γ(1 /q 2 2 ) 2 j =1 =2 corresponds to the Gaussian distribution, and only in this case is the q in which prior conjugate to the likelihood function (3.10). Finding the maximum of the poste- w rior distribution over corresponds to minimization of the regularized error function (3.29). In the case of the Gaussian prior, the mode of the posterior distribution was q equal to the mean, although this will no longer hold if . =2 3.3.2 Predictive distribution w In practice, we are not usually interested in the value of itself but rather in for new values of making predictions of . This requires that we evaluate the t x defined by predictive distribution ∫ ( p ( t | w ,β ) p w | t ,α,β )d w (3.57) p t t | ,α,β ( )= in which t is the vector of target values from the training set, and we have omitted the corresponding input vectors from the right-hand side of the conditioning statements to simplify the notation. The conditional distribution p ( t | x , w ,β ) of the target vari- able is given by (3.8), and the posterior weight distribution is given by (3.49). We see that (3.57) involves the convolution of two Gaussian distributions, and so making use of the result (2.115) from Section 8.1.4, we see that the predictive distribution takes the form Exercise 3.10 2 T x ( ( ,σ ) x (3.58) φ )) t | t p ( N )= ,α,β t , ( x | m N N 2 σ where the variance ( x ) of the predictive distribution is given by N 1 T 2 x ( φ ) + ( )= x (3.59) S . φ ( x ) σ N N β The first term in (3.59) represents the noise on the data whereas the second term reflects the uncertainty associated with the parameters w . Because the noise process w are independent Gaussians, their variances are additive. and the distribution of Note that, as additional data points are observed, the posterior distribution becomes 2  ) ( x et al. , 1997) that σ narrower. As a consequence it can be shown (Qazaz +1 N 2 N , the second term in (3.59) goes to zero, and the variance ( x ) . In the limit Exercise 3.11 →∞ σ N of the predictive distribution arises solely from the additive noise governed by the parameter β . As an illustration of the predictive distribution for Bayesian linear regression models, let us return to the synthetic sinusoidal data set of Section 1.1. In Figure 3.8,

177 3.3. Bayesian Linear Regression 157 1 1 t t 0 0 −1 −1 1 1 0 0 x x 1 1 t t 0 0 −1 −1 0 1 1 0 x x Figure 3.8 9 Gaussian basis functions Examples of the predictive distribution (3.58) for a model consisting of of the form (3.4) using the synthetic sinusoidal data set of Section 1.1. See the text for a detailed discussion. we fit a model comprising a linear combination of Gaussian basis functions to data sets of various sizes and then look at the corresponding posterior distributions. Here the green curves correspond to the function sin(2 πx ) from which the data points were generated (with the addition of Gaussian noise). Data sets of size N , =1 are shown in the four plots by the blue circles. For =2 N =4 , and N =25 , N each plot, the red curve shows the mean of the corresponding Gaussian predictive distribution, and the red shaded region spans one standard deviation either side of the mean. Note that the predictive uncertainty depends on x and is smallest in the neighbourhood of the data points. Also note that the level of uncertainty decreases as more data points are observed. The plots in Figure 3.8 only show the point-wise predictive variance as a func- tion of x . In order to gain insight into the covariance between the predictions at different values of x , we can draw samples from the posterior distribution over w , and then plot the corresponding functions y ( x, w ) , as shown in Figure 3.9.

178 158 3. LINEAR MODELS FOR REGRESSION 1 1 t t 0 0 −1 −1 1 0 1 0 x x 1 1 t t 0 0 −1 −1 1 0 1 0 x x Figure 3.9 Plots of the function y ( x, w ) using samples from the posterior distributions over w corresponding to the plots in Figure 3.8. If we used localized basis functions such as Gaussians, then in regions away from the basis function centres, the contribution from the second term in the predic- − 1 . Thus, tive variance (3.59) will go to zero, leaving only the noise contribution β the model becomes very confident in its predictions when extrapolating outside the region occupied by the basis functions, which is generally an undesirable behaviour. This problem can be avoided by adopting an alternative Bayesian approach to re- gression known as a Gaussian process. Section 6.4 Note that, if both w and β are treated as unknown, then we can introduce a that, from the discussion in Section 2.3.6, will conjugate prior distribution ( w ,β ) p be given by a Gaussian-gamma distribution (Denison et al. , 2002). In this case, the Exercise 3.12 predictive distribution is a Student’s t-distribution. Exercise 3.13

179 3.3. Bayesian Linear Regression 159 The equivalent ker- Figure 3.10 ′ x, x nel k ( ) for the Gaussian basis functions in Figure 3.1, shown as ′ x a plot of x versus , together with three slices through this matrix cor- responding to three different values of . The data set used to generate x this kernel comprised 200 values of x equally spaced over the interval 1 − . ( , 1) 3.3.3 Equivalent kernel The posterior mean solution (3.53) for the linear basis function model has an in- teresting interpretation that will set the stage for kernel methods, including Gaussian processes. If we substitute (3.53) into the expression (3.3), we see that the predictive Chapter 6 mean can be written in the form N ∑ T T T T m φ (3.60) t ( x )= β φ ( x ) ) S S Φ x t = ( φ )= ) β φ ( x y m , x ( N N N n n N n =1 is defined by (3.51). Thus the mean of the predictive distribution at a point S where N is given by a linear combination of the training set target variables t x , so that we n can write N ∑ y m x ( , ) )= ( x , x (3.61) k t n n N n =1 where the function ′ ′ T (3.62) φ x ) β S ( φ ( x )= ) k ( x , x N is known as the smoother matrix or the equivalent kernel . Regression functions, such as this, which make predictions by taking linear combinations of the training set target values are known as . Note that the equivalent kernel depends linear smoothers from the data set because these appear in the definition of x on the input values n S . The equivalent kernel is illustrated for the case of Gaussian basis functions in N ′ Figure 3.10 in which the kernel functions ( x, x k ) have been plotted as a function of ′ . We see that they are localized around for three different values of x , and so the x x x , given by y ( x, m mean of the predictive distribution at , is obtained by forming ) N x are given a weighted combination of the target values in which data points close to higher weight than points further removed from x . Intuitively, it seems reasonable that we should weight local evidence more strongly than distant evidence. Note that this localization property holds not only for the localized Gaussian basis functions but also for the nonlocal polynomial and sigmoidal basis functions, as illustrated in Figure 3.11.

180 160 3. LINEAR MODELS FOR REGRESSION Examples of equiva- Figure 3.11 ′ lent kernels k ( x, x =0 x for ) ′ 0.04 0.04 plotted as a function of x , corre- sponding (left) to the polynomial ba- sis functions and (right) to the sig- 0.02 0.02 moidal basis functions shown in Fig- ure 3.1. Note that these are local- 0 0 ′ even though the x ized functions of corresponding basis functions are 0 1 0 1 −1 −1 nonlocal. Further insight into the role of the equivalent kernel can be obtained by consid- ′ ( x ) and y ( x ering the covariance between y , which is given by ) T ′ T ′ w w φ ( x ) )]=cov[ , φ ( x )] ( x x cov[ ) ( ,y y T ′ ′ − 1 (3.63) S x φ ) x ( )= β x , k ( ) ( φ = x N where we have made use of (3.49) and (3.62). From the form of the equivalent kernel, we see that the predictive mean at nearby points will be highly correlated, whereas for more distant pairs of points the correlation will be smaller. The predictive distribution shown in Figure 3.8 allows us to visualize the point- wise uncertainty in the predictions, governed by (3.59). However, by drawing sam- ples from the posterior distribution over w , and plotting the corresponding model y ( x , w ) as in Figure 3.9, we are visualizing the joint uncertainty in the functions posterior distribution between the y x values, as governed by values at two (or more) the equivalent kernel. The formulation of linear regression in terms of a kernel function suggests an alternative approach to regression as follows. Instead of introducing a set of basis functions, which implicitly determines an equivalent kernel, we can instead define x a localized kernel directly and use this to make predictions for new input vectors , given the observed training set. This leads to a practical framework for regression (and classification) called Gaussian processes , which will be discussed in detail in Section 6.4. We have seen that the effective kernel defines the weights by which the training set target values are combined in order to make a prediction at a new value of , and x it can be shown that these weights sum to one, in other words N ∑ )=1 k ( (3.64) , x x n =1 n for all values of x . This intuitively pleasing result can easily be proven informally Exercise 3.14 ( x ) y ̂ by noting that the summation is equivalent to considering the predictive mean t for a set of target data in which =1 for all n . Provided the basis functions are n linearly independent, that there are more data points than basis functions, and that one of the basis functions is constant (corresponding to the bias parameter), then it is clear that we can fit the training data exactly and hence that the predictive mean will

181 3.4. Bayesian Model Comparison 161 ̂ y x )=1 , from which we obtain (3.64). Note that the kernel function can ( be simply be negative as well as positive, so although it satisfies a summation constraint, the corresponding predictions are not necessarily convex combinations of the training set target variables. Finally, we note that the equivalent kernel (3.62) satisfies an important property shared by kernel functions in general, namely that it can be expressed in the form an Chapter 6 ( x ) of nonlinear functions, so that inner product with respect to a vector ψ T ( z )= ψ , x ) x k ( ( z ) (3.65) ψ / 1 2 / 2 1 S ) . φ ( x β ψ x )= ( where N 3.4. Bayesian Model Comparison In Chapter 1, we highlighted the problem of over-fitting as well as the use of cross- validation as a technique for setting the values of regularization parameters or for choosing between alternative models. Here we consider the problem of model se- lection from a Bayesian perspective. In this section, our discussion will be very general, and then in Section 3.5 we shall see how these ideas can be applied to the determination of regularization parameters in linear regression. As we shall see, the over-fitting associated with maximum likelihood can be avoided by marginalizing (summing or integrating) over the model parameters in- stead of making point estimates of their values. Models can then be compared di- rectly on the training data, without the need for a validation set. This allows all available data to be used for training and avoids the multiple training runs for each model associated with cross-validation. It also allows multiple complexity parame- ters to be determined simultaneously as part of the training process. For example, in Chapter 7 we shall introduce the relevance vector machine , which is a Bayesian model having one complexity parameter for every training data point. The Bayesian view of model comparison simply involves the use of probabilities to represent uncertainty in the choice of model, along with a consistent application L of the sum and product rules of probability. Suppose we wish to compare a set of ,...,L } where . Here a model refers to a probability distribution =1 i {M models i over the observed data D . In the case of the polynomial curve-fitting problem, the t , while the set of input values X distribution is defined over the set of target values X is assumed to be known. Other types of model define a joint distributions over t . We shall suppose that the data is generated from one of these models but we Section 1.5.4 and are uncertain which one. Our uncertainty is expressed through a prior probability , we then wish to evaluate the posterior . Given a training set ) D M ( distribution p i distribution D|M (3.66) . ) |D ) ∝ p ( M ( ) p M ( p i i i The prior allows us to express a preference for different models. Let us simply assume that all models are given equal prior probability. The interesting term is the model evidence p ( D|M ) which expresses the preference shown by the data for i

182 162 3. LINEAR MODELS FOR REGRESSION different models, and we shall examine this term in more detail shortly. The model evidence is sometimes also called the marginal likelihood because it can be viewed as a likelihood function over the space of models, in which the parameters have been p ( marginalized out. The ratio of model evidences D|M ( for two models ) ) /p D|M j i is known as a (Kass and Raftery, 1995). Bayes factor Once we know the posterior distribution over models, the predictive distribution is given, from the sum and product rules, by L ∑ , . ) |D (3.67) M ( p ( t | x p M ) , D x , p ( t )= D | i i =1 i This is an example of a mixture distribution in which the overall predictive distribu- p tion is obtained by averaging the predictive distributions t | x , M ( D ) of individual , i |D ) of those models. For in- p models, weighted by the posterior probabilities ( M i stance, if we have two models that are a-posteriori equally likely and one predicts a narrow distribution around a while the other predicts a narrow distribution = t = b , the overall predictive distribution will be a bimodal distribution with around t t = a and modes at = b , not a single model at t =( a + b ) / 2 . t A simple approximation to model averaging is to use the single most probable model alone to make predictions. This is known as model selection . For a model governed by a set of parameters w , the model evidence is given, from the sum and product rules of probability, by ∫ . w )d )= D| p ( (3.68) w , M |M ) p ( w ( p D|M i i i Chapter 11 From a sampling perspective, the marginal likelihood can be viewed as the proba- bility of generating the data set from a model whose parameters are sampled at D random from the prior. It is also interesting to note that the evidence is precisely the normalizing term that appears in the denominator in Bayes’ theorem when evaluating the posterior distribution over parameters because |M w ( p ) ) M p ( D| w , i i . (3.69) )= |D w ( , p M i p ) ( D|M i We can obtain some insight into the model evidence by making a simple approx- imation to the integral over parameters. Consider first the case of a model having a single parameter w . The posterior distribution over parameters is proportional to to keep the notation ( D| w ) p ( w ) , where we omit the dependence on the model M p i uncluttered. If we assume that the posterior distribution is sharply peaked around the , then we can approximate the in- , with width ∆ w w most probable value posterior MAP tegral by the value of the integrand at its maximum times the width of the peak. If we further assume that the prior is flat with width ∆ w so that p ( w )=1 / ∆ w , prior prior then we have ∫ w ∆ posterior (3.70) p ( D| w ) p ( w )d w  p ( D| w ) )= D p ( MAP ∆ w prior

183 3.4. Bayesian Model Comparison 163 We can obtain a rough approximation to Figure 3.12 ∆ w posterior the model evidence if we assume that the posterior distribution over parame- ters is sharply peaked around its mode . w MAP w w MAP ∆ w prior and so taking logs we obtain ( ) ∆ w posterior ln p ( D| w ) ln ( p D  )+ln . (3.71) MAP ∆ w prior This approximation is illustrated in Figure 3.12. The first term represents the fit to the data given by the most probable parameter values, and for a flat prior this would correspond to the log likelihood. The second term penalizes the model according to ∆ < w this term is negative, and it increases w ∆ its complexity. Because prior posterior ∆ w in magnitude as the ratio / w gets smaller. Thus, if parameters are ∆ posterior prior finely tuned to the data in the posterior distribution, then the penalty term is large. For a model having a set of M parameters, we can make a similar approximation for each parameter in turn. Assuming that all parameters have the same ratio of / ∆ w , we obtain w ∆ posterior prior ) ( w ∆ posterior p p ( D| w ln (  ln ) D ln )+ (3.72) . M MAP w ∆ prior Thus, in this very simple approximation, the size of the complexity penalty increases linearly with the number M of adaptive parameters in the model. As we increase the complexity of the model, the first term will typically decrease, because a more complex model is better able to fit the data, whereas the second term will increase M . The optimal model complexity, as determined by due to the dependence on the maximum evidence, will be given by a trade-off between these two competing terms. We shall later develop a more refined version of this approximation, based on Section 4.4.1 a Gaussian approximation to the posterior distribution. We can gain further insight into Bayesian model comparison and understand how the marginal likelihood can favour models of intermediate complexity by con- sidering Figure 3.13. Here the horizontal axis is a one-dimensional representation of the space of possible data sets, so that each point on this axis corresponds to a specific data set. We now consider three models M of successively , M M and 2 3 1 increasing complexity. Imagine running these models generatively to produce exam- ple data sets, and then looking at the distribution of data sets that result. Any given

184 164 3. LINEAR MODELS FOR REGRESSION Schematic illustration of the Figure 3.13 distribution of data sets for D ) ( p M 1 three models of different com- is the M plexity, in which 1 is the most simplest and M 3 M 2 Note that the dis- complex. tributions are normalized. In this example, for the partic- M 3 ular observed data set D , 0 with intermedi- M the model 2 ate complexity has the largest evidence. D D 0 model can generate a variety of different data sets since the parameters are governed by a prior probability distribution, and for any choice of the parameters there may be random noise on the target variables. To generate a particular data set from a spe- cific model, we first choose the values of the parameters from their prior distribution ( w ) , and then for these parameter values we sample the data from p ( D| w p . A sim- ) ple model (for example, based on a first order polynomial) has little variability and so will generate data sets that are fairly similar to each other. Its distribution p ( D ) is therefore confined to a relatively small region of the horizontal axis. By contrast, a complex model (such as a ninth order polynomial) can generate a great variety of different data sets, and so its distribution p ( D ) is spread over a large region of the ) are normalized, we see that space of data sets. Because the distributions ( p D|M i can have the highest value of the evidence for the model the particular data set D 0 of intermediate complexity. Essentially, the simpler model cannot fit the data well, whereas the more complex model spreads its predictive probability over too broad a range of data sets and so assigns relatively small probability to any one of them. Implicit in the Bayesian model comparison framework is the assumption that the true distribution from which the data are generated is contained within the set of models under consideration. Provided this is so, we can show that Bayesian model comparison will on average favour the correct model. To see this, consider two . For a given finite data and M in which the truth corresponds to M models M 1 1 2 set, it is possible for the Bayes factor to be larger for the incorrect model. However, if we average the Bayes factor over the distribution of data sets, we obtain the expected Bayes factor in the form ∫ ) ( D|M p 1 )ln p D|M ( (3.73) d D 1 p ) ( D|M 2 where the average has been taken with respect to the true distribution of the data. This quantity is an example of the Kullback-Leibler divergence and satisfies the prop- Section 1.6.1 erty of always being positive unless the two distributions are equal in which case it is zero. Thus on average the Bayes factor will always favour the correct model. We have seen that the Bayesian framework avoids the problem of over-fitting and allows models to be compared on the basis of the training data alone. However,

185 3.5. The Evidence Approximation 165 a Bayesian approach, like any approach to pattern recognition, needs to make as- sumptions about the form of the model, and if these are invalid then the results can be misleading. In particular, we see from Figure 3.12 that the model evidence can be sensitive to many aspects of the prior, such as the behaviour in the tails. Indeed, the evidence is not defined if the prior is improper, as can be seen by noting that an improper prior has an arbitrary scaling factor (in other words, the normalization coefficient is not defined because the distribution cannot be normalized). If we con- sider a proper prior and then take a suitable limit in order to obtain an improper prior (for example, a Gaussian prior in which we take the limit of infinite variance) then the evidence will go to zero, as can be seen from (3.70) and Figure 3.12. It may, however, be possible to consider the evidence ratio between two models first and then take a limit to obtain a meaningful answer. In a practical application, therefore, it will be wise to keep aside an independent test set of data on which to evaluate the overall performance of the final system. 3.5. The Evidence Approximation In a fully Bayesian treatment of the linear basis function model, we would intro- and duce prior distributions over the hyperparameters and make predictions by α β marginalizing with respect to these hyperparameters as well as with respect to the w parameters w or . However, although we can integrate analytically over either over the hyperparameters, the complete marginalization over all of these variables is analytically intractable. Here we discuss an approximation in which we set the marginal likeli- hyperparameters to specific values determined by maximizing the obtained by first integrating over the parameters hood function w . This framework (Bernardo and Smith, 1994; empirical Bayes is known in the statistics literature as , 2004), or type 2 maximum likelihood Gelman generalized et al. (Berger, 1985), or (Wahba, 1975), and in the machine learning literature is also maximum likelihood evidence approximation (Gull, 1989; MacKay, 1992a). called the α β , the predictive distribution is obtained and If we introduce hyperpriors over w , α and β so that by marginalizing over ∫∫∫ β d α d w (3.74) p ( t | w ,β ) p ( w | t ,α,β ) p ( α, β | t )d )= t t ( p | and p t | w ,β ) is given by (3.8) and p ( w | t ,α,β ) is given by (3.49) with where ( m N defined by (3.53) and (3.54) respectively. Here we have omitted the dependence S N on the input variable to keep the notation uncluttered. If the posterior distribution x ̂ ̂ α and β , then the predictive distribution is ( α, β | t p is sharply peaked around values ) obtained simply by marginalizing over w in which α and β are fixed to the values ̂ α ̂ and β , so that ∫ ̂ ̂ ̂ t )d β (3.75) α, ̂ , t | ̂ α, . β )= w p ( ( | w , w β ) p ( t , t | t p p  ) t | (

186 166 3. LINEAR MODELS FOR REGRESSION α and is given by From Bayes’ theorem, the posterior distribution for β ( α, β t ) ∝ p ( t | ( ) p | α, β ) . (3.76) p α, β ̂ α and If the prior is relatively flat, then in the evidence framework the values of ̂ are obtained by maximizing the marginal likelihood function p ( t | α, β ) β . We shall proceed by evaluating the marginal likelihood for the linear basis function model and then finding its maxima. This will allow us to determine values for these hyperpa- rameters from the training data alone, without recourse to cross-validation. Recall is analogous to a regularization parameter. that the ratio α/β As an aside it is worth noting that, if we define conjugate (Gamma) prior distri- α β butions over , then the marginalization over these hyperparameters in (3.74) and w can be performed analytically to give a Student’s t-distribution over (see Sec- tion 2.3.7). Although the resulting integral over w is no longer analytically tractable, it might be thought that approximating this integral, for example using the Laplace approximation discussed (Section 4.4) which is based on a local Gaussian approxi- mation centred on the mode of the posterior distribution, might provide a practical alternative to the evidence framework (Buntine and Weigend, 1991). However, the integrand as a function of typically has a strongly skewed mode so that the Laplace w approximation fails to capture the bulk of the probability mass, leading to poorer re- sults than those obtained by maximizing the evidence (MacKay, 1999). Returning to the evidence framework, we note that there are two approaches that we can take to the maximization of the log evidence. We can evaluate the evidence function analytically and then set its derivative equal to zero to obtain re-estimation equations for α and β , which we shall do in Section 3.5.2. Alternatively we use a technique called the expectation maximization (EM) algorithm, which will be dis- cussed in Section 9.3.4 where we shall also show that these two approaches converge to the same solution. 3.5.1 Evaluation of the evidence function The marginal likelihood function t | α, β ) ( p is obtained by integrating over the w , so that weight parameters ∫ ( ) ( t | w ,β p p w | α )d w . (3.77) | ( p t α, β )= One way to evaluate this integral is to make use once again of the result (2.115) for the conditional distribution in a linear-Gaussian model. Here we shall evaluate Exercise 3.16 the integral instead by completing the square in the exponent and making use of the standard form for the normalization coefficient of a Gaussian. From (3.11), (3.12), and (3.52), we can write the evidence function in the form Exercise 3.17 ) ( ∫ 2 N/ ( ) M/ 2 β α ) w (3.78) w ( E d {− } exp p α, β ( t | )= π 2 π 2

187 3.5. The Evidence Approximation 167 is the dimensionality of w , and we have defined where M w ( w )+ αE ( ) E w βE ( )= W D β α 2 T = ‖ w ‖ t − Φw + . (3.79) w 2 2 We recognize (3.79) as being equal, up to a constant of proportionality, to the reg- w ularized sum-of-squares error function (3.27). We now complete the square over Exercise 3.18 giving 1 T E ( E ( w )= m w − m ( )+ ) ) w A ( (3.80) − m N N N 2 where we have introduced T Φ (3.81) Φ = α I + A β together with α β 2 T ‖ t − Φm m ‖ m + . (3.82) )= ( E m N N N N 2 2 A corresponds to the matrix of second derivatives of the error function Note that = E ( w ) (3.83) ∇∇ A given by . Here we have also defined m and is known as the Hessian matrix N T − 1 β A (3.84) . Φ = t m N − 1 , and hence (3.84) is equivalent to the previous Using (3.54), we see that S A = N definition (3.53), and therefore represents the mean of the posterior distribution. w can now be evaluated simply by appealing to the standard The integral over result for the normalization coefficient of a multivariate Gaussian, giving Exercise 3.19 ∫ E ( w ) } d w exp {− } { ∫ 1 T ( w − m ) − ) − m w } d w ) exp A ( {− m ( E =exp N N N 2 2 / 1 − M/ 2 ) } (2 (3.85) ) . π | A | E {− =exp m ( N Using (3.78) we can then write the log of the marginal likelihood in the form M N N 1 | )= t ( p α, β ln (3.86) ) ln β ln E ( m π α + ln(2 ln | A |− − − ) N 2 2 2 2 which is the required expression for the evidence function. Returning to the polynomial regression problem, we can plot the model evidence against the order of the polynomial, as shown in Figure 3.14. Here we have assumed − 3 fixed at α =5 × 10 a prior of the form (1.65) with the parameter α . The form of this plot is very instructive. Referring back to Figure 1.4, we see that the M =0 polynomial has very poor fit to the data and consequently gives a relatively low value

188 168 3. LINEAR MODELS FOR REGRESSION Plot of the model evidence versus Figure 3.14 , for the polynomial re- the order M −18 gression model, showing that the evidence favours the model with −20 =3 M . −22 −24 −26 8 6 4 2 0 M =1 for the evidence. Going to the M polynomial greatly improves the data fit, and M , the data hence the evidence is significantly higher. However, in going to =2 fit is improved only very marginally, due to the fact that the underlying sinusoidal function from which the data is generated is an odd function and so has no even terms in a polynomial expansion. Indeed, Figure 1.5 shows that the residual data error is reduced only slightly in going from =1 to M =2 . Because this richer model M M =1 suffers a greater complexity penalty, the evidence actually falls in going from M to . When we go to M =3 we obtain a significant further improvement in =2 data fit, as seen in Figure 1.4, and so the evidence is increased again, giving the highest overall evidence for any of the polynomials. Further increases in the value produce only small improvements in the fit to the data but suffer increasing of M complexity penalty, leading overall to a decrease in the evidence values. Looking again at Figure 1.5, we see that the generalization error is roughly constant between M =3 M =8 , and it would be difficult to choose between these models on and the basis of this plot alone. The evidence values, however, show a clear preference M =3 , since this is the simplest model which gives a good explanation for the for observed data. 3.5.2 Maximizing the evidence function α p t | α, β ) Let us first consider the maximization of ( . This can with respect to be done by first defining the following eigenvector equation ( ) T u Φ β Φ (3.87) = λ . u i i i has eigenvalues α From (3.81), it then follows that λ A + . Now consider the deriva- i ln | A | in (3.86) with respect to α .Wehave tive of the term involving ∑ ∑ ∏ 1 d d d ln( )= α + λ ( λ )= + α | (3.88) . = | A ln ln i i dα dα λ dα + α i i i i Thus the stationary points of (3.86) with respect to α satisfy ∑ 1 1 1 M T (3.89) m − . m − 0= N N 2 2 α 2 α + λ i i

189 3.5. The Evidence Approximation 169 2 Multiplying through by α and rearranging, we obtain ∑ 1 T − α = M m (3.90) = γ. α m N N + α λ i i terms in the sum over i , the quantity γ can be written Since there are M ∑ λ i γ = . (3.91) λ + α i i will be discussed shortly. From (3.90) we see The interpretation of the quantity γ that maximizes the marginal likelihood satisfies Exercise 3.20 α that the value of γ α = (3.92) . T m m N N α Note that this is an implicit solution for depends on α , but also not only because γ m because the mode of the posterior distribution itself depends on the choice of N α . We therefore adopt an iterative procedure in which we make an initial choice for , which γ , which is given by (3.53), and also to evaluate m and use this to find α N is given by (3.91). These values are then used to re-estimate α using (3.92), and the T Φ is fixed, we Φ process repeated until convergence. Note that because the matrix β can compute its eigenvalues once at the start and then simply multiply these by to . obtain the λ i It should be emphasized that the value of α has been determined purely by look- ing at the training data. In contrast to maximum likelihood methods, no independent data set is required in order to optimize the model complexity. We can similarly maximize the log marginal likelihood (3.86) with respect to . β β , defined by (3.87) are proportional to λ To do this, we note that the eigenvalues i and hence dλ giving /dβ = λ /β i i ∑ ∑ λ d γ d 1 i = = . (3.93) | A | ln ln( )= α + λ i λ dβ + α β dβ β i i i The stationary point of the marginal likelihood therefore satisfies N ∑ { } N 1 γ 2 T 0= − (3.94) t − − m ) x φ ( n n N β 2 β 2 2 n =1 Exercise 3.22 and rearranging we obtain N ∑ } { 1 1 2 T = ) x (3.95) t ( − m . φ n n N γ − β N n =1 Again, this is an implicit solution for β and can be solved by choosing an initial value for β and then using this to calculate m γ and then re-estimate β using and N (3.95), repeating until convergence. If both α and β are to be determined from the data, then their values can be re-estimated together after each update of γ .

190 170 3. LINEAR MODELS FOR REGRESSION w 2 Figure 3.15 Contours of the likelihood function (red) and the prior (green) in which the axes in parameter space have been rotated to align with the eigenvectors u 2 , the mode of the poste- =0 of the Hessian. For α u i , w rior is given by the maximum likelihood solution ML w whereas for nonzero α the mode is at .In m = N MAP w ML , defined by (3.87), is the eigenvalue λ the direction w u 1 1 1 ( λ / + α ) λ and so the quantity α small compared with 1 1 w MAP is close to zero, and the corresponding MAP value of w is also close to zero. By contrast, in the direction w 2 1 α the eigenvalue λ is large compared with and so the 2 λ quantity α ) ( / is close to unity, and the MAP value λ + 2 2 is close to its maximum likelihood value. w of 2 w 1 3.5.3 Effective number of parameters The result (3.92) has an elegant interpretation (MacKay, 1992a), which provides α . To see this, consider the contours of the like- insight into the Bayesian solution for lihood function and the prior as illustrated in Figure 3.15. Here we have implicitly transformed to a rotated set of axes in parameter space aligned with the eigenvec- defined in (3.87). Contours of the likelihood function are then axis-aligned tors u i ellipses. The eigenvalues λ measure the curvature of the likelihood function, and i is small compared with λ (because a smaller so in Figure 3.15 the eigenvalue λ 2 1 curvature corresponds to a greater elongation of the contours of the likelihood func- T is a positive definite matrix, it will have positive eigenvalues, Φ β Φ tion). Because and so the ratio λ / ( λ will lie between 0 and 1. Consequently, the quantity + α ) γ i i α ,  M . For directions in which λ defined by (3.91) will lie in the range 0  γ i the corresponding parameter w will be close to its maximum likelihood value, and i the ratio λ well determined / ( λ will be close to 1. Such parameters are called + α ) i i because their values are tightly constrained by the data. Conversely, for directions , the corresponding parameters w will be close to zero, as will the α in which λ i i . These are directions in which the likelihood function is relatively ) α / ( λ + ratios λ i i insensitive to the parameter value and so the parameter has been set to a small value by the prior. The quantity γ defined by (3.91) therefore measures the effective total number of well determined parameters. We can obtain some insight into the result (3.95) for re-estimating β by com- paring it with the corresponding maximum likelihood result given by (3.21). Both of these formulae express the variance (the inverse precision) as an average of the squared differences between the targets and the model predictions. However, they differ in that the number of data points N in the denominator of the maximum like- lihood result is replaced by N − γ in the Bayesian result. We recall from (1.56) that the maximum likelihood estimate of the variance for a Gaussian distribution over a

191 3.5. The Evidence Approximation 171 x is given by single variable N ∑ 1 2 2 σ (3.96) ) ( x = − μ ML n ML N n =1 for μ and that this estimate is biased because the maximum likelihood solution ML the mean has fitted some of the noise on the data. In effect, this has used up one degree of freedom in the model. The corresponding unbiased estimate is given by (1.59) and takes the form N ∑ 1 2 2 σ ) = ( μ − . (3.97) x n ML MAP 1 N − =1 n We shall see in Section 10.1.3 that this result can be obtained from a Bayesian treat- N − ment in which we marginalize over the unknown mean. The factor of in the 1 denominator of the Bayesian result takes account of the fact that one degree of free- dom has been used in fitting the mean and removes the bias of maximum likelihood. Now consider the corresponding results for the linear regression model. The mean T φ M , which contains x ) ( w of the target distribution is now given by the function parameters. However, not all of these parameters are tuned to the data. The effective number of parameters that are determined by the data is , with the remaining M − γ γ parameters set to small values by the prior. This is reflected in the Bayesian result for the variance that has a factor N − γ in the denominator, thereby correcting for the bias of the maximum likelihood result. We can illustrate the evidence framework for setting hyperparameters using the sinusoidal synthetic data set from Section 1.1, together with the Gaussian basis func- basis functions, so that the total number of parameters in tion model comprising 9 including the bias. Here, for simplicity of illustra- M =10 the model is given by 11 tion, we have set 1 and then used the evidence framework to β to its true value of . , as shown in Figure 3.16. determine α α controls the magnitude of the parameters We can also see how the parameter } , by plotting the individual parameters versus the effective number γ of param- { w i eters, as shown in Figure 3.17. If we consider the limit M in which the number of data points is large in N relation to the number of parameters, then from (3.87) all of the parameters will be T Φ involves an implicit sum over data points, Φ well determined by the data because , increase with the size of the data set. In this case, γ = M λ and so the eigenvalues i and β become α and the re-estimation equations for M (3.98) α = 2 ( m E ) N W N (3.99) β = E ( m ) 2 N D are defined by (3.25) and (3.26), respectively. These results and E E where D W can be used as an easy-to-compute approximation to the full evidence re-estimation

192 172 3. LINEAR MODELS FOR REGRESSION −5 5 −5 0 5 0 ln α α ln γ (red curve) and Figure 3.16 αE The left plot shows ( m for the sinusoidal ) (blue curve) versus ln α 2 N W α given by the synthetic data set. It is the intersection of these two curves that defines the optimum value for versus ln t | α, β ) ( ln α (red p evidence procedure. The right plot shows the corresponding graph of log evidence curve) showing that the peak coincides with the crossing point of the curves in the left plot. Also shown is the test set error (blue curve) showing that the evidence maximum occurs close to the point of best generalization. formulae, because they do not require evaluation of the eigenvalue spectrum of the Hessian. Plot of the 10 parameters Figure 3.17 w i 0 from the Gaussian basis function 2 model versus the effective num- 8 , in which the γ ber of parameters w i 4 hyperparameter is varied in the α 1 5 α causing  to  0 range ∞ γ 2 vary in the range 0  γ . M  0 6 3 1 −1 7 9 −2 8 10 0 2 4 6 γ 3.6. Limitations of Fixed Basis Functions Throughout this chapter, we have focussed on models comprising a linear combina- tion of fixed, nonlinear basis functions. We have seen that the assumption of linearity in the parameters led to a range of useful properties including closed-form solutions to the least-squares problem, as well as a tractable Bayesian treatment. Furthermore, for a suitable choice of basis functions, we can model arbitrary nonlinearities in the

193 Exercises 173 mapping from input variables to targets. In the next chapter, we shall study an anal- ogous class of models for classification. It might appear, therefore, that such linear models constitute a general purpose framework for solving problems in pattern recognition. Unfortunately, there are some significant shortcomings with linear models, which will cause us to turn in later chapters to more complex models such as support vector machines and neural networks. φ The difficulty stems from the assumption that the basis functions ) are fixed ( x j before the training data set is observed and is a manifestation of the curse of dimen- sionality discussed in Section 1.4. As a consequence, the number of basis functions needs to grow rapidly, often exponentially, with the dimensionality D of the input space. Fortunately, there are two properties of real data sets that we can exploit to help { alleviate this problem. First of all, the data vectors x typically lie close to a non- } n linear manifold whose intrinsic dimensionality is smaller than that of the input space as a result of strong correlations between the input variables. We will see an example of this when we consider images of handwritten digits in Chapter 12. If we are using localized basis functions, we can arrange that they are scattered in input space only in regions containing data. This approach is used in radial basis function networks and also in support vector and relevance vector machines. Neural network models, which use adaptive basis functions having sigmoidal nonlinearities, can adapt the parameters so that the regions of input space over which the basis functions vary corresponds to the data manifold. The second property is that target variables may have significant dependence on only a small number of possible directions within the data manifold. Neural networks can exploit this property by choosing the directions in input space to which the basis functions respond. Exercises ) www ( tanh ’ function and the logistic sigmoid function (3.6) 3.1 Show that the ‘ are related by a )=2 σ (2 a ) − 1 . (3.100) tanh( Hence show that a general linear combination of logistic sigmoid functions of the form M ( ) ∑ x − μ j w σ + (3.101) y ( x, w )= w j 0 s j =1 is equivalent to a linear combination of ‘ tanh ’ functions of the form M ( ) ∑ μ − x j y u )= u x, ( (3.102) tanh u + 0 j s j =1 { u and find expressions to relate the new parameters ,...,u } to the original pa- 1 M rameters { w ,...,w } . 1 M

194 174 3. LINEAR MODELS FOR REGRESSION ) Show that the matrix 3.2 ( T T − 1 Φ ( Φ (3.103) Φ Φ ) v and projects it onto the space spanned by the columns of Φ . Use takes any vector this result to show that the least-squares solution (3.15) corresponds to an orthogonal onto the manifold S as shown in Figure 3.2. projection of the vector t ( Consider a data set in which each data point t 3.3 ) is associated with a weighting n factor r 0 , so that the sum-of-squares error function becomes > n N ∑ } { 1 2 T E . )= r ( (3.104) t ) − w ( φ w x n n D n 2 =1 n  that minimizes this error function. Give two w Find an expression for the solution alternative interpretations of the weighted sum-of-squares error function in terms of (i) data dependent noise variance and (ii) replicated data points. www Consider a linear model of the form ) 3.4 ( D ∑ x, w ( w y )= + (3.105) x w i 0 i =1 i together with a sum-of-squares error function of the form N ∑ 1 2 E . ( w )= (3.106) y ( x } , w ) − t { n n D 2 =1 n 2 with zero mean and variance σ is added in- Now suppose that Gaussian noise  i dependently to each of the input variables x E [  ]=0 and . By making use of i i 2 , show that minimizing ]=  averaged over the noise distribution is E δ σ  [ E i D ij j equivalent to minimizing the sum-of-squares error for noise-free input variables with the addition of a weight-decay regularization term, in which the bias parameter w 0 is omitted from the regularizer. 3.5 ( ) www Using the technique of Lagrange multipliers, discussed in Appendix E, show that minimization of the regularized error function (3.29) is equivalent to mini- mizing the unregularized sum-of-squares error (3.12) subject to the constraint (3.30). Discuss the relationship between the parameters η and λ . 3.6 ) ( www Consider a linear basis function regression model for a multivariate target variable t having a Gaussian distribution of the form p ( t | W , Σ )= N ( t | y ( x , W ) , Σ ) (3.107) where T ) x ( φ (3.108) y )= W , x ( W

195 Exercises 175 ) and corre- φ ( together with a training data set comprising input basis vectors x n sponding target vectors t ,...,N n =1 . Show that the maximum likelihood , with n for the parameter matrix W has the property that each column is solution W ML given by an expression of the form (3.15), which was the solution for an isotropic Σ . Show noise distribution. Note that this is independent of the covariance matrix that the maximum likelihood solution for Σ is given by N ∑ ( ) )( 1 T T T − W (3.109) ) φ ( x x ) t ( φ − W . t = Σ n n n n ML ML N n =1 ( By using the technique of completing the square, verify the result (3.49) for the 3.7 ) in the linear basis function model in which posterior distribution of the parameters w and S are defined by (3.50) and (3.51) respectively. m N N www Consider the linear basis function model in Section 3.1, and suppose ) 3.8 ( N data points, so that the posterior distribution over that we have already observed w is given by (3.49). This posterior can be regarded as the prior for the next obser- ,t ) , and by completing vation. By considering an additional data point ( x +1 +1 N N the square in the exponential, show that the resulting posterior distribution is again m . replaced by S replaced by m and given by (3.49) but with S N +1 N +1 N N ( ) Repeat the previous exercise but instead of completing the square by hand, 3.9 make use of the general result for linear-Gaussian models given by (2.116). 3.10 ( ) By making use of the result (2.115) to evaluate the integral in (3.57), www verify that the predictive distribution for the Bayesian linear regression model is given by (3.58) in which the input-dependent variance is given by (3.59). 3.11 ) We have seen that, as the size of a data set increases, the uncertainty associated ( with the posterior distribution over model parameters decreases. Make use of the matrix identity (Appendix C) ) ( − T 1 1 − ) ( ) v v M M ( − 1 − 1 T (3.110) = − M vv + M T 1 − M v 1+ v 2 x ) ( associated with the linear regression function σ to show that the uncertainty N given by (3.59) satisfies 2 2 σ . ( x )  σ (3.111) ) ( x N N +1 3.12 ( ) We saw in Section 2.3.6 that the conjugate prior for a Gaussian distribution with unknown mean and unknown precision (inverse variance) is a normal-gamma distribution. This property also holds for the case of the conditional Gaussian dis- p ( t | x , w ,β ) of the linear regression model. If we consider the likelihood tribution w and β is given by function (3.10), then the conjugate prior for 1 − ) ,b a | β )Gam( S (3.112) . ,β )= p | ( w ( N m w ,β 0 0 0 0

196 176 3. LINEAR MODELS FOR REGRESSION Show that the corresponding posterior distribution takes the same functional form, so that − 1 | t )= N ( w | m p ( w ,β ,b S β | a ,β )Gam( (3.113) ) N N N N . , S b , a , and m and find expressions for the posterior parameters N N N N x Show that the predictive distribution p ( t | ) , t ) for the model discussed in Ex- 3.13 ( ercise 3.12 is given by a Student’s t-distribution of the form p t | x , t )=St( t | μ, λ, ν ) (3.114) ( and obtain expressions for , λ and ν . μ ( 3.14 In this exercise, we explore in more detail the properties of the equivalent ) S kernel defined by (3.62), where is defined by (3.54). Suppose that the basis N functions φ ( x ) are linearly independent and that the number N of data points is j greater than the number of basis functions. Furthermore, let one of the basis M x )=1 . By taking suitable linear combinations of ( functions be constant, say φ 0 ( x ) spanning the same ψ these basis functions, we can construct a new basis set j space but that are orthonormal, so that N ∑ (3.115) I ψ )= ( x x ) ψ ( k n j n jk n =1 where I 1 if j = k and 0 otherwise, and we take ψ )=1 ( x is defined to be . Show 0 jk ′ T ′ ( x ) )= ψ ( x ψ ) ( x , that for α =0 , the equivalent kernel can be written as k x T =( ψ where ψ ,...,ψ . Use this result to show that the kernel satisfies the ) M 1 summation constraint N ∑ (3.116) . k ( x , x )=1 n =1 n www Consider a linear basis function model for regression in which the pa- ( 3.15 ) and β are set using the evidence framework. Show that the function rameters α ( E m m defined by (3.82) satisfies the relation 2 E ( ) . )= N N N 3.16 ( ) Derive the result (3.86) for the log evidence function p ( t | α, β ) of the linear regression model by making use of (2.115) to evaluate the integral (3.77) directly. 3.17 ) Show that the evidence function for the Bayesian linear regression model can ( be written in the form (3.78) in which ( w ) is defined by (3.79). E 3.18 ( ) www w , show that the error function (3.79) By completing the square over in Bayesian linear regression can be written in the form (3.80). 3.19 ( ) Show that the integration over w in the Bayesian linear regression model gives the result (3.85). Hence show that the log marginal likelihood is given by (3.86).

197 Exercises 177 Starting from (3.86) verify all of the steps needed to show that maxi- www ) ( 3.20 mization of the log marginal likelihood function (3.86) with respect to α leads to the re-estimation equation (3.92). ) An alternative way to derive the result (3.92) for the optimal value of α in the 3.21 ( evidence framework is to make use of the identity ( ) d d − 1 | ln | = Tr A A A . (3.117) dα dα Prove this identity by considering the eigenvalue expansion of a real, symmetric matrix A , and making use of the standard results for the determinant and trace of expressed in terms of its eigenvalues (Appendix C). Then make use of (3.117) to A derive (3.92) starting from (3.86). 3.22 ) Starting from (3.86) verify all of the steps needed to show that maximiza- ( tion of the log marginal likelihood function (3.86) with respect to β leads to the re-estimation equation (3.95). Show that the marginal probability of the data, in other words the www ( ) 3.23 model evidence, for the model described in Exercise 3.12 is given by a 1 / 2 0 a | b ) Γ( 1 S | N N 0 (3.118) t ( p )= a N 2 1 2 / N/ ) a b Γ( S (2 | π ) | 0 0 N by first marginalizing with respect to w and then with respect to β . ( ) Repeat the previous exercise but now use Bayes’ theorem in the form 3.24 ) p t | w ,β ( p ( w ,β ) p ( t )= (3.119) p ( w ,β | t ) and then substitute for the prior and posterior distributions and the likelihood func- tion in order to derive the result (3.118).

198

199 4 Linear Models for Classification In the previous chapter, we explored a class of regression models having particularly simple analytical and computational properties. We now discuss an analogous class of models for solving classification problems. The goal in classification is to take an where k =1 ,...,K . input vector x and to assign it to one of K discrete classes C k In the most common scenario, the classes are taken to be disjoint, so that each input is assigned to one and only one class. The input space is thereby divided into decision regions whose boundaries are called decision boundaries or decision surfaces .In this chapter, we consider linear models for classification, by which we mean that the x and hence are defined decision surfaces are linear functions of the input vector by ( D − 1) -dimensional hyperplanes within the D -dimensional input space. Data sets whose classes can be separated exactly by linear decision surfaces are said to be linearly separable . For regression problems, the target variable t was simply the vector of real num- bers whose values we wish to predict. In the case of classification, there are various 179

200 180 4. LINEAR MODELS FOR CLASSIFICATION ways of using target values to represent class labels. For probabilistic models, the most convenient, in the case of two-class problems, is the binary representation in ∈{ 0 , 1 } such that t =1 represents class C which there is a single target variable t 1 t C =0 . We can interpret the value of t as the probability that represents class and 2 C the class is , with the values of probability taking only the extreme values of 0 and 1 2 classes, it is convenient to use a 1-of- K coding scheme in which t is 1. For K> , then all elements t of t are zero such that if the class is a vector of length K C k j t except element , which takes the value 1. For instance, if we have K =5 classes, j then a pattern from class would be given the target vector 2 T . (4.1) , 0 , 0 , 0) t , 1 =(0 C .For as the probability that the class is t Again, we can interpret the value of k k nonprobabilistic models, alternative choices of target variable representation will sometimes prove convenient. In Chapter 1, we identified three distinct approaches to the classification prob- lem. The simplest involves constructing a discriminant function that directly assigns x to a specific class. A more powerful approach, however, models the each vector | x ) in an inference stage, and then subse- p C conditional probability distribution ( k quently uses this distribution to make optimal decisions. By separating inference and decision, we gain numerous benefits, as discussed in Section 1.5.4. There are p ( C two different approaches to determining the conditional probabilities | ) . One x k technique is to model them directly, for example by representing them as parametric models and then optimizing the parameters using a training set. Alternatively, we can adopt a generative approach in which we model the class-conditional densities for the classes, and then ) C ( p , together with the prior probabilities ) |C ( p given by x k k we compute the required posterior probabilities using Bayes’ theorem ) C ( p ) p ( x |C k k C ( p )= x | (4.2) . k x ( p ) We shall discuss examples of all three approaches in this chapter. In the linear regression models considered in Chapter 3, the model prediction ( , w ) was given by a linear function of the parameters w . In the simplest case, y x y ( x the model is also linear in the input variables and therefore takes the form )= T x + w , so that y is a real number. For classification problems, however, we wish w 0 to predict discrete class labels, or more generally posterior probabilities that lie in the range (0 , 1) . To achieve this, we consider a generalization of this model in which we transform the linear function of w f ( · ) so that using a nonlinear function ( ) T x w + w (4.3) . )= y x ( f 0 In the machine learning literature f ( · ) is known as an activation function , whereas its inverse is called a link function in the statistics literature. The decision surfaces T y ( correspond to ) = constant , so that w x x + w and hence the deci- = constant 0 ( x , even if the function f sion surfaces are linear functions of · ) is nonlinear. For this reason, the class of models described by (4.3) are called generalized linear models

201 4.1. Discriminant Functions 181 (McCullagh and Nelder, 1989). Note, however, that in contrast to the models used for regression, they are no longer linear in the parameters due to the presence of the ( · ) . This will lead to more complex analytical and computa- nonlinear function f tional properties than for linear regression models. Nevertheless, these models are still relatively simple compared to the more general nonlinear models that will be studied in subsequent chapters. The algorithms discussed in this chapter will be equally applicable if we first make a fixed nonlinear transformation of the input variables using a vector of basis ( ) as we did for regression models in Chapter 3. We begin by consider- functions x φ , while in Section 4.3 we shall ing classification directly in the original input space x find it convenient to switch to a notation involving basis functions for consistency with later chapters. 4.1. Discriminant Functions A discriminant is a function that takes an input vector K x and assigns it to one of , linear discriminants . In this chapter, we shall restrict attention to C classes, denoted k namely those for which the decision surfaces are hyperplanes. To simplify the dis- cussion, we consider first the case of two classes and then investigate the extension K> 2 classes. to 4.1.1 Two classes The simplest representation of a linear discriminant function is obtained by tak- ing a linear function of the input vector so that T + x w (4.4) )= ( x w y 0 (not to be confused with bias in bias is a , and weight vector is called a w where w 0 threshold .An the statistical sense). The negative of the bias is sometimes called a ) if y ( x  0 and to class C otherwise. The cor- input vector x is assigned to class C 2 1 responding decision boundary is therefore defined by the relation y ( x )=0 , which corresponds to a ( − 1) -dimensional hyperplane within the D -dimensional input D x space. Consider two points and x both of which lie on the decision surface. B A T and hence the vector )=0 x )= y ( x − )=0 is w w ( x ,wehave y Because ( x B B A A orthogonal to every vector lying within the decision surface, and so w determines the orientation of the decision surface. Similarly, if x is a point on the decision surface, then x )=0 , and so the normal distance from the origin to the decision surface is ( y given by T w x w 0 (4.5) = − . ‖ w ‖ w ‖ ‖ We therefore see that the bias parameter w determines the location of the decision 0 surface. These properties are illustrated for the case of D =2 in Figure 4.1. gives a signed measure of the per- Furthermore, we note that the value of ( x ) y pendicular distance r of the point x from the decision surface. To see this, consider

202 182 4. LINEAR MODELS FOR CLASSIFICATION Illustration of the geometry of a Figure 4.1 x 2 0 y> linear discriminant function in two dimensions. y =0 The decision surface, shown in red, is perpen- R , and its displacement from the w dicular to 1 y< 0 . w origin is controlled by the bias parameter 0 R 2 Also, the signed orthogonal distance of a gen- eral point x from the decision surface is given . / ) w ‖ by y ( x ‖ x w ( y x ) ‖ w ‖ x ⊥ x 1 w − 0 ‖ w ‖ and let x be its orthogonal projection onto the decision surface, an arbitrary point x ⊥ so that w + r (4.6) . = x x ⊥ w ‖ ‖ T Multiplying both sides of this result by w w x , and making use of y ( and adding )= 0 T T w w ,wehave and y ( x )= w + x + w =0 x ⊥ 0 ⊥ 0 x ) y ( r = (4.7) . ‖ ‖ w This result is illustrated in Figure 4.1. As with the linear regression models in Chapter 3, it is sometimes convenient to use a more compact notation in which we introduce an additional dummy ‘input’ =1 and then define ̃ w =( w so that , ) ) and ̃ x =( x , x w value x 0 0 0 T ̃ w ̃ x . (4.8) )= y ( x In this case, the decision surfaces are D -dimensional hyperplanes passing through D +1 -dimensional expanded input space. the origin of the 4.1.2 Multiple classes Now consider the extension of linear discriminants to K> 2 classes. We might be tempted be to build a K -class discriminant by combining a number of two-class discriminant functions. However, this leads to some serious difficulties (Duda and Hart, 1973) as we now show. Consider the use of K − 1 classifiers each of which solves a two-class problem of separating points in a particular class C from points not in that class. This is known k as a one-versus-the-rest classifier. The left-hand example in Figure 4.2 shows an

203 4.1. Discriminant Functions 183 C 3 C 1 ? R 1 R R 3 1 C 1 ? R 2 C 3 C 1 R 2 R 3 C 2 C 2 C not 1 C 2 not C 2 K Attempting to construct a Figure 4.2 class discriminant from a set of two class discriminants leads to am- biguous regions, shown in green. On the left is an example involving the use of two discriminants designed to . On the right is an example involving three discriminant from points not in class C C distinguish points in class k k functions each of which is used to separate a pair of classes C and C . j k example involving three classes where this approach leads to regions of input space that are ambiguously classified. An alternative is to introduce K − 1) / 2 binary discriminant functions, one K ( classifier. Each for every possible pair of classes. This is known as a one-versus-one point is then classified according to a majority vote amongst the discriminant func- tions. However, this too runs into the problem of ambiguous regions, as illustrated in the right-hand diagram of Figure 4.2. We can avoid these difficulties by considering a single K -class discriminant comprising K linear functions of the form T ( )= w x (4.9) x + w y k k 0 k = j . The decision if y ( k ) >y ( x ) for all x C to class x and then assigning a point k k j ( )= and is therefore given by y ) C x and class y x ( boundary between class C j k j k ( D − 1) -dimensional hyperplane defined by hence corresponds to a T w w ) (4.10) x +( − − w . )=0 ( w 0 j k k 0 j This has the same form as the decision boundary for the two-class case discussed in Section 4.1.1, and so analogous geometrical properties apply. The decision regions of such a discriminant are always singly connected and both of which lie inside decision and x convex. To see this, consider two points x B A , as illustrated in Figure 4.3. Any point ̂ x that lies on the line connecting R region k x and x can be expressed in the form B A +(1 ̂ = λ x x x − λ ) (4.11) A B

204 184 4. LINEAR MODELS FOR CLASSIFICATION Illustration of the decision regions for a mul- Figure 4.3 ticlass linear discriminant, with the decision x boundaries shown in red. If two points A R j both lie inside the same decision re- and x B b , then any point that lies on the line x gion R k R i connecting these two points must also lie in R , and hence the decision region must be k singly connected and convex. R k x B x A ˆ x 0 λ  1 . From the linearity of the discriminant functions, it follows that where  y ( (4.12) x )= λy . ( x ) )+(1 − λ ) y x ( ̂ k A k k B ( , it follows that and x , and lie inside R ) x y ( >y x ) Because both x A A j k A k B also lies x ( x ̂ ) >y , and so ( x ) ) , for all j = k , and hence y x ( ̂ x ) >y ̂ ( y j j B k k B R inside . Thus R is singly connected and convex. k k Note that for two classes, we can either employ the formalism discussed here, , or else use the simpler but ) ( x ) and y x ( y based on two discriminant functions 2 1 equivalent formulation described in Section 4.1.1 based on a single discriminant y ( x ) . function We now explore three approaches to learning the parameters of linear discrimi- nant functions, based on least squares, Fisher’s linear discriminant, and the percep- tron algorithm. 4.1.3 Least squares for classification In Chapter 3, we considered models that were linear functions of the parame- ters, and we saw that the minimization of a sum-of-squares error function led to a simple closed-form solution for the parameter values. It is therefore tempting to see if we can apply the same formalism to classification problems. Consider a general K classes, with a 1-of- classification problem with binary coding scheme for the K target vector t . One justification for using least squares in such a context is that it approximates the conditional expectation E [ t | x ] of the target values given the input vector. For the binary coding scheme, this conditional expectation is given by the vector of posterior class probabilities. Unfortunately, however, these probabilities are typically approximated rather poorly, indeed the approximations can have values outside the range , 1) , due to the limited flexibility of a linear model as we shall (0 see shortly. is described by its own linear model so that C Each class k T w ( x )= w x + (4.13) y k 0 k k where k =1 ,...,K . We can conveniently group these together using vector nota- tion so that T ̃ W ̃ x (4.14) y ( x )=

205 4.1. Discriminant Functions 185 th ̃ where column comprises the D +1 W is a matrix whose k -dimensional vector T T T T with , w w ) =( w and ̃ x is the corresponding augmented input vector (1 , x ̃ ) 0 k k k a dummy input x =1 . This representation was discussed in detail in Section 3.1. A 0 T x ̃ is largest. = ̃ w is then assigned to the class for which the output x y new input k k ̃ by minimizing a sum-of-squares W We now determine the parameter matrix error function, as we did for regression in Chapter 3. Consider a training data set T th , t , } where n =1 ,...,N , and define a matrix T whose n t row is the vector { x n n n th T ̃ n row is X x whose . The sum-of-squares error function ̃ together with a matrix n can then be written as { } 1 T ̃ ̃ ̃ ̃ ̃ E ( )= X (4.15) W − T ) ( ( . X W W − T ) Tr D 2 ̃ W to zero, and rearranging, we then obtain the Setting the derivative with respect to ̃ W in the form solution for − 1 T † T ̃ ̃ ̃ ̃ ̃ W X =( X = ) X T T (4.16) X † ̃ ̃ where is the pseudo-inverse of the matrix , as discussed in Section 3.1.1. We X X then obtain the discriminant function in the form ) ( T T † T ̃ ̃ x = T x ̃ (4.17) X W . ̃ )= y x ( An interesting property of least-squares solutions with multiple target variables is that if every target vector in the training set satisfies some linear constraint T t (4.18) + b =0 a n a and b , then the model prediction for any value of for some constants will satisfy x Exercise 4.2 the same constraint so that T (4.19) y ( x )+ b =0 . a Thus if we use a 1-of- coding scheme for K classes, then the predictions made K y ( by the model will have the property that the elements of ) will sum to 1 for any x value of x . However, this summation constraint alone is not sufficient to allow the model outputs to be interpreted as probabilities because they are not constrained to lie within the interval (0 , 1) . The least-squares approach gives an exact closed-form solution for the discrimi- nant function parameters. However, even as a discriminant function (where we use it to make decisions directly and dispense with any probabilistic interpretation) it suf- fers from some severe problems. We have already seen that least-squares solutions Section 2.3.7 lack robustness to outliers, and this applies equally to the classification application, as illustrated in Figure 4.4. Here we see that the additional data points in the right- hand figure produce a significant change in the location of the decision boundary, even though these point would be correctly classified by the original decision bound- ary in the left-hand figure. The sum-of-squares error function penalizes predictions that are ‘too correct’ in that they lie a long way on the correct side of the decision

206 186 4. LINEAR MODELS FOR CLASSIFICATION 4 4 2 2 0 0 −2 −2 −4 −4 −6 −6 −8 −8 0 0 2 4 6 8 4 6 8 −2 2 −4 −4 −2 Figure 4.4 The left plot shows data from two classes, denoted by red crosses and blue circles, together with the decision boundary found by least squares (magenta curve) and also by the logistic regression model (green curve), which is discussed later in Section 4.3.2. The right-hand plot shows the corresponding results obtained when extra data points are added at the bottom left of the diagram, showing that least squares is highly sensitive to outliers, unlike logistic regression. boundary. In Section 7.1.2, we shall consider several alternative error functions for classification and we shall see that they do not suffer from this difficulty. However, problems with least squares can be more severe than simply lack of robustness, as illustrated in Figure 4.5. This shows a synthetic data set drawn from ( three classes in a two-dimensional input space x ,x ) , having the property that lin- 1 2 ear decision boundaries can give excellent separation between the classes. Indeed, the technique of logistic regression, described later in this chapter, gives a satisfac- tory solution as seen in the right-hand plot. However, the least-squares solution gives poor results, with only a small region of the input space assigned to the green class. The failure of least squares should not surprise us when we recall that it cor- responds to maximum likelihood under the assumption of a Gaussian conditional distribution, whereas binary target vectors clearly have a distribution that is far from Gaussian. By adopting more appropriate probabilistic models, we shall obtain clas- sification techniques with much better properties than least squares. For the moment, however, we continue to explore alternative nonprobabilistic methods for setting the parameters in the linear classification models. 4.1.4 Fisher’s linear discriminant One way to view a linear classification model is in terms of dimensionality reduction. Consider first the case of two classes, and suppose we take the D -

207 4.1. Discriminant Functions 187 6 6 4 4 2 2 0 0 −2 −2 −4 −4 −6 −6 −2 −4 −2 0 2 4 6 2 4 6 6 0 − 6 − −4 Example of a synthetic data set comprising three classes, with training data points denoted in red Figure 4.5 × ) , green (+) , and blue ( ◦ ) . Lines denote the decision boundaries, and the background colours denote the ( respective classes of the decision regions. On the left is the result of using a least-squares discriminant. We see that the region of input space assigned to the green class is too small and so most of the points from this class are misclassified. On the right is the result of using logistic regressions as described in Section 4.3.2 showing correct classification of the training data. and project it down to one dimension using dimensional input vector x T x . (4.20) y = w y and classify y If we place a threshold on − w  C , and otherwise class as class 0 1 , then we obtain our standard linear classifier discussed in the previous section. C 2 In general, the projection onto one dimension leads to a considerable loss of infor- mation, and classes that are well separated in the original D -dimensional space may become strongly overlapping in one dimension. However, by adjusting the com- ponents of the weight vector w , we can select a projection that maximizes the class points N separation. To begin with, consider a two-class problem in which there are 1 , so that the mean vectors of the two classes are and N C points of class C of class 2 2 1 given by ∑ ∑ 1 1 = x (4.21) x . , m = m 2 n 1 n N N 2 1 ∈C n n ∈C 2 1 The simplest measure of the separation of the classes, when projected onto w ,isthe separation of the projected class means. This suggests that we might choose w so as to maximize T ) − m m = w ( (4.22) m − m 2 1 2 1 where T (4.23) = w m m k k

208 188 4. LINEAR MODELS FOR CLASSIFICATION 4 4 2 2 0 0 −2 −2 −2 2 6 2 −2 6 Figure 4.6 The left plot shows samples from two classes (depicted in red and blue) along with the histograms resulting from projection onto the line joining the class means. Note that there is considerable class overlap in the projected space. The right plot shows the corresponding projection based on the Fisher linear discriminant, showing the greatly improved class separation. C . However, this expression can be is the mean of the projected data from class k . To solve this made arbitrarily large simply by increasing the magnitude of w ∑ 2 w =1 . Using to have unit length, so that w problem, we could constrain i i a Lagrange multiplier to perform the constrained maximization, we then find that Appendix E . There is still a problem with this approach, however, as illustrated ) m − ( ∝ w m 2 1 Exercise 4.4 in Figure 4.6. This shows two classes that are well separated in the original two- but that have considerable overlap when projected onto ) ,x dimensional space x ( 1 2 the line joining their means. This difficulty arises from the strongly nondiagonal covariances of the class distributions. The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap. The projection formula (4.20) transforms the set of labelled data points in x into a labelled set in the one-dimensional space y . The within-class variance of the is therefore given by transformed data from class C k ∑ 2 2 s = (4.24) ( y ) − m k n k ∈C n k T where y = w x . We can define the total within-class variance for the whole n n 2 2 + s . The Fisher criterion is defined to be the ratio of the s data set to be simply 2 1 between-class variance to the within-class variance and is given by 2 m ) − m ( 1 2 . (4.25) )= J ( w 2 2 s s + 2 1 We can make the dependence on w explicit by using (4.20), (4.23), and (4.24) to rewrite the Fisher criterion in the form Exercise 4.5

209 4.1. Discriminant Functions 189 T S w w B (4.26) ( )= w J T w w S W S where is the between-class covariance matrix and is given by B T − m )( m =( m ) m (4.27) − S 2 B 1 2 1 S and within-class covariance matrix, given by is the total W ∑ ∑ T T ( x ) − m m )( x − − m x ) = + . (4.28) ( x )( − m S 2 n 1 n n 1 2 n W ∈C n n ∈C 2 1 w , we find that J ( w Differentiating (4.26) with respect to is maximized when ) T T w ( (4.29) w ) S . w =( w S S w w ) S B W B W ) is always in the direction of ( m . Furthermore, − m w S From (4.27), we see that 1 2 B we do not care about the magnitude of w , only its direction, and so we can drop the 1 − T T w ( and S w S S ) ) . Multiplying both sides of (4.29) by w w ( scalar factors W B W we then obtain − 1 m (4.30) ( − m . ) w S ∝ 1 2 W is proportional to the S Note that if the within-class covariance is isotropic, so that W unit matrix, we find that w is proportional to the difference of the class means, as discussed above. Fisher’s linear discriminant The result (4.30) is known as , although strictly it is not a discriminant but rather a specific choice of direction for projection of the data down to one dimension. However, the projected data can subsequently be used so that we classify a new to construct a discriminant, by choosing a threshold y 0 C y ( x )  y otherwise. and classify it as belonging to if C point as belonging to 2 0 1 For example, we can model the class-conditional densities p ( y |C using Gaussian ) k distributions and then use the techniques of Section 1.2.4 to find the parameters of the Gaussian distributions by maximum likelihood. Having found Gaussian ap- proximations to the projected classes, the formalism of Section 1.5.1 then gives an expression for the optimal threshold. Some justification for the Gaussian assumption T x is the sum of a set of w y comes from the central limit theorem by noting that = random variables. 4.1.5 Relation to least squares The least-squares approach to the determination of a linear discriminant was based on the goal of making the model predictions as close as possible to a set of target values. By contrast, the Fisher criterion was derived by requiring maximum class separation in the output space. It is interesting to see the relationship between these two approaches. In particular, we shall show that, for the two-class problem, the Fisher criterion can be obtained as a special case of least squares. So far we have considered 1-of- K coding for the target values. If, however, we adopt a slightly different target coding scheme, then the least-squares solution for

210 190 4. LINEAR MODELS FOR CLASSIFICATION the weights becomes equivalent to the Fisher solution (Duda and Hart, 1973). In particular, we shall take the targets for class C N/N N is the number to be , where 1 1 1 , and is the total number of patterns. This target value N of patterns in class C 1 C approximates the reciprocal of the prior probability for class C ,we . For class 1 2 shall take the targets to be − N/N N . is the number of patterns in class C , where 2 2 2 The sum-of-squares error function can be written N ∑ ( ) 1 2 T + . x (4.31) w w t − E = 0 n n 2 =1 n E with respect to Setting the derivatives of w w to zero, we obtain respectively and 0 N ∑ ( ) T x w + w (4.32) − =0 t n 0 n =1 n N ∑ ( ) T (4.33) x . + w =0 − t x w n n 0 n n =1 t From (4.32), and making use of our choice of target coding scheme for the ,we n obtain an expression for the bias in the form T w − w = m (4.34) 0 where we have used N ∑ N N t = (4.35) =0 N N − n 2 1 N N 1 2 =1 n and where is the mean of the total data set and is given by m N ∑ 1 1 m ) m x N = . + (4.36) N ( = m n 1 2 2 1 N N n =1 t After some straightforward algebra, and again making use of the choice of , the n second equation (4.33) becomes Exercise 4.6 ( ) N N 2 1 (4.37) m + w ) N ( m S − = S 1 B 2 W N is defined by (4.28), S is defined by (4.27), and we have substituted for where S W B S the bias using (4.34). Using (4.27), we note that w is always in the direction of B . Thus we can write ) − m m ( 1 2 1 − (4.38) ( m − m ) S ∝ w 1 2 W where we have ignored irrelevant scale factors. Thus the weight vector coincides with that found from the Fisher criterion. In addition, we have also found an expres- sion for the bias value w given by (4.34). This tells us that a new vector x should be 0 T classified as belonging to class C if y ( otherwise. )= w C ( x − m ) > 0 and class x 1 2

211 4.1. Discriminant Functions 191 4.1.6 Fisher’s discriminant for multiple classes 2 classes, We now consider the generalization of the Fisher discriminant to K> and we shall assume that the dimensionality of the input space is greater than the D ′ T number K D of classes. Next, we introduce y , where = w > 1 x linear ‘features’ k k ′ . These feature values can conveniently be grouped together to form ,...,D k =1 . Similarly, the weight vectors w y a vector { can be considered to be the columns } k W , so that of a matrix T . x (4.39) W = y y . The Note that again we are not including any bias parameters in the definition of generalization of the within-class covariance matrix to the case of K classes follows from (4.28) to give K ∑ (4.40) = S S k W =1 k where ∑ T (4.41) = ) m ( x − − m x )( S k n k n k n ∈C k ∑ 1 = (4.42) x m k n N k n ∈C k C is the number of patterns in class . In order to find a generalization of the N and k k between-class covariance matrix, we follow Duda and Hart (1973) and consider first the total covariance matrix N ∑ T S = x (4.43) − m )( x ) − m ( n n T n =1 m is the mean of the total data set where K N ∑ ∑ 1 1 = m = x (4.44) N m n k k N N =1 n =1 k ∑ = and N N is the total number of data points. The total covariance matrix can k k be decomposed into the sum of the within-class covariance matrix, given by (4.40) , which we identify as a measure of the S and (4.41), plus an additional matrix B between-class covariance (4.45) = S S + S W T B where K ∑ T S . = ) N m (4.46) m − − m )( m ( k k k B =1 k

212 192 4. LINEAR MODELS FOR CLASSIFICATION -space. We can now These covariance matrices have been defined in the original x ′ define similar matrices in the projected D -dimensional y -space K ∑ ∑ T s (4.47) ( y − μ )( = y − μ ) n W n k k ∈C n k =1 k and K ∑ T (4.48) ( μ N − = )( μ μ − μ ) s k B k k =1 k where K ∑ ∑ 1 1 μ μ N (4.49) y . , μ = = n k k k N N k n ∈C =1 k k Again we wish to construct a scalar that is large when the between-class covariance is large and when the within-class covariance is small. There are now many possible choices of criterion (Fukunaga, 1990). One example is given by } { − 1 . s s (4.50) )= W ( J Tr B W This criterion can then be rewritten as an explicit function of the projection matrix W in the form { } T − 1 T ( W WS ( WS . W ) ) (4.51) ( w )= Tr J B W Maximization of such criteria is straightforward, though somewhat involved, and is discussed at length in Fukunaga (1990). The weight values are determined by those − 1 ′ S largest eigenvalues. that correspond to the D eigenvectors of S B W There is one important result that is common to all such criteria, which is worth is composed of the sum of K ma- emphasizing. We first note from (4.46) that S B trices, each of which is an outer product of two vectors and therefore of rank 1. In ( addition, only − 1) of these matrices are independent as a result of the constraint K S (4.44). Thus, has rank at most equal to ( K − 1) and so there are at most ( K − 1) B nonzero eigenvalues. This shows that the projection onto the ( K − 1) -dimensional does not alter the value of J ( w ) , and subspace spanned by the eigenvectors of S B linear ‘features’ by this means K − 1) so we are therefore unable to find more than ( (Fukunaga, 1990). 4.1.7 The perceptron algorithm Another example of a linear discriminant model is the perceptron of Rosenblatt (1962), which occupies an important place in the history of pattern recognition al- gorithms. It corresponds to a two-class model in which the input vector x is first transformed using a fixed nonlinear transformation to give a feature vector ( x ) , φ and this is then used to construct a generalized linear model of the form ) ( T (4.52) ( ) w x φ f x ( y )=

213 4.1. Discriminant Functions 193 ( where the nonlinear activation function is given by a step function of the form · f ) { ,a  0 +1 (4.53) )= a ( f 0 − 1 ,a< . φ x ) will typically include a bias component φ The vector ( x )=1 . In earlier ( 0 discussions of two-class classification problems, we have focussed on a target coding t ∈{ 0 , 1 } , which is appropriate in the context of probabilistic scheme in which models. For the perceptron, however, it is more convenient to use target values C , which matches the choice of activation for class and t = − 1 t C =+1 for class 2 1 function. w of the perceptron can most The algorithm used to determine the parameters easily be motivated by error function minimization. A natural choice of error func- tion would be the total number of misclassified patterns. However, this does not lead to a simple learning algorithm because the error is a piecewise constant function of , with discontinuities wherever a change in w causes the decision boundary to w move across one of the data points. Methods based on changing w using the gradi- ent of the error function cannot then be applied, because the gradient is zero almost everywhere. perceptron cri- We therefore consider an alternative error function known as the terion . To derive this, we note that we are seeking a weight vector w such that T in class x in class C , whereas patterns will have w C φ ( x 0 ) > x patterns n n 2 1 n T , φ ( x have ) < 0 . Using the t ∈{− 1 w +1 } target coding scheme it follows that n T t φ ( x ) . The perceptron criterion > 0 w we would like all patterns to satisfy n n associates zero error with any pattern that is correctly classified, whereas for a mis- T t ) it tries to minimize the quantity − w . The perceptron φ ( x classified pattern x n n n criterion is therefore given by ∑ T w )= − φ t (4.54) w ( E n P n ∈M n Seymour Papert. This book was widely misinter- Frank Rosenblatt preted at the time as showing that neural networks 1928–1969 were fatally flawed and could only learn solutions for linearly separable problems. In fact, it only proved Rosenblatt’s perceptron played an such limitations in the case of single-layer networks important role in the history of ma- such as the perceptron and merely conjectured (in- chine learning. Initially, Rosenblatt correctly) that they applied to more general network simulated the perceptron on an IBM models. Unfortunately, however, this book contributed 704 computer at Cornell in 1957, to the substantial decline in research funding for neu- but by the early 1960s he had built special-purpose hardware that provided a direct, par- ral computing, a situation that was not reversed un- allel implementation of perceptron learning. Many of til the mid-1980s. Today, there are many hundreds, his ideas were encapsulated in “Principles of Neuro- if not thousands, of applications of neural networks dynamics: Perceptrons and the Theory of Brain Mech- in widespread use, with examples in areas such as anisms” published in 1962. Rosenblatt’s work was handwriting recognition and information retrieval be- criticized by Marvin Minksy, whose objections were ing used routinely by millions of people. published in the book “Perceptrons”, co-authored with

214 194 4. LINEAR MODELS FOR CLASSIFICATION M where denotes the set of all misclassified patterns. The contribution to the error in regions associated with a particular misclassified pattern is a linear function of w w space where the pattern is misclassified and zero in regions where it is correctly of classified. The total error function is therefore piecewise linear. We now apply the stochastic gradient descent algorithm to this error function. Section 3.1.3 is then given by The change in the weight vector w τ +1) ( τ ) ( τ ) ( w ∇ E (4.55) ( w )= w = w − + η φ η t n P n η is the learning rate parameter and is an integer that indexes the steps of where τ ( ) , w y is unchanged if we multiply the algorithm. Because the perceptron function x η equal to 1 without of w by a constant, we can set the learning rate parameter generality. Note that, as the weight vector evolves during training, the set of patterns that are misclassified will change. The perceptron learning algorithm has a simple interpretation, as follows. We we evaluate the x cycle through the training patterns in turn, and for each pattern n perceptron function (4.52). If the pattern is correctly classified, then the weight C vector remains unchanged, whereas if it is incorrectly classified, then for class 1 φ ( x we add the vector ) onto the current estimate of weight vector w while for n class C . The perceptron learning algorithm is we subtract the vector φ ( x w ) from n 2 illustrated in Figure 4.7. If we consider the effect of a single update in the perceptron learning algorithm, we see that the contribution to the error from a misclassified pattern will be reduced because from (4.55) we have )T τ ( τ +1)T ( T )T τ ( φ − w t φ (4.56) t t − ( φ φ t φ ) w = − t < w − n n n n n n n n n n 2 η =1 , and made use of ‖ φ where we have set t ‖ > 0 . Of course, this does n n not imply that the contribution to the error function from the other misclassified patterns will have been reduced. Furthermore, the change in weight vector may have caused some previously correctly classified patterns to become misclassified. Thus the perceptron learning rule is not guaranteed to reduce the total error function at each stage. However, the perceptron convergence theorem states that if there exists an ex- act solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite num- ber of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve con- vergence could still be substantial, and in practice, until convergence is achieved, we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge. Even when the data set is linearly separable, there may be many solutions, and which one is found will depend on the initialization of the parameters and on the or- der of presentation of the data points. Furthermore, for data sets that are not linearly separable, the perceptron learning algorithm will never converge.

215 4.1. Discriminant Functions 195 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 0 − −1 0.5 0.5 1 0 0.5 1 − −1 0.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 0 −1 0.5 0 1 1 − 0.5 0.5 − −1 0.5 Figure 4.7 Illustration of the convergence of the perceptron learning algorithm, showing data points from two ,φ ) . The top left plot shows the initial parameter classes (red and blue) in a two-dimensional feature space φ ( 1 2 vector w shown as a black arrow together with the corresponding decision boundary (black line), in which the arrow points towards the decision region which classified as belonging to the red class. The data point circled in green is misclassified and so its feature vector is added to the current weight vector, giving the new decision boundary shown in the top right plot. The bottom left plot shows the next misclassified point to be considered, indicated by the green circle, and its feature vector is again added to the weight vector giving the decision boundary shown in the bottom right plot for which all data points are correctly classified.

216 196 4. LINEAR MODELS FOR CLASSIFICATION Illustration of the Mark 1 perceptron hardware. The photograph on the left shows how the inputs Figure 4.8 were obtained using a simple camera system in which an input scene, in this case a printed character, was × 20 illuminated by powerful lights, and an image focussed onto a 20 array of cadmium sulphide photocells, giving a primitive 400 pixel image. The perceptron also had a patch board, shown in the middle photograph, which allowed different configurations of input features to be tried. Often these were wired up at random to demonstrate the ability of the perceptron to learn without the need for precise wiring, in contrast to a modern digital computer. The photograph on the right shows one of the racks of adaptive weights. Each weight was implemented using a rotary variable resistor, also called a potentiometer, driven by an electric motor thereby allowing the value of the weight to be adjusted automatically by the learning algorithm. Aside from difficulties with the learning algorithm, the perceptron does not pro- vide probabilistic outputs, nor does it generalize readily to K> 2 classes. The most important limitation, however, arises from the fact that (in common with all of the models discussed in this chapter and the previous one) it is based on linear com- binations of fixed basis functions. More detailed discussions of the limitations of perceptrons can be found in Minsky and Papert (1969) and Bishop (1995a). Analogue hardware implementations of the perceptron were built by Rosenblatt, . based on motor-driven variable resistors to implement the adaptive parameters w j These are illustrated in Figure 4.8. The inputs were obtained from a simple camera system based on an array of photo-sensors, while the basis functions φ could be chosen in a variety of ways, for example based on simple fixed functions of randomly chosen subsets of pixels from the input image. Typical applications involved learning to discriminate simple shapes or characters. At the same time that the perceptron was being developed, a closely related system called the adaline , which is short for ‘adaptive linear element’, was being explored by Widrow and co-workers. The functional form of the model was the same as for the perceptron, but a different approach to training was adopted (Widrow and Hoff, 1960; Widrow and Lehr, 1990). 4.2. Probabilistic Generative Models We turn next to a probabilistic view of classification and show how models with linear decision boundaries arise from simple assumptions about the distribution of the data. In Section 1.5.4, we discussed the distinction between the discriminative and the generative approaches to classification. Here we shall adopt a generative

217 4.2. Probabilistic Generative Models 197 Plot of the logistic sigmoid function Figure 4.9 1 ) defined by (4.59), shown in σ a ( red, together with the scaled pro- 2 8 = π/ , ,for λ Φ( λa bit function ) ) shown in dashed blue, where Φ( a is defined by (4.114). The scal- 8 π/ ing factor is chosen so that the 0.5 derivatives of the two curves are =0 . a equal for 0 0 5 −5 p ( x approach in which we model the class-conditional densities , as well as the ) |C k p class priors ( C x , and then use these to compute posterior probabilities ( C ) | p ) k k through Bayes’ theorem. Consider first of all the case of two classes. The posterior probability for class can be written as C 1 ) C ( p ) |C p ( x 1 1 )= x | C ( p 1 p ( x |C ) ) p ( C C )+ p ( x |C ( ) p 2 2 1 1 1 = ( (4.57) = σ ) a − ) a 1+exp( where we have defined ) C ( p ) x ( p |C 1 1 =ln a (4.58) x |C C ) p ( p ( ) 2 2 σ ( a ) and logistic sigmoid function defined by is the 1 (4.59) a )= σ ( − 1+exp( ) a which is plotted in Figure 4.9. The term ‘sigmoid’ means S-shaped. This type of function is sometimes also called a ‘squashing function’ because it maps the whole real axis into a finite interval. The logistic sigmoid has been encountered already in earlier chapters and plays an important role in many classification algorithms. It satisfies the following symmetry property ( − a )=1 − σ ( a ) (4.60) σ as is easily verified. The inverse of the logistic sigmoid is given by ( ) σ (4.61) a =ln σ 1 − and is known as the function. It represents the log of the ratio of probabilities logit ln [ p ( C log odds | x . /p ( C for the two classes, also known as the | x )] ) 2 1

218 198 4. LINEAR MODELS FOR CLASSIFICATION Note that in (4.57) we have simply rewritten the posterior probabilities in an equivalent form, and so the appearance of the logistic sigmoid may seem rather vac- ( x ) takes a simple functional uous. However, it will have significance provided a ( a is a linear function of x ,in x ) form. We shall shortly consider situations in which which case the posterior probability is governed by a generalized linear model. 2 classes, we have K> For the case of C ( p ) ) p x |C ( k k ∑ C ( p )= x | k p x |C ) ( p ( C ) j j j ) exp( a k ∑ = (4.62) exp( a ) j j which is known as the and can be regarded as a multiclass normalized exponential generalization of the logistic sigmoid. Here the quantities a are defined by k C =ln ( x |C (4.63) ) p . p ) ( a k k k The normalized exponential is also known as the softmax function , as it represents k a , then for all j = a a smoothed version of the ‘max’ function because, if j k ( p C | x )  1 , and p ( C 0 . x )  | j k We now investigate the consequences of choosing specific forms for the class- conditional densities, looking first at continuous input variables x and then dis- cussing briefly the case of discrete inputs. 4.2.1 Continuous inputs Let us assume that the class-conditional densities are Gaussian and then explore the resulting form for the posterior probabilities. To start with, we shall assume that is given all classes share the same covariance matrix. Thus the density for class C k by { } 1 1 1 1 − T Σ μ ) )= − ( ) . exp (4.64) − x μ − x ( x ( |C p k k k 1 2 / 2 D/ 2 (2 | | Σ π ) Consider first the case of two classes. From (4.57) and (4.58), we have T ( C p (4.65) σ ( w | x + w x ) )= 1 0 where we have defined − 1 = w Σ μ μ ) (4.66) ( − 1 2 ) C 1 ( p 1 1 1 − − 1 T T μ Σ + Σ − = μ +ln w μ (4.67) μ . 0 2 1 2 1 2 p ( C ) 2 2 We see that the quadratic terms in x from the exponents of the Gaussian densities have cancelled (due to the assumption of common covariance matrices) leading to a linear function of x in the argument of the logistic sigmoid. This result is illus- trated for the case of a two-dimensional input space x in Figure 4.10. The resulting

219 4.2. Probabilistic Generative Models 199 The left-hand plot shows the class-conditional densities for two classes, denoted red and blue. Figure 4.10 | x ) , which is given by a logistic sigmoid of a linear C On the right is the corresponding posterior probability ( p 1 ) | x and a x p function of . The surface in the right-hand plot is coloured using a proportion of red ink given by C ( 1 proportion of blue ink given by p ( C | x )=1 − p ( C . | x ) 2 1 decision boundaries correspond to surfaces along which the posterior probabilities ) | x are constant and so will be given by linear functions of x , and therefore C p ( k p ( the decision boundaries are linear in input space. The prior probabilities C enter ) k so that changes in the priors have the effect of w only through the bias parameter 0 making parallel shifts of the decision boundary and more generally of the parallel contours of constant posterior probability. For the general case of K classes we have, from (4.62) and (4.63), T ( )= w x x + w (4.68) a k 0 k k where we have defined − 1 w = Σ μ (4.69) k k 1 − 1 T μ w ) (4.70) . C Σ = − μ ( +ln p k 0 k k k 2 We see that the a ( x ) as a consequence of the cancel- x are again linear functions of k lation of the quadratic terms due to the shared covariances. The resulting decision boundaries, corresponding to the minimum misclassification rate, will occur when two of the posterior probabilities (the two largest) are equal, and so will be defined by linear functions of x , and so again we have a generalized linear model. If we relax the assumption of a shared covariance matrix and allow each class- Σ , then the earlier to have its own covariance matrix ) |C x ( conditional density p k k cancellations will no longer occur, and we will obtain quadratic functions of x ,giv- ing rise to a quadratic discriminant . The linear and quadratic decision boundaries are illustrated in Figure 4.11.

220 200 4. LINEAR MODELS FOR CLASSIFICATION 2.5 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 −1 2 −2 0 1 The left-hand plot shows the class-conditional densities for three classes each having a Gaussian Figure 4.11 distribution, coloured red, green, and blue, in which the red and green classes have the same covariance matrix. The right-hand plot shows the corresponding posterior probabilities, in which the RGB colour vector represents the posterior probabilities for the respective three classes. The decision boundaries are also shown. Notice that the boundary between the red and green classes, which have the same covariance matrix, is linear, whereas those between the other pairs of classes are quadratic. 4.2.2 Maximum likelihood solution Once we have specified a parametric functional form for the class-conditional ( x |C densities p , we can then determine the values of the parameters, together with ) k ) , using maximum likelihood. This requires a data the prior class probabilities p ( C k x along with their corresponding class labels. set comprising observations of Consider first the case of two classes, each having a Gaussian class-conditional density with a shared covariance matrix, and suppose we have a data set { x ,t } n n =1 denotes class C .We and t =0 denotes class C n . Here ,...,N t where =1 2 1 n n denote the prior class probability p ( C )= . For a data point , so that p ( C π )=1 − π 2 1 t C ,wehave from class =1 and hence x 1 n n p ( x . , C ) )= p ( C Σ ) p ( x , |C μ )= π N ( x | n 1 1 n 1 n 1 C Similarly for class and hence t =0 ,wehave n 2 p ( x , C . )= p ( C ) ) p ( x Σ |C , )=(1 − π ) N ( x μ | n 2 n 2 2 n 2 Thus the likelihood function is given by N ∏ t − 1 t n n μ π, | p t ( )] [ π N )= x , | μ , Σ μ , (4.71) [(1 − π ) N ( x Σ | μ , Σ )] ( n n 1 2 2 1 =1 n T where t =( t ,...,t . As usual, it is convenient to maximize the log of the ) N 1 likelihood function. Consider first the maximization with respect to π . The terms in

221 4.2. Probabilistic Generative Models 201 π the log likelihood function that depend on are N ∑ t } ln π +(1 − t (4.72) )ln(1 − π ) { . n n n =1 π equal to zero and rearranging, we obtain Setting the derivative with respect to N ∑ N 1 N 1 1 = = π = (4.73) t n N + N N N 1 2 =1 n C , and N denotes the total number of data points in class denotes the total where N 2 1 1 π is . Thus the maximum likelihood estimate for number of data points in class C 2 simply the fraction of points in class C as expected. This result is easily generalized 1 to the multiclass case where again the maximum likelihood estimate of the prior is given by the fraction of the training set points probability associated with class C k assigned to that class. Exercise 4.9 . Again we can pick out of μ Now consider the maximization with respect to 1 giving the log likelihood function those terms that depend on μ 1 N N ∑ ∑ 1 1 − T x )+const | μ μ , Σ )= − t − (4.74) t x ( . ln ( x N − μ Σ ) ( n n n n n 1 1 1 2 n =1 =1 n μ Setting the derivative with respect to to zero and rearranging, we obtain 1 N ∑ 1 μ (4.75) = x t n n 1 N 1 n =1 assigned to class .Bya C x which is simply the mean of all the input vectors n 1 is given by μ similar argument, the corresponding result for 2 N ∑ 1 μ − t (4.76) (1 x = ) n n 2 N 2 =1 n . C assigned to class which again is the mean of all the input vectors x 2 n Finally, consider the maximum likelihood solution for the shared covariance Σ Σ ,we matrix . Picking out the terms in the log likelihood function that depend on have N N ∑ ∑ 1 1 − 1 T Σ ) μ − t ( Σ x t ln ( x ) − μ | |− − n n n n 1 1 2 2 n =1 =1 n N N ∑ ∑ 1 1 − 1 T − | Σ − (1 − μ x ( (1 )ln t )( x ) − μ Σ ) t |− − n n n n 2 2 2 2 =1 n n =1 { } N N − 1 = − ln | Tr Σ |− Σ S (4.77) 2 2

222 202 4. LINEAR MODELS FOR CLASSIFICATION where we have defined N N 1 2 = S + (4.78) S S 2 1 N N ∑ 1 T − = ) ( x − μ μ )( x (4.79) S 1 n n 1 1 N 1 ∈C n 1 ∑ 1 T μ (4.80) = . ( ) − − μ x )( x S n n 2 2 2 N 2 ∈C n 2 Using the standard result for the maximum likelihood solution for a Gaussian distri- Σ = , which represents a weighted average of the covariance bution, we see that S matrices associated with each of the two classes separately. K This result is easily extended to the class problem to obtain the corresponding maximum likelihood solutions for the parameters in which each class-conditional density is Gaussian with a shared covariance matrix. Note that the approach of fitting Exercise 4.10 Gaussian distributions to the classes is not robust to outliers, because the maximum Section 2.3.7 likelihood estimation of a Gaussian is not robust. 4.2.3 Discrete features Let us now consider the case of discrete feature values x . For simplicity, we i begin by looking at binary feature values x 0 , 1 } and discuss the extension to ∈{ i more general discrete features shortly. If there are D inputs, then a general distribu- D D − 1 numbers for each class, containing 2 tion would correspond to a table of 2 independent variables (due to the summation constraint). Because this grows expo- nentially with the number of features, we might seek a more restricted representa- naive Bayes Section 8.2.2 tion. Here we will make the assumption in which the feature values are . Thus we have class-conditional treated as independent, conditioned on the class C k distributions of the form D ∏ x x − 1 i i − (4.81) )= (1 μ μ ) |C ( p x k ki ki i =1 which contain D independent parameters for each class. Substituting into (4.63) then gives D ∑ a )= C ( (4.82) ) x x ln μ +(1 − x )ln(1 − μ ) } +ln p ( { k i ki ki i k =1 i which again are linear functions of the input values x . For the case of K =2 classes, i we can alternatively consider the logistic sigmoid formulation given by (4.57). Anal- ogous results are obtained for discrete variables each of which can take M> 2 states. Exercise 4.11 4.2.4 Exponential family As we have seen, for both Gaussian distributed and discrete inputs, the posterior class probabilities are given by generalized linear models with logistic sigmoid ( K =

223 4.3. Probabilistic Discriminative Models 203 K classes) or softmax (  2 2 classes) activation functions. These are particular cases of a more general result obtained by assuming that the class-conditional densities |C x ( p ) are members of the exponential family of distributions. k Using the form (2.194) for members of the exponential family, we see that the distribution of x can be written in the form { } T | λ x ( p ) g ( λ )= )exp h λ ( (4.83) u ( x ) x . k k k u ( x )= x . We now restrict attention to the subclass of such distributions for which s Then we make use of (2.236) to introduce a scaling parameter , so that we obtain the restricted set of exponential family class-conditional densities of the form } { ) ( 1 1 1 T )= ,s (4.84) x g )exp λ . ( h x λ | x λ ( p k k k s s s Note that we are allowing each class to have its own parameter vector λ but we are k s . assuming that the classes share the same scale parameter For the two-class problem, we substitute this expression for the class-conditional densities into (4.58) and we see that the posterior class probability is again given by a logistic sigmoid acting on a linear function a ( x ) which is given by T . ) C ( p − λ ln ) (4.85) x +ln g ( λ − ) − ln g ( λ ) )+ln p ( C a λ )=( x ( 2 1 1 2 2 1 K -class problem, we substitute the class-conditional density ex- Similarly, for the pression into (4.63) to give T ( x )= λ +ln ) x (4.86) g ( λ C )+ln p ( a k k k k and so again is a linear function of x . 4.3. Probabilistic Discriminative Models For the two-class classification problem, we have seen that the posterior probability of class C can be written as a logistic sigmoid acting on a linear function of x , for a 1 wide choice of class-conditional distributions p x |C ( ) . Similarly, for the multiclass k is given by a softmax transformation of a case, the posterior probability of class C k , ) . For specific choices of the class-conditional densities x |C x linear function of ( p k we have used maximum likelihood to determine the parameters of the densities as ) and then used Bayes’ theorem to find the posterior class p ( C well as the class priors k probabilities. However, an alternative approach is to use the functional form of the generalized linear model explicitly and to determine its parameters directly by using maximum likelihood. We shall see that there is an efficient algorithm finding such solutions known as iterative reweighted least squares ,or IRLS . The indirect approach to finding the parameters of a generalized linear model, by fitting class-conditional densities and class priors separately and then applying

224 204 4. LINEAR MODELS FOR CLASSIFICATION 1 1 φ 2 x 2 0 0.5 −1 0 1 −1 0.5 1 0 0 φ x 1 1 Figure 4.12 Illustration of the role of nonlinear basis functions in linear classification models. The left plot ) ,x together with data points from two classes labelled red and blue. Two ( shows the original input space x 1 2 φ ‘Gaussian’ basis functions ( x ) and φ ) ( x are defined in this space with centres shown by the green crosses 2 1 and with contours shown by the green circles. The right-hand plot shows the corresponding feature space ) together with the linear decision boundary obtained given by a logistic regression model of the form ,φ ( φ 1 2 discussed in Section 4.3.2. This corresponds to a nonlinear decision boundary in the original input space, shown by the black curve in the left-hand plot. Bayes’ theorem, represents an example of generative modelling, because we could take such a model and generate synthetic data by drawing values of x from the p x ) . In the direct approach, we are maximizing a likelihood marginal distribution ( function defined through the conditional distribution p ( C | , which represents a ) x k form of discriminative training. One advantage of the discriminative approach is that there will typically be fewer adaptive parameters to be determined, as we shall see shortly. It may also lead to improved predictive performance, particularly when the class-conditional density assumptions give a poor approximation to the true dis- tributions. 4.3.1 Fixed basis functions So far in this chapter, we have considered classification models that work di- rectly with the original input vector . However, all of the algorithms are equally x applicable if we first make a fixed nonlinear transformation of the inputs using a vector of basis functions φ ( x ) . The resulting decision boundaries will be linear in the feature space φ , and these correspond to nonlinear decision boundaries in the original space, as illustrated in Figure 4.12. Classes that are linearly separable x in the feature space φ ( x ) need not be linearly separable in the original observation space x . Note that as in our discussion of linear models for regression, one of the

225 4.3. Probabilistic Discriminative Models 205 x )=1 , so that the correspond- ( φ basis functions is typically set to a constant, say 0 w ing parameter plays the role of a bias. For the remainder of this chapter, we shall 0 x ) include a fixed basis function transformation ( φ , as this will highlight some useful similarities to the regression models discussed in Chapter 3. For many problems of practical interest, there is significant overlap between ) . This corresponds to posterior probabilities |C ( x p the class-conditional densities k ( p C x ) | x , are not 0 or 1. In such cases, the opti- , which, for at least some values of k mal solution is obtained by modelling the posterior probabilities accurately and then applying standard decision theory, as discussed in Chapter 1. Note that nonlinear φ ( transformations ) cannot remove such class overlap. Indeed, they can increase x the level of overlap, or create overlap where none existed in the original observation space. However, suitable choices of nonlinearity can make the process of modelling the posterior probabilities easier. Such fixed basis function models have important limitations, and these will be Section 3.6 resolved in later chapters by allowing the basis functions themselves to adapt to the data. Notwithstanding these limitations, models with fixed nonlinear basis functions play an important role in applications, and a discussion of such models will intro- duce many of the key concepts needed for an understanding of their more complex counterparts. 4.3.2 Logistic regression We begin our treatment of generalized linear models by considering the problem of two-class classification. In our discussion of generative approaches in Section 4.2, we saw that under rather general assumptions, the posterior probability of class C 1 can be written as a logistic sigmoid acting on a linear function of the feature vector φ so that ) ( T σ y ( φ )= )= | w φ φ (4.87) p C ( 1 with p ( C is the | )=1 − p ( C function defined by | φ ) . Here σ ( · ) φ logistic sigmoid 1 2 logistic regression (4.59). In the terminology of statistics, this model is known as , although it should be emphasized that this is a model for classification rather than regression. M For an φ , this model has M adjustable parameters. -dimensional feature space By contrast, if we had fitted Gaussian class conditional densities using maximum likelihood, we would have used 2 M parameters for the means and M ( M +1) / 2 ) , parameters for the (shared) covariance matrix. Together with the class prior ( p C 1 / this gives a total of M +5) M 2+1 parameters, which grows quadratically with M , ( in contrast to the linear dependence on M of the number of parameters in logistic regression. For large values of , there is a clear advantage in working with the M logistic regression model directly. We now use maximum likelihood to determine the parameters of the logistic regression model. To do this, we shall make use of the derivative of the logistic sig- moid function, which can conveniently be expressed in terms of the sigmoid function itself Exercise 4.12 dσ = σ (1 − σ ) . (4.88) da

226 206 4. LINEAR MODELS FOR CLASSIFICATION ) = ,t n } , where t , with ∈{ 0 , 1 } and φ x = φ ( { For a data set φ n n n n n 1 ,...,N , the likelihood function can be written N ∏ t − 1 t n n )= w t ( p | 1 − y y } (4.89) { n n n =1 T ) φ ,...,t | ) ( and y C = p . As usual, we can define an error t =( where t n N 1 1 n function by taking the negative logarithm of the likelihood, which gives the cross- error function in the form entropy N ∑ − )ln(1 t − +(1 y ln t { } ) y (4.90) p − w | t ( )= ln − )= w ( E n n n n =1 n T = σ ( a . Taking the gradient of the error function with ) and a φ = w y where n n n n respect to w , we obtain Exercise 4.13 N ∑ ( )= E ∇ w (4.91) ( y φ − t ) n n n =1 n where we have made use of (4.88). We see that the factor involving the derivative of the logistic sigmoid has cancelled, leading to a simplified form for the gradient of the log likelihood. In particular, the contribution to the gradient from data point − t between the target value and the prediction of the is given by the ‘error’ y n n n model, times the basis function vector φ . Furthermore, comparison with (3.13) n shows that this takes precisely the same form as the gradient of the sum-of-squares error function for the linear regression model. Section 3.1.1 If desired, we could make use of the result (4.91) to give a sequential algorithm in which patterns are presented one at a time, in which each of the weight vectors is th is the n term in (4.91). ∇ E updated using (3.22) in which n It is worth noting that maximum likelihood can exhibit severe over-fitting for data sets that are linearly separable. This arises because the maximum likelihood so- T =0 . 5 σ w lution occurs when the hyperplane corresponding to , equivalent to φ = 0 , separates the two classes and the magnitude of w goes to infinity. In this case, the logistic sigmoid function becomes infinitely steep in feature space, corresponding to k is assigned a Heaviside step function, so that every training point from each class Exercise 4.14 . Furthermore, there is typically a continuum )=1 x | ( p a posterior probability C k of such solutions because any separating hyperplane will give rise to the same pos- terior probabilities at the training data points, as will be seen later in Figure 10.13. Maximum likelihood provides no way to favour one such solution over another, and which solution is found in practice will depend on the choice of optimization algo- rithm and on the parameter initialization. Note that the problem will arise even if the number of data points is large compared with the number of parameters in the model, so long as the training data set is linearly separable. The singularity can be avoided by inclusion of a prior and finding a MAP solution for w , or equivalently by adding a regularization term to the error function.

227 4.3. Probabilistic Discriminative Models 207 4.3.3 Iterative reweighted least squares In the case of the linear regression models discussed in Chapter 3, the maxi- mum likelihood solution, on the assumption of a Gaussian noise model, leads to a closed-form solution. This was a consequence of the quadratic dependence of the w log likelihood function on the parameter vector . For logistic regression, there is no longer a closed-form solution, due to the nonlinearity of the logistic sigmoid function. However, the departure from a quadratic form is not substantial. To be precise, the error function is concave, as we shall see shortly, and hence has a unique minimum. Furthermore, the error function can be minimized by an efficient iterative technique based on the iterative optimization scheme, which uses a Newton-Raphson local quadratic approximation to the log likelihood function. The Newton-Raphson w ) , takes the form (Fletcher, 1987; Bishop and E update, for minimizing a function ( Nabney, 2008) (new) (old) − 1 w = w ∇ E ( w ) . (4.92) − H is the Hessian matrix whose elements comprise the second derivatives of where H with respect to the components of w . w ( ) E Let us first of all apply the Newton-Raphson method to the linear regression model (3.3) with the sum-of-squares error function (3.12). The gradient and Hessian of this error function are given by N ∑ T T T (4.93) φ t − t − ) φ Φ = Φ w Φw ( ∇ E ( w )= n n n n =1 N ∑ T T = (4.94) Φ φ Φ φ = )= w ( E ∇∇ H n n =1 n T th is the N × M design matrix, whose n where Φ φ row is given by Section 3.1.1 . The Newton- n Raphson update then takes the form } { T T T (old) (new) (old) − 1 w t ) Φ Φw ( − Φ − Φ w = Φ T T − 1 Φ Φ Φ =( t (4.95) ) which we recognize as the standard least-squares solution. Note that the error func- tion in this case is quadratic and hence the Newton-Raphson formula gives the exact solution in one step. Now let us apply the Newton-Raphson update to the cross-entropy error function (4.90) for the logistic regression model. From (4.91) we see that the gradient and Hessian of this error function are given by N ∑ T ) ( ( t − t − ) φ y = Φ (4.96) y ( w )= ∇ E n n n =1 n N ∑ T T E w )= ∇∇ = H ( RΦ Φ (1 − y = ) φ (4.97) φ y n n n n =1 n

228 208 4. LINEAR MODELS FOR CLASSIFICATION N × diagonal where we have made use of (4.88). Also, we have introduced the N with elements R matrix R y (1 − y . ) = (4.98) n nn n w through the weight- We see that the Hessian is no longer constant but depends on ing matrix R , corresponding to the fact that the error function is no longer quadratic. 1 < , which follows from the form of the logistic sigmoid 0 Using the property 0 for an arbitrary vector u , and so the Hessian matrix Hu u function, we see that w H is positive definite. It follows that the error function is a concave function of Exercise 4.15 and hence has a unique minimum. The Newton-Raphson update formula for the logistic regression model then be- comes T T (new) (old) − 1 w = RΦ ) w ) Φ − ( y − t ( Φ { } T T T − 1 (old) RΦw − RΦ Φ Φ ( y − t ) ) Φ =( T T − 1 ) Φ RΦ R z (4.99) =( Φ z is an N -dimensional vector with elements where 1 − (old) R (4.100) ( − − t ) . y z = Φw We see that the update formula (4.99) takes the form of a set of normal equations for a weighted least-squares problem. Because the weighing matrix is not constant but R w , we must apply the normal equations iteratively, depends on the parameter vector each time using the new weight vector to compute a revised weighing matrix w R . For this reason, the algorithm is known as iterative reweighted least squares ,or IRLS (Rubin, 1983). As in the weighted least-squares problem, the elements of the diagonal weighting matrix R can be interpreted as variances because the mean and t in the logistic regression model are given by variance of t ]= σ ( x )= y (4.101) E [ 2 2 2 − (1 x ] − E [ t ] (4.102) = σ ( x ) − σ ( ) ) y = y t E ]= t [ var[ 2 where we have used the property t t for t ∈{ 0 , 1 } . In fact, we can interpret IRLS = T φ . The as the solution to a linearized problem in the space of the variable w a = th , can then be given a simple , which corresponds to the n element of z quantity z n interpretation as an effective target value in this space obtained by making a local linear approximation to the logistic sigmoid function around the current operating (old) point w ∣ ∣ d a n (old) ∣ y ( w t )+ a ) ( − w ( )  a n n n n ∣ y d (old) n w y ( ) − t n n T (old) (4.103) − . w = z = φ n n y ) (1 − y n n

229 4.3. Probabilistic Discriminative Models 209 4.3.4 Multiclass logistic regression In our discussion of generative models for multiclass classification, we have Section 4.2 seen that for a large class of distributions, the posterior probabilities are given by a softmax transformation of linear functions of the feature variables, so that ) exp( a k ∑ p C ( y | ( )= φ φ )= (4.104) k k a ) exp( j j are given by a where the ‘activations’ k T (4.105) = . φ w a k k There we used maximum likelihood to determine separately the class-conditional densities and the class priors and then found the corresponding posterior probabilities using Bayes’ theorem, thereby implicitly determining the parameters w { } . Here we k of this } { consider the use of maximum likelihood to determine the parameters w k with respect to all of y model directly. To do this, we will require the derivatives of k the activations a . These are given by Exercise 4.17 j ∂y k (4.106) = y ) ( I y − kj j k ∂a j I where are the elements of the identity matrix. kj Next we write down the likelihood function. This is most easily done using for a feature vector φ K t the 1-of- coding scheme in which the target vector n n C belonging to class is a binary vector with all elements zero except for element k , k which equals one. The likelihood function is then given by N K N K ∏ ∏ ∏ ∏ t t nk nk (4.107) )= ,..., y p ( C = | φ ) w p ( T w | K 1 k n nk n n =1 =1 k k =1 =1 where y is an = matrix of target variables with elements ( φ K ) , and × y N T nk k n . Taking the negative logarithm then gives t nk K N ∑ ∑ w ( E − ln p ( T | w y ,..., w ln ,..., − )= w t (4.108) )= nk K 1 nk K 1 =1 n k =1 which is known as the cross-entropy error function for the multiclass classification problem. We now take the gradient of the error function with respect to one of the param- . Making use of the result (4.106) for the derivatives of the softmax eter vectors w j function, we obtain Exercise 4.18 N ∑ t E ( w − ,..., (4.109) y )= φ ) ( w ∇ K nj 1 nj w n j n =1

230 210 4. LINEAR MODELS FOR CLASSIFICATION ∑ =1 . Once again, we see the same form arising t where we have made use of nk k for the gradient as was found for the sum-of-squares error function with the linear model and the cross-entropy error for the logistic regression model, namely the prod- uct of the error y ( t . Again, we could use this ) − φ times the basis function nj nj n to formulate a sequential algorithm in which patterns are presented one at a time, in which each of the weight vectors is updated using (3.22). We have seen that the derivative of the log likelihood function for a linear regres- w for a data point n took the form sion model with respect to the parameter vector y of the ‘error’ − t times the feature vector φ . Similarly, for the combination n n n of logistic sigmoid activation function and cross-entropy error function (4.90), and for the softmax activation function with the multiclass cross-entropy error function (4.108), we again obtain this same simple form. This is an example of a more general result, as we shall see in Section 4.3.6. To find a batch algorithm, we again appeal to the Newton-Raphson update to obtain the corresponding IRLS algorithm for the multiclass problem. This requires M evaluation of the Hessian matrix that comprises blocks of size M in which × j, k block is given by N ∑ T ∇ − φ E ( w φ ,..., w ) )= − (4.110) . y ∇ y I ( nk kj nj K 1 w w n n j k n =1 As with the two-class problem, the Hessian matrix for the multiclass logistic regres- sion model is positive definite and so the error function again has a unique minimum. Exercise 4.20 Practical details of IRLS for the multiclass case can be found in Bishop and Nabney (2008). 4.3.5 Probit regression We have seen that, for a broad range of class-conditional distributions, described by the exponential family, the resulting posterior class probabilities are given by a logistic (or softmax) transformation acting on a linear function of the feature vari- ables. However, not all choices of class-conditional density give rise to such a simple form for the posterior probabilities (for instance, if the class-conditional densities are modelled using Gaussian mixtures). This suggests that it might be worth exploring other types of discriminative probabilistic model. For the purposes of this chapter, however, we shall return to the two-class case, and again remain within the frame- work of generalized linear models so that ( t =1 | a )= f ( a p (4.111) ) T φ , and f ( · ) is the activation function. where a = w One way to motivate an alternative choice for the link function is to consider a T and , we evaluate a φ = w φ noisy threshold model, as follows. For each input n n n then we set the target value according to { θ t =1 if a  n n (4.112) t =0 otherwise . n

231 4.3. Probabilistic Discriminative Models 211 Schematic example of a probability density p ) Figure 4.13 ( θ 1 shown by the blue curve, given in this example by a mixture of two Gaussians, along with its cumulative distribution function 0.8 ( ) f , shown by the red curve. Note that the value of the blue a curve at any point, such as that indicated by the vertical green line, corresponds to the slope of the red curve at the same point. 0.6 Conversely, the value of the red curve at this point corresponds to the area under the blue curve indicated by the shaded green 0.4 region. In the stochastic threshold model, the class label takes T φ exceeds a threshold, oth- =1 if the value of a = w the value t . This is equivalent to an activation =0 t erwise it takes the value 0.2 f ( a ) function given by the cumulative distribution function . 0 3 2 4 1 0 p is drawn from a probability density θ ) , then the corresponding θ If the value of ( activation function will be given by the cumulative distribution function ∫ a (4.113) p ( θ )d θ )= f a ( −∞ as illustrated in Figure 4.13. As a specific example, suppose that the density p ( θ ) is given by a zero mean, unit variance Gaussian. The corresponding cumulative distribution function is given by ∫ a θ N ( (4.114) | 0 , 1) d θ a )= Φ( −∞ which is known as the function. It has a sigmoidal shape and is compared probit with the logistic sigmoid function in Figure 4.9. Note that the use of a more gen- eral Gaussian distribution does not change the model because this is equivalent to w . Many numerical packages provide for the a re-scaling of the linear coefficients evaluation of a closely related function defined by ∫ a 2 2 √ / θ 2) d θ exp( − (4.115) a erf( )= π 0 erf function and known as the error function (not to be confused with the error or function of a machine learning model). It is related to the probit function by Exercise 4.21 } { 1 1 √ ) erf( a (4.116) 1+ . a ( )= Φ 2 2 The generalized linear model based on a probit activation function is known as probit regression . We can determine the parameters of this model using maximum likelihood, by a straightforward extension of the ideas discussed earlier. In practice, the results found using probit regression tend to be similar to those of logistic regression. We shall,

232 212 4. LINEAR MODELS FOR CLASSIFICATION however, find another use for the probit model when we discuss Bayesian treatments of logistic regression in Section 4.5. , which can One issue that can occur in practical applications is that of outliers or through misla- arise for instance through errors in measuring the input vector x t . Because such points can lie a long way to the wrong side belling of the target value of the ideal decision boundary, they can seriously distort the classifier. Note that the logistic and probit regression models behave differently in this respect because the x ) tails of the logistic sigmoid decay asymptotically like x →∞ , whereas exp( − for 2 x exp( for the probit activation function they decay like − , and so the probit model ) can be significantly more sensitive to outliers. However, both the logistic and the probit models assume the data is correctly labelled. The effect of mislabelling is easily incorporated into a probabilistic model that the target value t by introducing a probability  has been flipped to the wrong value (Opper and Winther, 2000a), leading to a target value distribution for data point x of the form ( t | p )=(1 −  ) σ ( x )+  (1 − σ ( x )) x = +(1 − 2  ) σ ( x ) (4.117)   where x ) is the activation function with input vector x . Here ( may be set in σ advance, or it may be treated as a hyperparameter whose value is inferred from the data. 4.3.6 Canonical link functions For the linear regression model with a Gaussian noise distribution, the error function, corresponding to the negative log likelihood, is given by (3.12). If we take the derivative with respect to the parameter vector w of the contribution to the error , this takes the form of the ‘error’ n function from a data point y − t times the n n T , where y . Similarly, for the combination of the logistic = w φ φ feature vector n n n sigmoid activation function and the cross-entropy error function (4.90), and for the softmax activation function with the multiclass cross-entropy error function (4.108), we again obtain this same simple form. We now show that this is a general result of assuming a conditional distribution for the target variable from the exponential family, along with a corresponding choice for the activation function known as the . canonical link function We again make use of the restricted form (4.84) of exponential family distribu- tions. Note that here we are applying the assumption of exponential family distribu- t , in contrast to Section 4.2.4 where we applied it to the tion to the target variable x . We therefore consider conditional distributions of the target variable input vector of the form { } ) ( t ηt 1 h . g ( η )exp (4.118) t | η, s )= p ( s s s Using the same line of argument as led to the derivation of the result (2.226), we see that the conditional mean of t , which we denote by y , is given by d ≡ E [ t | η ]= − s y ln g ( η ) . (4.119) dη

233 4.4. The Laplace Approximation 213 . η = ψ ( y ) η Thus must related, and we denote this relation through and y generalized linear model Following Nelder and Wedderburn (1972), we define a y to be one for which is a nonlinear function of a linear combination of the input (or feature) variables so that T w = f ( y (4.120) ) φ ( · ) is known as the activation function in the machine learning literature, and where f 1 − f · is known as the ( link function in statistics. ) Now consider the log likelihood function for this model, which, as a function of ,isgivenby η N N { } ∑ ∑ t η n n )= η, s t ( p ln | )= )+ (4.121) ln p ( ln g ( η t +const | η, s n n s =1 n =1 n where we are assuming that all observations share a common scale parameter (which corresponds to the noise variance for a Gaussian distribution for instance) and so s is independent of n . The derivative of the log likelihood with respect to the model is then given by parameters w } { N ∑ dy dη d t n n n ln p ( t | ln g ( η η, s )= )+ a ∇ ∇ n n w dη dy da s n n n n =1 N ∑ 1 ′ ′ { t ψ − y = } (4.122) φ ( y ) f ) ( a n n n n n s =1 n T where a ( φ together with the result (4.119) , and we have used y ) = f = w a n n n n E [ for | η ] . We now see that there is a considerable simplification if we choose a t 1 − y ) given by ( f particular form for the link function − 1 ( y )= ψ ( y ) (4.123) f 1 ′ ′ − y )) = f and hence f which gives ( ψ ( y ψ y ( ( )=1 . Also, because a = f ψ ) ( y ) , ′ ′ ( ( a ) ψ y )=1 . In this case, the gradient of the error ψ we have and hence f a = function reduces to N ∑ 1 w )= ∇ ln E ( (4.124) y − t { } φ . n n n s =1 n − 1 For the Gaussian = β s , whereas for the logistic model s =1 . 4.4. The Laplace Approximation In Section 4.5 we shall discuss the Bayesian treatment of logistic regression. As we shall see, this is more complex than the Bayesian treatment of linear regression models, discussed in Sections 3.3 and 3.5. In particular, we cannot integrate exactly

234 214 4. LINEAR MODELS FOR CLASSIFICATION w since the posterior distribution is no longer Gaussian. over the parameter vector It is therefore necessary to introduce some form of approximation. Later in the Chapter 10 book we shall consider a range of techniques based on analytical approximations and numerical sampling. Chapter 11 Here we introduce a simple, but widely used, framework called the Laplace ap- proximation, that aims to find a Gaussian approximation to a probability density defined over a set of continuous variables. Consider first the case of a single contin- p uous variable z ) is defined by z , and suppose the distribution ( 1 )= p ( z ( z ) (4.125) f Z ∫ z f ( z )d is the normalization coefficient. We shall suppose that the = where Z value of Z is unknown. In the Laplace method the goal is to find a Gaussian approx- q ( z ) which is centred on a mode of the distribution p imation z ) . The first step is to ( ′ ( z ) , in other words a point z p find a mode of p ( z )=0 , or equivalently such that 0 0 ∣ ∣ ) z ( df ∣ =0 . (4.126) ∣ dz = z z 0 A Gaussian distribution has the property that its logarithm is a quadratic function centred on the ln ( z of the variables. We therefore consider a Taylor expansion of f ) so that mode z 0 1 2 f ln f ( z z  ) ln ( − (4.127) ) ) − z A ( z 0 0 2 where ∣ 2 ∣ d ∣ = − A z ) (4.128) ln f . ( ∣ 2 dz z = z 0 is a Note that the first-order term in the Taylor expansion does not appear since z 0 local maximum of the distribution. Taking the exponential we obtain } { A 2 ( z − z (4.129) . − )exp ) ) z z ( f  ( f 0 0 2 We can then obtain a normalized distribution q ( z ) by making use of the standard result for the normalization of a Gaussian, so that } { ) ( 1 / 2 A A 2 q )= z ( ) − (4.130) . exp − z z ( 0 2 π 2 The Laplace approximation is illustrated in Figure 4.14. Note that the Gaussian approximation will only be well defined if its precision A> 0 , in other words the z ) must be a local maximum, so that the second derivative of f ( stationary point z 0 is negative. at the point z 0

235 4.4. The Laplace Approximation 215 0.8 40 0.6 30 0.4 20 0.2 10 0 0 1 0 2 2 3 4 4 −1 3 −2 −1 0 −2 1 2 (20 ( ∝ exp( − z p / 2) σ ) z +4) z Figure 4.14 Illustration of the Laplace approximation applied to the distribution z − − 1 σ ( z )=(1+ e where σ ( z ) is the logistic sigmoid function defined by . The left plot shows the normalized ) in red. The of p ( z ) p ( z ) in yellow, together with the Laplace approximation centred on the mode distribution z 0 right plot shows the negative logarithms of the corresponding curves. We can extend the Laplace method to approximate a distribution )= f ( z ) /Z z ( p z . At a stationary point z -dimensional space M defined over an ∇ f ( z ) the gradient 0 will vanish. Expanding around this stationary point we have 1 T − z z ( ) − ) z ) (4.131) A ( z − ( z f ( ln )  ln f z 0 0 0 2 where the M × M Hessian matrix A is defined by (4.132) A ( z ) | = ln f −∇∇ z z = 0 is the gradient operator. Taking the exponential of both sides we obtain ∇ and } { 1 T ( z f ( z )  f ( (4.133) ) − A . z − z )exp ) − z ( z 0 0 0 2 q ( z ) is proportional to f ( z ) and the appropriate normalization coef- The distribution ficient can be found by inspection, using the standard result (2.43) for a normalized multivariate Gaussian, giving } { 2 / 1 A | | 1 1 − T )= ( z q ) − z , z | (4.134) ( N ) A A ( z − z = ) exp − z ( z 0 0 0 2 M/ 2 ) π (2 where | A | denotes the determinant of A . This Gaussian distribution will be well defined provided its precision matrix, given by , is positive definite, which implies A must be a local maximum, not a minimum or a saddle that the stationary point z 0 point. In order to apply the Laplace approximation we first need to find the mode z , 0 and then evaluate the Hessian matrix at that mode. In practice a mode will typi- cally be found by running some form of numerical optimization algorithm (Bishop

236 216 4. LINEAR MODELS FOR CLASSIFICATION and Nabney, 2008). Many of the distributions encountered in practice will be mul- timodal and so there will be different Laplace approximations according to which of the true distri- mode is being considered. Note that the normalization constant Z bution does not need to be known in order to apply the Laplace method. As a result of the central limit theorem, the posterior distribution for a model is expected to become increasingly better approximated by a Gaussian as the number of observed data points is increased, and so we would expect the Laplace approximation to be most useful in situations where the number of data points is relatively large. One major weakness of the Laplace approximation is that, since it is based on a Gaussian distribution, it is only directly applicable to real variables. In other cases it may be possible to apply the Laplace approximation to a transformation of the τ< ∞ variable. For instance if 0  then we can consider a Laplace approximation τ . The most serious limitation of the Laplace framework, however, is that ln of it is based purely on the aspects of the true distribution at a specific value of the variable, and so can fail to capture important global properties. In Chapter 10 we shall consider alternative approaches which adopt a more global perspective. 4.4.1 Model comparison and BIC p z ) we can also obtain an approxi- As well as approximating the distribution ( . Using the approximation (4.133) we have mation to the normalization constant Z ∫ Z = f ( )d z z } { ∫ 1 T ( z − z ) exp d − ) z ) A ( z − z ( z  f 0 0 0 2 2 M/ (2 π ) = f ( z ) (4.135) 0 1 / 2 A | | where we have noted that the integrand is Gaussian and made use of the standard result (2.43) for a normalized Gaussian distribution. We can use the result (4.135) to obtain an approximation to the model evidence which, as discussed in Section 3.4, plays a central role in Bayesian model comparison. Consider a data set D and a set of models {M .For } having parameters { θ } i i p ( θ each model we define a likelihood function D| M ) . If we introduce a prior , i i |M ) over the parameters, then we are interested in computing the model evi- θ p ( i i dence p ( D|M ) for the various models. From now on we omit the conditioning on i to keep the notation uncluttered. From Bayes’ theorem the model evidence is M i given by ∫ ( D )= p ( D| p ) p ( θ )d θ . (4.136) θ Identifying f ( θ )= p ( D| θ ) p ( θ ) and Z = p ( D ) , and applying the result (4.135), we obtain Exercise 4.22 1 M D| ) ln p (  θ ln D ( p | − ) | π ln A ln(2 )+ θ ( p )+ ln (4.137) MAP MAP 2 2 ︸ ︷︷ ︸ Occam factor

237 4.5. Bayesian Logistic Regression 217 θ at the mode of the posterior distribution, and A is the is the value of θ where MAP matrix of second derivatives of the negative log posterior Hessian A −∇∇ ln p ( D| θ = ) p (4.138) θ . )= −∇∇ ln p ( θ ) |D ( MAP MAP MAP The first term on the right hand side of (4.137) represents the log likelihood evalu- ated using the optimized parameters, while the remaining three terms comprise the ‘Occam factor’ which penalizes model complexity. If we assume that the Gaussian prior distribution over parameters is broad, and that the Hessian has full rank, then we can approximate (4.137) very roughly using Exercise 4.23 1 ) − (4.139) N M ln ( D| θ D p ) ln  ln p ( MAP 2 N M is the number of parameters in θ and where is the number of data points, Bayesian Information we have omitted additive constants. This is known as the (BIC) or the Schwarz criterion (Schwarz, 1978). Note that, compared to Criterion AIC given by (1.73), this penalizes model complexity more heavily. Complexity measures such as AIC and BIC have the virtue of being easy to evaluate, but can also give misleading results. In particular, the assumption that the Hessian matrix has full rank is often not valid since many of the parameters are not ‘well-determined’. We can use the result (4.137) to obtain a more accurate estimate Section 3.5.3 of the model evidence starting from the Laplace approximation, as we illustrate in the context of neural networks in Section 5.7. 4.5. Bayesian Logistic Regression We now turn to a Bayesian treatment of logistic regression. Exact Bayesian infer- ence for logistic regression is intractable. In particular, evaluation of the posterior distribution would require normalization of the product of a prior distribution and a likelihood function that itself comprises a product of logistic sigmoid functions, one for every data point. Evaluation of the predictive distribution is similarly intractable. Here we consider the application of the Laplace approximation to the problem of Bayesian logistic regression (Spiegelhalter and Lauritzen, 1990; MacKay, 1992b). 4.5.1 Laplace approximation Recall from Section 4.4 that the Laplace approximation is obtained by finding the mode of the posterior distribution and then fitting a Gaussian centred at that mode. This requires evaluation of the second derivatives of the log posterior, which is equivalent to finding the Hessian matrix. Because we seek a Gaussian representation for the posterior distribution, it is natural to begin with a Gaussian prior, which we write in the general form ) S , (4.140) N w )= | m ( w ( p 0 0

238 218 4. LINEAR MODELS FOR CLASSIFICATION S are fixed hyperparameters. The posterior distribution over w is and m where 0 0 given by p t ) ∝ p ( w ) | ( t | w ) (4.141) ( w p T ) . Taking the log of both sides, and substituting for the prior ,...,t where t =( t 1 N distribution using (4.140), and for the likelihood function using (4.89), we obtain 1 − 1 T ( w − ) S ) m m ( w − w ( − ln )= t | p 0 0 0 2 N ∑ + +const { t } ln y ) +(1 − t (4.142) )ln(1 − y n n n n n =1 T where y . To obtain a Gaussian approximation to the posterior dis- w ) φ σ ( = n n tribution, we first maximize the posterior distribution to give the MAP (maximum , which defines the mean of the Gaussian. The covariance posterior) solution w MAP is then given by the inverse of the matrix of second derivatives of the negative log likelihood, which takes the form N ∑ T − 1 S φ | t )= S = ln (4.143) + p ( . w y φ (1 − y −∇∇ ) n n N n n 0 n =1 The Gaussian approximation to the posterior distribution therefore takes the form (4.144) , S . ) w q w ( w )= N ( | N MAP Having obtained a Gaussian approximation to the posterior distribution, there remains the task of marginalizing with respect to this distribution in order to make predictions. 4.5.2 Predictive distribution The predictive distribution for class , given a new feature vector φ ( x ) ,is C 1 p w | t ) , which is ( obtained by marginalizing with respect to the posterior distribution ( w ) so that q itself approximated by a Gaussian distribution ∫ ∫ T t φ , t )= (4.145) w ( C )d | φ , w ) p ( w | | )d w  w σ ( w p φ ) q ( p ( C 1 1 with the corresponding probability for class C . given by p ( C ) | φ , t )=1 − p ( C t | φ , 1 2 2 T φ ) de- σ w To evaluate the predictive distribution, we first note that the function ( T only through its projection onto φ . Denoting pends on = w w a φ ,wehave ∫ T T σ ( w ) δ ( a − w φ φ )= σ ( a )d a (4.146) where ( · ) is the Dirac delta function. From this we obtain δ ∫ ∫ T w (4.147) φ ) q ( σ )d w = ( σ ( a ) p ( a )d a w

239 4.5. Bayesian Logistic Regression 219 where ∫ T w ( a − w ( φ ) q δ (4.148) )d w . )= p ( a We can evaluate p ( a ) by noting that the delta function imposes a linear constraint and so forms a marginal distribution from the joint distribution q ( w ) by inte- on w . Because ( w ) is Gaussian, we know from q φ grating out all directions orthogonal to Section 2.3.2 that the marginal distribution will also be Gaussian. We can evaluate the mean and covariance of this distribution by taking moments, and interchanging and w , so that the order of integration over a ∫ ∫ T T a ]= p (4.149) ( a ) a d a = = q ( w ) w E φ d w = w [ φ μ a MAP ( ) . where we have used the result (4.144) for the variational posterior distribution w q Similarly ∫ } { 2 2 2 p ( a ) =var[ a a − E [ a ] ]= a d σ a ∫ } { T T 2 2 T w ( ) ) − ( m w φ φ ) ( q d w = φ (4.150) S . φ = N N a takes the same form as the predictive distribution Note that the distribution of (3.58) for the linear regression model, with the noise variance set to zero. Thus our variational approximation to the predictive distribution becomes ∫ ∫ 2 | (4.151) )= a. σ ( a ) p ( a )d )d = t σ ( a ) N ( a | μ ,σ a C ( p a 1 a This result can also be derived directly by making use of the results for the marginal of a Gaussian distribution given in Section 2.3.2. Exercise 4.24 The integral over a represents the convolution of a Gaussian with a logistic sig- moid, and cannot be evaluated analytically. We can, however, obtain a good approx- imation (Spiegelhalter and Lauritzen, 1990; MacKay, 1992b; Barber and Bishop, 1998a) by making use of the close similarity between the logistic sigmoid function a ) defined by (4.59) and the probit function Φ( a ) defined by (4.114). In order to σ ( obtain the best approximation to the logistic function we need to re-scale the hori- σ ( a ) zontal axis, so that we approximate Φ( λa ) . We can find a suitable value of by λ by requiring that the two functions have the same slope at the origin, which gives 2 = π/ 8 . The similarity of the logistic sigmoid and the probit function, for this Exercise 4.25 λ choice of λ , is illustrated in Figure 4.9. The advantage of using a probit function is that its convolution with a Gaussian can be expressed analytically in terms of another probit function. Specifically we can show that Exercise 4.26 ( ) ∫ μ 2 ( a | μ, σ (4.152) )d a Φ( λa ) . N =Φ − 2 / 2 2 1 ) σ + λ (

240 220 4. LINEAR MODELS FOR CLASSIFICATION σ We now apply the approximation  Φ( λa ) to the probit functions appearing ( ) a on both sides of this equation, leading to the following approximation for the convo- lution of a logistic sigmoid with a Gaussian ∫ ( ) 2 2 a | μ, σ σ )d a  σ σ κ ( ( a ) μ ) (4.153) N ( where we have defined / 2 1 2 − 2 / πσ ) = (1 + . (4.154) 8) ( σ κ Applying this result to (4.151) we obtain the approximate predictive distribution in the form ) ( 2 | (4.155) φ t )= σ μ κ ( σ ) , C ( p a 1 a 2 2 σ σ are defined by (4.149) and (4.150), respectively, and κ ( and is de- ) where μ a a a fined by (4.154). 5 . φ )=0 | is given by , t C p Note that the decision boundary corresponding to ( 1 , which is the same as the decision boundary obtained by using the MAP =0 μ a w . Thus if the decision criterion is based on minimizing misclassifica- value for tion rate, with equal prior probabilities, then the marginalization over has no ef- w fect. However, for more complex decision criteria it will play an important role. Marginalization of the logistic sigmoid model under a Gaussian approximation to the posterior distribution will be illustrated in the context of variational inference in Figure 10.13. Exercises 4.1 ( Given a set of data points { x } , we can define the convex hull to be the set of ) n x all points given by ∑ x = α x (4.156) n n n ∑ together with  and α 0 =1 . Consider a second set of points { y } where α n n n n their corresponding convex hull. By definition, the two sets of points will be linearly T w w x such that ̂ w and a scalar ̂ for all + w 0 > separable if there exists a vector 0 n 0 T x . Show that if their convex hulls intersect, the two , and w y y + w for all < 0 ̂ 0 n n n sets of points cannot be linearly separable, and conversely that if they are linearly separable, their convex hulls do not intersect. Consider the minimization of a sum-of-squares error function (4.15), www ) 4.2 ( and suppose that all of the target vectors in the training set satisfy a linear constraint T a t (4.157) + b =0 n th t where corresponds to the n row of the matrix T in (4.15). Show that as a n consequence of this constraint, the elements of the model prediction y ( x ) given by the least-squares solution (4.17) also satisfy this constraint, so that T (4.158) y ( . )+ b =0 x a

241 Exercises 221 x )=1 so that the corresponding ( φ To do so, assume that one of the basis functions 0 parameter w plays the role of a bias. 0 Extend the result of Exercise 4.2 to show that if multiple linear constraints ( 4.3 ) are satisfied simultaneously by the target vectors, then the same constraints will also be satisfied by the least-squares prediction of a linear model. ) 4.4 ( Show that maximization of the class separation criterion given by (4.23) www T , =1 w w , using a Lagrange multiplier to enforce the constraint with respect to w ∝ m leads to the result that w ( m . ) − 2 1 4.5 ( ) By making use of (4.20), (4.23), and (4.24), show that the Fisher criterion (4.25) can be written in the form (4.26). ( ) 4.6 Using the definitions of the between-class and within-class covariance matrices given by (4.27) and (4.28), respectively, together with (4.34) and (4.36) and the choice of target values described in Section 4.1.5, show that the expression (4.33) that minimizes the sum-of-squares error function can be written in the form (4.37). ) 4.7 ( Show that the logistic sigmoid function (4.59) satisfies the property www − 1 a σ σ ( ( ) and that its inverse is given by σ − a )=1 − y )=ln ( y/ (1 − y ) } . { 4.8 ) Using (4.57) and (4.58), derive the result (4.65) for the posterior class probability ( in the two-class generative model with Gaussian densities, and verify the results . (4.66) and (4.67) for the parameters w w and 0 4.9 ( ) www Consider a generative classification model for K classes defined by prior class probabilities ( p C |C and general class-conditional densities p ( φ )= π ) k k k , t } where φ is the input feature vector. Suppose we are given a training data set { φ n n is a binary target vector of length K that uses the 1-of- , and t where n =1 ,...,N n K t coding scheme, so that it has components = I . if pattern n is from class C k jk nj Assuming that the data points are drawn independently from this model, show that the maximum-likelihood solution for the prior probabilities is given by N k (4.159) = π k N where N is the number of data points assigned to class C . k k ( 4.10 ) Consider the classification model of Exercise 4.9 and now suppose that the class-conditional densities are given by Gaussian distributions with a shared covari- ance matrix, so that p ( φ |C (4.160) )= N ( φ | μ . , Σ ) k k Show that the maximum likelihood solution for the mean of the Gaussian distribution is given by for class C k N ∑ 1 (4.161) φ t = μ nk k n N k n =1

242 222 4. LINEAR MODELS FOR CLASSIFICATION . Similarly, C which represents the mean of those feature vectors assigned to class k show that the maximum likelihood solution for the shared covariance matrix is given by K ∑ N k Σ = (4.162) S k N =1 k where N ∑ 1 T S = (4.163) . t μ ( φ ) − μ − )( φ nk k n k k n N k =1 n Σ is given by a weighted average of the covariances of the data associated with Thus each class, in which the weighting coefficients are given by the prior probabilities of the classes. ) Consider a classification problem with K classes for which the feature vector 4.11 ( M components each of which can take L discrete states. Let the values of the φ has binary coding scheme. Further suppose that, components be represented by a 1-of- L conditioned on the class C , the M components of φ are independent, so that the k class-conditional density factorizes with respect to the feature vector components. given by (4.63), which appear in the argument to the a Show that the quantities k softmax function describing the posterior class probabilities, are linear functions of the components of φ . Note that this represents an example of the naive Bayes model which is discussed in Section 8.2.2. www Verify the relation (4.88) for the derivative of the logistic sigmoid func- ) 4.12 ( tion defined by (4.59). www By making use of the result (4.88) for the derivative of the logistic sig- ) 4.13 ( moid, show that the derivative of the error function (4.90) for the logistic regression model is given by (4.91). Show that for a linearly separable data set, the maximum likelihood solution 4.14 ) ( for the logistic regression model is obtained by finding a vector w whose decision T ( x )=0 separates the classes and then taking the magnitude of w to φ w boundary infinity. ( ) Show that the Hessian matrix H for the logistic regression model, given by 4.15 ) (1 − y , y (4.97), is positive definite. Here R is a diagonal matrix with elements n n and y is the output of the logistic regression model for input vector x . Hence show n n that the error function is a concave function of w and that it has a unique minimum. is known ) 4.16 x ( Consider a binary classification problem in which each observation n t =0 and t =1 , and suppose that to belong to one of two classes, corresponding to the procedure for collecting training data is imperfect, so that training points are sometimes mislabelled. For every data point x , instead of having a value t for the n class label, we have instead a value π representing the probability that t =1 . n n Given a probabilistic model p ( t =1 | φ ) , write down the log likelihood function appropriate to such a data set.

243 Exercises 223 Show that the derivatives of the softmax activation function (4.104), www ) ( 4.17 where the a are defined by (4.105), are given by (4.106). k ( ) Using the result (4.91) for the derivatives of the softmax activation function, 4.18 show that the gradients of the cross-entropy error (4.108) are given by (4.109). ) ( 4.19 www Write down expressions for the gradient of the log likelihood, as well as the corresponding Hessian matrix, for the probit regression model defined in Sec- tion 4.3.5. These are the quantities that would be required to train such a model using IRLS. 4.20 ) Show that the Hessian matrix for the multiclass logistic regression problem, ( defined by (4.110), is positive semidefinite. Note that the full Hessian matrix for × MK , where M is the number of parameters and K this problem is of size MK is the number of classes. To prove the positive semidefinite property, consider the T is an arbitrary vector of length Hu where u MK , and then apply Jensen’s product u inequality. ( ) Show that the probit function (4.114) and the erf function (4.115) are related by 4.21 (4.116). 4.22 ) Using the result (4.135), derive the expression (4.137) for the log model evi- ( dence under the Laplace approximation. ( 4.23 ) www In this exercise, we derive the BIC result (4.139) starting from the Laplace approximation to the model evidence given by (4.137). Show that if the , the log model ) V θ ( θ | m , N prior over parameters is Gaussian of the form ( p )= 0 evidence under the Laplace approximation takes the form 1 1 − 1 T ) m − − m ) − V ) − θ ( | ln H θ +const | ( ( D )  ln p ( D| θ ln p MAP MAP MAP 0 2 2 ) H p ( D| θ ln evaluated where is the matrix of second derivatives of the log likelihood 1 − is small and the second V . Now assume that the prior is broad so that θ at MAP 0 term on the right-hand side above can be neglected. Furthermore, consider the case of independent, identically distributed data so that H is the sum of terms one for each data point. Show that the log model evidence can then be written approximately in the form of the BIC expression (4.139). 4.24 ( ) Use the results from Section 2.3.2 to derive the result (4.151) for the marginal- ization of the logistic regression model with respect to a Gaussian posterior distribu- tion over the parameters . w ( ) 4.25 σ ( a ) defined by (4.59) Suppose we wish to approximate the logistic sigmoid by a scaled probit function Φ( λa ) , where Φ( a ) is defined by (4.114). Show that if λ a =0 , then is chosen so that the derivatives of the two functions are equal at 2 . = π/ 8 λ

244 224 4. LINEAR MODELS FOR CLASSIFICATION ( ) In this exercise, we prove the relation (4.152) for the convolution of a probit 4.26 function with a Gaussian distribution. To do this, show that the derivative of the left- hand side with respect to is equal to the derivative of the right-hand side, and then μ integrate both sides with respect to μ and then show that the constant of integration vanishes. Note that before differentiating the left-hand side, it is convenient first to introduce a change of variable given by a = μ + σz so that the integral over a is replaced by an integral over z . When we differentiate the left-hand side of the relation (4.152), we will then obtain a Gaussian integral over z that can be evaluated analytically.

245 5 Neural Networks In Chapters 3 and 4 we considered models for regression and classification that com- prised linear combinations of fixed basis functions. We saw that such models have useful analytical and computational properties but that their practical applicability was limited by the curse of dimensionality. In order to apply such models to large- scale problems, it is necessary to adapt the basis functions to the data. Support vector machines (SVMs), discussed in Chapter 7, address this by first defining basis functions that are centred on the training data points and then selecting a subset of these during training. One advantage of SVMs is that, although the training involves nonlinear optimization, the objective function is convex, and so the solution of the optimization problem is relatively straightforward. The number of basis functions in the resulting models is generally much smaller than the number of training points, although it is often still relatively large and typically increases with the size of the training set. The relevance vector machine, discussed in Section 7.2, also chooses a subset from a fixed set of basis functions and typically results in much 225

246 226 5. NEURAL NETWORKS sparser models. Unlike the SVM it also produces probabilistic outputs, although this is at the expense of a nonconvex optimization during training. An alternative approach is to fix the number of basis functions in advance but allow them to be adaptive, in other words to use parametric forms for the basis func- tions in which the parameter values are adapted during training. The most successful model of this type in the context of pattern recognition is the feed-forward neural network, also known as the multilayer perceptron , discussed in this chapter. In fact, ‘multilayer perceptron’ is really a misnomer, because the model comprises multi- ple layers of logistic regression models (with continuous nonlinearities) rather than multiple perceptrons (with discontinuous nonlinearities). For many applications, the resulting model can be significantly more compact, and hence faster to evaluate, than a support vector machine having the same generalization performance. The price to be paid for this compactness, as with the relevance vector machine, is that the like- lihood function, which forms the basis for network training, is no longer a convex function of the model parameters. In practice, however, it is often worth investing substantial computational resources during the training phase in order to obtain a compact model that is fast at processing new data. The term ‘neural network’ has its origins in attempts to find mathematical rep- resentations of information processing in biological systems (McCulloch and Pitts, 1943; Widrow and Hoff, 1960; Rosenblatt, 1962; Rumelhart et al. , 1986). Indeed, it has been used very broadly to cover a wide range of different models, many of which have been the subject of exaggerated claims regarding their biological plau- sibility. From the perspective of practical applications of pattern recognition, how- ever, biological realism would impose entirely unnecessary constraints. Our focus in this chapter is therefore on neural networks as efficient models for statistical pattern recognition. In particular, we shall restrict our attention to the specific class of neu- ral networks that have proven to be of greatest practical value, namely the multilayer perceptron. We begin by considering the functional form of the network model, including the specific parameterization of the basis functions, and we then discuss the prob- lem of determining the network parameters within a maximum likelihood frame- work, which involves the solution of a nonlinear optimization problem. This requires the evaluation of derivatives of the log likelihood function with respect to the net- work parameters, and we shall see how these can be obtained efficiently using the technique of error backpropagation . We shall also show how the backpropagation framework can be extended to allow other derivatives to be evaluated, such as the Jacobian and Hessian matrices. Next we discuss various approaches to regulariza- tion of neural network training and the relationships between them. We also consider some extensions to the neural network model, and in particular we describe a gen- eral framework for modelling conditional probability distributions known as mixture density networks . Finally, we discuss the use of Bayesian treatments of neural net- works. Additional background on neural network models can be found in Bishop (1995a).

247 5.1. Feed-forward Network Functions 227 5.1. Feed-forward Network Functions The linear models for regression and classification discussed in Chapters 3 and 4, re- φ spectively, are based on linear combinations of fixed nonlinear basis functions ) x ( j and take the form ( ) M ∑ φ (5.1) w ) x ( f )= w , x ( y j j j =1 ( · ) is a nonlinear activation function in the case of classification and is the f where identity in the case of regression. Our goal is to extend this model by making the ) ( depend on parameters and then to allow these parameters to x basis functions φ j { w be adjusted, along with the coefficients , during training. There are, of course, } j many ways to construct parametric nonlinear basis functions. Neural networks use basis functions that follow the same form as (5.1), so that each basis function is itself a nonlinear function of a linear combination of the inputs, where the coefficients in the linear combination are adaptive parameters. This leads to the basic neural network model, which can be described a series linear combinations of the input of functional transformations. First we construct M ,...,x in the form x variables D 1 D ∑ (1) (1) a x w (5.2) w = + i j ji j 0 =1 i where j =1 ,...,M , and the superscript (1) indicates that the corresponding param- (1) eters are in the first ‘layer’ of the network. We shall refer to the parameters w as ji (1) , following the nomenclature of Chapter 3. biases as weights and the parameters w 0 j a The quantities are known as activations . Each of them is then transformed using j a differentiable, nonlinear h ( · ) to give activation function = h ( a (5.3) ) . z j j These quantities correspond to the outputs of the basis functions in (5.1) that, in the context of neural networks, are called hidden units . The nonlinear functions h ( · ) are generally chosen to be sigmoidal functions such as the logistic sigmoid or the ‘ tanh ’ output Exercise 5.1 function. Following (5.1), these values are again linearly combined to give unit activations M ∑ (2) (2) z w = w + (5.4) a j k k 0 kj =1 j where k =1 ,...,K , and K is the total number of outputs. This transformation cor- (2) w responds to the second layer of the network, and again the are bias parameters. k 0 Finally, the output unit activations are transformed using an appropriate activation function to give a set of network outputs y . The choice of activation function is k determined by the nature of the data and the assumed distribution of target variables

248 228 5. NEURAL NETWORKS Network diagram for the two- Figure 5.1 hidden units layer neural network corre- z M (1) sponding to (5.7). The input, w (2) MD hidden, and output variables w KM are represented by nodes, and x D the weight parameters are rep- y K resented by links between the nodes, in which the bias pa- outputs inputs rameters are denoted by links coming from additional input y 1 and and hidden variables x 0 z . Arrows denote the direc- 0 x 1 tion of information flow through the network during forward (2) z 1 propagation. w 10 x 0 z 0 and follows the same considerations as for linear models discussed in Chapters 3 and 4. Thus for standard regression problems, the activation function is the identity so = a . Similarly, for multiple binary classification problems, each output unit y that k k activation is transformed using a logistic sigmoid function so that (5.5) = σ ( a ) y k k where 1 . (5.6) )= ( σ a a ) 1+exp( − Finally, for multiclass problems, a softmax activation function of the form (4.62) is used. The choice of output unit activation function is discussed in detail in Sec- tion 5.2. We can combine these various stages to give the overall network function that, for sigmoidal output unit activation functions, takes the form ( ) ) ( D M ∑ ∑ (2) (2) (1) (1) ( , w h σ )= (5.7) w + w x x w + w y i k ji 0 j k 0 kj j =1 =1 i where the set of all weight and bias parameters have been grouped together into a vector w . Thus the neural network model is simply a nonlinear function from a set of w controlled by a vector } } to a set of output variables { y x { of input variables k i adjustable parameters. This function can be represented in the form of a network diagram as shown in Figure 5.1. The process of evaluating (5.7) can then be interpreted as a forward propagation of information through the network. It should be emphasized that these diagrams do not represent probabilistic graphical models of the kind to be consid- ered in Chapter 8 because the internal nodes represent deterministic variables rather than stochastic ones. For this reason, we have adopted a slightly different graphical

249 5.1. Feed-forward Network Functions 229 notation for the two kinds of model. We shall see later how to give a probabilistic interpretation to a neural network. As discussed in Section 3.1, the bias parameters in (5.2) can be absorbed into x the set of weight parameters by defining an additional input variable whose value 0 is clamped at x =1 , so that (5.2) takes the form 0 D ∑ (1) a = (5.8) . w x j i ji =0 i We can similarly absorb the second-layer biases into the second-layer weights, so that the overall network function becomes )) ( ( D M ∑ ∑ (2) (1) y (5.9) . w σ ( h x w x )= w , k i ji kj i =0 j =0 As can be seen from Figure 5.1, the neural network model comprises two stages of processing, each of which resembles the perceptron model of Section 4.1.7, and for this reason the neural network is also known as the multilayer perceptron ,or MLP. A key difference compared to the perceptron, however, is that the neural net- work uses continuous sigmoidal nonlinearities in the hidden units, whereas the per- ceptron uses step-function nonlinearities. This means that the neural network func- tion is differentiable with respect to the network parameters, and this property will play a central role in network training. If the activation functions of all the hidden units in a network are taken to be linear, then for any such network we can always find an equivalent network without hidden units. This follows from the fact that the composition of successive linear transformations is itself a linear transformation. However, if the number of hidden units is smaller than either the number of input or output units, then the transforma- tions that the network can generate are not the most general possible linear trans- formations from inputs to outputs because information is lost in the dimensionality reduction at the hidden units. In Section 12.4.2, we show that networks of linear units give rise to principal component analysis. In general, however, there is little interest in multilayer networks of linear units. The network architecture shown in Figure 5.1 is the most commonly used one in practice. However, it is easily generalized, for instance by considering additional layers of processing each consisting of a weighted linear combination of the form (5.4) followed by an element-wise transformation using a nonlinear activation func- tion. Note that there is some confusion in the literature regarding the terminology for counting the number of layers in such networks. Thus the network in Figure 5.1 may be described as a 3-layer network (which counts the number of layers of units, and treats the inputs as units) or sometimes as a single-hidden-layer network (which counts the number of layers of hidden units). We recommend a terminology in which Figure 5.1 is called a two-layer network, because it is the number of layers of adap- tive weights that is important for determining the network properties. Another generalization of the network architecture is to include skip-layer con- nections, each of which is associated with a corresponding adaptive parameter. For

250 230 5. NEURAL NETWORKS Example of a neural network having a Figure 5.2 z 2 general feed-forward topology. Note that each hidden and output unit has an associated bias parameter (omitted for y x 2 2 clarity). z outputs inputs 1 y x 1 1 z 3 instance, in a two-layer network these would go directly from inputs to outputs. In principle, a network with sigmoidal hidden units can always mimic skip layer con- nections (for bounded input values) by using a sufficiently small first-layer weight that, over its operating range, the hidden unit is effectively linear, and then com- pensating with a large weight value from the hidden unit to the output. In practice, however, it may be advantageous to include skip-layer connections explicitly. Furthermore, the network can be sparse, with not all possible connections within a layer being present. We shall see an example of a sparse network architecture when we consider convolutional neural networks in Section 5.5.6. Because there is a direct correspondence between a network diagram and its mathematical function, we can develop more general network mappings by con- sidering more complex network diagrams. However, these must be restricted to a feed-forward architecture, in other words to one having no closed directed cycles, to ensure that the outputs are deterministic functions of the inputs. This is illustrated with a simple example in Figure 5.2. Each (hidden or output) unit in such a network computes a function given by ( ) ∑ w = h (5.10) z z j kj k j where the sum runs over all units that send connections to unit k (and a bias param- eter is included in the summation). For a given set of values applied to the inputs of the network, successive application of (5.10) allows the activations of all units in the network to be evaluated including those of the output units. The approximation properties of feed-forward networks have been widely stud- ied (Funahashi, 1989; Cybenko, 1989; Hornik et al. , 1989; Stinchecombe and White, 1989; Cotter, 1990; Ito, 1991; Hornik, 1991; Kreinovich, 1991; Ripley, 1996) and found to be very general. Neural networks are therefore said to be universal ap- proximators . For example, a two-layer network with linear outputs can uniformly approximate any continuous function on a compact input domain to arbitrary accu- racy provided the network has a sufficiently large number of hidden units. This result holds for a wide range of hidden unit activation functions, but excluding polynomi- als. Although such theorems are reassuring, the key problem is how to find suitable parameter values given a set of training data, and in later sections of this chapter we

251 5.1. Feed-forward Network Functions 231 Illustration of the ca- Figure 5.3 pability of a multilayer perceptron to approximate four different func- 2 , (b) f x )= x ( tions comprising (a) x )=sin( x f ,(c), f ( x )= | x | , ( ) f and (d) x )= H ( x ) where H ( x ) ( is the Heaviside step function. In N data points, each case, =50 shown as blue dots, have been sam- x over the interval pled uniformly in − 1 , 1) and the corresponding val- ( ues of f ( x ) evaluated. These data points are then used to train a two- (b) (a) layer network having 3 hidden units ’ activation functions and with ‘ tanh linear output units. The resulting network functions are shown by the red curves, and the outputs of the three hidden units are shown by the three dashed curves. (c) (d) will show that there exist effective solutions to this problem based on both maximum likelihood and Bayesian approaches. The capability of a two-layer network to model a broad range of functions is illustrated in Figure 5.3. This figure also shows how individual hidden units work collaboratively to approximate the final function. The role of hidden units in a simple classification problem is illustrated in Figure 5.4 using the synthetic classification data set described in Appendix A. 5.1.1 Weight-space symmetries One property of feed-forward networks, which will play a role when we consider Bayesian model comparison, is that multiple distinct choices for the weight vector w can all give rise to the same mapping function from inputs to outputs (Chen et al. , 1993). Consider a two-layer network of the form shown in Figure 5.1 with M hidden units having ‘ ’ activation functions and full connectivity in both layers. If we tanh change the sign of all of the weights and the bias feeding into a particular hidden unit, then, for a given input pattern, the sign of the activation of the hidden unit will be reversed, because ‘ tanh ’ is an odd function, so that tanh( − a )= − tanh( a ) . This transformation can be exactly compensated by changing the sign of all of the weights leading out of that hidden unit. Thus, by changing the signs of a particular group of weights (and a bias), the input–output mapping function represented by the network is unchanged, and so we have found two different weight vectors that give rise to the same mapping function. For M hidden units, there will be M such ‘sign-flip’

252 232 5. NEURAL NETWORKS Example of the solution of a simple two- Figure 5.4 3 class classification problem involving synthetic data using a neural network 2 having two inputs, two hidden units with tanh ’ activation functions, and a single ‘ 1 output having a logistic sigmoid activa- tion function. The dashed blue lines 0 =0 . show the contours for each of z 5 the hidden units, and the red line shows decision surface for the net- 5 . =0 y the −1 work. For comparison, the green line denotes the optimal decision boundary −2 computed from the distributions used to generate the data. −1 −2 1 0 2 M symmetries, and thus any given weight vector will be one of a set equivalent 2 weight vectors . Similarly, imagine that we interchange the values of all of the weights (and the bias) leading both into and out of a particular hidden unit with the corresponding values of the weights (and bias) associated with a different hidden unit. Again, this clearly leaves the network input–output mapping function unchanged, but it corre- sponds to a different choice of weight vector. For M hidden units, any given weight vector will belong to a set of M equivalent weight vectors associated with this inter- ! M different orderings of the hidden units. ! change symmetry, corresponding to the M . !2 The network will therefore have an overall weight-space symmetry factor of M For networks with more than two layers of weights, the total level of symmetry will be given by the product of such factors, one for each layer of hidden units. It turns out that these factors account for all of the symmetries in weight space (except for possible accidental symmetries due to specific choices for the weight val- ues). Furthermore, the existence of these symmetries is not a particular property of ̇ ́ ’ function but applies to a wide range of activation functions (K a and urkov the ‘ tanh Kainen, 1994). In many cases, these symmetries in weight space are of little practi- cal consequence, although in Section 5.7 we shall encounter a situation in which we need to take them into account. 5.2. Network Training So far, we have viewed neural networks as a general class of parametric nonlinear functions from a vector of input variables to a vector y of output variables. A x simple approach to the problem of determining the network parameters is to make an analogy with the discussion of polynomial curve fitting in Section 1.1, and therefore to minimize a sum-of-squares error function. Given a training set comprising a set , together with a corresponding set of ,...,N =1 n , where } of input vectors x { n

253 5.2. Network Training 233 } , we minimize the error function target vectors { t n N ∑ 1 2 ‖ . t ‖ y ( x , w ) − (5.11) )= w ( E n n 2 =1 n However, we can provide a much more general view of network training by first giving a probabilistic interpretation to the network outputs. We have already seen many advantages of using probabilistic predictions in Section 1.5.4. Here it will also provide us with a clearer motivation both for the choice of output unit nonlinearity and the choice of error function. We start by discussing regression problems, and for the moment we consider t that can take any real value. Following the discussions a single target variable has a Gaussian distribution with an x - in Section 1.2.5 and 3.1, we assume that t dependent mean, which is given by the output of the neural network, so that ( ) 1 − , w )= N p ( t | x x t ) ,β ( y | (5.12) w , is the precision (inverse variance) of the Gaussian noise. Of course this β where is a somewhat restrictive assumption, and in Section 5.6 we shall see how to extend this approach to allow for more general conditional distributions. For the conditional distribution given by (5.12), it is sufficient to take the output unit activation function to be the identity, because such a network can approximate any continuous function x to y . Given a data set of N independent, identically distributed observations from t ,we } ,..., x } , along with corresponding target values ,...,t = { t X { x = 1 N N 1 can construct the corresponding likelihood function N ∏ X , w t )= ( | p ,β . p ( t ) | x ,β , w n n =1 n Taking the negative logarithm, we obtain the error function N ∑ β N N 2 t x w ) ( , } y − { − π ln(2 ) ln β + (5.13) n n 2 2 2 n =1 which can be used to learn the parameters w and β . In Section 5.7, we shall dis- cuss the Bayesian treatment of neural networks, while here we consider a maximum likelihood approach. Note that in the neural networks literature, it is usual to con- sider the minimization of an error function rather than the maximization of the (log) likelihood, and so here we shall follow this convention. Consider first the determi- nation of . Maximizing the likelihood function is equivalent to minimizing the w sum-of-squares error function given by N ∑ 1 2 E )= ( w (5.14) } { y ( x t , w ) − n n 2 =1 n

254 234 5. NEURAL NETWORKS w found where we have discarded additive and multiplicative constants. The value of ( w will be denoted w by minimizing E ) because it corresponds to the maximum ML ) , w y likelihood solution. In practice, the nonlinearity of the network function x ( n ( ) to be nonconvex, and so in practice local maxima of the w E causes the error likelihood may be found, corresponding to local minima of the error function, as discussed in Section 5.2.1. w Having found β can be found by minimizing the negative log , the value of ML likelihood to give N ∑ 1 1 2 x . { y ( = , w (5.15) ) − t } n n ML β N ML =1 n Note that this can be evaluated once the iterative optimization required to find w ML is completed. If we have multiple target variables, and we assume that they are inde- pendent conditional on and w with shared noise precision β , then the conditional x distribution of the target values is given by ( ) − 1 N )= t p | x , w ( x , w ) ,β . t I | y (5.16) ( Following the same argument as for a single target variable, we see that the maximum likelihood weights are determined by minimizing the sum-of-squares error function Exercise 5.2 (5.11). The noise precision is then given by N ∑ 1 1 2 x y (5.17) ‖ ‖ = ( t − , w ) ML n n NK β ML =1 n where K is the number of target variables. The assumption of independence can be dropped at the expense of a slightly more complex optimization problem. Exercise 5.3 Recall from Section 4.3.6 that there is a natural pairing of the error function (given by the negative log likelihood) and the output unit activation function. In the regression case, we can view the network as having an output activation function that a . The corresponding sum-of-squares error function = y is the identity, so that k k has the property ∂E y = − t (5.18) k k ∂a k which we shall make use of when discussing error backpropagation in Section 5.3. Now consider the case of binary classification in which we have a single target . Following and t =0 denotes class C t such that t variable denotes class C =1 2 1 the discussion of canonical link functions in Section 4.3.6, we consider a network having a single output whose activation function is a logistic sigmoid 1 (5.19) ( a ) y = σ ≡ − a ) 1+exp( so that 0  y ( x , w )  1 . We can interpret y ( x , w ) as the conditional probability p ( C | x ) , with p ( C . The conditional distribution of targets | x ) given by 1 − y ( x , w ) 2 1 given inputs is then a Bernoulli distribution of the form t − 1 t x , w )= y ( x , w ) p ( t | } y x , w ) ( − 1 { . (5.20)

255 5.2. Network Training 235 If we consider a training set of independent observations, then the error function, error function cross-entropy which is given by the negative log likelihood, is then a of the form N ∑ ) (5.21) } y − { t )ln(1 ln y t +(1 − − E ( )= w n n n n n =1 β denotes y ( x . Note that there is no analogue of the noise precision , w ) where y n n because the target values are assumed to be correctly labelled. However, the model et al. Exercise 5.4 is easily extended to allow for labelling errors. Simard (2003) found that using the cross-entropy error function instead of the sum-of-squares for a classification problem leads to faster training as well as improved generalization. separate binary classifications to perform, then we can use a net- If we have K work having K outputs each of which has a logistic sigmoid activation function. . ,...,K ∈{ 0 , 1 } , where k =1 Associated with each output is a binary class label t k If we assume that the class labels are independent, given the input vector, then the conditional distribution of the targets is K ∏ 1 t − t k k . w , y x ( x , w ) )] (5.22) [1 − y ( t )= w , x | p ( k k =1 k Taking the negative logarithm of the corresponding likelihood function then gives the following error function Exercise 5.5 N K ∑ ∑ − } ) y { t (5.23) ln y )ln(1 +(1 − t w )= − E ( nk nk nk nk n =1 k =1 . Again, the derivative of the error function with re- ) denotes y w ( x , y where n nk k Exercise 5.6 spect to the activation for a particular output unit takes the form (5.18) just as in the regression case. It is interesting to contrast the neural network solution to this problem with the corresponding approach based on a linear classification model of the kind discussed in Chapter 4. Suppose that we are using a standard two-layer network of the kind shown in Figure 5.1. We see that the weight parameters in the first layer of the network are shared between the various outputs, whereas in the linear model each classification problem is solved independently. The first layer of the network can be viewed as performing a nonlinear feature extraction, and the sharing of features between the different outputs can save on computation and can also lead to improved generalization. Finally, we consider the standard multiclass classification problem in which each input is assigned to one of K mutually exclusive classes. The binary target variables coding scheme indicating the class, and the network K ∈{ 0 , 1 } have a 1-of- t k y outputs are interpreted as t w )= p ( x , =1 | x ) , leading to the following error ( k k function N K ∑ ∑ , . ) x ( w (5.24) y ln t )= E ( w − k kn n n =1 =1 k

256 236 5. NEURAL NETWORKS ( as Figure 5.5 Geometrical view of the error function w E ) ) ( E w is a surface sitting over weight space. Point w A a local minimum and w is the global minimum. B , the local gradient of the error w At any point C surface is given by the vector E . ∇ w 1 w w A B w C w E ∇ 2 Following the discussion of Section 4.3.4, we see that the output unit activation function, which corresponds to the canonical link, is given by the softmax function w )) , x ( a exp( k , w )= ( x (5.25) y ∑ k )) x , w exp( a ( j j ∑ 0  y which satisfies 1 and ) are unchanged y w =1 . Note that the y , ( x  k k k k ( x , w ) , causing the error function to be constant if a constant is added to all of the a k for some directions in weight space. This degeneracy is removed if an appropriate regularization term (Section 5.5) is added to the error function. Once again, the derivative of the error function with respect to the activation for a particular output unit takes the familiar form (5.18). Exercise 5.7 In summary, there is a natural choice of both output unit activation function and matching error function, according to the type of problem being solved. For re- gression we use linear outputs and a sum-of-squares error, for (multiple independent) binary classifications we use logistic sigmoid outputs and a cross-entropy error func- tion, and for multiclass classification we use softmax outputs with the corresponding multiclass cross-entropy error function. For classification problems involving two classes, we can use a single logistic sigmoid output, or alternatively we can use a network with two outputs having a softmax output activation function. 5.2.1 Parameter optimization We turn next to the task of finding a weight vector which minimizes the w E chosen function w ) . At this point, it is useful to have a geometrical picture of the ( error function, which we can view as a surface sitting over weight space as shown in Figure 5.5. First note that if we make a small step in weight space from w to w + δ w T ) ∇ E ( w E , where the vector ∇ ( w ) w δE then the change in the error function is  δ points in the direction of greatest rate of increase of the error function. Because the error E ( w ) is a smooth continuous function of w , its smallest value will occur at a

257 5.2. Network Training 237 point in weight space such that the gradient of the error function vanishes, so that ( )=0 (5.26) E w ∇ ( as otherwise we could make a small step in the direction of ) and thereby −∇ E w further reduce the error. Points at which the gradient vanishes are called stationary points, and may be further classified into minima, maxima, and saddle points. such that E ( w ) takes its smallest value. How- Our goal is to find a vector w ever, the error function typically has a highly nonlinear dependence on the weights and bias parameters, and so there will be many points in weight space at which the gradient vanishes (or is numerically very small). Indeed, from the discussion in Sec- w that is a local minimum, there will be other tion 5.1.1 we see that for any point points in weight space that are equivalent minima. For instance, in a two-layer net- hidden units, each point in weight work of the kind shown in Figure 5.1, with M M equivalent points. Section 5.1.1 !2 space is a member of a family of M Furthermore, there will typically be multiple inequivalent stationary points and in particular multiple inequivalent minima. A minimum that corresponds to the smallest value of the error function for any weight vector is said to be a global minimum . Any other minima corresponding to higher values of the error function are said to be local minima . For a successful application of neural networks, it may not be necessary to find the global minimum (and in general it will not be known whether the global minimum has been found) but it may be necessary to compare several local minima in order to find a sufficiently good solution. Because there is clearly no hope of finding an analytical solution to the equa- ∇ E ( w )=0 we resort to iterative numerical procedures. The optimization of tion continuous nonlinear functions is a widely studied problem and there exists an ex- tensive literature on how to solve it efficiently. Most techniques involve choosing (0) for the weight vector and then moving through weight space w some initial value in a succession of steps of the form τ ) ( τ ) ( ( τ +1) w w (5.27) +∆ = w where labels the iteration step. Different algorithms involve different choices for τ ( τ ) . Many algorithms make use of gradient information ∆ the weight vector update w ∇ E ( w ) is evaluated at and therefore require that, after each update, the value of τ +1) ( . In order to understand the importance of gradient w the new weight vector information, it is useful to consider a local approximation to the error function based on a Taylor expansion. 5.2.2 Local quadratic approximation Insight into the optimization problem, and into the various techniques for solv- ing it, can be obtained by considering a local quadratic approximation to the error function. in weight space w ̂ w ( E Consider the Taylor expansion of ) around some point 1 T T ( w − H ̂ w )+( w − ̂ w (5.28) ) b + w ̂ − w ( ̂ w ) ) ( E  ) w ( E 2

258 238 5. NEURAL NETWORKS is defined to be the gradient b where cubic and higher terms have been omitted. Here of evaluated at E ̂ w (5.29) b E | ≡∇ w w = b = ∇∇ E has elements H and the Hessian matrix ∣ ∣ ∂E ∣ . (5.30) ≡ ( H ) ij ∣ ∂w ∂w i j b = w w From (5.28), the corresponding local approximation to the gradient is given by  E + H ( w − ∇ b w ) . (5.31) ̂ w , these expressions will give reasonable ̂ w that are sufficiently close to For points approximations for the error and its gradient. Consider the particular case of a local quadratic approximation around a point  that is a minimum of the error function. In this case there is no linear term, w  E =0 at w because ∇ , and (5.28) becomes 1   T  w ( ) )+ − H w ) (5.32) w w − ( ( E )= w ( w E 2  . In order to interpret this geometrically, where the Hessian is evaluated at w H consider the eigenvalue equation for the Hessian matrix (5.33) = λ u Hu i i i form a complete orthonormal set (Appendix C) so that where the eigenvectors u i T u = δ . (5.34) u j ij i  w − w ( We now expand as a linear combination of the eigenvectors in the form ) ∑  u = (5.35) . α − w w i i i This can be regarded as a transformation of the coordinate system in which the origin  is translated to the point w , and the axes are rotated to align with the eigenvectors u (through the orthogonal matrix whose columns are the ), and is discussed in more i detail in Appendix C. Substituting (5.35) into (5.32), and using (5.33) and (5.34), allows the error function to be written in the form ∑ 1 2  λ (5.36) . )+ α )= w E E ( ( w i i 2 i A matrix H is said to be positive definite if, and only if, T Hv > 0 for all v . (5.37) v

259 5.2. Network Training 239 w 2 In the neighbourhood of a min- Figure 5.6  , the error function w imum can be approximated by a u 2 u 1 quadratic. Contours of con- stant error are then ellipses whose axes are aligned with  of the Hes- the eigenvectors u i w − 1 2 / λ sian matrix, with lengths that 2 are inversely proportional to the 1 − 2 / λ 1 square roots of the correspond- . ing eigenvectors λ i w 1 { u v } form a complete set, an arbitrary vector Because the eigenvectors can be i written in the form ∑ = v c (5.38) u . i i i From (5.33) and (5.34), we then have ∑ 2 T (5.39) Hv c λ = v i i i and so H will be positive definite if, and only if, all of its eigenvalues are positive. Exercise 5.10 In the new coordinate system, whose basis vectors are given by the eigenvectors { u } E are ellipses centred on the origin, as illustrated Exercise 5.11 , the contours of constant i  w in Figure 5.6. For a one-dimensional weight space, a stationary point will be a minimum if ∣ 2 ∣ E ∂ ∣ . 0 (5.40) > ∣ 2 ∂w  w D -dimensions is that the Hessian matrix, evaluated at The corresponding result in  , should be positive definite. Exercise 5.12 w 5.2.3 Use of gradient information As we shall see in Section 5.3, it is possible to evaluate the gradient of an error function efficiently by means of the backpropagation procedure. The use of this gradient information can lead to significant improvements in the speed with which the minima of the error function can be located. We can see why this is so, as follows. In the quadratic approximation to the error function, given in (5.28), the error surface is specified by the quantities b and H , which contain a total of W ( W + is the 3) independent elements (because the matrix H is symmetric), where W 2 Exercise 5.13 / dimensionality of w (i.e., the total number of adaptive parameters in the network). The location of the minimum of this quadratic approximation therefore depends on 2 ) parameters, and we should not expect to be able to locate the minimum until ( O W 2 ) independent pieces of information. If we do not make O W we have gathered ( 2 O ( W use of gradient information, we would expect to have to perform ) function

260 240 5. NEURAL NETWORKS O evaluations, each of which would require ) steps. Thus, the computational ( W 3 ( O effort needed to find the minimum using such an approach would be W ) . Now compare this with an algorithm that makes use of the gradient information. E brings W items of information, we might hope to Because each evaluation of ∇ ( O ) gradient evaluations. As we shall see, find the minimum of the function in W O W ) steps and so by using error backpropagation, each such evaluation takes only ( 2 ) steps. For this reason, the use of gradient O the minimum can now be found in ( W information forms the basis of practical algorithms for training neural networks. 5.2.4 Gradient descent optimization The simplest approach to using gradient information is to choose the weight update in (5.27) to comprise a small step in the direction of the negative gradient, so that τ +1) ) τ ) τ ( ( ( w ∇ = ( w w − η ) (5.41) E where the parameter η> 0 is known as the learning rate . After each such update, the gradient is re-evaluated for the new weight vector and the process repeated. Note that the error function is defined with respect to a training set, and so each step requires that the entire training set be processed in order to evaluate ∇ E . Techniques that use the whole data set at once are called batch methods. At each step the weight vector is moved in the direction of the greatest rate of decrease of the error function, gradient descent steepest descent . Although and so this approach is known as or such an approach might intuitively seem reasonable, in fact it turns out to be a poor algorithm, for reasons discussed in Bishop and Nabney (2008). For batch optimization, there are more efficient methods, such as conjugate gra- dients and quasi-Newton methods, which are much more robust and much faster et al. , 1981; Fletcher, 1987; Nocedal and Wright, than simple gradient descent (Gill 1999). Unlike gradient descent, these algorithms have the property that the error function always decreases at each iteration unless the weight vector has arrived at a local or global minimum. In order to find a sufficiently good minimum, it may be necessary to run a gradient-based algorithm multiple times, each time using a different randomly cho- sen starting point, and comparing the resulting performance on an independent vali- dation set. There is, however, an on-line version of gradient descent that has proved useful et al. , 1989). in practice for training neural networks on large data sets (Le Cun Error functions based on maximum likelihood for a set of independent observations comprise a sum of terms, one for each data point N ∑ . (5.42) E ) ( w w ( E )= n n =1 On-line gradient descent, also known as sequential gradient descent stochastic or gradient descent , makes an update to the weight vector based on one data point at a time, so that ) ( τ +1) τ ( ( τ ) w − ∇ E . ( η w = ) (5.43) w n

261 5.3. Error Backpropagation 241 This update is repeated by cycling through the data either in sequence or by selecting points at random with replacement. There are of course intermediate scenarios in which the updates are based on batches of data points. One advantage of on-line methods compared to batch methods is that the former handle redundancy in the data much more efficiently. To see, this consider an ex- treme example in which we take a data set and double its size by duplicating every data point. Note that this simply multiplies the error function by a factor of 2 and so is equivalent to using the original error function. Batch methods will require double the computational effort to evaluate the batch error function gradient, whereas on- line methods will be unaffected. Another property of on-line gradient descent is the possibility of escaping from local minima, since a stationary point with respect to the error function for the whole data set will generally not be a stationary point for each data point individually. Nonlinear optimization algorithms, and their practical application to neural net- work training, are discussed in detail in Bishop and Nabney (2008). 5.3. Error Backpropagation Our goal in this section is to find an efficient technique for evaluating the gradient of an error function E w ) for a feed-forward neural network. We shall see that ( this can be achieved using a local message passing scheme in which information is sent alternately forwards and backwards through the network and is known as error backpropagation , or sometimes simply as . backprop It should be noted that the term backpropagation is used in the neural com- puting literature to mean a variety of different things. For instance, the multilayer perceptron architecture is sometimes called a backpropagation network. The term backpropagation is also used to describe the training of a multilayer perceptron us- ing gradient descent applied to a sum-of-squares error function. In order to clarify the terminology, it is useful to consider the nature of the training process more care- fully. Most training algorithms involve an iterative procedure for minimization of an error function, with adjustments to the weights being made in a sequence of steps. At each such step, we can distinguish between two distinct stages. In the first stage, the derivatives of the error function with respect to the weights must be evaluated. As we shall see, the important contribution of the backpropagation technique is in pro- viding a computationally efficient method for evaluating such derivatives. Because it is at this stage that errors are propagated backwards through the network, we shall use the term backpropagation specifically to describe the evaluation of derivatives. In the second stage, the derivatives are then used to compute the adjustments to be made to the weights. The simplest such technique, and the one originally considered by Rumelhart et al. (1986), involves gradient descent. It is important to recognize that the two stages are distinct. Thus, the first stage, namely the propagation of er- rors backwards through the network in order to evaluate derivatives, can be applied to many other kinds of network and not just the multilayer perceptron. It can also be applied to error functions other that just the simple sum-of-squares, and to the eval-

262 242 5. NEURAL NETWORKS uation of other derivatives such as the Jacobian and Hessian matrices, as we shall see later in this chapter. Similarly, the second stage of weight adjustment using the calculated derivatives can be tackled using a variety of optimization schemes, many of which are substantially more powerful than simple gradient descent. 5.3.1 Evaluation of error-function derivatives We now derive the backpropagation algorithm for a general network having ar- bitrary feed-forward topology, arbitrary differentiable nonlinear activation functions, and a broad class of error function. The resulting formulae will then be illustrated using a simple layered network structure having a single layer of sigmoidal hidden units together with a sum-of-squares error. Many error functions of practical interest, for instance those defined by maxi- mum likelihood for a set of i.i.d. data, comprise a sum of terms, one for each data point in the training set, so that N ∑ w )= E ( E w ) . (5.44) ( n =1 n ) w ( for one such term in the E Here we shall consider the problem of evaluating ∇ n error function. This may be used directly for sequential optimization, or the results can be accumulated over the training set in the case of batch methods. Consider first a simple linear model in which the outputs y are linear combina- k so that tions of the input variables x i ∑ = (5.45) x w y ki i k i together with an error function that, for a particular input pattern , takes the form n ∑ 1 2 (5.46) ) ( t = y − E nk n nk 2 k y where = y . The gradient of this error function with respect to a weight ( x ) , w nk n k is given by w ji ∂E n (5.47) x =( y ) − t nj nj ni ∂w ji which can be interpreted as a ‘local’ computation involving the product of an ‘error and the variable − x associated with the output end of the link w t y signal’ ji nj ni nj associated with the input end of the link. In Section 4.3.2, we saw how a similar formula arises with the logistic sigmoid activation function together with the cross entropy error function, and similarly for the softmax activation function together with its matching cross-entropy error function. We shall now see how this simple result extends to the more complex setting of multilayer feed-forward networks. In a general feed-forward network, each unit computes a weighted sum of its inputs of the form ∑ a = (5.48) z w ji i j i

263 5.3. Error Backpropagation 243 is the activation of a unit, or input, that sends a connection to unit j , and w where z ji i is the weight associated with that connection. In Section 5.1, we saw that biases can be included in this sum by introducing an extra unit, or input, with activation fixed at +1 . We therefore do not need to deal with biases explicitly. The sum in (5.48) is ( · ) to give the activation z transformed by a nonlinear activation function h j of unit j in the form . = ( a (5.49) ) h z j j Note that one or more of the variables z in the sum in (5.48) could be an input, and i similarly, the unit j in (5.49) could be an output. For each pattern in the training set, we shall suppose that we have supplied the corresponding input vector to the network and calculated the activations of all of the hidden and output units in the network by successive application of (5.48) and because it can be regarded forward propagation (5.49). This process is often called as a forward flow of information through the network. E Now consider the evaluation of the derivative of with respect to a weight n . . The outputs of the various units will depend on the particular input pattern n w ji n However, in order to keep the notation uncluttered, we shall omit the subscript only w depends on the weight E from the network variables. First we note that n ji . We can therefore apply the chain rule for partial j to unit a via the summed input j derivatives to give ∂a ∂E ∂E j n n . (5.50) = ∂w ∂a ∂w j ji ji We now introduce a useful notation ∂E n ≡ (5.51) δ j ∂a j where the ’s are often referred to as errors for reasons we shall see shortly. Using δ (5.48), we can write ∂a j = z . (5.52) i ∂w ji Substituting (5.51) and (5.52) into (5.50), we then obtain ∂E n (5.53) . δ = z i j ∂w ji Equation (5.53) tells us that the required derivative is obtained simply by multiplying the value of δ for the unit at the output end of the weight by the value of z for the unit at the input end of the weight (where z =1 in the case of a bias). Note that this takes the same form as for the simple linear model considered at the start of this section. δ Thus, in order to evaluate the derivatives, we need only to calculate the value of j for each hidden and output unit in the network, and then apply (5.53). As we have seen already, for the output units, we have t (5.54) = y − δ k k k

264 244 5. NEURAL NETWORKS δ Illustration of the calculation of j by Figure 5.7 for hidden unit j δ k to which backpropagation of the ’s from those units z δ i k δ j unit sends connections. The blue arrow denotes the j w ji w direction of information flow during forward propagation, kj and the red arrows indicate the backward propagation z j of error information. δ 1 provided we are using the canonical link as the output-unit activation function. To δ ’s for hidden units, we again make use of the chain rule for partial evaluate the derivatives, ∑ ∂a ∂E ∂E n n k δ ≡ = (5.55) j ∂a ∂a ∂a j j k k where the sum runs over all units k to which unit j sends connections. The arrange- ment of units and weights is illustrated in Figure 5.7. Note that the units labelled k could include other hidden units and/or output units. In writing down (5.55), we are give rise to variations in the error func- a making use of the fact that variations in j tion only through variations in the variables a . If we now substitute the definition k δ given by (5.51) into (5.55), and make use of (5.48) and (5.49), we obtain the of following backpropagation formula ∑ ′ h ( (5.56) δ w ) = a δ j kj j k k which tells us that the value of δ for a particular hidden unit can be obtained by propagating the δ ’s backwards from units higher up in the network, as illustrated in Figure 5.7. Note that the summation in (5.56) is taken over the first index on (corresponding to backward propagation of information through the network), w kj whereas in the forward propagation equation (5.10) it is taken over the second index. Because we already know the values of the δ ’s for the output units, it follows that δ ’s for all of the hidden units in a by recursively applying (5.56) we can evaluate the feed-forward network, regardless of its topology. The backpropagation procedure can therefore be summarized as follows. Error Backpropagation to the network and forward propagate through 1. Apply an input vector x n the network using (5.48) and (5.49) to find the activations of all the hidden and output units. for all the output units using (5.54). 2. Evaluate the δ k for each hidden unit in the δ ’s using (5.56) to obtain δ 3. Backpropagate the j network. 4. Use (5.53) to evaluate the required derivatives.

265 5.3. Error Backpropagation 245 can then be obtained by For batch methods, the derivative of the total error E repeating the above steps for each pattern in the training set and then summing over all patterns: ∑ ∂E ∂E n = (5.57) . ∂w ∂w ji ji n In the above derivation we have implicitly assumed that each hidden or output unit in ( the network has the same activation function ) . The derivation is easily general- h · ized, however, to allow different units to have individual activation functions, simply ( · ) goes with which unit. by keeping track of which form of h 5.3.2 A simple example The above derivation of the backpropagation procedure allowed for general forms for the error function, the activation functions, and the network topology. In order to illustrate the application of this algorithm, we shall consider a particular example. This is chosen both for its simplicity and for its practical importance, be- cause many applications of neural networks reported in the literature make use of this type of network. Specifically, we shall consider a two-layer network of the form illustrated in Figure 5.1, together with a sum-of-squares error, in which the output y units have linear activation functions, so that = a , while the hidden units have k k logistic sigmoid activation functions given by a ) ≡ h a ) (5.58) ( tanh( where a − a e − e )= a tanh( (5.59) . a − a e + e A useful feature of this function is that its derivative can be expressed in a par- ticularly simple form: ′ 2 h a )=1 − h ( a ) ( . (5.60) We also consider a standard sum-of-squares error function, so that for pattern n the error is given by K ∑ 1 2 (5.61) = − ( y t ) E k k n 2 k =1 k , and is the activation of output unit is the corresponding target, for a t y where k k . particular input pattern x n For each pattern in the training set in turn, we first perform a forward propagation using D ∑ (1) = (5.62) x w a j i ji =0 i z (5.63) =tanh( a ) j j M ∑ (2) (5.64) . z = w y k j kj j =0

266 246 5. NEURAL NETWORKS ’s for each output unit using Next we compute the δ δ = − t . (5.65) y k k k s for the hidden units using δ Then we backpropagate these to obtain K ∑ 2 z ) − =(1 (5.66) . δ w δ kj k j j k =1 Finally, the derivatives with respect to the first-layer and second-layer weights are given by ∂E ∂E n n = δ z x δ , = (5.67) . i k j j (2) (1) ∂w ∂w ji kj 5.3.3 Efficiency of backpropagation One of the most important aspects of backpropagation is its computational effi- ciency. To understand this, let us examine how the number of computer operations required to evaluate the derivatives of the error function scales with the total number W of weights and biases in the network. A single evaluation of the error function ( W (for a given input pattern) would require operations, for sufficiently large W . O ) This follows from the fact that, except for a network with very sparse connections, the number of weights is typically much greater than the number of units, and so the bulk of the computational effort in forward propagation is concerned with evaluat- ing the sums in (5.48), with the evaluation of the activation functions representing a small overhead. Each term in the sum in (5.48) requires one multiplication and one O W ) . ( addition, leading to an overall computational cost that is An alternative approach to backpropagation for computing the derivatives of the error function is to use finite differences. This can be done by perturbing each weight in turn, and approximating the derivatives by the expression E + ) ( E − ( w )  w ∂E ji ji n n n = O (  ) (5.68) + ∂w  ji  1 . In a software simulation, the accuracy of the approximation to the where derivatives can be improved by making  smaller, until numerical roundoff problems arise. The accuracy of the finite differences method can be improved significantly central differences of the form by using symmetrical E + − ( ) ( w   ) − E w ∂E n n ji ji n 2 = (5.69) . ) O (  + ∂w  2 ji In this case, the O (  ) corrections cancel, as can be verified by Taylor expansion on Exercise 5.14 2 the right-hand side of (5.69), and so the residual corrections are (  O ) . The number of computational steps is, however, roughly doubled compared with (5.68). The main problem with numerical differentiation is that the highly desirable steps, and ( W ) scaling has been lost. Each forward propagation requires O ( W ) O

267 5.3. Error Backpropagation 247 Figure 5.8 Illustration of a modular pattern recognition system in which the u Jacobian matrix can be used v to backpropagate error signals from the outputs through to ear- y lier modules in the system. z w x W weights in the network each of which must be perturbed individually, so there are 2 ) . W that the overall scaling is ( O However, numerical differentiation plays an important role in practice, because a comparison of the derivatives calculated by backpropagation with those obtained us- ing central differences provides a powerful check on the correctness of any software implementation of the backpropagation algorithm. When training networks in prac- tice, derivatives should be evaluated using backpropagation, because this gives the greatest accuracy and numerical efficiency. However, the results should be compared with numerical differentiation using (5.69) for some test cases in order to check the correctness of the implementation. 5.3.4 The Jacobian matrix We have seen how the derivatives of an error function with respect to the weights can be obtained by the propagation of errors backwards through the network. The technique of backpropagation can also be applied to the calculation of other deriva- tives. Here we consider the evaluation of the Jacobian matrix, whose elements are given by the derivatives of the network outputs with respect to the inputs ∂y k (5.70) ≡ J ki ∂x i where each such derivative is evaluated with all other inputs held fixed. Jacobian matrices play a useful role in systems built from a number of distinct modules, as illustrated in Figure 5.8. Each module can comprise a fixed or adaptive function, which can be linear or nonlinear, so long as it is differentiable. Suppose we wish E with respect to the parameter w in Figure 5.8. The to minimize an error function derivative of the error function is given by ∑ ∂z ∂E ∂y ∂E j k (5.71) = ∂y ∂w ∂w ∂z j k k,j in which the Jacobian matrix for the red module in Figure 5.8 appears in the middle term. Because the Jacobian matrix provides a measure of the local sensitivity of the outputs to changes in each of the input variables, it also allows any known errors ∆ x i

268 248 5. NEURAL NETWORKS associated with the inputs to be propagated through the trained network in order to estimate their contribution y ∆ to the errors at the outputs, through the relation k ∑ ∂y k x  ∆ (5.72) ∆ y i k ∂x i i which is valid provided the | ∆ x are small. In general, the network mapping rep- | i resented by a trained neural network will be nonlinear, and so the elements of the Jacobian matrix will not be constants but will depend on the particular input vector used. Thus (5.72) is valid only for small perturbations of the inputs, and the Jacobian itself must be re-evaluated for each new input vector. The Jacobian matrix can be evaluated using a backpropagation procedure that is similar to the one derived earlier for evaluating the derivatives of an error function in the form with respect to the weights. We start by writing the element J ki ∑ ∂a ∂y ∂y j k k = = J ki ∂x ∂x ∂a j i i j ∑ ∂y k w = (5.73) ji ∂a j j j to which where we have made use of (5.48). The sum in (5.73) runs over all units i the input unit sends connections (for example, over all units in the first hidden layer in the layered topology considered earlier). We now write down a recursive /∂a backpropagation formula to determine the derivatives ∂y j k ∑ ∂y ∂a ∂y l k k = ∂a ∂a ∂a l j j l ∑ ∂y k ′ ) (5.74) = w a h ( lj j ∂a l l where the sum runs over all units l to which unit j sends connections (corresponding to the first index of w ). Again, we have made use of (5.48) and (5.49). This lj backpropagation starts at the output units for which the required derivatives can be found directly from the functional form of the output-unit activation function. For instance, if we have individual sigmoidal activation functions at each output unit, then ∂y k ′ = δ (5.75) σ ) ( a j kj ∂a j whereas for softmax outputs we have ∂y k (5.76) . = δ y y y − k k kj j ∂a j We can summarize the procedure for evaluating the Jacobian matrix as follows. Apply the input vector corresponding to the point in input space at which the Ja- cobian matrix is to be found, and forward propagate in the usual way to obtain the

269 5.4. The Hessian Matrix 249 activations of all of the hidden and output units in the network. Next, for each row k , backpropagate using of the Jacobian matrix, corresponding to the output unit k the recursive relation (5.74), starting with (5.75) or (5.76), for all of the hidden units in the network. Finally, use (5.73) to do the backpropagation to the inputs. The Jacobian can also be evaluated using an alternative propagation formalism, forward which can be derived in an analogous way to the backpropagation approach given here. Exercise 5.15 Again, the implementation of such algorithms can be checked by using numeri- cal differentiation in the form ∂y  − x ( y ( x − +  ) ) y k i k i k 2 ( + O  ) = (5.77)  ∂x 2 i which involves 2 forward propagations for a network having D inputs. D 5.4. The Hessian Matrix We have shown how the technique of backpropagation can be used to obtain the first derivatives of an error function with respect to the weights in the network. Back- propagation can also be used to evaluate the second derivatives of the error, given by 2 E ∂ (5.78) . ∂w ∂w lk ji Note that it is sometimes convenient to consider all of the weight and bias parameters w as elements of a single vector, denoted w , in which case the second derivatives i form the elements H is of the Hessian matrix H , where i, j ∈{ 1 ,...,W } and W ij the total number of weights and biases. The Hessian plays an important role in many aspects of neural computing, including the following: 1. Several nonlinear optimization algorithms used for training neural networks are based on considerations of the second-order properties of the error surface, which are controlled by the Hessian matrix (Bishop and Nabney, 2008). 2. The Hessian forms the basis of a fast procedure for re-training a feed-forward network following a small change in the training data (Bishop, 1991). 3. The inverse of the Hessian has been used to identify the least significant weights in a network as part of network ‘pruning’ algorithms (Le Cun et al. , 1990). 4. The Hessian plays a central role in the Laplace approximation for a Bayesian neural network (see Section 5.7). Its inverse is used to determine the predic- tive distribution for a trained network, its eigenvalues determine the values of hyperparameters, and its determinant is used to evaluate the model evidence. Various approximation schemes have been used to evaluate the Hessian matrix for a neural network. However, the Hessian can also be calculated exactly using an extension of the backpropagation technique.

270 250 5. NEURAL NETWORKS An important consideration for many applications of the Hessian is the efficiency parameters (weights and biases) in the W with which it can be evaluated. If there are W W and so the computational network, then the Hessian matrix has dimensions × 2 effort needed to evaluate the Hessian will scale like O ( W for each pattern in the ) data set. As we shall see, there are efficient methods for evaluating the Hessian 2 ( W O whose scaling is indeed ) . 5.4.1 Diagonal approximation Some of the applications for the Hessian matrix discussed above require the inverse of the Hessian, rather than the Hessian itself. For this reason, there has been some interest in using a diagonal approximation to the Hessian, in other words one that simply replaces the off-diagonal elements with zeros, because its inverse is trivial to evaluate. Again, we shall consider an error function that consists of a sum ∑ E . The Hessian can of terms, one for each pattern in the data set, so that E = n n then be obtained by considering one pattern at a time, and then summing the results n , over all patterns. From (5.48), the diagonal elements of the Hessian, for pattern can be written 2 2 E E ∂ ∂ n n 2 = (5.79) . z i 2 2 ∂w ∂a ji j Using (5.48) and (5.49), the second derivatives on the right-hand side of (5.79) can be found recursively using the chain rule of differential calculus to give a backprop- agation equation of the form 2 2 n ∑ ∑ ∑ ∂E ∂ E E ∂ n n ′ ′′ 2 ′ (5.80) h w + w a ( ) ) . = h w a ( j j k kj kj j 2 ′ ∂a ∂a ∂a ∂a k k k j ′ k k k If we now neglect off-diagonal elements in the second-derivative terms, we obtain , 1990) et al. (Becker and Le Cun, 1989; Le Cun 2 2 ∑ ∑ E E ∂E ∂ ∂ n n n 2 ′′ ′ 2 ( h ) = (5.81) w . ) + h w ( a a j kj j kj 2 2 ∂a ∂a ∂a k j k k k Note that the number of computational steps required to evaluate this approximation O ( is ) , where W is the total number of weight and bias parameters in the network, W 2 ) for the full Hessian. O ( W compared with (1988) also used the diagonal approximation to the Hessian, but Ricotti et al. 2 2 ∂ they retained all terms in the evaluation of /∂a E and so obtained exact expres- n j sions for the diagonal terms. Note that this no longer has O ( W ) scaling. The major problem with diagonal approximations, however, is that in practice the Hessian is typically found to be strongly nondiagonal, and so these approximations, which are driven mainly be computational convenience, must be treated with care.

271 5.4. The Hessian Matrix 251 5.4.2 Outer product approximation When neural networks are applied to regression problems, it is common to use a sum-of-squares error function of the form N ∑ 1 2 = E ( y − t ) (5.82) n n 2 =1 n where we have considered the case of a single output in order to keep the notation simple (the extension to several outputs is straightforward). We can then write the Exercise 5.16 Hessian matrix in the form N N ∑ ∑ + (5.83) t − ∇ y y ∇ y . y ∇∇ ) ( = E ∇∇ = H n n n n n =1 n =1 n y If the network has been trained on the data set, and its outputs happen to be very n , then the second term in (5.83) will be small and can close to the target values t n be neglected. More generally, however, it may be appropriate to neglect this term by the following argument. Recall from Section 1.5.5 that the optimal function that minimizes a sum-of-squares loss is the conditional average of the target data. The is then a random variable with zero mean. If we assume that its ) t − y ( quantity n n value is uncorrelated with the value of the second derivative term on the right-hand side of (5.83), then the whole term will average to zero in the summation over n . Exercise 5.17 By neglecting the second term in (5.83), we arrive at the Levenberg–Marquardt outer product approximation (because the Hessian matrix is built approximation or up from a sum of outer products of vectors), given by N ∑ T H  b (5.84) b n n n =1 = a y because the activation function for the output units is = ∇ ∇ where b n n n simply the identity. Evaluation of the outer product approximation for the Hessian is straightforward as it only involves first derivatives of the error function, which can be evaluated efficiently in O ( W ) steps using standard backpropagation. The 2 ) steps by simple multiplication. elements of the matrix can then be found in W ( O It is important to emphasize that this approximation is only likely to be valid for a network that has been trained appropriately, and that for a general network mapping the second derivative terms on the right-hand side of (5.83) will typically not be negligible. In the case of the cross-entropy error function for a network with logistic sigmoid output-unit activation functions, the corresponding approximation is given by Exercise 5.19 N ∑ T ) y . (1 − y (5.85) b b H  n n n n n =1 An analogous result can be obtained for multiclass networks having softmax output- unit activation functions. Exercise 5.20

272 252 5. NEURAL NETWORKS 5.4.3 Inverse Hessian We can use the outer-product approximation to develop a computationally ef- ficient procedure for approximating the inverse of the Hessian (Hassibi and Stork, 1993). First we write the outer-product approximation in matrix notation as N ∑ T H (5.86) b = b N n n n =1 b where is the contribution to the gradient of the output unit activation ≡∇ a n w n arising from data point n . We now derive a sequential procedure for building up the Hessian by including data points one at a time. Suppose we have already obtained data points. By separating off the contribution L the inverse Hessian using the first L +1 , we obtain from data point T H + b b . H (5.87) = L L +1 L +1 +1 L In order to evaluate the inverse of the Hessian, we now consider the matrix identity ) ( 1 − T − 1 ( ) v v ) M ( M 1 − − 1 T M + vv − M = (5.88) T − 1 v M 1+ v where I is the unit matrix, which is simply a special case of the Woodbury identity , we obtain v with M and b with H (C.7). If we now identify +1 L L 1 − − 1 T H b b H +1 L L +1 L L 1 − 1 − = H (5.89) . − H +1 L L − 1 T b H b 1+ L +1 L L +1 L +1 = N and the whole data In this way, data points are sequentially absorbed until set has been processed. This result therefore represents a procedure for evaluating the inverse of the Hessian using a single pass through the data set. The initial matrix α I , where α is a small quantity, so that the algorithm actually is chosen to be H 0 H + α I . The results are not particularly sensitive to the precise finds the inverse of α . Extension of this algorithm to networks having more than one output is value of Exercise 5.21 straightforward. We note here that the Hessian matrix can sometimes be calculated indirectly as part of the network training algorithm. In particular, quasi-Newton nonlinear opti- mization algorithms gradually build up an approximation to the inverse of the Hes- sian during training. Such algorithms are discussed in detail in Bishop and Nabney (2008). 5.4.4 Finite differences As in the case of the first derivatives of the error function, we can find the second derivatives by using finite differences, with accuracy limited by numerical precision. If we perturb each possible pair of weights in turn, we obtain 2 1 E ∂  + )  − , w { E ( w + + , w w = ) − E ( lk ji ji lk 2  4 ∂w ∂w ji lk 2 E ( w − − (5.90) . +  )+ E ( w ) − , w , w −  ) } + O (  ji ji lk lk

273 5.4. The Hessian Matrix 253 Again, by using a symmetrical central differences formulation, we ensure that the 2 2 ( residual errors are  O  . Because there are W ( elements in the O ) rather than ) Hessian matrix, and because the evaluation of each element requires four forward W ) operations (per pattern), we see that this approach propagations each needing O ( 3 W will require O ( operations to evaluate the complete Hessian. It therefore has ) poor scaling properties, although in practice it is very useful as a check on the soft- ware implementation of backpropagation methods. A more efficient version of numerical differentiation can be found by applying central differences to the first derivatives of the error function, which are themselves calculated using backpropagation. This gives } { 2 ∂ E ∂E ∂E 1 2 +  ) − (5.91) = ( ( ) −  ) w + O (  . w lk lk ∂w 2  ∂w ∂w ∂w lk ji ji ji W weights to be perturbed, and because the gradients Because there are now only 2 ) ( W steps, we see that this method gives the Hessian in O ( W O can be evaluated in ) operations. 5.4.5 Exact evaluation of the Hessian So far, we have considered various approximation schemes for evaluating the Hessian matrix or its inverse. The Hessian can also be evaluated exactly, for a net- work of arbitrary feed-forward topology, using extension of the technique of back- propagation used to evaluate first derivatives, which shares many of its desirable features including computational efficiency (Bishop, 1991; Bishop, 1992). It can be applied to any differentiable error function that can be expressed as a function of the network outputs and to networks having arbitrary differentiable activation func- tions. The number of computational steps needed to evaluate the Hessian scales 2 ) . Similar algorithms have also been considered by Buntine and Weigend W O like ( (1993). Here we consider the specific case of a network having two layers of weights, ′ Exercise 5.22 i for which the required equations are easily derived. We shall use indices i and ′ ′ k to denoted hidden units, and indices k and to to denote inputs, indices j j and denote outputs. We first define 2 E ∂ ∂E n n ′ = ≡ ,M (5.92) δ k kk ′ ∂a ∂a ∂a k k k is the contribution to the error from data point n . The Hessian matrix for E where n this network can then be considered in three separate blocks as follows. 1. Both weights in the second layer: 2 ∂ E n ′ ′ M . z z = (5.93) j j kk (2) (2) ∂w ∂w ′ ′ kj k j

274 254 5. NEURAL NETWORKS 2. Both weights in the first layer: 2 ∑ ∂ E n (2) ′′ ′ ′ ′ a I x = x h δ ( ) w ′ j i k jj i kj (1) (1) ∂w ∂w ′ ′ ji j i k ∑ ∑ (2) (2) ′ ′ ′ ′ ′ x (5.94) ) h x ( a . ) h M + ( w w a ′ ′ j i i kk j j k kj ′ k k 3. One weight in each layer: { } 2 ∑ ∂ E n (2) ′ ′ ′ ′ (5.95) h . H a z w ) = δ + I x ( ′ ′ jj k j j i kk k j (2) (1) ∂w ∂w ′ ′ ji kj k ′ ′ I Here is the j, j element of the identity matrix. If one or both of the weights is jj a bias term, then the corresponding expressions are obtained simply by setting the 1 . Inclusion of skip-layer connections is straightforward. Exercise 5.23 appropriate activation(s) to 5.4.6 Fast multiplication by the Hessian For many applications of the Hessian, the quantity of interest is not the Hessian H itself but the product of H with some vector v . We have seen that the matrix 2 ) operations, and it also requires storage that is O W ( evaluation of the Hessian takes 2 T W ) . The vector elements, v H that we wish to calculate, however, has only W O ( so instead of computing the Hessian as an intermediate step, we can instead try to T H directly in a way that requires only find an efficient approach to evaluating v O ( W ) operations. To do this, we first note that T T v v ∇ ( ∇ E ) (5.96) H = ∇ denotes the gradient operator in weight space. We can then write down where the standard forward-propagation and backpropagation equations for the evaluation of ∇ E and apply (5.96) to these equations to give a set of forward-propagation and T H (Møller, 1993; Pearlmutter, backpropagation equations for the evaluation of v 1994). This corresponds to acting on the original forward-propagation and back- T ∇ . Pearlmutter (1994) used the propagation equations with a differential operator v T R{·} notation v to denote the operator ∇ , and we shall follow this convention. The analysis is straightforward and makes use of the usual rules of differential calculus, together with the result . w } = v R{ (5.97) The technique is best illustrated with a simple example, and again we choose a two-layer network of the form shown in Figure 5.1, with linear output units and a sum-of-squares error function. As before, we consider the contribution to the error function from one pattern in the data set. The required vector is then obtained as

275 5.4. The Hessian Matrix 255 usual by summing over the contributions from each of the patterns separately. For the two-layer network, the forward-propagation equations are given by ∑ a x (5.98) w = i j ji i z ( a ) = (5.99) h j j ∑ y = w z . (5.100) k kj j j operator to obtain a set of forward We now act on these equations using the R{·} propagation equations in the form ∑ a R{ x } v = (5.101) i j ji i ′ R{ z = h ( (5.102) } } ) R{ a a j j j ∑ ∑ z w (5.103) R{ = z } + } v R{ y kj j j kj k j j is the element of the vector w . Quan- v that corresponds to the weight v where ji ji } a are to be regarded as new variables } , R{ } y R{ and z tities of the form R{ j k j whose values are found using the above equations. Because we are considering a sum-of-squares error function, we have the fol- lowing standard backpropagation expressions: (5.104) = − t y δ k k k ∑ ′ δ δ ( a . ) (5.105) = w h kj k j j k Again, we act on these equations with the operator to obtain a set of backprop- R{·} agation equations in the form R{ δ } = (5.106) y } R{ k k ∑ ′′ R{ δ = h } ( a w ) R{ a δ } j kj j k j k ∑ ∑ ′ ′ . ) + v } δ δ h h (5.107) ( a R{ ) ( a w + kj j k k kj j k k Finally, we have the usual equations for the first derivatives of the error ∂E (5.108) z = δ k j ∂w kj ∂E x (5.109) = δ j i ∂w ji

276 256 5. NEURAL NETWORKS operator, we obtain expressions for the elements and acting on these with the R{·} T v of the vector H { } ∂E R{ = R{ δ z } z (5.110) + δ } R j j k k ∂w kj } { ∂E R = x R{ δ } . (5.111) j i ∂w ji The implementation of this algorithm involves the introduction of additional δ } } , R{ z } and R{ y } for the hidden units and R{ δ } and R{ R{ a variables j k j k j for the output units. For each input pattern, the values of these quantities can be T H are then given by (5.110) found using the above results, and the elements of v and (5.111). An elegant aspect of this technique is that the equations for evaluating T H mirror closely those for standard forward and backward propagation, and so the v extension of existing software to compute this product is typically straightforward. If desired, the technique can be used to evaluate the full Hessian matrix by choosing the vector v to be given successively by a series of unit vectors of the (0 , 0 ,..., form ,..., 0) each of which picks out one column of the Hessian. This 1 leads to a formalism that is analytically equivalent to the backpropagation procedure of Bishop (1992), as described in Section 5.4.5, though with some loss of efficiency due to redundant calculations. 5.5. Regularization in Neural Networks The number of input and outputs units in a neural network is generally determined by the dimensionality of the data set, whereas the number M of hidden units is a free parameter that can be adjusted to give the best predictive performance. Note that M controls the number of parameters (weights and biases) in the network, and so we might expect that in a maximum likelihood setting there will be an optimum value M of that gives the best generalization performance, corresponding to the optimum balance between under-fitting and over-fitting. Figure 5.9 shows an example of the M for the sinusoidal regression problem. effect of different values of M due to the The generalization error, however, is not a simple function of presence of local minima in the error function, as illustrated in Figure 5.10. Here we see the effect of choosing multiple random initializations for the weight vector for a range of values of M . The overall best validation set performance in this case occurred for a particular solution having =8 . In practice, one approach to M choosing M is in fact to plot a graph of the kind shown in Figure 5.10 and then to choose the specific solution having the smallest validation set error. There are, however, other ways to control the complexity of a neural network model in order to avoid over-fitting. From our discussion of polynomial curve fitting in Chapter 1, we see that an alternative approach is to choose a relatively large value for M and then to control complexity by the addition of a regularization term to the error function. The simplest regularizer is the quadratic, giving a regularized error

277 5.5. Regularization in Neural Networks 257 M =1 M =10 =3 M 1 1 1 0 0 0 −1 −1 −1 0 0 1 1 0 1 Examples of two-layer networks trained on 10 data points drawn from the sinusoidal data set. The Figure 5.9 graphs show the result of fitting networks having =1 , 3 and 10 hidden units, respectively, by minimizing a M sum-of-squares error function using a scaled conjugate-gradient algorithm. of the form λ T ̃ )= E ( w E ( (5.112) . w w )+ w 2 This regularizer is also known as weight decay and has been discussed at length in Chapter 3. The effective model complexity is then determined by the choice of the regularization coefficient . As we have seen previously, this regularizer can be λ interpreted as the negative logarithm of a zero-mean Gaussian prior distribution over the weight vector . w 5.5.1 Consistent Gaussian priors One of the limitations of simple weight decay in the form (5.112) is that is inconsistent with certain scaling properties of network mappings. To illustrate this, consider a multilayer perceptron network having two layers of weights and linear } to a set x output units, which performs a mapping from a set of input variables { i } . The activations of the hidden units in the first hidden layer { y of output variables k Figure 5.10 Plot of the sum-of-squares test-set error for the polynomial data set ver- 160 sus the number of hidden units in the 30 random starts for network, with 140 each network size, showing the ef- fect of local minima. For each new start, the weight vector was initial- 120 ized by sampling from an isotropic Gaussian distribution having a mean 100 10 of zero and a variance of . 80 60 0 0 4 6 8 1 2

278 258 5. NEURAL NETWORKS take the form ( ) ∑ z + w = x (5.113) h w j 0 ji j i i while the activations of the output units are given by ∑ = . (5.114) w w z + y j 0 kj k k j Suppose we perform a linear transformation of the input data of the form ax ̃ x → = (5.115) b. + x i i i Then we can arrange for the mapping performed by the network to be unchanged by making a corresponding linear transformation of the weights and biases from the inputs to the units in the hidden layer of the form Exercise 5.24 1 w w w (5.116) = → ̃ ji ji ji a ∑ b w → ̃ w (5.117) . = w − w 0 0 j j 0 ji j a i Similarly, a linear transformation of the output variables of the network of the form d → ̃ y + = cy (5.118) y k k k can be achieved by making a transformation of the second-layer weights and biases using (5.119) → ̃ w cw = w kj kj kj (5.120) d. + → ̃ w cw = w 0 0 k k k 0 If we train one network using the original data and one network using data for which the input and/or target variables are transformed by one of the above linear transfor- mations, then consistency requires that we should obtain equivalent networks that differ only by the linear transformation of the weights as given. Any regularizer should be consistent with this property, otherwise it arbitrarily favours one solution over another, equivalent one. Clearly, simple weight decay (5.112), that treats all weights and biases on an equal footing, does not satisfy this property. We therefore look for a regularizer which is invariant under the linear trans- formations (5.116), (5.117), (5.119) and (5.120). These require that the regularizer should be invariant to re-scaling of the weights and to shifts of the biases. Such a regularizer is given by ∑ ∑ λ λ 1 2 2 2 w w (5.121) + 2 2 w ∈W w ∈W 2 1 where W denotes the set of weights in the first layer, W denotes the set of weights 1 2 in the second layer, and biases are excluded from the summations. This regularizer

279 5.5. Regularization in Neural Networks 259 will remain unchanged under the weight transformations provided the regularization 2 1 / 2 / 1 − λ parameters are re-scaled using λ λ c and → λ . → a 2 1 2 1 The regularizer (5.121) corresponds to a prior of the form ) ( ∑ ∑ α α 2 1 2 2 . − ,α ∝ ) (5.122) − w w exp w p ( | α 1 2 2 2 ∈W ∈W w w 1 2 improper (they cannot be normalized) because the Note that priors of this form are bias parameters are unconstrained. The use of improper priors can lead to difficulties in selecting regularization coefficients and in model comparison within the Bayesian framework, because the corresponding evidence is zero. It is therefore common to include separate priors for the biases (which then break shift invariance) having their own hyperparameters. We can illustrate the effect of the resulting four hyperpa- rameters by drawing samples from the prior and plotting the corresponding network functions, as shown in Figure 5.11. More generally, we can consider priors in which the weights are divided into W any number of groups so that k ) ( ∑ 1 2 ∝ p w ( exp ) (5.123) ‖ w ‖ − α k k 2 k where ∑ 2 2 w ‖ ‖ w (5.124) = . k j ∈W j k As a special case of this prior, if we choose the groups to correspond to the sets of weights associated with each of the input units, and we optimize the marginal automatic , we obtain α likelihood with respect to the corresponding parameters k as discussed in Section 7.2.2. relevance determination 5.5.2 Early stopping An alternative to regularization as a way of controlling the effective complexity of a network is the procedure of early stopping . The training of nonlinear network models corresponds to an iterative reduction of the error function defined with re- spect to a set of training data. For many of the optimization algorithms used for network training, such as conjugate gradients, the error is a nonincreasing function of the iteration index. However, the error measured with respect to independent data, generally called a validation set, often shows a decrease at first, followed by an in- crease as the network starts to over-fit. Training can therefore be stopped at the point of smallest error with respect to the validation data set, as indicated in Figure 5.12, in order to obtain a network having good generalization performance. The behaviour of the network in this case is sometimes explained qualitatively in terms of the effective number of degrees of freedom in the network, in which this number starts out small and then to grows during the training process, corresponding to a steady increase in the effective complexity of the model. Halting training before

280 260 5. NEURAL NETWORKS w b w b b w b w =1, α =1, α α α =1 = 10, =1 α =1, =1, α =1, α α 2 1 1 1 1 2 2 2 40 4 2 20 0 0 −20 −2 −40 −4 −60 −6 1 0 0.5 0.5 −1 0 0.5 1 −1 − 0.5 − w w b w b b w b =1, α =1 =1 = 100, α =1, α = 1000, α = 1000, α α = 1000, α α 2 1 1 2 1 2 1 2 5 5 0 0 −5 −5 −10 −10 1 − 0.5 0 0.5 0.5 − −1 0.5 1 −1 0 Illustration of the effect of the hyperparameters governing the prior distribution over weights and Figure 5.11 12 hidden units having ‘ tanh ’ biases in a two-layer network having a single input, a single linear output, and b w b w , which represent , α α α , and , α activation functions. The priors are governed by four hyperparameters 2 2 1 1 the precisions of the Gaussian distributions of the first-layer biases, first-layer weights, second-layer biases, and w governs the vertical scale of functions (note α second-layer weights, respectively. We see that the parameter 2 w α the different vertical axis ranges on the top two diagrams), governs the horizontal scale of variations in the 1 b b α governs the horizontal range over which variations occur. The parameter , whose function values, and α 1 2 effect is not illustrated here, governs the range of vertical offsets of the functions. a minimum of the training error has been reached then represents a way of limiting the effective network complexity. In the case of a quadratic error function, we can verify this insight, and show that early stopping should exhibit similar behaviour to regularization using a sim- ple weight-decay term. This can be understood from Figure 5.13, in which the axes in weight space have been rotated to be parallel to the eigenvectors of the Hessian matrix. If, in the absence of weight decay, the weight vector starts at the origin and proceeds during training along a path that follows the local negative gradient vec- axis through a point tor, then the weight vector will move initially parallel to the w 2 corresponding roughly to ̃ w and then move towards the minimum of the error func- . This follows from the shape of the error surface and the widely differing tion w ML ̃ w is therefore similar to weight eigenvalues of the Hessian. Stopping at a point near decay. The relationship between early stopping and weight decay can be made quan- titative, thereby showing that the quantity τη (where τ is the iteration index, and η Exercise 5.25 is the learning rate parameter) plays the role of the reciprocal of the regularization

281 5.5. Regularization in Neural Networks 261 0.45 0.25 0.4 0.2 0.15 0.35 2 50 1 2 0 0 1 0 0 0 30 4 0 50 30 4 0 0 Figure 5.12 An illustration of the behaviour of training set error (left) and validation set error (right) during a typical training session, as a function of the iteration step, for the sinusoidal data set. The goal of achieving the best generalization performance suggests that training should be stopped at the point shown by the vertical dashed lines, corresponding to the minimum of the validation set error. parameter λ . The effective number of parameters in the network therefore grows during the course of training. 5.5.3 Invariances In many applications of pattern recognition, it is known that predictions should be unchanged, or , under one or more transformations of the input vari- invariant ables. For example, in the classification of objects in two-dimensional images, such as handwritten digits, a particular object should be assigned the same classification irrespective of its position within the image ( ) or of its size translation invariance ( scale invariance ). Such transformations produce significant changes in the raw data, expressed in terms of the intensities at each of the pixels in the image, and yet should give rise to the same output from the classification system. Similarly in speech recognition, small levels of nonlinear warping along the time axis, which preserve temporal ordering, should not change the interpretation of the signal. If sufficiently large numbers of training patterns are available, then an adaptive model such as a neural network can learn the invariance, at least approximately. This involves including within the training set a sufficiently large number of examples of the effects of the various transformations. Thus, for translation invariance in an im- age, the training set should include examples of objects at many different positions. This approach may be impractical, however, if the number of training examples is limited, or if there are several invariants (because the number of combinations of transformations grows exponentially with the number of such transformations). We therefore seek alternative approaches for encouraging an adaptive model to exhibit the required invariances. These can broadly be divided into four categories: 1. The training set is augmented using replicas of the training patterns, trans- formed according to the desired invariances. For instance, in our digit recog- nition example, we could make multiple copies of each example in which the

282 262 5. NEURAL NETWORKS w 2 Figure 5.13 A schematic illustration of why early stopping can give similar results to weight decay in the case of a quadratic error func- tion. The ellipse shows a con- w tour of constant error, and ML w ML denotes the minimum of the er- ror function. If the weight vector ̃ w starts at the origin and moves ac- cording to the local negative gra- dient direction, then it will follow the path shown by the curve. By stopping training early, a weight e vector w is found that is qual- w 1 itatively similar to that obtained with a simple weight-decay reg- ularizer and training to the mini- mum of the regularized error, as can be seen by comparing with Figure 3.15. digit is shifted to a different position in each image. 2. A regularization term is added to the error function that penalizes changes in the model output when the input is transformed. This leads to the technique of tangent propagation , discussed in Section 5.5.4. 3. Invariance is built into the pre-processing by extracting features that are invari- ant under the required transformations. Any subsequent regression or classi- fication system that uses such features as inputs will necessarily also respect these invariances. 4. The final option is to build the invariance properties into the structure of a neu- ral network (or into the definition of a kernel function in the case of techniques such as the relevance vector machine). One way to achieve this is through the use of local receptive fields and shared weights, as discussed in the context of convolutional neural networks in Section 5.5.6. Approach 1 is often relatively easy to implement and can be used to encourage com- plex invariances such as those illustrated in Figure 5.14. For sequential training algorithms, this can be done by transforming each input pattern before it is presented to the model so that, if the patterns are being recycled, a different transformation (drawn from an appropriate distribution) is added each time. For batch methods, a similar effect can be achieved by replicating each data point a number of times and transforming each copy independently. The use of such augmented data can lead to significant improvements in generalization (Simard et al. , 2003), although it can also be computationally costly. Approach 2 leaves the data set unchanged but modifies the error function through the addition of a regularizer. In Section 5.5.5, we shall show that this approach is closely related to approach 2.

283 5.5. Regularization in Neural Networks 263 Illustration of the synthetic warping of a handwritten digit. The original image is shown on the Figure 5.14 left. On the right, the top row shows three examples of warped digits, with the corresponding displacement fields shown on the bottom row. These displacement fields are generated by sampling random displacements 0 y ∈ (0 , 1) x, ∆ . 01 , 30 and 60 ∆ at each pixel and then smoothing by convolution with Gaussians of width respectively. One advantage of approach 3 is that it can correctly extrapolate well beyond the range of transformations included in the training set. However, it can be difficult to find hand-crafted features with the required invariances that do not also discard information that can be useful for discrimination. 5.5.4 Tangent propagation We can use regularization to encourage models to be invariant to transformations tangent propagation (Simard et al. , 1992). of the input through the technique of . Provided the x Consider the effect of a transformation on a particular input vector n transformation is continuous (such as translation or rotation, but not mirror reflection for instance), then the transformed pattern will sweep out a manifold M within the D -dimensional input space. This is illustrated in Figure 5.15, for the case of D = 2 for simplicity. Suppose the transformation is governed by a single parameter ξ M swept out by x (which might be rotation angle for instance). Then the subspace n x 2 Figure 5.15 Illustration of a two-dimensional input space showing the effect of a continuous transforma- M τ n x tion on a particular input vector . A one- n dimensional transformation, parameterized by ξ x n the continuous variable x , applied to ξ causes n . M it to sweep out a one-dimensional manifold Locally, the effect of the transformation can be approximated by the tangent vector τ . n x 1

284 264 5. NEURAL NETWORKS ξ will be one-dimensional, and will be parameterized by . Let the vector that results from acting on x ) by this transformation be denoted by ,ξ ( , which is defined x s n n so that , 0) = x . Then the tangent to the curve M is given by the directional s ( x is given by ∂ s /∂ξ , and the tangent vector at the point x derivative τ = n ∣ ∣ ) ,ξ x ∂ ( s n ∣ τ (5.125) . = n ∣ ∂ξ =0 ξ Under a transformation of the input vector, the network output vector will, in general, change. The derivative of output ξ is given by k with respect to ∣ ∣ D D ∣ ∑ ∑ ∣ ∂x ∂y ∂y ∣ k k i ∣ = τ (5.126) J = ∣ i ki ∣ ∂x ∂ξ ∂ξ ∣ i ξ =0 i =1 =1 i ξ =0 , as discussed in Section 5.3.4. J element of the Jacobian matrix is the ( k, i ) J where ki The result (5.126) can be used to modify the standard error function, so as to encour- age local invariance in the neighbourhood of the data points, by the addition to the original error function of a regularization function Ω to give a total error function E of the form ̃ E + λ Ω E = (5.127) λ where is a regularization coefficient and ( ( ) ) 2 2 ∣ D ∑ ∑ ∑ ∑ ∑ ∣ ∂y 1 1 nk ∣ Ω= (5.128) J . τ = ni nki ∣ ∂ξ 2 2 =0 ξ n n =1 i k k The regularization function will be zero when the network mapping function is in- variant under the transformation in the neighbourhood of each pattern vector, and λ determines the balance between fitting the training data the value of the parameter and learning the invariance property. can be approximated us- τ In a practical implementation, the tangent vector n ing finite differences, by subtracting the original vector x from the corresponding n vector after transformation using a small value of ξ , and then dividing by ξ . This is illustrated in Figure 5.16. The regularization function depends on the network weights through the Jaco- bian J . A backpropagation formalism for computing the derivatives of the regu- larizer with respect to the network weights is easily obtained by extension of the Exercise 5.26 techniques introduced in Section 5.3. If the transformation is governed by L parameters (e.g., L =3 for the case of translations combined with in-plane rotations in a two-dimensional image), then the M will have dimensionality L , and the corresponding regularizer is given manifold by the sum of terms of the form (5.128), one for each transformation. If several transformations are considered at the same time, and the network mapping is made invariant to each separately, then it will be (locally) invariant to combinations of the transformations (Simard et al. , 1992).

285 5.5. Regularization in Neural Networks 265 Illustration showing Figure 5.16 x of a hand- (a) the original image (b) (a) written digit, (b) the tangent vector corresponding to an infinitesimal τ clockwise rotation, (c) the result of adding a small contribution from the tangent vector to the original image τ with =15 giving x + degrees, and (d) the true image rotated for comparison. (d) (c) tangent distance , can be used to build invariance A related technique, called properties into distance-based methods such as nearest-neighbour classifiers (Simard , 1993). et al. 5.5.5 Training with transformed data We have seen that one way to encourage invariance of a model to a set of trans- formations is to expand the training set using transformed versions of the original input patterns. Here we show that this approach is closely related to the technique of tangent propagation (Bishop, 1995b; Leen, 1995). As in Section 5.5.4, we shall consider a transformation governed by a single parameter ξ and described by the function s ( x ,ξ ) , with s ( x , 0) = x . We shall also consider a sum-of-squares error function. The error function for untransformed inputs can be written (in the infinite data set limit) in the form ∫∫ 1 2 E = { y ( x ) − t } (5.129) p ( t | x ) p ( x )d x d t 2 as discussed in Section 1.5.5. Here we have considered a network having a single output, in order to keep the notation uncluttered. If we now consider an infinite number of copies of each data point, each of which is perturbed by the transformation

286 266 5. NEURAL NETWORKS ξ is drawn from a distribution ( ξ ) , then the error function in which the parameter p defined over this expanded data set can be written as ∫∫∫ 1 2 ̃ | ( s ( x ,ξ )) − t } E p ( t { x ) p ( x ) p ( ξ )d x d t d ξ. (5.130) y = 2 ( ξ ) has zero mean with small variance, so that We now assume that the distribution p we are only considering small transformations of the original input vectors. We can ξ then expand the transformation function as a Taylor series in powers of to give ∣ ∣ 2 2 ∣ ∣ ∂ ξ ∂ 3 ∣ ∣ ( x ,ξ ) s ( ) ,ξ ξ + x ( ) s + O ( ( , 0) + ,ξ x s )= ξ s x ∣ ∣ 2 ∂ξ 2 ∂ξ =0 ξ ξ =0 1 2 ′ 3 x = ξ τ + + ( ξ + ) τ O ξ 2 ′ s ( x ,ξ ) with respect to ξ evaluated at ξ =0 . denotes the second derivative of τ where This allows us to expand the model function to give ] [ 2 ξ T ′ T T 3 ∇∇ . τ )+ ) x ∇ y ( x )+ τ ( ( y ( x ) τ y + O ( ξ ∇ ) τ ( s ( x ,ξ )) = y ( x )+ y ξ 2 Substituting into the mean error function (5.130) and expanding, we then have ∫∫ 1 2 ̃ t E { y ( x ) − t } = p ( t | x ) p ( x )d x d 2 ∫∫ T d t { y ( x ) − t } τ t ∇ y ( x ) p ( x | x ) p ( x )d [ ξ + ] E [ ∫∫ } { 1 T 2 ′ T ( x ) − t } τ x ) ( τ ] ) { ∇ y ( x )+ τ y ∇∇ y ( [ E ξ + 2 ] ( ) 2 3 T ( x ) ( τ + p ( t | x ) p ( x )d x d t + O ∇ ξ y ) . Because the distribution of transformations has zero mean we have E ]=0 . Also, [ ξ 2 3 [ E ξ we shall denote . Omitting terms of O ] ξ by ) , the average error function then λ ( becomes ̃ = E + λ Ω (5.131) E E Ω takes the where is the original sum-of-squares error, and the regularization term form [ ∫ } { 1 T ′ T Ω= E [ t | x ] } { y ( ( τ x ) ) ∇ y ( x )+ τ − ∇∇ y ( x ) τ 2 ] ( ) 2 T p ( x ) ∇ + τ y ( x )d x (5.132) in which we have performed the integration over t .

287 5.5. Regularization in Neural Networks 267 We can further simplify this regularization term as follows. In Section 1.5.5 we saw that the function that minimizes the sum-of-squares error is given by the condi- [ t | x ] of the target values t E tional average . From (5.131) we see that the regularized O ( ξ ) , and so error will equal the unregularized sum-of-squares plus terms which are the network function that minimizes the total error will have the form ( x )= E [ t | x ]+ O ( y ) . (5.133) ξ Thus, to leading order in , the first term in the regularizer vanishes and we are left ξ with ∫ ) ( 1 2 T Ω= y (5.134) x τ ( ∇ p ( x )d x ) 2 which is equivalent to the tangent propagation regularizer (5.128). If we consider the special case in which the transformation of the inputs simply consists of the addition of random noise, so that x x + ξ , then the regularizer → Exercise 5.27 takes the form ∫ 1 2 ‖∇ y ( (5.135) ) ‖ x x ( x )d p Ω= 2 which is known as Tikhonov regularization (Tikhonov and Arsenin, 1977; Bishop, 1995b). Derivatives of this regularizer with respect to the network weights can be found using an extended backpropagation algorithm (Bishop, 1993). We see that, for small noise amplitudes, Tikhonov regularization is related to the addition of random noise to the inputs, which has been shown to improve generalization in appropriate circumstances (Sietsma and Dow, 1991). 5.5.6 Convolutional networks Another approach to creating models that are invariant to certain transformation of the inputs is to build the invariance properties into the structure of a neural net- convolutional neural network (Le Cun et al. work. This is the basis for the , 1989; LeCun et al. , 1998), which has been widely applied to image data. Consider the specific task of recognizing handwritten digits. Each input image comprises a set of pixel intensity values, and the desired output is a posterior proba- bility distribution over the ten digit classes. We know that the identity of the digit is invariant under translations and scaling as well as (small) rotations. Furthermore, the network must also exhibit invariance to more subtle transformations such as elastic deformations of the kind illustrated in Figure 5.14. One simple approach would be to treat the image as the input to a fully connected network, such as the kind shown in Figure 5.1. Given a sufficiently large training set, such a network could in principle yield a good solution to this problem and would learn the appropriate invariances by example. However, this approach ignores a key property of images, which is that nearby pixels are more strongly correlated than more distant pixels. Many of the modern approaches to computer vision exploit this property by extracting local features that depend only on small subregions of the image. Information from such features can then be merged in later stages of processing in order to detect higher-order features

288 268 5. NEURAL NETWORKS Sub-sampling Convolutional layer Input image layer Figure 5.17 Diagram illustrating part of a convolutional neural network, showing a layer of convolu- tional units followed by a layer of subsampling units. Several successive pairs of such layers may be used. and ultimately to yield information about the image as whole. Also, local features that are useful in one region of the image are likely to be useful in other regions of the image, for instance if the object of interest is translated. These notions are incorporated into convolutional neural networks through three mechanisms: (i) local receptive fields, (ii) weight sharing, and (iii) subsampling. The structure of a convolutional network is illustrated in Figure 5.17. In the convolutional layer the units are organized into planes, each of which is called a feature map . Units in a feature map each take inputs only from a small subregion of the image, and all of the units in a feature map are constrained to share the same weight values. For instance, a feature map might consist of 100 units arranged in a 10 × 10 grid, with each unit taking inputs from a × 5 pixel patch of the image. The whole feature map 5 therefore has 25 adjustable weight parameters plus one adjustable bias parameter. Input values from a patch are linearly combined using the weights and the bias, and the result transformed by a sigmoidal nonlinearity using (5.1). If we think of the units as feature detectors, then all of the units in a feature map detect the same pattern but at different locations in the input image. Due to the weight sharing, the evaluation of the activations of these units is equivalent to a convolution of the image pixel intensities with a ‘kernel’ comprising the weight parameters. If the input image is shifted, the activations of the feature map will be shifted by the same amount but will otherwise be unchanged. This provides the basis for the (approximate) invariance of

289 5.5. Regularization in Neural Networks 269 the network outputs to translations and distortions of the input image. Because we will typically need to detect multiple features in order to build an effective model, there will generally be multiple feature maps in the convolutional layer, each having its own set of weight and bias parameters. The outputs of the convolutional units form the inputs to the subsampling layer of the network. For each feature map in the convolutional layer, there is a plane of units in the subsampling layer and each unit takes inputs from a small receptive field in the corresponding feature map of the convolutional layer. These units perform 2 × 2 subsampling. For instance, each subsampling unit might take inputs from a unit region in the corresponding feature map and would compute the average of those inputs, multiplied by an adaptive weight with the addition of an adaptive bias parameter, and then transformed using a sigmoidal nonlinear activation function. The receptive fields are chosen to be contiguous and nonoverlapping so that there are half the number of rows and columns in the subsampling layer compared with the convolutional layer. In this way, the response of a unit in the subsampling layer will be relatively insensitive to small shifts of the image in the corresponding regions of the input space. In a practical architecture, there may be several pairs of convolutional and sub- sampling layers. At each stage there is a larger degree of invariance to input trans- formations compared to the previous layer. There may be several feature maps in a given convolutional layer for each plane of units in the previous subsampling layer, so that the gradual reduction in spatial resolution is then compensated by an increas- ing number of features. The final layer of the network would typically be a fully connected, fully adaptive layer, with a softmax output nonlinearity in the case of multiclass classification. The whole network can be trained by error minimization using backpropagation to evaluate the gradient of the error function. This involves a slight modification of the usual backpropagation algorithm to ensure that the shared-weight constraints are satisfied. Due to the use of local receptive fields, the number of weights in Exercise 5.28 the network is smaller than if the network were fully connected. Furthermore, the number of independent parameters to be learned from the data is much smaller still, due to the substantial numbers of constraints on the weights. 5.5.7 Soft weight sharing One way to reduce the effective complexity of a network with a large number of weights is to constrain weights within certain groups to be equal. This is the technique of weight sharing that was discussed in Section 5.5.6 as a way of building translation invariance into networks used for image interpretation. It is only appli- cable, however, to particular problems in which the form of the constraints can be specified in advance. Here we consider a form of soft weight sharing (Nowlan and Hinton, 1992) in which the hard constraint of equal weights is replaced by a form of regularization in which groups of weights are encouraged to have similar values. Furthermore, the division of weights into groups, the mean weight value for each group, and the spread of values within the groups are all determined as part of the learning process.

290 270 5. NEURAL NETWORKS Recall that the simple weight decay regularizer, given in (5.112), can be viewed as the negative log of a Gaussian prior distribution over the weights. We can encour- age the weight values to form several groups, rather than just one group, by consid- ering instead a probability distribution that is a mixture Section 2.3.9 of Gaussians. The centres and variances of the Gaussian components, as well as the mixing coefficients, will be considered as adjustable parameters to be determined as part of the learning process. Thus, we have a probability density of the form ∏ p w )= ( w p ) (5.136) ( i i where M ∑ 2 w p ( μ N ( w | π ,σ )= ) (5.137) j i j i j j =1 are the mixing coefficients. Taking the negative logarithm then leads to a and π j regularization function of the form ( ) M ∑ ∑ 2 (5.138) ,σ . ln ) π μ N ( w | w Ω( )= − j j i j j =1 i The total error function is then given by ̃ ( w )= E ( w )+ λ Ω( w ) (5.139) E where λ is the regularization coefficient. This error is minimized both with respect π ,μ ,σ } of the mixture and with respect to the parameters { to the weights w j j j i model. If the weights were constant, then the parameters of the mixture model could be determined by using the EM algorithm discussed in Chapter 9. However, the dis- tribution of weights is itself evolving during the learning process, and so to avoid nu- merical instability, a joint optimization is performed simultaneously over the weights and the mixture-model parameters. This can be done using a standard optimization algorithm such as conjugate gradients or quasi-Newton methods. In order to minimize the total error function, it is necessary to be able to evaluate its derivatives with respect to the various adjustable parameters. To do this it is con- venient to regard the { π } as prior probabilities and to introduce the corresponding j posterior probabilities which, following (2.192), are given by Bayes’ theorem in the form 2 π ( ,σ N ) w | μ j j j ∑ . (5.140) )= ( w γ j 2 ,σ w π μ ) ( N | k k k k The derivatives of the total error function with respect to the weights are then given by Exercise 5.29 ∑ ̃ w ( μ ) − ∂ ∂E E i j (5.141) γ = ( w + ) λ . i j 2 ∂w σ ∂w i i j j

291 5.5. Regularization in Neural Networks 271 The effect of the regularization term is therefore to pull each weight towards the th j centre of the Gaussian, with a force proportional to the posterior probability of that Gaussian for the given weight. This is precisely the kind of effect that we are seeking. Derivatives of the error with respect to the centres of the Gaussians are also Exercise 5.30 easily computed to give ∑ ̃ μ ( − w ) E ∂ j i (5.142) ) γ = λ ( w i j 2 σ ∂μ j j i towards an aver- which has a simple intuitive interpretation, because it pushes μ j age of the weight values, weighted by the posterior probabilities that the respective j . Similarly, the derivatives with weight parameters were generated by component Exercise 5.31 respect to the variances are given by ) ( 2 ∑ ̃ ( w μ ) − E 1 ∂ j i γ ) − (5.143) λ ( w = j i 3 σ σ ∂σ j j j i towards the weighted average of the squared deviations of the weights σ which drives j around the corresponding centre μ , where the weighting coefficients are again given j by the posterior probability that each weight is generated by component j . Note that defined by η in a practical implementation, new variables j 2 ) η (5.144) = exp( σ j j η are introduced, and the minimization is performed with respect to the . This en- j sures that the parameters σ remain positive. It also has the effect of discouraging j goes to zero, corresponding σ pathological solutions in which one or more of the j to a Gaussian component collapsing onto one of the weight parameter values. Such solutions are discussed in more detail in the context of Gaussian mixture models in Section 9.2.1. π For the derivatives with respect to the mixing coefficients , we need to take j account of the constraints ∑ (5.145) π 1 =1 , 0  π  i j j as prior probabilities. This can be π which follow from the interpretation of the j done by expressing the mixing coefficients in terms of a set of auxiliary variables { η } using the softmax function given by j η exp( ) j = π (5.146) . ∑ j M exp( η ) k k =1 The derivatives of the regularized error function with respect to the { η } then take j the form Exercise 5.32

292 272 5. NEURAL NETWORKS The left figure shows a two-link robot arm, Figure 5.18 ) ,x x ( x ( ) ,x 2 1 1 2 L 2 of the end ef- ) ,x x ( in which the Cartesian coordinates 2 1 fector are determined uniquely by the two joint angles θ 1 and θ of the arms. This and and the (fixed) lengths L L 2 2 1 θ 2 forward kinematics of the arm. In prac- is know as the elbow tice, we have to find the joint angles that will give rise to a up desired end effector position and, as shown in the right fig- elbow L 1 ure, this inverse kinematics has two solutions correspond- down θ 1 ing to ‘elbow up’ and ‘elbow down’. ∑ ̃ ∂ E . = γ } { π ) − (5.147) w ( j i j ∂η j i is therefore driven towards the average posterior probability for com- π We see that j ponent . j 5.6. Mixture Density Networks p ( The goal of supervised learning is to model a conditional distribution | x ) , which t for many simple regression problems is chosen to be Gaussian. However, practical machine learning problems can often have significantly non-Gaussian distributions. These can arise, for example, with inverse problems in which the distribution can be multimodal, in which case the Gaussian assumption can lead to very poor predic- tions. As a simple example of an inverse problem, consider the kinematics of a robot arm, as illustrated in Figure 5.18. The forward problem involves finding the end ef- Exercise 5.33 fector position given the joint angles and has a unique solution. However, in practice we wish to move the end effector of the robot to a specific position, and to do this we must set appropriate joint angles. We therefore need to solve the inverse problem, which has two solutions as seen in Figure 5.18. Forward problems often corresponds to causality in a physical system and gen- erally have a unique solution. For instance, a specific pattern of symptoms in the human body may be caused by the presence of a particular disease. In pattern recog- nition, however, we typically have to solve an inverse problem, such as trying to predict the presence of a disease given a set of symptoms. If the forward problem involves a many-to-one mapping, then the inverse problem will have multiple solu- tions. For instance, several different diseases may result in the same symptoms. In the robotics example, the kinematics is defined by geometrical equations, and the multimodality is readily apparent. However, in many machine learning problems the presence of multimodality, particularly in problems involving spaces of high di- mensionality, can be less obvious. For tutorial purposes, however, we shall consider a simple toy problem for which we can easily visualize the multimodality. Data for this problem is generated by sampling a variable x uniformly over the interval (0 , 1) , } , and the corresponding target values t are obtained to give a set of values { x n n

293 5.6. Mixture Density Networks 273 On the left is the data Figure 5.19 set for a simple ‘forward problem’ in 1 1 which the red curve shows the result of fitting a two-layer neural network by minimizing the sum-of-squares error function. The corresponding inverse problem, shown on the right, is obtained by exchanging the roles Here the same net- x and . t of work trained again by minimizing the sum-of-squares error function gives 0 0 a very poor fit to the data due to the multimodality of the data set. 1 0 0 1 x +0 . 3 sin(2 by computing the function and then adding uniform noise over ) πx n n ( − the interval . 1 , 0 . 1) . The inverse problem is then obtained by keeping the same 0 data points but exchanging the roles of and t . Figure 5.19 shows the data sets for x the forward and inverse problems, along with the results of fitting two-layer neural networks having hidden units and a single linear output unit by minimizing a sum- 6 of-squares error function. Least squares corresponds to maximum likelihood under a Gaussian assumption. We see that this leads to a very poor model for the highly non-Gaussian inverse problem. We therefore seek a general framework for modelling conditional probability p ( t | distributions. This can be achieved by using a mixture model for ) in which x both the mixing coefficients as well as the component densities are flexible functions of the input vector x , giving rise to the mixture density network . For any given value of x , the mixture model provides a general formalism for modelling an arbitrary conditional density function p ( t | x ) . Provided we consider a sufficiently flexible network, we then have a framework for approximating arbitrary conditional distri- butions. Here we shall develop the model explicitly for Gaussian components, so that K ∑ ) ( 2 | . x ( π t ( x ) N ) (5.148) ,σ μ ) ( x x p t | ( )= k k k k =1 This is an example of a heteroscedastic model since the noise variance on the data is a function of the input vector x . Instead of Gaussians, we can use other distribu- tions for the components, such as Bernoulli distributions if the target variables are binary rather than continuous. We have also specialized to the case of isotropic co- variances for the components, although the mixture density network can readily be extended to allow for general covariance matrices by representing the covariances using a Cholesky factorization (Williams, 1996). Even with isotropic components, p ( t | x ) does not assume factorization with respect to the the conditional distribution t (in contrast to the standard sum-of-squares regression model) as a components of consequence of the mixture distribution. We now take the various parameters of the mixture model, namely the mixing 2 , to be governed by x ( x ) , the means μ ( ( x ) , and the variances σ ) coefficients π k k k

294 274 5. NEURAL NETWORKS ) x | t ( p x θ D M θ x θ 1 1 t Figure 5.20 can represent general conditional probability densities p ( t | x ) mixture density network The whose parameters are t by considering a parametric mixture model for the distribution of as its input vector. x determined by the outputs of a neural network that takes as its input. The structure x the outputs of a conventional neural network that takes of this mixture density network is illustrated in Figure 5.20. The mixture density network is closely related to the mixture of experts discussed in Section 14.5.3. The principle difference is that in the mixture density network the same function is used to predict the parameters of all of the component densities as well as the mixing co- efficients, and so the nonlinear hidden units are shared amongst the input-dependent functions. The neural network in Figure 5.20 can, for example, be a two-layer network having sigmoidal (‘ ’) hidden units. If there are L components in the mixture tanh model (5.148), and if has K components, then the network will have L output unit t π outputs that determine the mixing coefficients π ( x ) , K activations denoted by a k k σ ) outputs denoted that determine the kernel widths σ K ( x × , and L denoted by a k k μ by a . The total that determine the components μ ) ( x ) of the kernel centres μ x ( kj k kj K K L , as compared with the usual number of network outputs is given by ( +2) outputs for a network, which simply predicts the conditional means of the target variables. The mixing coefficients must satisfy the constraints K ∑ (5.149) π 1 ( x )=1 , 0  π  ( x ) k k =1 k which can be achieved using a set of softmax outputs π a exp( ) k )= x ( (5.150) . π ∑ k K π exp( a ) l =1 l 2 Similarly, the variances must satisfy σ and so can be represented in terms ( x )  0 k of the exponentials of the corresponding network activations using σ . ( x )=exp( a ) (5.151) σ k k ( x ) have real components, they can be represented Finally, because the means μ k

295 5.6. Mixture Density Networks 275 directly by the network output activations μ (5.152) a ( )= . x μ kj kj The adaptive parameters of the mixture density network comprise the vector w of weights and biases in the neural network, that can be set by maximum likelihood, or equivalently by minimizing an error function defined to be the negative logarithm of the likelihood. For independent data, this error function takes the form } { N k ∑ ∑ ) ( 2 ) ) (5.153) x ( π w ( x ,σ , w ) N , t ln | μ w ( x , E ( w )= − n n n k n k k n =1 k =1 explicit. where we have made the dependencies on w In order to minimize the error function, we need to calculate the derivatives of the error ( w ) with respect to the components of w . These can be evaluated by E using the standard backpropagation procedure, provided we obtain suitable expres- sions for the derivatives of the error with respect to the output-unit activations. These represent error signals δ for each pattern and for each output unit, and can be back- propagated to the hidden units and the error function derivatives evaluated in the usual way. Because the error function (5.153) is composed of a sum of terms, one for each training data point, we can consider the derivatives for a particular pattern E and then find the derivatives of n by summing over all patterns. Because we are dealing with mixture distributions, it is convenient to view the ( -dependent prior probabilities and to introduce the x ) as x π mixing coefficients k corresponding posterior probabilities given by N π nk k | (5.154) )= x ( t γ ∑ k K N π l nl =1 l 2 x )) ( t | μ . ( N ) ,σ x ( denotes where N nk n n n k k The derivatives with respect to the network output activations governing the mix- ing coefficients are given by Exercise 5.34 ∂E n . γ = (5.155) − π k k π ∂a k Similarly, the derivatives with respect to the output activations controlling the com- ponent means are given by Exercise 5.35 } { μ − t ∂E n kl l γ . (5.156) = k μ 2 σ ∂a k kl Finally, the derivatives with respect to the output activations controlling the compo- nent variances are given by Exercise 5.36 { } 2 ‖ t − μ ‖ 1 ∂E n k γ . (5.157) = − − k 3 σ σ σ ∂a k k k

296 276 5. NEURAL NETWORKS Figure 5.21 (a) Plot of the mixing 1 1 ( x ) as a function of π coefficients k x for the three kernel functions in a mixture density network trained on the data shown in Figure 5.19. The model has three Gaussian compo- nents, and uses a two-layer multi- layer perceptron with five ‘ tanh ’ sig- moidal units in the hidden layer, and nine outputs (corresponding to the 3 0 0 means and 3 variances of the Gaus- sian components and the 3 mixing 0 1 1 0 coefficients). At both small and large (a) (b) , where the conditional values of x probability density of the target data is unimodal, only one of the ker- nels has a high value for its prior 1 1 probability, while at intermediate val- , where the conditional den- x ues of sity is trimodal, the three mixing co- efficients have comparable values. x ( ) using (b) Plots of the means μ k the same colour coding as for the mixing coefficients. (c) Plot of the contours of the corresponding con- 0 0 ditional probability density of the tar- get data for the same mixture den- 0 1 1 0 (d) Plot of the ap- sity network. (d) (c) proximate conditional mode, shown by the red points, of the conditional density. We illustrate the use of a mixture density network by returning to the toy ex- ample of an inverse problem shown in Figure 5.19. Plots of the mixing coeffi- cients π ( x , the means ( , and the conditional density contours corresponding ) x ) μ k k p ( to | x ) , are shown in Figure 5.21. The outputs of the neural network, and hence the t parameters in the mixture model, are necessarily continuous single-valued functions of the input variables. However, we see from Figure 5.21(c) that the model is able to produce a conditional density that is unimodal for some values of and trimodal for x ( x ) . π other values by modulating the amplitudes of the mixing components k Once a mixture density network has been trained, it can predict the conditional density function of the target data for any given value of the input vector. This conditional density represents a complete description of the generator of the data, so far as the problem of predicting the value of the output vector is concerned. From this density function we can calculate more specific quantities that may be of interest in different applications. One of the simplest of these is the mean, corresponding to the conditional average of the target data, and is given by ∫ K ∑ x ]= | [ E t ( t | (5.158) )d t = t p ) x π ( ( x ) μ x k k =1 k

297 5.7. Bayesian Neural Networks 277 where we have used (5.148). Because a standard network trained by least squares is approximating the conditional mean, we see that a mixture density network can reproduce the conventional least-squares result as a special case. Of course, as we have already noted, for a multimodal distribution the conditional mean is of limited value. We can similarly evaluate the variance of the density function about the condi- tional average, to give Exercise 5.37 [ ] 2 2 s ] ‖ t − E [ t | x ( ‖ x | x )= (5.159) E ⎧ ⎫ ∥ ∥ 2 K K ∥ ∥ ⎬ ⎨ ∑ ∑ ∥ ∥ 2 ( x )+ ) (5.160) π ) ) μ x ( x σ − ( x ( μ π ) ( x = ∥ ∥ k l k l k ∥ ∥ ⎩ ⎭ =1 l =1 k where we have used (5.148) and (5.158). This is more general than the corresponding least-squares result because the variance is a function of x . We have seen that for multimodal distributions, the conditional mean can give a poor representation of the data. For instance, in controlling the simple robot arm shown in Figure 5.18, we need to pick one of the two possible joint angle settings in order to achieve the desired end-effector location, whereas the average of the two solutions is not itself a solution. In such cases, the conditional mode may be of more value. Because the conditional mode for the mixture density network does not have a simple analytical solution, this would require numerical iteration. A simple alternative is to take the mean of the most probable component (i.e., the one with the x . This is shown for the toy data set in largest mixing coefficient) at each value of Figure 5.21(d). 5.7. Bayesian Neural Networks So far, our discussion of neural networks has focussed on the use of maximum like- lihood to determine the network parameters (weights and biases). Regularized max- imum likelihood can be interpreted as a MAP (maximum posterior) approach in which the regularizer can be viewed as the logarithm of a prior parameter distribu- tion. However, in a Bayesian treatment we need to marginalize over the distribution of parameters in order to make predictions. In Section 3.3, we developed a Bayesian solution for a simple linear regression model under the assumption of Gaussian noise. We saw that the posterior distribu- tion, which is Gaussian, could be evaluated exactly and that the predictive distribu- tion could also be found in closed form. In the case of a multilayered network, the highly nonlinear dependence of the network function on the parameter values means that an exact Bayesian treatment can no longer be found. In fact, the log of the pos- terior distribution will be nonconvex, corresponding to the multiple local minima in the error function. The technique of variational inference, to be discussed in Chapter 10, has been applied to Bayesian neural networks using a factorized Gaussian approximation

298 278 5. NEURAL NETWORKS to the posterior distribution (Hinton and van Camp, 1993) and also using a full- covariance Gaussian (Barber and Bishop, 1998a; Barber and Bishop, 1998b). The most complete treatment, however, has been based on the Laplace approximation (MacKay, 1992c; MacKay, 1992b) and forms the basis for the discussion given here. We will approximate the posterior distribution by a Gaussian, centred at a mode of the true posterior. Furthermore, we shall assume that the covariance of this Gaus- sian is small so that the network function is approximately linear with respect to the parameters over the region of parameter space for which the posterior probability is significantly nonzero. With these two approximations, we will obtain models that are analogous to the linear regression and classification models discussed in earlier chapters and so we can exploit the results obtained there. We can then make use of the evidence framework to provide point estimates for the hyperparameters and to compare alternative models (for example, networks having different numbers of hid- den units). To start with, we shall discuss the regression case and then later consider the modifications needed for solving classification tasks. 5.7.1 Posterior parameter distribution Consider the problem of predicting a single continuous target variable t from x a vector of inputs (the extension to multiple targets is straightforward). We shall ( t | p ) is Gaussian, with an x -dependent suppose that the conditional distribution x y ( x , w ) , and with precision mean given by the output of a neural network model β (inverse variance) − 1 ) . (5.161) w ,β )= N ( t | y ( p , w ) ,β x ( t | x , w that is Gaussian of Similarly, we shall choose a prior distribution over the weights the form 1 − ) I (5.162) . N ( | 0 ,α )= α | w ( p w x ,..., , with a corresponding set of target observations N For an i.i.d. data set of x N 1 , the likelihood function is given by ,...,t } t D = values { 1 N N ∏ 1 − ,β ) ) ( w , N ( t x | y (5.163) w p )= ,β ( D| n n =1 n and so the resulting posterior distribution is then p ( w |D ,α,β ) ∝ p ( w | α ) p ( D| w ,β ) . (5.164) on which, as a consequence of the nonlinear dependence of ( x , w ) y w , will be non- Gaussian. We can find a Gaussian approximation to the posterior distribution by using the Laplace approximation. To do this, we must first find a (local) maximum of the posterior, and this must be done using iterative numerical optimization. As usual, it is convenient to maximize the logarithm of the posterior, which can be written in the

299 5.7. Bayesian Neural Networks 279 form N ∑ α β 2 T ln |D )= ( − p w w w { y ( x +const , − ) w t − } (5.165) n n 2 2 =1 n which corresponds to a regularized sum-of-squares error function. Assuming for and are fixed, we can find a maximum of the posterior, which α the moment that β we denote w , by standard nonlinear optimization algorithms such as conjugate MAP gradients, using error backpropagation to evaluate the required derivatives. w Having found a mode , we can then build a local Gaussian approximation MAP by evaluating the matrix of second derivatives of the negative log posterior distribu- tion. From (5.165), this is given by = −∇∇ ln p ( w |D ,α,β )= α I + β H A (5.166) where is the Hessian matrix comprising the second derivatives of the sum-of- H w squares error function with respect to the components of . Algorithms for comput- ing and approximating the Hessian were discussed in Section 5.4. The corresponding Gaussian approximation to the posterior is then given from (4.134) by − 1 , A ) . (5.167) q w |D )= N ( w | ( w MAP Similarly, the predictive distribution is obtained by marginalizing with respect to this posterior distribution ∫ p ( t | x , w (5.168) q ( w |D )d w . ) x , p )= ( t | D However, even with the Gaussian approximation to the posterior, this integration is y ( x , still analytically intractable due to the nonlinearity of the network function ) w as a function of w . To make progress, we now assume that the posterior distribution has small variance compared with the characteristic scales of w over which y ( x , w ) is varying. This allows us to make a Taylor series expansion of the network function and retain only the linear terms around w MAP T x )  y ( , , w y ( w x w ( w − )+ (5.169) g ) MAP MAP where we have defined ) (5.170) y ( x , w . | g ∇ = w w = w MAP With this approximation, we now have a linear-Gaussian model with a Gaussian | w ) and a Gaussian for p ( t ( w ) whose mean is a linear function of distribution for p of the form w ( ) T − 1 ( , w (5.171) )+ g | t w − w . ) ,β ( x y t | x , w ,β ) N p ( MAP MAP p ( t ) We can therefore make use of the general result (2.115) for the marginal Exercise 5.38 to give ) ( 2 ( (5.172) ,σ ) w , ) x t | y ( x x )= p D , ,α,β | t ( N MAP

300 280 5. NEURAL NETWORKS where the input-dependent variance is given by 2 1 − − 1 T σ g x β )= + . (5.173) g ( A ( t We see that the predictive distribution x , D ) is a Gaussian whose mean is given p | y x , w by the network function ( ) with the parameter set to their MAP value. The MAP variance has two terms, the first of which arises from the intrinsic noise on the target variable, whereas the second is an x -dependent term that expresses the uncertainty w . This should in the interpolant due to the uncertainty in the model parameters be compared with the corresponding predictive distribution for the linear regression model, given by (3.58) and (3.59). 5.7.2 Hyperparameter optimization and are fixed and known. So far, we have assumed that the hyperparameters β α We can make use of the evidence framework, discussed in Section 3.5, together with the Gaussian approximation to the posterior obtained using the Laplace approxima- tion, to obtain a practical procedure for choosing the values of such hyperparameters. The marginal likelihood, or evidence, for the hyperparameters is obtained by integrating over the network weights ∫ . w )d w p ( D| (5.174) ,β ) p ( w | α )= D| ( α, β p This is easily evaluated by making use of the Laplace approximation result (4.135). Exercise 5.39 Taking logarithms then gives 1 W N N − ) β ln (5.175) + α ) π ln(2 ln ln | A | + − ) ln ( E − w α, β D| ( p MAP 2 2 2 2 where W w , and the regularized error function is the total number of parameters in is defined by N ∑ α β 2 T w + (5.176) . { y ( x w , w )= ) − t } ( E w n MAP MAP n MAP MAP 2 2 n =1 We see that this takes the same form as the corresponding result (3.86) for the linear regression model. In the evidence framework, we make point estimates for α and β by maximizing ln p ( D| α, β ) . Consider first the maximization with respect to α , which can be done by analogy with the linear regression case discussed in Section 3.5.2. We first define the eigenvalue equation u (5.177) = λ β Hu i i i where H is the Hessian matrix comprising the second derivatives of the sum-of- . By analogy with (3.92), we obtain w = w squares error function, evaluated at MAP γ (5.178) α = T w w MAP MAP

301 5.7. Bayesian Neural Networks 281 γ where represents the effective number of parameters and is defined by Section 3.5.3 W ∑ λ i γ = . (5.179) λ α + i =1 i Note that this result was exact for the linear regression case. For the nonlinear neural will cause changes in the α network, however, it ignores the fact that changes in , which in turn will change the eigenvalues. We have therefore implicitly Hessian H ignored terms involving the derivatives of λ with respect to α . i Similarly, from (3.95) we see that maximizing the evidence with respect to β gives the re-estimation formula N ∑ 1 1 2 } y . t − { (5.180) ( ) w , x = n MAP n γ − N β n =1 As with the linear model, we need to alternate between re-estimation of the hyper- parameters α β and updating of the posterior distribution. The situation with and a neural network model is more complex, however, due to the multimodality of the posterior distribution. As a consequence, the solution for w found by maximiz- MAP ing the log posterior will depend on the initialization of w . Solutions that differ only as a consequence of the interchange and sign reversal symmetries in the hidden units Section 5.1.1 are identical so far as predictions are concerned, and it is irrelevant which of the equivalent solutions is found. However, there may be inequivalent solutions as well, and these will generally yield different values for the optimized hyperparameters. In order to compare different models, for example neural networks having differ- p ( D ) . This can ent numbers of hidden units, we need to evaluate the model evidence α and β obtained be approximated by taking (5.175) and substituting the values of from the iterative optimization of these hyperparameters. A more careful evaluation is obtained by marginalizing over α and β , again by making a Gaussian approxima- tion (MacKay, 1992c; Bishop, 1995a). In either case, it is necessary to evaluate the | A | of the Hessian matrix. This can be problematic in practice because determinant the determinant, unlike the trace, is sensitive to the small eigenvalues that are often difficult to determine accurately. The Laplace approximation is based on a local quadratic expansion around a mode of the posterior distribution over weights. We have seen in Section 5.1.1 that M equivalent M any given mode in a two-layer network is a member of a set of !2 modes that differ by interchange and sign-change symmetries, where M is the num- ber of hidden units. When comparing networks having different numbers of hid- den units, this can be taken into account by multiplying the evidence by a factor of M . M !2 5.7.3 Bayesian neural networks for classification So far, we have used the Laplace approximation to develop a Bayesian treat- ment of neural network regression models. We now discuss the modifications to

302 282 5. NEURAL NETWORKS this framework that arise when it is applied to classification. Here we shall con- sider a network having a single logistic sigmoid output corresponding to a two-class classification problem. The extension to networks with multiclass softmax outputs is straightforward. We shall build extensively on the analogous results for linear Exercise 5.40 classification models discussed in Section 4.5, and so we encourage the reader to familiarize themselves with that material before studying this section. The log likelihood function for this model is given by ∑ N ( D| )= p ln w { t =1 ln y (5.181) +(1 − t } )ln(1 − y ) n n n n n where t ∈{ 0 , 1 } are the target values, and y ≡ . Note that there is no y ( x ) , w n n n β , because the data points are assumed to be correctly labelled. As hyperparameter before, the prior is taken to be an isotropic Gaussian of the form (5.162). The first stage in applying the Laplace framework to this model is to initialize the hyperparameter α , and then to determine the parameter vector w by maximizing the log posterior distribution. This is equivalent to minimizing the regularized error function α T (5.182) w w E ( D| w )+ ( w )= − ln p 2 and can be achieved using error backpropagation combined with standard optimiza- tion algorithms, as discussed in Section 5.3. for the weight vector, the next step is to eval- Having found a solution w MAP uate the Hessian matrix H comprising the second derivatives of the negative log likelihood function. This can be done, for instance, using the exact method of Sec- tion 5.4.5, or using the outer product approximation given by (5.85). The second derivatives of the negative log posterior can again be written in the form (5.166), and the Gaussian approximation to the posterior is then given by (5.167). To optimize the hyperparameter α , we again maximize the marginal likelihood, Exercise 5.41 which is easily shown to take the form 1 W − ) | A | + (5.183) ln α +const ln ) − E ( w ln p ( D| α MAP 2 2 where the regularized error function is defined by N ∑ α T E ( w w w )= − t { ln y (5.184) +(1 − t + )ln(1 − y } ) n n MAP n MAP n MAP 2 =1 n in which y α ≡ y ( x . Maximizing this evidence function with respect to , w ) MAP n n again leads to the re-estimation equation given by (5.178). The use of the evidence procedure to determine α is illustrated in Figure 5.22 for the synthetic two-dimensional data discussed in Appendix A. Finally, we need the predictive distribution, which is defined by (5.168). Again, this integration is intractable due to the nonlinearity of the network function. The

303 5.7. Bayesian Neural Networks 283 Illustration of the evidence framework Figure 5.22 3 applied to a synthetic two-class data set. The green curve shows the optimal de- 2 cision boundary, the black curve shows the result of fitting a two-layer network 1 with 8 hidden units by maximum likeli- hood, and the red curve shows the re- sult of including a regularizer in which 0 α is optimized using the evidence pro- cedure, starting from the initial value −1 . Note that the evidence proce- α =0 dure greatly reduces the over-fitting of −2 the network. 2 1 −1 −2 0 simplest approximation is to assume that the posterior distribution is very narrow and hence make the approximation ) . (5.185) x D )  p ( t | t , w ( | x , p MAP We can improve on this, however, by taking account of the variance of the posterior distribution. In this case, a linear approximation for the network outputs, as was used in the case of regression, would be inappropriate due to the logistic sigmoid output- unit activation function that constrains the output to lie in the range , 1) . Instead, (0 we make a linear approximation for the output unit activation in the form T ( x )+ b ( w − w (5.186) )  a a ( w ) , x MAP MAP ) ( can be found by x )= a ( x , w w ) , and the vector b ≡∇ a ( x , a where MAP MAP MAP backpropagation. Because we now have a Gaussian approximation for the posterior distribution over w , and a model for a that is a linear function of w , we can now appeal to the results of Section 4.5.2. The distribution of output unit activation values, induced by the distribution over network weights, is given by ∫ ) ( T , D )= | x p a ( a (5.187) ( x ) − b δ ( x a w − w w ) − q ( w |D )d )( MAP MAP q ( where |D ) is the Gaussian approximation to the posterior distribution given by w (5.167). From Section 4.5.2, we see that this distribution is Gaussian with mean , and variance ) ≡ a ( x , w a MAP MAP 2 T − 1 )= b x ( (5.188) ) . ( x b ( x ) A σ a Finally, to obtain the predictive distribution, we must marginalize over a using ∫ a a. )d D , x | a (5.189) σ ( ( ) p ( )= D , x | =1 t p

304 284 5. NEURAL NETWORKS 3 3 2 2 1 1 0 0 −1 −1 −2 −2 1 2 0 2 −2 −1 0 −1 1 −2 Figure 5.23 An illustration of the Laplace approximation for a Bayesian neural network having 8 hidden units ’ activation functions and a single logistic-sigmoid output unit. The weight parameters were found using with ‘ tanh was optimized using the evidence framework. On the left scaled conjugate gradients, and the hyperparameter α of the parameters, is the result of using the simple approximation (5.185) based on a point estimate w MAP in which the green curve shows the =0 . 5 decision boundary, and the other contours correspond to output y y =0 . 1 , 0 . 3 , 0 . 7 , and 0 . 9 . On the right is the corresponding result obtained using (5.190). Note probabilities of that the effect of marginalization is to spread out the contours and to make the predictions less confident, so that at each input point 0 . 5 , while the y , the posterior probabilities are shifted towards . 5 contour itself is x =0 unaffected. The convolution of a Gaussian with a logistic sigmoid is intractable. We therefore apply the approximation (4.153) to (5.189) giving ) ( 2 T ( σ κ ) b w (5.190) ( t =1 | x , D )= σ p MAP a 2 b . x are functions of and κ · σ is defined by (4.154). Recall that both ) where ( a Figure 5.23 shows an example of this framework applied to the synthetic classi- fication data set described in Appendix A. Exercises ( ) Consider a two-layer network function of the form (5.7) in which the hidden- 5.1 g ( · ) are given by logistic sigmoid functions of the unit nonlinear activation functions form − 1 . (5.191) { 1+exp( − a ) } σ ( a )= Show that there exists an equivalent network, which computes exactly the same func- tanh( a ) where the tanh func- tion, but with hidden unit activation functions given by tion is defined by (5.59). Hint: first find the relation between σ ( a ) and tanh( a ) , and then show that the parameters of the two networks differ by linear transformations. 5.2 ( ) www Show that maximizing the likelihood function under the conditional distribution (5.16) for a multioutput neural network is equivalent to minimizing the sum-of-squares error function (5.11).

305 Exercises 285 5.3 ) ( Consider a regression problem involving multiple target variables in which it ,isa x is assumed that the distribution of the targets, conditioned on the input vector Gaussian of the form | x , w )= N ( t | p ( x , w ) , Σ ) (5.192) ( t y ( x , w ) is the output of a neural network with input vector x and weight where y w Σ is the covariance of the assumed Gaussian noise on the targets. vector , and and , write down the error function x Given a set of independent observations of t ,if that must be minimized in order to find the maximum likelihood solution for w is fixed and known. Now assume that Σ is also to be determined we assume that Σ from the data, and write down an expression for the maximum likelihood solution Σ . Note that the optimizations of w and Σ are now coupled, in contrast to the for case of independent target variables discussed in Section 5.2. 5.4 ) Consider a binary classification problem in which the target values are t ∈ ( that represents , } , with a network output y ( 0 , w ) 1 p ( t =1 | x ) , and suppose that { x  that the class label on a training data point has been incorrectly there is a probability set. Assuming independent and identically distributed data, write down the error function corresponding to the negative log likelihood. Verify that the error function =0 . Note that this error function makes the model robust (5.21) is obtained when  to incorrectly labelled data, in contrast to the usual error function. www Show that maximizing likelihood for a multiclass neural network model 5.5 ) ( x ( x , w )= p ( t ) =1 | is y in which the network outputs have the interpretation k k equivalent to the minimization of the cross-entropy error function (5.24). www Show the derivative of the error function (5.21) with respect to the 5.6 ) ( a activation for an output unit having a logistic sigmoid activation function satisfies k (5.18). ( ) Show the derivative of the error function (5.24) with respect to the activation a 5.7 k for output units having a softmax activation function satisfies (5.18). 5.8 ) We saw in (4.88) that the derivative of the logistic sigmoid activation function ( can be expressed in terms of the function value itself. Derive the corresponding result for the ‘ ’ activation function defined by (5.59). tanh ( 5.9 ) www The error function (5.21) for binary classification problems was de- rived for a network having a logistic-sigmoid output activation function, so that 0  y ( x , w )  1 , and data having target values t ∈{ 0 , 1 } . Derive the correspond- , ing error function if we consider a network having an output  y ( x 1 w )  1 − . What would be the C for class 1 − = and t for class =1 t and target values C 1 2 appropriate choice of output unit activation function? with eigenvector equation (5.33). By H Consider a Hessian matrix www ) ( 5.10 setting the vector v in (5.39) equal to each of the eigenvectors u in turn, show that i H is positive definite if, and only if, all of its eigenvalues are positive.

306 286 5. NEURAL NETWORKS Consider a quadratic error function defined by (5.32), in which the www ) ( 5.11 Hessian matrix has an eigenvalue equation given by (5.33). Show that the con- H tours of constant error are ellipses whose axes are aligned with the eigenvectors u , i with lengths that are inversely proportional to the square root of the corresponding eigenvalues λ . i 5.12 ) ( By considering the local Taylor expansion (5.32) of an error function www  w about a stationary point , show that the necessary and sufficient condition for the stationary point to be a local minimum of the error function is that the Hessian matrix  , be positive definite. = w ̂ w , defined by (5.30) with H ( ) Show that as a consequence of the symmetry of the Hessian matrix H , the 5.13 number of independent elements in the quadratic error function (5.28) is given by W +3) / 2 . ( W  By making a Taylor expansion, verify that the terms that are O ( ( ) cancel on the 5.14 ) right-hand side of (5.69). ( ) In Section 5.3.4, we derived a procedure for evaluating the Jacobian matrix of a 5.15 neural network using a backpropagation procedure. Derive an alternative formalism for finding the Jacobian based on equations. forward propagation 5.16 ( ) The outer product approximation to the Hessian matrix for a neural network using a sum-of-squares error function is given by (5.84). Extend this result to the case of multiple outputs. ( ) Consider a squared loss function of the form 5.17 ∫∫ 1 2 d (5.193) { y ( t , w ) − t } x p ( x ,t )d x = E 2 ( , w ) is a parametric function such as a neural network. The result (1.89) where x y shows that the function x , w ) that minimizes this error is given by the conditional y ( t given x . Use this result to show that the second derivative of E with expectation of , is given by of the vector w w and w respect to two elements r s ∫ 2 E ∂y ∂y ∂ x . x = )d p ( (5.194) ∂w ∂w ∂w ∂w s s r r Note that, for a finite sample from p ( x ) , we obtain (5.84). 5.18 ( ) Consider a two-layer network of the form shown in Figure 5.1 with the addition of extra parameters corresponding to skip-layer connections that go directly from the inputs to the outputs. By extending the discussion of Section 5.3.2, write down the equations for the derivatives of the error function with respect to these additional parameters. 5.19 ( ) www Derive the expression (5.85) for the outer product approximation to the Hessian matrix for a network having a single output with a logistic sigmoid output-unit activation function and a cross-entropy error function, corresponding to the result (5.84) for the sum-of-squares error function.

307 Exercises 287 ( Derive an expression for the outer product approximation to the Hessian matrix 5.20 ) outputs with a softmax output-unit activation function and for a network having K a cross-entropy error function, corresponding to the result (5.84) for the sum-of- squares error function. ( ) Extend the expression (5.86) for the outer product approximation of the Hes- 5.21 sian matrix to the case of K> 1 output units. Hence, derive a recursive expression of patterns and a similar expres- N analogous to (5.87) for incrementing the number of outputs. Use these results, together with the sion for incrementing the number K identity (5.88), to find sequential update expressions analogous to (5.89) for finding the inverse of the Hessian by incrementally including both extra patterns and extra outputs. Derive the results (5.93), (5.94), and (5.95) for the elements of the Hessian 5.22 ) ( matrix of a two-layer feed-forward network by application of the chain rule of cal- culus. ) Extend the results of Section 5.4.5 for the exact Hessian of a two-layer network 5.23 ( to include skip-layer connections that go directly from inputs to outputs. ( ) 5.24 Verify that the network function defined by (5.113) and (5.114) is invariant un- der the transformation (5.115) applied to the inputs, provided the weights and biases are simultaneously transformed using (5.116) and (5.117). Similarly, show that the network outputs can be transformed according (5.118) by applying the transforma- tion (5.119) and (5.120) to the second-layer weights and biases. www Consider a quadratic error function of the form 5.25 ( ) 1   T E E = ( w − w + (5.195) H ( w − w ) ) 0 2  w where H is positive definite and represents the minimum, and the Hessian matrix (0) w constant. Suppose the initial weight vector is chosen to be at the origin and is updated using simple gradient descent 1) ( τ ) − τ ( w = − ρ ∇ E (5.196) w where ρ is the learning rate (which is assumed to be τ denotes the step number, and τ small). Show that, after steps, the components of the weight vector parallel to the eigenvectors of H can be written ( τ ) τ  (5.197) 1 − (1 − ρη { ) = } w w j j j T where w = w u are the eigenvectors and eigenvalues, respectively, η , and u and j j j j so that H of η u . (5.198) = Hu j j j ( τ )  w Show that as τ →∞ , this gives ρη w | 1 − as expected, provided → | < 1 . j Now suppose that training is halted after a finite number τ of steps. Show that the

308 288 5. NEURAL NETWORKS components of the weight vector parallel to the eigenvectors of the Hessian satisfy τ ) (  − 1 when η w (5.199) ( ρτ )  w j j j ( τ ) 1  − | w ) | η | | (5.200) ( ρτ when . w j j j Compare this result with the discussion in Section 3.5.3 of regularization with simple 1 − is analogous to the regularization param- ) ρτ weight decay, and hence show that ( . The above results also show that the effective number of parameters in the λ eter network, as defined by (3.91), grows as the training progresses. ) Consider a multilayer perceptron with arbitrary feed-forward topology, which 5.26 ( error function (5.127) in is to be trained by minimizing the tangent propagation which the regularizing function is given by (5.128). Show that the regularization term Ω can be written as a sum over patterns of terms of the form ∑ 1 2 y (5.201) ( G = ) Ω k n 2 k G is a differential operator defined by where ∑ ∂ G≡ τ . (5.202) i ∂x i i By acting on the forward propagation equations ∑ z = h ( a z ) ,a w = (5.203) ji i j j j i can be evaluated by forward propagation using G Ω with the operator , show that n the following equations: ∑ ′ (5.204) = ( a h ) β . ,β α = w α ji j i j j j i where we have defined the new variables (5.205) ≡G z . ,β a ≡G α j j j j with respect to a weight w in the network can Now show that the derivatives of Ω rs n be written in the form ∑ Ω ∂ n z (5.206) } α α { φ δ = + s kr kr k s ∂w rs k where we have defined ∂y k δ ≡ (5.207) . ,φ ≡G δ kr kr kr ∂a r Write down the backpropagation equations for δ , and hence derive a set of back- kr propagation equations for the evaluation of the φ . kr

309 Exercises 289 Consider the framework for training with transformed data in the www ) ( 5.27 special case in which the transformation consists simply of the addition of random has a Gaussian distribution with zero mean and unit + x where ξ x → noise ξ covariance. By following an argument analogous to that of Section 5.5.5, show that the resulting regularizer reduces to the Tikhonov form (5.135). ) 5.28 ( Consider a neural network, such as the convolutional network discussed www in Section 5.5.6, in which multiple weights are constrained to have the same value. Discuss how the standard backpropagation algorithm must be modified in order to ensure that such constraints are satisfied when evaluating the derivatives of an error function with respect to the adjustable parameters in the network. www Verify the result (5.141). 5.29 ) ( ( ) Verify the result (5.142). 5.30 ( Verify the result (5.143). ) 5.31 ) Show that the derivatives of the mixing coefficients { π 5.32 ( , defined by (5.146), } k } are given by with respect to the auxiliary parameters { η j ∂π k . = π π − π (5.208) δ j j jk k ∂η j ∑ π =1 , derive the result (5.147). Hence, by making use of the constraint k k 5.33 ( ) Write down a pair of equations that express the Cartesian coordinates ( x ,x ) 1 2 θ for the robot arm shown in Figure 5.18 in terms of the joint angles and θ and 2 1 of the links. Assume the origin of the coordinate system is and L the lengths L 2 1 given by the attachment point of the lower arm. These equations define the ‘forward kinematics’ of the robot arm. www Derive the result (5.155) for the derivative of the error function with 5.34 ( ) respect to the network output activations controlling the mixing coefficients in the mixture density network. ( ) Derive the result (5.156) for the derivative of the error function with respect 5.35 to the network output activations controlling the component means in the mixture density network. 5.36 ( ) Derive the result (5.157) for the derivative of the error function with respect to the network output activations controlling the component variances in the mixture density network. 5.37 ( ) Verify the results (5.158) and (5.160) for the conditional mean and variance of the mixture density network model. 5.38 ( ) Using the general result (2.115), derive the predictive distribution (5.172) for the Laplace approximation to the Bayesian neural network model.

310 290 5. NEURAL NETWORKS www Make use of the Laplace approximation result (4.135) to show that the ( ) 5.39 evidence function for the hyperparameters and β in the Bayesian neural network α model can be approximated by (5.175). 5.40 ( ) www Outline the modifications needed to the framework for Bayesian neural networks, discussed in Section 5.7.3, to handle multiclass problems using networks having softmax output-unit activation functions. 5.41 ( ) By following analogous steps to those given in Section 5.7.1 for regression networks, derive the result (5.183) for the marginal likelihood in the case of a net- work having a cross-entropy error function and logistic-sigmoid output-unit activa- tion function.

311 6 Kernel Methods In Chapters 3 and 4, we considered linear parametric models for regression and y ( x , classification in which the form of the mapping ) from input x to output y w is governed by a vector w of adaptive parameters. During the learning phase, a set of training data is used either to obtain a point estimate of the parameter vector or to determine a posterior distribution over this vector. The training data is then discarded, and predictions for new inputs are based purely on the learned parameter vector w . This approach is also used in nonlinear parametric models such as neural networks. Chapter 5 However, there is a class of pattern recognition techniques, in which the training data points, or a subset of them, are kept and used also during the prediction phase. For instance, the Parzen probability density model comprised a linear combination Section 2.5.1 of ‘kernel’ functions each one centred on one of the training data points. Similarly, in Section 2.5.2 we introduced a simple technique for classification called nearest neighbours, which involved assigning to each new test vector the same label as the 291

312 292 6. KERNEL METHODS methods closest example from the training set. These are examples of memory-based that involve storing the entire training set in order to make predictions for future data points. They typically require a metric to be defined that measures the similarity of any two vectors in input space, and are generally fast to ‘train’ but slow at making predictions for test data points. Many linear parametric models can be re-cast into an equivalent ‘dual represen- tation’ in which the predictions are also based on linear combinations of a kernel function evaluated at the training data points. As we shall see, for models which are mapping φ based on a fixed nonlinear x ) , the kernel function is given feature space ( by the relation ′ T ′ k ( x , x x ) ) φ ( x )= φ . (6.1) ( From this definition, we see that the kernel is a symmetric function of its arguments ′ ′ ) x ( )= k , x . The kernel concept was introduced into the field of pat- ( x , x k so that et al. (1964) in the context of the method of potential tern recognition by Aizerman functions, so-called because of an analogy with electrostatics. Although neglected for many years, it was re-introduced into machine learning in the context of large- et al. (1992) giving rise to the technique of support margin classifiers by Boser . Since then, there has been considerable interest in this topic, both Chapter 7 vector machines in terms of theory and applications. One of the most significant developments has been the extension of kernels to handle symbolic objects, thereby greatly expanding the range of problems that can be addressed. The simplest example of a kernel function is obtained by considering the identity ′ )= ( x )= x , in which case k ( x , x mapping for the feature space in (6.1) so that φ T ′ . We shall refer to this as the linear kernel. x x The concept of a kernel formulated as an inner product in a feature space allows us to build interesting extensions of many well-known algorithms by making use of , also known as kernel substitution . The general idea is that, if we have the kernel trick enters only in the form an algorithm formulated in such a way that the input vector x of scalar products, then we can replace that scalar product with some other choice of kernel. For instance, the technique of kernel substitution can be applied to principal ̈ component analysis in order to develop a nonlinear variant of PCA (Sch et al. , Section 12.3 olkopf 1998). Other examples of kernel substitution include nearest-neighbour classifiers and the kernel Fisher discriminant (Mika et al. , 1999; Roth and Steinhage, 2000; Baudat and Anouar, 2000). There are numerous forms of kernel functions in common use, and we shall en- counter several examples in this chapter. Many have the property of being a function ′ ′ ( x − x , which )= k ) x , k only of the difference between the arguments, so that x ( stationary are known as kernels because they are invariant to translations in input space. A further specialization involves homogeneous kernels, also known as ra- , which depend only on the magnitude of the distance (typically Section 6.3 dial basis functions ′ ′ k ( ‖ x )= x . ‖ ) − Euclidean) between the arguments so that k ( x , x ̈ For recent textbooks on kernel methods, see Sch olkopf and Smola (2002), Her- brich (2002), and Shawe-Taylor and Cristianini (2004).

313 6.1. Dual Representations 293 6.1. Dual Representations Many linear models for regression and classification can be reformulated in terms of a dual representation in which the kernel function arises naturally. This concept will play an important role when we consider support vector machines in the next chapter. Here we consider a linear regression model whose parameters are determined by minimizing a regularized sum-of-squares error function given by N ∑ { } 1 λ 2 T T )= w ( J w ) − t (6.2) φ ( + w w x n n 2 2 =1 n  0 . If we set the gradient of J ( w ) with respect to w equal to zero, we see where λ ) , takes the form of a linear combination of the vectors x ( w that the solution for φ n with coefficients that are functions of w , of the form N N ∑ ∑ { } 1 T T = w − (6.3) x Φ ) − t )= w φ ( x x )= a φ ( φ a ( n n n n n λ n =1 =1 n th T is the design matrix, whose Φ n where x row is given by ) φ . Here the vector ( n T ) , and we have defined ,...,a =( a a 1 N { } 1 T ( w (6.4) φ − x = ) − t . a n n n λ Instead of working with the parameter vector w , we can now reformulate the least- squares algorithm in terms of the parameter vector a , giving rise to a dual represen- T a into J ( w ) , we obtain tation Φ . If we substitute = w 1 1 λ T T T T T T T T J a ( )= t t ΦΦ a a − + + a ΦΦ ΦΦ (6.5) ΦΦ t a a 2 2 2 T T matrix . We now define the ,...,t ΦΦ ) , which is an = Gram K =( where t t N 1 N × N symmetric matrix with elements T )= (6.6) x ) φ φ ( x ( k ( x , x ) = K m m n n nm ′ kernel function ( x , x k where we have introduced the ) defined by (6.1). In terms of the Gram matrix, the sum-of-squares error function can be written as λ 1 1 T T T T Ka KKa − a (6.7) K t + . t + t a a )= a ( J 2 2 2 Setting the gradient of J ( a ) with respect to a to zero, we obtain the following solu- tion − 1 =( K + λ I a (6.8) ) t . N

314 294 6. KERNEL METHODS If we substitute this back into the linear regression model, we obtain the following prediction for a new input x − 1 T T T x w y )= ( Φ φ ( x )= φ ( x ) x ( K + λ I )= ) a k t (6.9) ( N ) x , x ( k )= ( . Thus we x with elements ) x ( k where we have defined the vector k n n see that the dual formulation allows the solution to the least-squares problem to be ′ expressed entirely in terms of the kernel function , x k ( x . This is known as a dual ) formulation because, by noting that the solution for a can be expressed as a linear φ ( x ) combination of the elements of , we recover the original formulation in terms of the parameter vector . Note that the prediction at x is given by a linear combination Exercise 6.1 w of the target values from the training set. In fact, we have already obtained this result, using a slightly different notation, in Section 3.3.3. a by inverting an In the dual formulation, we determine the parameter vector N × N matrix, whereas in the original parameter space formulation we had to invert an M × M matrix in order to determine w . Because N is typically much larger than M , the dual formulation does not seem to be particularly useful. However, the advantage of the dual formulation, as we shall see, is that it is expressed entirely in ′ ) . We can therefore work directly in terms of terms of the kernel function k x , ( x ( x ) , which allows φ kernels and avoid the explicit introduction of the feature vector us implicitly to use feature spaces of high, even infinite, dimensionality. The existence of a dual representation based on the Gram matrix is a property of many linear models, including the perceptron. In Section 6.4, we will develop a dual- Exercise 6.2 ity between probabilistic linear models for regression and the technique of Gaussian processes. Duality will also play an important role when we discuss support vector machines in Chapter 7. 6.2. Constructing Kernels In order to exploit kernel substitution, we need to be able to construct valid kernel φ ( functions. One approach is to choose a feature space mapping ) and then use x this to find the corresponding kernel, as is illustrated in Figure 6.1. Here the kernel function is defined for a one-dimensional input space by M ∑ ′ ′ ′ T x, x ( k ( ( x )= )= ( φ x (6.10) φ ) ) x ) φ φ ( x i i =1 i are the basis functions. ( x ) where φ i An alternative approach is to construct kernel functions directly. In this case, we must ensure that the function we choose is a valid kernel, in other words that it corresponds to a scalar product in some (perhaps infinite dimensional) feature space. As a simple example, consider a kernel function given by ) ( 2 T (6.11) . z x )= , x ( k z

315 6.2. Constructing Kernels 295 1 1 1 0.75 0.5 0.75 0.5 0 0.5 0.25 0.25 −0.5 0 −1 0 0 1 1 −1 −1 0 1 −1 0 0.04 0.04 0.04 0.02 0.02 0.02 0 0 0 0 −1 0 1 1 0 −1 −1 1 Illustration of the construction of kernel functions starting from a corresponding set of basis func- Figure 6.1 ′ tions. In each column the lower plot shows the kernel function k ( x, x ) defined by (6.10) plotted as a function of ′ , while the upper plot shows the corresponding basis functions given by polynomials (left column), =0 x x for ‘Gaussians’ (centre column), and logistic sigmoids (right column). x x ) ,x =( If we take the particular case of a two-dimensional input space we 2 1 can expand out the terms and thereby identify the corresponding nonlinear feature mapping ( ) 2 2 T =( x ) z z x + x z )= z , x ( k 1 2 2 1 2 2 2 2 z z +2 x z x z = + x x 1 1 2 2 2 2 1 1 √ √ 2 T 2 2 2 2 x ,x x =( )( z 2 x , , ) z ,z z 2 1 2 1 1 2 1 2 T . ( φ φ ( z ) ) (6.12) = x √ 2 2 T ) and ,x , x 2 x )=( x ( φ We see that the feature mapping takes the form x 1 2 2 1 therefore comprises all possible second order terms, with a specific weighting be- tween them. More generally, however, we need a simple way to test whether a function con- stitutes a valid kernel without having to construct the function ( x ) explicitly. A φ ′ ) to be a valid kernel (Shawe- , x k necessary and sufficient condition for a function x ( Taylor and Cristianini, 2004) is that the Gram matrix K , whose elements are given by { . , x } ) , should be positive semidefinite for all possible choices of the set x x ( k n m n Note that a positive semidefinite matrix is not the same thing as a matrix whose elements are nonnegative. Appendix C One powerful technique for constructing new kernels is to build them out of simpler kernels as building blocks. This can be done using the following properties:

316 296 6. KERNEL METHODS Techniques for Constructing New Kernels. ′ ′ ( ) and k x , x , x , the following new kernels will also ) x ( Given valid kernels k 1 2 be valid: ′ ′ x )= x ( ) (6.13) ck , , x k ( x 1 ′ ′ ′ ) )= f ( x ) k ( ( x , x x ) f (6.14) , x ( k x 1 ′ ′ ( q ( k , x x )= )) (6.15) , x x k ( 1 ′ ′ (6.16) )) )=exp( k x ( x , x x ( , k 1 ′ ′ ′ x )= k ) ( (6.17) , x x )+ k ( x , , k ( x x 1 2 ′ ′ ′ ) ( x , x (6.18) )= k ( x , x k ) ( , k x x 1 2 ′ ′ ( φ ( φ )= x ) , ( x k )) (6.19) k x ( , x 3 ′ T ′ )= x Ax (6.20) , x k ( x ′ ′ ′ x ( x (6.21) , )+ )= ) k k ( x x , ( , x x k b b a a b a ′ ′ ′ x , (6.22) )= k x ) x ( , x ( k ) x , x ( k a b b a a b 0 is a polynomial with nonneg- f ( · ) is any function, q ( · ) c> where is a constant, M ( ) φ x to R is a function from ative coefficients, x k ( · , · ) is a valid kernel in , 3 M , A is a symmetric positive semidefinite matrix, x are variables (not and x R b a , and , x are valid kernel functions ) k and k =( x x necessarily disjoint) with b b a a over their respective spaces. Equipped with these properties, we can now embark on the construction of more complex kernels appropriate to specific applications. We require that the kernel ′ ) be symmetric and positive semidefinite and that it expresses the appropriate ( , x x k ′ and x form of similarity between x according to the intended application. Here we consider a few common examples of kernel functions. For a more extensive discus- sion of ‘kernel engineering’, see Shawe-Taylor and Cristianini (2004). ) ( 2 T ′ ′ contains only )= x x ( k We saw that the simple polynomial kernel x , x ′ k ( x , x terms of degree two. If we consider the slightly generalized kernel )= ) ( 2 T ′ ) c with c> 0 , then the corresponding feature mapping φ ( x + contains con- x x ( ) M ′ T ′ )= x x , x stant and linear terms as well as terms of order two. Similarly, k ( x ′ M . For instance, if x and x contains all monomials of order are two images, then the kernel represents a particular weighted sum of all possible products of M pixels M in the first image with pixels in the second image. This can similarly be gener- ) ( M ′ T ′ + )= x c x , x ( k by considering M alized to include all terms up to degree x with c> 0 . Using the results (6.17) and (6.18) for combining kernels we see that these will all be valid kernel functions. Another commonly used kernel takes the form ) ( 2 ′ ′ 2 ) = exp (6.23) −‖ x σ x 2 ‖ − / x , x ( k and is often called a ‘Gaussian’ kernel. Note, however, that in this context it is not interpreted as a probability density, and hence the normalization coefficient is

317 6.2. Constructing Kernels 297 omitted. We can see that this is a valid kernel by expanding the square T ′ ′ ′ 2 T ′ T 2 x x (6.24) ) ‖ x x − +( x = x x ‖ x − to give ) ( ) ( ) ( ′ ′ 2 ′ T T 2 2 ′ T x x / x 2 /σ σ (6.25) exp )=exp − ( x x ) exp x − / 2 σ x k , ( x and then making use of (6.14) and (6.16), together with the validity of the linear ′ T ′ x x . Note that the feature vector that corresponds to the Gaussian )= ( x , x kernel k Exercise 6.11 kernel has infinite dimensionality. The Gaussian kernel is not restricted to the use of Euclidean distance. If we use T ′ ′ kernel substitution in (6.24) to replace x κ ( x , x x ) ,we with a nonlinear kernel obtain } { 1 ′ ′ ′ ′ x − )) (6.26) , x ( κ ( x , x )+ κ ( x . , x ) = exp ) − 2 κ ( x , x ( k 2 σ 2 An important contribution to arise from the kernel viewpoint has been the exten- sion to inputs that are symbolic, rather than simply vectors of real numbers. Kernel functions can be defined over objects as diverse as graphs, sets, strings, and text doc- uments. Consider, for instance, a fixed set and define a nonvectorial space consisting and are two such subsets then one simple A A of all possible subsets of this set. If 1 2 choice of kernel would be | A | A ∩ 2 1 (6.27) )=2 ,A A ( k 2 1 where A ∩ A | denotes the intersection of sets A denotes the and A | , and A 2 2 1 1 A number of subsets in . This is a valid kernel function because it can be shown to correspond to an inner product in a feature space. Exercise 6.12 One powerful approach to the construction of kernels starts from a probabilistic generative model (Haussler, 1999), which allows us to apply generative models in a discriminative setting. Generative models can deal naturally with missing data and in the case of hidden Markov models can handle sequences of varying length. By contrast, discriminative models generally give better performance on discriminative tasks than generative models. It is therefore of some interest to combine these two approaches (Lasserre et al. , 2006). One way to combine them is to use a generative model to define a kernel, and then use this kernel in a discriminative approach. Given a generative model p ( x ) we can define a kernel by ′ ′ ) p ( x (6.28) p ( x )= ) . k , x x ( This is clearly a valid kernel function because we can interpret it as an inner product p x ) . It says that two in the one-dimensional feature space defined by the mapping ( ′ and x inputs x are similar if they both have high probabilities. We can use (6.13) and (6.17) to extend this class of kernels by considering sums over products of different probability distributions, with positive weighting coefficients p ( i ) , of the form ∑ ′ ′ ) i (6.29) )= . ( p ( x | i p p ( x ) | i ) ( x , x k i

318 298 6. KERNEL METHODS This is equivalent, up to an overall multiplicative constant, to a mixture distribution in which the components factorize, with the index playing the role of a ‘latent’ i ′ x x and variable. Two inputs Section 9.2 will give a large value for the kernel function, and hence appear similar, if they have significant probability under a range of different components. Taking the limit of an infinite sum, we can also consider kernels of the form ∫ ′ ′ , x x ( k x | z ) p ( x )= | z ) p ( z )d z (6.30) p ( is a continuous latent variable. z where Now suppose that our data consists of ordered sequences of length so that L { x X an observation is given by = x . A popular generative model for } ,..., L 1 Section 13.2 X ) as a p sequences is the hidden Markov model, which expresses the distribution ( ,..., z . } marginalization over a corresponding sequence of hidden states Z = { z L 1 We can use this approach to define a kernel function measuring the similarity of two ′ by extending the mixture representation (6.29) to give and sequences X X ∑ ′ ′ ( X k X , p ( X | Z ) p ( X Z | )= ) p ( Z ) (6.31) Z so that both observed sequences are generated by the same hidden sequence . This Z model can easily be extended to allow sequences of differing length to be compared. An alternative technique for using generative models to define kernel functions is known as the Fisher kernel (Jaakkola and Haussler, 1999). Consider a parametric p ( x | θ ) where θ denotes the vector of parameters. The goal is to generative model ′ induced by the and x find a kernel that measures the similarity of two input vectors x generative model. Jaakkola and Haussler (1999) consider the gradient with respect to θ , which defines a vector in a ‘feature’ space having the same dimensionality as Fisher score θ . In particular, they consider the (6.32) θ | x ( p ln ) )= x , g θ ( ∇ θ from which the Fisher kernel is defined by ′ T − 1 ′ , x ( k x x ) )= F g ( g ( θ , x θ ) . (6.33) , F Fisher information matrix , given by Here is the ] [ T E F = ( θ (6.34) x ) g ( θ , x ) , g x where the expectation is with respect to under the distribution p ( x | θ ) . This can x information geometry be motivated from the perspective of (Amari, 1998), which considers the differential geometry of the space of model parameters. Here we sim- ply note that the presence of the Fisher information matrix causes this kernel to be θ → invariant under a nonlinear re-parameterization of the density model ( θ ) . Exercise 6.13 ψ In practice, it is often infeasible to evaluate the Fisher information matrix. One approach is simply to replace the expectation in the definition of the Fisher informa- tion with the sample average, giving N ∑ 1 T (6.35) . ) x g ( θ , , θ ) g ( x F  n n N =1 n

319 6.3. Radial Basis Function Networks 299 This is the covariance matrix of the Fisher scores, and so the Fisher kernel corre- sponds to a whitening of these scores. More simply, we can just omit the Fisher Section 12.1.3 information matrix altogether and use the noninvariant kernel ′ T ′ x )= g ( θ , x ) (6.36) g ( θ , . ) x x ( k , An application of Fisher kernels to document retrieval is given by Hofmann (2000). A final example of a kernel function is the sigmoidal kernel given by ) ( T ′ ′ ) = tanh a x x + b (6.37) ( k , x x whose Gram matrix in general is not positive semidefinite. This form of kernel has, however, been used in practice (Vapnik, 1995), possibly because it gives kernel expansions such as the support vector machine a superficial resemblance to neural network models. As we shall see, in the limit of an infinite number of basis functions, a Bayesian neural network with an appropriate prior reduces to a Gaussian process, thereby providing a deeper link between neural networks and kernel methods. Section 6.4.7 6.3. Radial Basis Function Networks In Chapter 3, we discussed regression models based on linear combinations of fixed basis functions, although we did not discuss in detail what form those basis functions might take. One choice that has been widely used is that of radial basis functions , which have the property that each basis function depends only on the radial distance . ) ‖ , so that φ μ ( x )= h ( ‖ x − μ (typically Euclidean) from a centre j j j Historically, radial basis functions were introduced for the purpose of exact func- tion interpolation (Powell, 1987). Given a set of input vectors { x x ,..., along } N 1 with corresponding target values { t ,...,t , the goal is to find a smooth function } N 1 )= t . This for n =1 ,...,N f ( x ) that fits every target value exactly, so that f ( x n n is achieved by expressing f ( x ) as a linear combination of radial basis functions, one centred on every data point N ∑ . ‖ ) (6.38) x − w x h ( ‖ )= f x ( n n n =1 are found by least squares, and because there } w { The values of the coefficients n are the same number of coefficients as there are constraints, the result is a function that fits every target value exactly. In pattern recognition applications, however, the target values are generally noisy, and exact interpolation is undesirable because this corresponds to an over-fitted solution. Expansions in radial basis functions also arise from regularization theory (Pog- gio and Girosi, 1990; Bishop, 1995a). For a sum-of-squares error function with a regularizer defined in terms of a differential operator, the optimal solution is given by an expansion in the Green’s functions of the operator (which are analogous to the eigenvectors of a discrete matrix), again with one basis function centred on each data

320 300 6. KERNEL METHODS point. If the differential operator is isotropic then the Green’s functions depend only on the radial distance from the corresponding data point. Due to the presence of the regularizer, the solution no longer interpolates the training data exactly. Another motivation for radial basis functions comes from a consideration of the interpolation problem when the input (rather than the target) variables are noisy (Webb, 1994; Bishop, 1995a). If the noise on the input variable x is described having a distribution ν ( ξ ) , then the sum-of-squares error function by a variable ξ becomes ∫ N ∑ 1 2 = E (6.39) { y ( x ξ + ξ ) − t )d } . ν ( ξ n n 2 n =1 ( x ) Using the calculus of variations, we can optimize with respect to the function f Appendix D to give Exercise 6.17 N ∑ x )= ( (6.40) − x t ) h x ( y n n n n =1 where the basis functions are given by ) x x − ν ( n (6.41) . − x ( h x )= n N ∑ ) x − x ( ν n n =1 We see that there is one basis function centred on every data point. This is known as Nadaraya-Watson model and will be derived again from a different perspective the ν ξ ) is isotropic, so that it is a function ( in Section 6.3.1. If the noise distribution ‖ ξ ‖ , then the basis functions will be radial. only of ∑ )=1 x h ( x − Note that the basis functions (6.41) are normalized, so that n n x for any value of . The effect of such normalization is shown in Figure 6.2. Normal- ization is sometimes used in practice as it avoids having regions of input space where all of the basis functions take small values, which would necessarily lead to predic- tions in such regions that are either small or controlled purely by the bias parameter. Another situation in which expansions in normalized radial basis functions arise is in the application of kernel density estimation to the problem of regression, as we shall discuss in Section 6.3.1. Because there is one basis function associated with every data point, the corre- sponding model can be computationally costly to evaluate when making predictions for new data points. Models have therefore been proposed (Broomhead and Lowe, 1988; Moody and Darken, 1989; Poggio and Girosi, 1990), which retain the expan- sion in radial basis functions but where the number M of basis functions is smaller than the number N of data points. Typically, the number of basis functions, and the alone. The of their centres, are determined based on the input data { x } locations μ n i } are determined by least basis functions are then kept fixed and the coefficients { w i squares by solving the usual set of linear equations, as discussed in Section 3.1.1.

321 6.3. Radial Basis Function Networks 301 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 0 0.5 1 0.5 0.5 1 −1 − − −1 0.5 Plot of a set of Gaussian basis functions on the left, together with the corresponding normalized Figure 6.2 basis functions on the right. One of the simplest ways of choosing basis function centres is to use a randomly chosen subset of the data points. A more systematic approach is called orthogonal et al. least squares , 1991). This is a sequential selection process in which at (Chen each step the next data point to be chosen as a basis function centre corresponds to the one that gives the greatest reduction in the sum-of-squares error. Values for the expansion coefficients are determined as part of the algorithm. Clustering algorithms -means have also been used, which give a set of basis function centres that Section 9.1 K such as no longer coincide with training data points. 6.3.1 Nadaraya-Watson model In Section 3.3.3, we saw that the prediction of a linear regression model for a x takes the form of a linear combination of the training set target values new input with coefficients given by the ‘equivalent kernel’ (3.62) where the equivalent kernel satisfies the summation constraint (3.64). We can motivate the kernel regression model (3.61) from a different perspective, and ,t } starting with kernel density estimation. Suppose we have a training set { x n n , so that p x ,t we use a Parzen density estimator to model the joint distribution ( Section 2.5.1 ) N ∑ 1 − ,t x x (6.42) ) − t f ( )= ,t x ( p n n N =1 n where f ( x ,t ) is the component density function, and there is one such component centred on each data point. We now find an expression for the regression function y ( x ) , corresponding to the conditional average of the target variable conditioned on

322 302 6. KERNEL METHODS the input variable, which is given by ∫ ∞ t tp ( t | x )d E ( [ t | x ]= x y )= −∞ ∫ ( x ,t )d t tp ∫ = x ,t )d t p ( ∫ ∑ x tf ( x − t )d ,t − t n n n ∫ (6.43) . = ∑ x ( ,t − x − )d t f t m m m We now assume for simplicity that the component density functions have zero mean so that ∫ ∞ f ( x ,t ) t d t =0 (6.44) −∞ for all values of x . Using a simple change of variable, we then obtain ∑ ) x − t x g ( n n n ( y )= x ∑ g x − x ( ) m m ∑ = k ( x , x t (6.45) ) n n n is given by ) x and the kernel function where ( n, m , x =1 ,...,N k n x − x ( g ) n , x ( k x )= ∑ (6.46) n ) g − x x ( m m and we have defined ∫ ∞ )d ,t x (6.47) t. f ( x ( g )= −∞ The result (6.45) is known as the Nadaraya-Watson model, or kernel regression (Nadaraya, 1964; Watson, 1964). For a localized kernel function, it has the prop- . Note that the x that are close to erty of giving more weight to the data points x n kernel (6.46) satisfies the summation constraint N ∑ )=1 x . ( x , k n =1 n

323 6.4. Gaussian Processes 303 Figure 6.3 Illustration of the Nadaraya-Watson kernel 1.5 regression model using isotropic Gaussian kernels, for the sinusoidal data set. The original sine function is shown 1 by the green curve, the data points are shown in blue, and each is the centre of an isotropic Gaussian kernel. 0.5 The resulting regression function, given by the condi- tional mean, is shown by the red line, along with the two- 0 standard-deviation region for the conditional distribution shown by the red shading. The blue ellipse around p ( t | x ) −0.5 each data point shows one standard deviation contour for the corresponding kernel. These appear noncircular due −1 to the different scales on the horizontal and vertical axes. −1.5 2 0.8 0.6 4 0. 1 0. 0 In fact, this model defines not only a conditional expectation but also a full conditional distribution given by ∑ t − ,t x − x ( f ) n n ) p ( t, x n ∫ ∫ = (6.48) )= x p | t ( ∑ ,t t, p )d ( f ( t − x t x − t )d x m m m from which other expectations can be evaluated. As an illustration we consider the case of a single input variable x in which f ( x, t ) is given by a zero-mean isotropic Gaussian over the variable z =( x, t ) with 2 . The corresponding conditional distribution (6.48) is given by a Gaus- variance σ Exercise 6.18 sian mixture, and is shown, together with the conditional mean, for the sinusoidal synthetic data set in Figure 6.3. An obvious extension of this model is to allow for more flexible forms of Gaus- sian components, for instance having different variance parameters for the input and p target variables. More generally, we could model the joint distribution t, x ) using ( a Gaussian mixture model, trained using techniques discussed in Chapter 9 (Ghahra- mani and Jordan, 1994), and then find the corresponding conditional distribution p t | x ) . In this latter case we no longer have a representation in terms of kernel func- ( tions evaluated at the training set data points. However, the number of components in the mixture model can be smaller than the number of training set points, resulting in a model that is faster to evaluate for test data points. We have thereby accepted an increased computational cost during the training phase in order to have a model that is faster at making predictions. 6.4. Gaussian Processes In Section 6.1, we introduced kernels by applying the concept of duality to a non- probabilistic model for regression. Here we extend the role of kernels to probabilis-

324 304 6. KERNEL METHODS tic discriminative models, leading to the framework of Gaussian processes. We shall thereby see how kernels arise naturally in a Bayesian setting. ( x , w )= In Chapter 3, we considered linear regression models of the form y T w φ ) in which w is a vector of parameters and φ ( x ) is a vector of fixed nonlinear x ( . We showed that a prior distribution x basis functions that depend on the input vector induced a corresponding prior distribution over functions y ( over , w ) . Given a w x training data set, we then evaluated the posterior distribution over and thereby w obtained the corresponding posterior distribution over regression functions, which p ( | x ) for new in turn (with the addition of noise) implies a predictive distribution t . input vectors x In the Gaussian process viewpoint, we dispense with the parametric model and instead define a prior probability distribution over functions directly. At first sight, it might seem difficult to work with a distribution over the uncountably infinite space of functions. However, as we shall see, for a finite training set we only need to consider x the values of the function at the discrete set of input values corresponding to the n training set and test set data points, and so in practice we can work in a finite space. Models equivalent to Gaussian processes have been widely studied in many dif- ferent fields. For instance, in the geostatistics literature Gaussian process regression is known as kriging (Cressie, 1993). Similarly, ARMA (autoregressive moving aver- age) models, Kalman filters, and radial basis function networks can all be viewed as forms of Gaussian process models. Reviews of Gaussian processes from a machine learning perspective can be found in MacKay (1998), Williams (1999), and MacKay (2003), and a comparison of Gaussian process models with alternative approaches is given in Rasmussen (1996). See also Rasmussen and Williams (2006) for a recent textbook on Gaussian processes. 6.4.1 Linear regression revisited In order to motivate the Gaussian process viewpoint, let us return to the linear regression example and re-derive the predictive distribution by working in terms of distributions over functions y x , w ) . This will provide a specific example of a ( Gaussian process. Consider a model defined in terms of a linear combination of fixed basis M functions given by the elements of the vector φ ( x ) so that T (6.49) φ ( x ) w y x )= ( where x is the input vector and w is the M -dimensional weight vector. Now consider a prior distribution over w given by an isotropic Gaussian of the form − 1 I ) (6.50) 0 | w ,α p )= N ( w ( governed by the hyperparameter α , which represents the precision (inverse variance) of the distribution. For any given value of , the definition (6.49) defines a partic- w ular function of x . The probability distribution over w defined by (6.50) therefore induces a probability distribution over functions y ( x ) . In practice, we wish to eval- uate this function at specific values of x , for example at the training data points

325 6.4. Gaussian Processes 305 ,..., . We are therefore interested in the joint distribution of the function val- x x 1 N y x ues ( ( x , which we denote by the vector ) ) y ,...,y y ) = y ( x with elements n n N 1 =1 ,...,N for n . From (6.49), this vector is given by = (6.51) y Φw ) = φ ( x . We can find the proba- is the design matrix with elements where Φ Φ n k nk as follows. First of all we note that y is a linear combination of y bility distribution of and hence is itself Gaus- Gaussian distributed variables given by the elements of w sian. We therefore need only to find its mean and covariance, which are given from Exercise 2.31 (6.50) by [ y ]= Φ E [ w ]= 0 (6.52) E ] ] [ [ 1 T T T T ΦΦ yy Φ E = Φ ww = = K (6.53) cov[ y E ]= α is the Gram matrix with elements where K 1 T K φ ( x )= = ( k x , ) x φ ( x ) (6.54) n nm m n m α ′ ) is the kernel function. x , x and k ( This model provides us with a particular example of a Gaussian process. In gen- eral, a Gaussian process is defined as a probability distribution over functions ( x ) y ,..., x y x ( such that the set of values of evaluated at an arbitrary set of points ) x 1 N x is two di- jointly have a Gaussian distribution. In cases where the input vector . More generally, a mensional, this may also be known as a Gaussian random field ( ) is specified by giving the joint probability distribution for y stochastic process x ) ) ,...,y ( x in a consistent manner. y ( any finite set of values x N 1 A key point about Gaussian stochastic processes is that the joint distribution over variables y N ,...,y is specified completely by the second-order statistics, N 1 namely the mean and the covariance. In most applications, we will not have any prior knowledge about the mean of y ( x ) and so by symmetry we take it to be zero. This is equivalent to choosing the mean of the prior over weight values p ( w | α ) to be zero in the basis function viewpoint. The specification of the Gaussian process is y ( x ) evaluated at any two values of x , then completed by giving the covariance of which is given by the kernel function (6.55) . ) x , x ) y ( x ( )] = k ( y [ E x m n m n For the specific case of a Gaussian process defined by the linear regression model (6.49) with a weight prior (6.50), the kernel function is given by (6.54). We can also define the kernel function directly, rather than indirectly through a choice of basis function. Figure 6.4 shows samples of functions drawn from Gaus- sian processes for two different choices of kernel function. The first of these is a ‘Gaussian’ kernel of the form (6.23), and the second is the exponential kernel given by ′ ′ − θ | x )=exp( x (6.56) | ) − k ( x, x which corresponds to the Ornstein-Uhlenbeck process originally introduced by Uh- lenbeck and Ornstein (1930) to describe Brownian motion.

326 306 6. KERNEL METHODS Figure 6.4 Samples from Gaus- 3 3 sian processes for a ‘Gaussian’ ker- nel (left) and an exponential kernel (right). 1.5 1.5 0 0 −1.5 −1.5 −3 −3 − 0.5 0 0.5 1 0.5 1 0.5 −1 − −1 0 6.4.2 Gaussian processes for regression In order to apply Gaussian process models to the problem of regression, we need to take account of the noise on the observed target values, which are given by = +  (6.57) y t n n n is a random noise variable whose value is chosen inde- y ( x ) , and  = where y n n n n pendently for each observation . Here we shall consider noise processes that have a Gaussian distribution, so that 1 − | (6.58) N ( | )= y ) ,β y t p ( t n n n n β is a hyperparameter representing the precision of the noise. Because the where noise is independent for each data point, the joint distribution of the target values T T t t =( ) y conditioned on the values of y =( ,...,t is given by an ,...,y ) N 1 N 1 isotropic Gaussian of the form − 1 t | y )= N ( t | y ,β p ( I (6.59) ) N N × N unit matrix. From the definition of a Gaussian process, denotes the I where N p y ) is given by a Gaussian whose mean is zero and whose the marginal distribution ( covariance is defined by a Gram matrix K so that ( y )= N ( y p 0 , K ) . (6.60) | The kernel function that determines is typically chosen to express the property K x that, for points and x y that are similar, the corresponding values and ( x ) n n m y ( x ) will be more strongly correlated than for dissimilar points. Here the notion m of similarity will depend on the application. In order to find the marginal distribution p ( t ) , conditioned on the input values . This can be done by making use of the ,..., x y , we need to integrate over x N 1 results from Section 2.3.3 for the linear-Gaussian model. Using (2.115), we see that the marginal distribution of t is given by ∫ C (6.61) , ) p ( t | y ) p ( y )d y = N ( t | 0 )= ( p t

327 6.4. Gaussian Processes 307 C has elements where the covariance matrix − 1 β k ( x , (6.62) )+ x , )= δ . x C ( x nm n n m m This result reflects the fact that the two Gaussian sources of randomness, namely y x ) and that associated with  , are independent and so their ( that associated with covariances simply add. One widely used kernel function for Gaussian process regression is given by the exponential of a quadratic form, with the addition of constant and linear terms to give } { θ 1 2 T ‖ x )= exp (6.63) − x , . ‖ θ x − x x θ + θ + ( k x 3 m 0 2 n m n m n 2 corresponds to a parametric model that is a linear Note that the term involving θ 3 function of the input variables. Samples from this prior are plotted for various values of the parameters θ ,...,θ in Figure 6.5, and Figure 6.6 shows a set of points sam- 3 0 pled from the joint distribution (6.60) along with the corresponding values defined by (6.61). So far, we have used the Gaussian process viewpoint to build a model of the joint distribution over sets of data points. Our goal in regression, however, is to make predictions of the target variables for new inputs, given a set of training data. T t x , ,..., ) =( , corresponding to input values x ,...,t t Let us suppose that N 1 N N 1 t comprise the observed training set, and our goal is to predict the target variable +1 N x . This requires that we evaluate the predictive distri- for a new input vector N +1 bution p ( t . Note that this distribution is conditioned also on the variables ) | t N N +1 ,..., and x . However, to keep the notation simple we will not show these x x N +1 N 1 conditioning variables explicitly. | t ) , we begin by writing down the ( t To find the conditional distribution p +1 N T joint distribution p ( t ) , where t denotes the vector ) .We ( t ,t ,...,t N 1 +1 +1 N +1 N N then apply the results from Section 2.3.1 to obtain the required conditional distribu- tion, as illustrated in Figure 6.7. From (6.61), the joint distribution over t ,...,t will be given by +1 N 1 C , )= N ( (6.64) ) | 0 t t ( p N N N +1 +1 +1 covariance matrix with elements given by is an ( N +1) × ( N +1) where C +1 N (6.62). Because this joint distribution is Gaussian, we can apply the results from Section 2.3.1 to find the conditional Gaussian distribution. To do this, we partition the covariance matrix as follows ) ( C k N = (6.65) C +1 N T k c n, m is the N × N covariance matrix with elements given by (6.62) for = C where N ,...,N , the vector k has elements k ( x 1 ,...,N , and the scalar x =1 n ) for , +1 N n

328 308 6. KERNEL METHODS 00) . , 4 . 00 , 0 . 00 , 0 . 00) 00) . 0 , 00 . 0 , 00 . 64 (9 . 00 , 4 . 00 , 0 . 00 , 0 (1 00 , 00 . (1 . 3 3 9 4.5 1.5 1.5 0 0 0 −1.5 −1.5 −4.5 −3 −9 −3 1 1 0 0.5 − −1 −1 − 0.5 0 0.5 1 0.5 0 0.5 − −1 0.5 00) . 5 , 00 . 0 , 00 . 4 , 00 . (1 (1 0 00 , . 00 , 0 . 00) , 25 . 0 . 00 00 . 00) . 10 , 0 . 4 , 00 . (1 , 4 9 3 4.5 2 1.5 0 0 0 −4.5 −2 −1.5 −4 −9 −3 −1 − 0.5 0 0.5 0.5 −1 − 0.5 0 0.5 1 1 −1 0 0.5 1 − Samples from a Gaussian process prior defined by the covariance function (6.63). The title above Figure 6.5 ) ,θ ,θ . ,θ each plot denotes θ ( 3 2 0 1 − 1 x = , x )+ β c ( . Using the results (2.81) and (2.82), we see that the con- k N +1 N +1 | t ) is a Gaussian distribution with mean and covariance p ( t ditional distribution N +1 given by 1 − T m x ( k )= t (6.66) C +1 N N − 1 T 2 σ x c − ( k C (6.67) )= . k N +1 N These are the key results that define Gaussian process regression. Because the vector , we see that the predictive distribu- is a function of the test point input value x k +1 N tion is a Gaussian whose mean and variance both depend on x . An example of N +1 Gaussian process regression is shown in Figure 6.8. The only restriction on the kernel function is that the covariance matrix given by (6.62) must be positive definite. If λ is an eigenvalue of , then the corresponding K i − 1 . It is therefore sufficient that the kernel matrix + β eigenvalue of will be λ C i k ( x , , x 0 ) be positive semidefinite for any pair of points x  and x λ , so that m n m n i because any eigenvalue λ that is zero will still give rise to a positive eigenvalue i for C because β> 0 . This is the same restriction on the kernel function discussed earlier, and so we can again exploit all of the techniques in Section 6.2 to construct

329 6.4. Gaussian Processes 309 Illustration of the sampling of data Figure 6.6 3 } from a Gaussian process. t points { n The blue curve shows a sample func- tion from the Gaussian process prior over functions, and the red points t obtained by show the values of y n evaluating the function at a set of in- } . The correspond- put values x { n } , shown in green, t ing values of { n are obtained by adding independent 0 } . y { Gaussian noise to each of the n −3 x 1 −1 0 suitable kernels. Note that the mean (6.66) of the predictive distribution can be written, as a func- , in the form tion of x N +1 N ∑ m x ( a (6.68) x ) )= , k ( x +1 n N N +1 n n =1 − 1 th is the n ( component of C ) x x t . Thus, if the kernel function k , where a n m n N x ‖ depends only on the distance x − , then we obtain an expansion in radial ‖ n m basis functions. The results (6.66) and (6.67) define the predictive distribution for Gaussian pro- . In the particular case in ) x , ( k cess regression with an arbitrary kernel function x n m ′ ) is defined in terms of a finite set of basis functions, , x ( k x which the kernel function we can derive the results obtained previously in Section 3.3.2 for linear regression Exercise 6.21 starting from the Gaussian process viewpoint. For such models, we can therefore obtain the predictive distribution either by taking a parameter space viewpoint and using the linear regression result or by taking a function space viewpoint and using the Gaussian process result. The central computational operation in using Gaussian processes will involve 3 ) × N , for which standard methods require O ( N the inversion of a matrix of size N computations. By contrast, in the basis function model we have to invert a matrix 3 of size M × M , which has O ( M computational complexity. Note that for ) S N both viewpoints, the matrix inversion must be performed once for the given training set. For each new test point, both methods require a vector-matrix multiply, which 2 2 for the linear basis func- O ( M ) ) in the Gaussian process case and O ( N has cost tion model. If the number M of basis functions is smaller than the number N of data points, it will be computationally more efficient to work in the basis function

330 310 6. KERNEL METHODS Illustration of the mechanism of Figure 6.7 t 2 Gaussian process regression for the case of one training point and one test point, in which the red el- lipses show contours of the joint dis- 1 ) . Here t is the ,t ( p tribution t 1 1 2 training data point, and condition- x ) m ( , correspond- ingonthevalueof t 2 1 ing to the vertical blue line, we ob- shown as a function of t | ) tain p ( t 2 1 0 by the green curve. t t 2 1 −1 −1 0 1 framework. However, an advantage of a Gaussian processes viewpoint is that we can consider covariance functions that can only be expressed in terms of an infinite number of basis functions. For large training data sets, however, the direct application of Gaussian process methods can become infeasible, and so a range of approximation schemes have been developed that have better scaling with training set size than the exact approach (Gibbs, 1997; Tresp, 2001; Smola and Bartlett, 2001; Williams and Seeger, 2001; ́ o and Opper, 2002; Seeger et al. Csat , 2003). Practical issues in the application of Gaussian processes are discussed in Bishop and Nabney (2008). We have introduced Gaussian process regression for the case of a single tar- get variable. The extension of this formalism to multiple target variables, known as co-kriging (Cressie, 1993), is straightforward. Various other extensions of Gaus- Exercise 6.23 Figure 6.8 Illustration of Gaussian process re- gression applied to the sinusoidal 1 data set in Figure A.6 in which the three right-most data points have The green curve been omitted. 0.5 shows the sinusoidal function from which the data points, shown in 0 blue, are obtained by sampling and addition of Gaussian noise. The red line shows the mean of the −0.5 Gaussian process predictive distri- bution, and the shaded region cor- −1 responds to plus and minus two standard deviations. Notice how the uncertainty increases in the re- 4 0.6 0.8 1 0 0. 2 0. gion to the right of the data points.

331 6.4. Gaussian Processes 311 sian process regression have also been considered, for purposes such as modelling the distribution over low-dimensional manifolds for unsupervised learning (Bishop , 1998a) and the solution of stochastic differential equations (Graepel, 2003). et al. 6.4.3 Learning the hyperparameters The predictions of a Gaussian process model will depend, in part, on the choice of covariance function. In practice, rather than fixing the covariance function, we may prefer to use a parametric family of functions and then infer the parameter values from the data. These parameters govern such things as the length scale of the correlations and the precision of the noise and correspond to the hyperparameters in a standard parametric model. Techniques for learning the hyperparameters are based on the evaluation of the ( t | θ ) where θ denotes the hyperparameters of the Gaussian pro- likelihood function p by maximizing θ cess model. The simplest approach is to make a point estimate of θ the log likelihood function. Because represents a set of hyperparameters for the regression problem, this can be viewed as analogous to the type 2 maximum like- lihood procedure for linear regression models. Maximization of the log likelihood Section 3.5 can be done using efficient gradient-based optimization algorithms such as conjugate gradients (Fletcher, 1987; Nocedal and Wright, 1999; Bishop and Nabney, 2008). The log likelihood function for a Gaussian process regression model is easily evaluated using the standard form for a multivariate Gaussian distribution, giving N 1 1 T 1 − (6.69) . ) ln | π ln(2 t C C |− t − t p ln − )= ( θ | N N 2 2 2 For nonlinear optimization, we also need the gradient of the log likelihood func- θ tion with respect to the parameter vector . We shall assume that evaluation of the is straightforward, as would be the case for the covariance func- derivatives of C N tions considered in this chapter. Making use of the result (C.21) for the derivative of 1 − , we obtain , together with the result (C.22) for the derivative of ln | C | C N N ) ( C ∂ C ∂ 1 1 ∂ N N T − 1 1 − − 1 )= ( . | C p + ln C θ − C (6.70) t t t Tr N N N ∂θ 2 ∂θ ∂θ 2 i i i Because ln p ( t | θ ) will in general be a nonconvex function, it can have multiple max- ima. It is straightforward to introduce a prior over and to maximize the log poste- θ rior using gradient-based methods. In a fully Bayesian treatment, we need to evaluate marginals over θ p ( θ ) and the likelihood func- weighted by the product of the prior tion p ( t | θ ) . In general, however, exact marginalization will be intractable, and we must resort to approximations. The Gaussian process regression model gives a predictive distribution whose mean and variance are functions of the input vector x . However, we have assumed that the contribution to the predictive variance arising from the additive noise, gov- erned by the parameter β , is a constant. For some problems, known as heteroscedas- tic , the noise variance itself will also depend on x . To model this, we can extend the

332 312 6. KERNEL METHODS Samples from the ARD Figure 6.9 prior for Gaussian processes, in which the kernel function is given by (6.71). The left plot corresponds to = η , and the right plot cor- =1 η 2 1 responds to η , η =1 =0 . 01 . 2 1 Gaussian process framework by introducing a second Gaussian process to represent on the input x (Goldberg the dependence of , 1998). Because β is a variance, β et al. ln β ( x ) . and hence nonnegative, we use the Gaussian process to model 6.4.4 Automatic relevance determination In the previous section, we saw how maximum likelihood could be used to de- termine a value for the correlation length-scale parameter in a Gaussian process. This technique can usefully be extended by incorporating a separate parameter for each input variable (Rasmussen and Williams, 2006). The result, as we shall see, is that the optimization of these parameters by maximum likelihood allows the relative importance of different inputs to be inferred from the data. This represents an exam- ,or ARD , ple in the Gaussian process context of automatic relevance determination which was originally formulated in the framework of neural networks (MacKay, 1994; Neal, 1996). The mechanism by which appropriate inputs are preferred is discussed in Section 7.2.2. ,x , ) Consider a Gaussian process with a two-dimensional input space x =( x 2 1 having a kernel function of the form } { 2 ∑ 1 ′ ′ 2 k x , ( x )= (6.71) . η exp − x ) − x ( θ i i 0 i 2 =1 i y ( x ) are shown for two different Samples from the resulting prior over functions in Figure 6.9. We see that, as a particu- η settings of the precision parameters i becomes small, the function becomes relatively insensitive to the η lar parameter i corresponding input variable x . By adapting these parameters to a data set using i maximum likelihood, it becomes possible to detect input variables that have little effect on the predictive distribution, because the corresponding values of η will be i small. This can be useful in practice because it allows such inputs to be discarded. ARD is illustrated using a simple synthetic data set having three inputs x , x and x 1 2 3 (Nabney, 2002) in Figure 6.10. The target variable , is generated by sampling 100 t values of x , and then adding from a Gaussian, evaluating the function sin(2 πx ) 1 1

333 6.4. Gaussian Processes 313 Illustration of automatic rele- Figure 6.10 2 10 vance determination in a Gaus- sian process for a synthetic prob- , x , lem having three inputs x 1 2 , for which the curves x and 3 0 show the corresponding values of 10 η the hyperparameters η (red), 2 1 (green), and η (blue) as a func- 3 tion of the number of iterations when optimizing the marginal −2 10 likelihood. Details are given in the text. Note the logarithmic scale on the vertical axis. −4 10 0 100 80 60 40 20 Gaussian noise. Values of x are given by copying the corresponding values of x 2 1 x and adding noise, and values of are sampled from an independent Gaussian dis- 3 , and is a more noisy predictor of t , x x is a good predictor of t x tribution. Thus 2 3 1 has only chance correlations with t . The marginal likelihood for a Gaussian process with ARD parameters η ,η is optimized using the scaled conjugate gradients ,η 3 2 1 η converges to a relatively large value, η algorithm. We see from Figure 6.10 that 2 1 converges to a much smaller value, and becomes very small indicating that x is η 3 3 irrelevant for predicting t . The ARD framework is easily incorporated into the exponential-quadratic kernel (6.63) to give the following form of kernel function, which has been found useful for applications of Gaussian processes to a range of regression problems { } D D ∑ ∑ 1 2 x − , x x (6.72) )= θ η + ( θ − x θ ) x exp + x ( k mi ni i n 3 mi m 2 ni 0 2 =1 i =1 i where D is the dimensionality of the input space. 6.4.5 Gaussian processes for classification In a probabilistic approach to classification, our goal is to model the posterior probabilities of the target variable for a new input vector, given a set of training data. These probabilities must lie in the interval , 1) , whereas a Gaussian process (0 model makes predictions that lie on the entire real axis. However, we can easily adapt Gaussian processes to classification problems by transforming the output of the Gaussian process using an appropriate nonlinear activation function. t ∈{ 0 , 1 } . If we de- Consider first the two-class problem with a target variable fine a Gaussian process over a function a ( x ) and then transform the function using a logistic sigmoid = σ ( a ) , given by (4.59), then we will obtain a non-Gaussian y stochastic process over functions y ( x ) where y ∈ (0 , 1) . This is illustrated for the case of a one-dimensional input space in Figure 6.11 in which the probability distri-

334 314 6. KERNEL METHODS 1 10 5 0.75 0 0.5 −5 0.25 −10 0 − 0.5 0 0.5 1 −1 0.5 0 0.5 1 −1 − a , and the right plot x ) The left plot shows a sample from a Gaussian process prior over functions Figure 6.11 ( shows the result of transforming this sample using a logistic sigmoid function. t bution over the target variable is then given by the Bernoulli distribution t − 1 t )) (1 a ( (6.73) . − σ ) a σ )= a | t ( p ( with corresponding x ,..., As usual, we denote the training set inputs by x 1 N T =( t observed target variables t ,...,t . We also consider a single test point ) 1 N with target value t . Our goal is to determine the predictive distribution x +1 N N +1 p ( t , where we have left the conditioning on the input variables implicit. To do | t ) +1 N , which has compo- a this we introduce a Gaussian process prior over the vector N +1 , t . This in turn defines a non-Gaussian process over ) ) ,...,a ( x a x ( nents 1 N +1 +1 N t and by conditioning on the training data we obtain the required predictive distri- N takes the form a bution. The Gaussian process prior for +1 N (6.74) , N ( a )= . | 0 ) C ( p a N +1 N +1 N +1 Unlike the regression case, the covariance matrix no longer includes a noise term because we assume that all of the training data points are correctly labelled. How- ever, for numerical reasons it is convenient to introduce a noise-like term governed ν that ensures that the covariance matrix is positive definite. Thus by a parameter has elements given by C the covariance matrix +1 N ( x C , x νδ )= k (6.75) x )+ , x ( m n m nm n is any positive semidefinite kernel function of the kind considered ) x , x k where ( m n ν is typically fixed in advance. We shall assume that in Section 6.2, and the value of ′ is governed by a vector ) θ of parameters, and we shall the kernel function ( k , x x later discuss how θ may be learned from the training data. because the ) t | =1 For two-class problems, it is sufficient to predict ( p t +1 N N value of p ( t ) t =0 | t . The required ) is then given by 1 − p ( t | =1 +1 N +1 N N N

335 6.4. Gaussian Processes 315 predictive distribution is given by ∫ p t ( | )d (6.76) )= =1 p ( t t a =1 | a t | ) p ( a N N +1 N N +1 +1 N +1 N N +1 ( | . ) =1 a a σ )= ( t p where N +1 N N +1 +1 This integral is analytically intractable, and so may be approximated using sam- pling methods (Neal, 1997). Alternatively, we can consider techniques based on an analytical approximation. In Section 4.5.2, we derived the approximate formula (4.153) for the convolution of a logistic sigmoid with a Gaussian distribution. We can use this result to evaluate the integral in (6.76) provided we have a Gaussian approximation to the posterior distribution ( a p | ) . The usual justification for a t N N +1 Gaussian approximation to a posterior distribution is that the true posterior will tend to a Gaussian as the number of data points increases as a consequence of the central limit theorem. In the case of Gaussian processes, the number of variables grows with Section 2.3 the number of data points, and so this argument does not apply directly. However, if space, we consider increasing the number of data points falling in a fixed region of x a x ) will decrease, again leading then the corresponding uncertainty in the function ( asymptotically to a Gaussian (Williams and Barber, 1998). Three different approaches to obtaining a Gaussian approximation have been variational inference (Gibbs and MacKay, Section 10.1 considered. One technique is based on 2000) and makes use of the local variational bound (10.144) on the logistic sigmoid. This allows the product of sigmoid functions to be approximated by a product of to be performed analyti- a Gaussians thereby allowing the marginalization over N . ) θ | ( p cally. The approach also yields a lower bound on the likelihood function t N The variational framework for Gaussian process classification can also be extended K> 2 to multiclass ( ) problems by using a Gaussian approximation to the softmax function (Gibbs, 1997). A second approach uses expectation propagation (Opper and Winther, 2000b; Section 10.7 Minka, 2001b; Seeger, 2003). Because the true posterior distribution is unimodal, as we shall see shortly, the expectation propagation approach can give good results. 6.4.6 Laplace approximation The third approach to Gaussian process classification is based on the Laplace Section 4.4 approximation, which we now consider in detail. In order to evaluate the predictive distribution (6.76), we seek a Gaussian approximation to the posterior distribution , which, using Bayes’ theorem, is given by over a N +1 ∫ a p ( a )= p ( | , t | t )d a a N +1 N N N N N +1 ∫ 1 a )d a = a , p ( a | t , a ( ) p N N N N +1 N N +1 t ( p ) N ∫ 1 )d = | p ( a | a ) p ( a a ) p ( t a N N N N +1 N N ) t ( p N ∫ (6.77) p ( a a )d | = t ) p ( a | a N N N +1 N N

336 316 6. KERNEL METHODS ) | a . The conditional distribution , a | )= p ( t a p where we have used t ( N +1 N N N N p ( a a ) | is obtained by invoking the results (6.66) and (6.67) for Gaussian pro- N +1 N cess regression, to give 1 − − 1 T T )= N ( a . ) | k k C | a C a (6.78) ,c − k ( p a N N +1 N N +1 N N We can therefore evaluate the integral in (6.77) by finding a Laplace approximation for the posterior distribution p ( a | , and then using the standard result for the ) t N N convolution of two Gaussian distributions. ) is given by a zero-mean Gaussian process with covariance ma- The prior p ( a N , and the data term (assuming independence of the data points) is given by C trix N N N ∏ ∏ t 1 − t a t n n n n . ) ) a (1 − σ | a a )) − )= ( (6.79) = ( σ σ ( e a t ( p n n N N n =1 =1 n n We then obtain the Laplace approximation by Taylor expanding the logarithm of p ( a | t ) , which up to an additive normalization constant is given by the quantity N N ) a | )=ln p ( a t )+ln p ( a Ψ( N N N N 1 N 1 T − 1 T a ln(2 π ) − ln | C a t + a − C | = − N N N N N N 2 2 2 N ∑ a n ln(1 + e − (6.80) ) + const . =1 n First we need to find the mode of the posterior distribution, and this requires that we ) , which is given by evaluate the gradient of a Ψ( N − 1 a ∇ Ψ( (6.81) − σ a − C )= t N N N N N . We cannot simply find the mode by ) a is a vector with elements σ ( where σ N n a depends nonlinearly on , and so we σ setting this gradient to zero, because N N resort to an iterative scheme based on the Newton-Raphson method, which gives rise Section 4.3.3 to an iterative reweighted least squares (IRLS) algorithm. This requires the second derivatives of Ψ( a ) , which we also require for the Laplace approximation anyway, N and which are given by − 1 − (6.82) )= − W C Ψ( ∇∇ a N N N , and we have used )) a is a diagonal matrix with elements σ ( a ( )(1 − σ W where n N n the result (4.88) for the derivative of the logistic sigmoid function. Note that these is a positive definite W 1 / diagonal elements lie in the range , and hence (0 , 4) N (and hence its inverse) is positive definite by construction, and matrix. Because C N because the sum of two positive definite matrices is also positive definite, we see Exercise 6.24 ) is positive definite and so the posterior that the Hessian matrix = −∇∇ Ψ( a A N is log convex and therefore has a single mode that is the global ) t | a ( p distribution N N

337 6.4. Gaussian Processes 317 maximum. The posterior distribution is not Gaussian, however, because the Hessian a is a function of . N Using the Newton-Raphson formula (4.92), the iterative update equation for a N Exercise 6.25 is given by new − 1 . ( I + W } C a ) = C { t W − σ + (6.83) a N N N N N N N N These equations are iterated until they converge to the mode which we denote by   Ψ( a ∇ ) will vanish, and hence a . At the mode, the gradient will satisfy a N N N  = C (6.84) ( t . − ) σ a N N N N  of the posterior, we can evaluate the Hessian Once we have found the mode a N matrix given by − 1 )= W + C (6.85) H = a −∇∇ Ψ( N N N  are evaluated using a . This defines our Gaussian ap- W where the elements of N N | t ) given by ( a proximation to the posterior distribution p N N  − 1 a ( a | ) . (6.86) , H )= N ( a q N N N We can now combine this with (6.78) and hence evaluate the integral (6.77). Because this corresponds to a linear-Gaussian model, we can use the general result (2.115) to give Exercise 6.26 T − | t (6.87) ) k σ ( t ]= a [ E +1 N N N N − 1 − T 1 ]= c − k ( (6.88) W | t k + C . ) var[ a N N +1 N N , we can approximate ) t | a ( p Now that we have a Gaussian distribution for +1 N N the integral (6.76) using the result (4.153). As with the Bayesian logistic regression model of Section 4.5, if we are only interested in the decision boundary correspond- 5 . )=0 t , then we need only consider the mean and we can ignore | t ( p ing to N +1 N the effect of the variance. We also need to determine the parameters of the covariance function. One θ for which we need | θ ) ( approach is to maximize the likelihood function given by p t N expressions for the log likelihood and its gradient. If desired, suitable regularization terms can also be added, leading to a penalized maximum likelihood solution. The likelihood function is defined by ∫ ( t p θ )= ( p ( t (6.89) | a . ) p | a a | θ )d N N N N N This integral is analytically intractable, so again we make use of the Laplace approx- imation. Using the result (4.135), we obtain the following approximation for the log of the likelihood function N 1 1 −  p ( t ln C a ) − )=Ψ( + | | θ + (6.90) W | ) ln π ln(2 N N N N 2 2

338 318 6. KERNEL METHODS    )+ln a ) p | θ )=ln p ( t . We also need to evaluate the gradient | a ( Ψ( a where N N N N ( ln of t p ) with respect to the parameter vector | . Note that changes in θ will θ θ N  , leading to additional terms in the gradient. Thus, when we cause changes in a N differentiate (6.90) with respect to θ , we obtain two sets of terms, the first arising from the dependence of the covariance matrix C θ , and the rest arising from on N  on θ . a dependence of N can be found by using The terms arising from the explicit dependence on θ (6.80) together with the results (C.21) and (C.22), and are given by ln ( t p ∂ θ | C ∂ ) 1 N N − 1 1 −  T  a C = a C N N N N ∂θ ∂θ 2 j j [ ] ∂ C 1 N − 1 ) ( I + C W W − (6.91) . Tr N N N ∂θ 2 j  θ , we note that on To compute the terms arising from the dependence of a N ) has zero gradient Ψ( a the Laplace approximation has been constructed such that N   a at = gives no contribution to the gradient as a result of its , and so Ψ( a ) a N N N  . This leaves the following contribution to the derivative with a dependence on N of θ respect to a component θ j N 1 −  ∑ | ∂ | W ln + C ∂a 1 N n N −  ∂a 2 ∂θ j n n =1 N  ∑ [ ] ∂a 1 n  1   − ( W ) (6.92) = C ) C σ σ + 2 (1 − σ I − )(1 − N N N n n n nn ∂θ 2 j n =1   σ where σ ( a ) , and again we have used the result (C.22) together with the = n n  . We can evaluate the derivative of with respect to θ a by differ- W definition of N j N to give entiating the relation (6.84) with respect to θ j   ∂a ∂a ∂ C N n n (6.93) . ( t = − σ W ) − C N N N N ∂θ ∂θ ∂θ j j j Rearranging then gives  ∂a C ∂ N n − 1 . C ) ) − + σ I W =( ( t (6.94) N N N N ∂θ ∂θ j j Combining (6.91), (6.92), and (6.94), we can evaluate the gradient of the log likelihood function, which can be used with standard nonlinear optimization algo- rithms in order to determine a value for θ . We can illustrate the application of the Laplace approximation for Gaussian pro- cesses using the synthetic two-class data set shown in Figure 6.12. Extension of the Appendix A Laplace approximation to Gaussian processes involving K> 2 classes, using the softmax activation function, is straightforward (Williams and Barber, 1998).

339 6.4. Gaussian Processes 319 2 0 −2 0 −2 2 Illustration of the use of a Gaussian process for classification, showing the data on the left together Figure 6.12 with the optimal decision boundary from the true distribution in green, and the decision boundary from the Gaussian process classifier in black. On the right is the predicted posterior probability for the blue and red classes together with the Gaussian process decision boundary. 6.4.7 Connection to neural networks We have seen that the range of functions which can be represented by a neural network is governed by the number M of hidden units, and that, for sufficiently M large , a two-layer network can approximate any given function with arbitrary accuracy. In the framework of maximum likelihood, the number of hidden units needs to be limited (to a level dependent on the size of the training set) in order to avoid over-fitting. However, from a Bayesian perspective it makes little sense to limit the number of parameters in the network according to the size of the training set. In a Bayesian neural network, the prior distribution over the parameter vector , in conjunction with the network function f w x , w ) , produces a prior distribution ( over functions from y ( x ) where y is the vector of network outputs. Neal (1996) has shown that, for a broad class of prior distributions over , the distribution of w functions generated by a neural network will tend to a Gaussian process in the limit M →∞ . It should be noted, however, that in this limit the output variables of the neural network become independent. One of the great merits of neural networks is that the outputs share the hidden units and so they can ‘borrow statistical strength’ from each other, that is, the weights associated with each hidden unit are influenced by all of the output variables not just by one of them. This property is therefore lost in the Gaussian process limit. We have seen that a Gaussian process is determined by its covariance (kernel) function. Williams (1998) has given explicit forms for the covariance in the case of two specific choices for the hidden unit activation function (probit and Gaussian). ′ ) are nonstationary, i.e. they cannot be expressed as x , x These kernel functions k ( ′ , as a consequence of the Gaussian weight prior a function of the difference x − x being centred on zero which breaks translation invariance in weight space.

340 320 6. KERNEL METHODS By working directly with the covariance function we have implicitly marginal- ized over the distribution of weights. If the weight prior is governed by hyperpa- rameters, then their values will determine the length scales of the distribution over functions, as can be understood by studying the examples in Figure 5.11 for the case of a finite number of hidden units. Note that we cannot marginalize out the hyperpa- rameters analytically, and must instead resort to techniques of the kind discussed in Section 6.4. Exercises ( Consider the dual formulation of the least squares linear regression ) www 6.1 problem given in Section 6.1. Show that the solution for the components a of n can be expressed as a linear combination of the elements of the vector a the vector ) . Denoting these coefficients by the vector w , show that the dual of the dual x φ ( n formulation is given by the original representation in terms of the parameter vector w . 6.2 ( ) In this exercise, we develop a dual formulation of the perceptron learning algorithm. Using the perceptron learning rule (4.55), show that the learned weight t where ) x φ ( ∈ can be written as a linear combination of the vectors w vector t n n n 1 +1 } . Denote the coefficients of this linear combination by α , {− and derive a n formulation of the perceptron learning algorithm, and the predictive function for the enters only in the ) . Show that the feature vector φ ( x perceptron, in terms of the α n ′ ′ T , x form of the kernel function k ( x x ) φ φ ( x )= ) . ( 6.3 ( ) The nearest-neighbour classifier (Section 2.5.2) assigns a new input vector x to the same class as that of the nearest input vector x from the training set, where n 2 ‖ .By in the simplest case, the distance is defined by the Euclidean metric x − x ‖ n expressing this rule in terms of scalar products and then making use of kernel sub- stitution, formulate the nearest-neighbour classifier for a general nonlinear kernel. 6.4 ( ) In Appendix C, we give an example of a matrix that has positive elements but that has a negative eigenvalue and hence that is not positive definite. Find an example 2 × of the converse property, namely a matrix with positive eigenvalues yet that 2 has at least one negative element. www Verify the results (6.13) and (6.14) for constructing valid kernels. ( ) 6.5 ( 6.6 Verify the results (6.15) and (6.16) for constructing valid kernels. ) www Verify the results (6.17) and (6.18) for constructing valid kernels. 6.7 ( ) 6.8 ( ) Verify the results (6.19) and (6.20) for constructing valid kernels. 6.9 ) Verify the results (6.21) and (6.22) for constructing valid kernels. ( ( ) Show that an excellent choice of kernel for learning a function f ( x ) is given 6.10 ′ ′ ( x , x by k ) f x ) f ( x )= ( by showing that a linear learning machine based on this kernel will always find a solution proportional to f ( x ) .

341 Exercises 321 ) ( 6.11 By making use of the expansion (6.25), and then expanding the middle factor as a power series, show that the Gaussian kernel (6.23) can be expressed as the inner product of an infinite-dimensional feature vector. ( 6.12 ) of a given fixed set . A Consider the space of all possible subsets www D Show that the kernel function (6.27) corresponds to an inner product in a feature D | | defined by the mapping ( A ) where A φ D is a subset of 2 space of dimensionality and the element φ A ) , indexed by the subset ( , is given by U U { ; A 1 , if U ⊆ φ )= A ( (6.95) U 0 , otherwise. ⊆ A denotes that U Here A or is equal to A . U is either a subset of ( ) Show that the Fisher kernel, defined by (6.33), remains invariant if we make 6.13 θ → ψ ( θ ) , where the function a nonlinear transformation of the parameter vector ( ψ is invertible and differentiable. · ) www Write down the form of the Fisher kernel, defined by (6.33), for the ( ) 6.14 ( x | μ )= N ( case of a distribution | μ , S ) that is Gaussian with mean μ and fixed p x S . covariance ( ) By considering the determinant of a 2 × 2 Gram matrix, show that a positive- 6.15 ′ k x, x definite kernel function ( ) satisfies the Cauchy-Schwartz inequality 2 ,x (6.96) ,x ) . ) k ( x ,x ) k ( x  x ( k 2 2 2 1 1 1 6.16 ( ) Consider a parametric model governed by the parameter vector w together x x . and a nonlinear feature mapping φ ( ,..., ) x with a data set of input values N 1 Suppose that the dependence of the error function on w takes the form T T T ( φ ( x (6.97) ) ,..., w φ x ) )) + g ( w w ( ( w w J )= f N 1 where g ( · ) is a monotonically increasing function. By writing w in the form N ∑ w (6.98) α )+ φ ( x w = n n ⊥ n =1 show that the value of w J ( w ) takes the form of a linear combination that minimizes of the basis functions φ ( x ) n =1 ,...,N . for n ( ) 6.17 www Consider the sum-of-squares error function (6.39) for data having noisy inputs, where ν ( ξ ) is the distribution of the noise. Use the calculus of vari- ations to minimize this error function with respect to the function y ( x ) , and hence show that the optimal solution is given by an expansion of the form (6.40) in which the basis functions are given by (6.41).

342 322 6. KERNEL METHODS ( Consider a Nadaraya-Watson model with one input variable x and one target 6.18 ) having Gaussian components with isotropic covariances, so that the co- variable t 2 σ variance matrix is given by I I is the unit matrix. Write down expressions where t | x ) for the conditional density E [ t | x ] and variance p ( and for the conditional mean | x ] , in terms of the kernel function k ( var[ t x, x ) . n ( Another viewpoint on kernel regression comes from a consideration of re- ) 6.19 gression problems in which the input variables as well as the target variables are corrupted with additive noise. Suppose each target value t is generated as usual n ) evaluated at a point z , and adding Gaussian noise. The ( z by taking a function y n n is not directly observed, however, but only a noise corrupted version value of z n x = z . + ξ ) where the random variable ξ is governed by some distribution g ( ξ n n n , together with a cor- ,t } , where n =1 ,...,N x { Consider a set of observations n n responding sum-of-squares error function defined by averaging over the distribution of input noise to give ∫ N ∑ 1 2 )d ξ x { y ( ξ ( (6.99) ξ g ) − t . } − E = n n n n n 2 n =1 By minimizing with respect to the function y ( z ) using the calculus of variations E (Appendix D), show that optimal solution for y ( x ) is given by a Nadaraya-Watson kernel regression solution of the form (6.45) with a kernel of the form (6.46). www Verify the results (6.66) and (6.67). ) 6.20 ( ( 6.21 ) Consider a Gaussian process regression model in which the kernel www function is defined in terms of a fixed set of nonlinear basis functions. Show that the predictive distribution is identical to the result (3.58) obtained in Section 3.3.2 for the Bayesian linear regression model. To do this, note that both models have Gaussian predictive distributions, and so it is only necessary to show that the conditional mean and variance are the same. For the mean, make use of the matrix identity (C.6), and for the variance, make use of the matrix identity (C.7). ( ) Consider a regression problem with N training set input vectors x 6.22 ,..., x N 1 L test set input vectors x ,..., x and , and suppose we define a Gaussian +1 N L N + process prior over functions t ( x ) . Derive an expression for the joint predictive dis- . Show the ) x ( ,...,t ) ) ,...,t ( x x ( t ), given the values of x ( t tribution for L + 1 N +1 N N t marginal of this distribution for one of the test observations where N +1  j  j N L is given by the usual Gaussian process regression result (6.66) and (6.67). + www Consider a Gaussian process regression model in which the target 6.23 ) ( variable t has dimensionality D . Write down the conditional distribution of t N +1 ,..., x and , given a training set of input vectors x for a test input vector x N N +1 +1 1 . t ,..., corresponding target observations t 1 N 6.24 ( ) Show that a diagonal matrix W whose elements satisfy 0

343 Exercises 323 Using the Newton-Raphson formula (4.92), derive the iterative update www ) ( 6.25  formula (6.83) for finding the mode a of the posterior distribution in the Gaussian N process classification model. ( ) 6.26 Using the result (2.115), derive the expressions (6.87) and (6.88) for the mean in the Gaussian process clas- ) | t and variance of the posterior distribution p ( a N N +1 sification model. 6.27 ( ) Derive the result (6.90) for the log likelihood function in the Laplace approx- imation framework for Gaussian process classification. Similarly, derive the results (6.91), (6.92), and (6.94) for the terms in the gradient of the log likelihood.

344

345 7 Sparse Kernel Machines In the previous chapter, we explored a variety of learning algorithms based on non- linear kernels. One of the significant limitations of many such algorithms is that x and x must be evaluated for all possible pairs ) , x x ( k the kernel function m n m n of training points, which can be computationally infeasible during training and can lead to excessive computation times when making predictions for new data points. In this chapter we shall look at kernel-based algorithms that have sparse solutions, so that predictions for new inputs depend only on the kernel function evaluated at a subset of the training data points. We begin by looking in some detail at the support vector machine (SVM), which became popular in some years ago for solving problems in classification, regression, and novelty detection. An important property of support vector machines is that the determination of the model parameters corresponds to a convex optimization prob- lem, and so any local solution is also a global optimum. Because the discussion of support vector machines makes extensive use of Lagrange multipliers, the reader is 325

346 326 7. SPARSE KERNEL MACHINES encouraged to review the key concepts covered in Appendix E. Additional infor- mation on support vector machines can be found in Vapnik (1995), Burges (1998), ̈ ̈ (2001), Sch Cristianini and Shawe-Taylor (2000), M olkopf and Smola uller et al. (2002), and Herbrich (2002). The SVM is a decision machine and so does not provide posterior probabilities. We have already discussed some of the benefits of determining probabilities in Sec- relevance vector tion 1.5.4. An alternative sparse kernel technique, known as the machine (RVM), is based on a Bayesian formulation and provides posterior proba- Section 7.2 bilistic outputs, as well as having typically much sparser solutions than the SVM. 7.1. Maximum Margin Classifiers We begin our discussion of support vector machines by returning to the two-class classification problem using linear models of the form T ( φ x )+ b (7.1) y x )= w ( where ( x ) denotes a fixed feature-space transformation, and we have made the φ bias parameter explicit. Note that we shall shortly introduce a dual representation b expressed in terms of kernel functions, which avoids having to work explicitly in ,..., x , with feature space. The training data set comprises input vectors x N 1 N t corresponding target values 1 ,...,t where t x ∈{− , and new data points , 1 } N n 1 are classified according to the sign of y ( x ) . We shall assume for the moment that the training data set is linearly separable in feature space, so that by definition there exists at least one choice of the parameters ) > 0 for points having and w b x such that a function of the form (7.1) satisfies y ( n for all 0 =+1 and y ( x > ) < 0 for points having t ) = − 1 , so that t x y ( t n n n n n training data points. There may of course exist many such solutions that separate the classes exactly. In Section 4.1.7, we described the perceptron algorithm that is guaranteed to find a solution in a finite number of steps. The solution that it finds, however, will be w and dependent on the (arbitrary) initial values chosen for as well as on the b order in which the data points are presented. If there are multiple solutions all of which classify the training data set exactly, then we should try to find the one that will give the smallest generalization error. The support vector machine approaches this problem through the concept of the margin , which is defined to be the smallest distance between the decision boundary and any of the samples, as illustrated in Figure 7.1. In support vector machines the decision boundary is chosen to be the one for which the margin is maximized. The maximum margin solution can be motivated us- ing computational learning theory , also known as statistical learning theory .How- Section 7.1.5 ever, a simple insight into the origins of maximum margin has been given by Tong and Koller (2000) who consider a framework for classification based on a hybrid of generative and discriminative approaches. They first model the distribution over in- put vectors x for each class using a Parzen density estimator with Gaussian kernels

347 7.1. Maximum Margin Classifiers 327 y 1 − = =1 y y =0 y =0 y 1 − = y =1 margin Figure 7.1 The margin is defined as the perpendicular distance between the decision boundary and the closest of the data points, as shown on the left figure. Maximizing the margin leads to a particular choice of decision boundary, as shown on the right. The location of this boundary is determined by a subset of the data points, known as support vectors, which are indicated by the circles. 2 σ . Together with the class priors, this defines an opti- having a common parameter mal misclassification-rate decision boundary. However, instead of using this optimal boundary, they determine the best hyperplane by minimizing the probability of error 2 → 0 , the optimal hyperplane σ relative to the learned density model. In the limit is shown to be the one having maximum margin. The intuition behind this result is 2 is reduced, the hyperplane is increasingly dominated by nearby data points that as σ relative to more distant ones. In the limit, the hyperplane becomes independent of data points that are not support vectors. We shall see in Figure 10.13 that marginalization with respect to the prior distri- bution of the parameters in a Bayesian approach for a simple linearly separable data set leads to a decision boundary that lies in the middle of the region separating the data points. The large margin solution has similar behaviour. Recall from Figure 4.1 that the perpendicular distance of a point x from a hyper- y ( x )=0 where y ( plane defined by ) takes the form (7.1) is given by | y ( x ) | / ‖ w ‖ . x Furthermore, we are only interested in solutions for which all data points are cor- to the x . Thus the distance of a point y ( x n ) > 0 for all rectly classified, so that t n n n decision surface is given by T t ( x ) ) b )+ x ( w y ( φ t n n n n . = (7.2) w w ‖ ‖ ‖ ‖ from the The margin is given by the perpendicular distance to the closest point x n data set, and we wish to optimize the parameters w and b in order to maximize this distance. Thus the maximum margin solution is found by solving } { )] [ ( 1 T arg max x ( (7.3) t φ b w )+ min n n n ‖ w ‖ ,b w w 1 / ‖ w ‖ outside the optimization over n because where we have taken the factor

348 328 7. SPARSE KERNEL MACHINES n does not depend on . Direct solution of this optimization problem would be very complex, and so we shall convert it into an equivalent problem that is much easier , κ and b to solve. To do this we note that if we make the rescaling κb w w → → x then the distance from any point t w y ( x , ) / ‖ to the decision surface, given by ‖ n n n is unchanged. We can use this freedom to set ( ) T t ( x w )+ b φ =1 (7.4) n n for the point that is closest to the surface. In this case, all data points will satisfy the constraints ( ) T t φ ( x )+ w b (7.5)  1 ,n =1 ,...,N. n n This is known as the canonical representation of the decision hyperplane. In the case of data points for which the equality holds, the constraints are said to be active , whereas for the remainder they are said to be inactive . By definition, there will always be at least one active constraint, because there will always be a closest point, and once the margin has been maximized there will be at least two active constraints. 1 − , which is ‖ w ‖ The optimization problem then simply requires that we maximize 2 , and so we have to solve the optimization problem equivalent to minimizing ‖ w ‖ 1 2 arg min (7.6) w ‖ ‖ 2 w ,b 1 in (7.6) is included for 2 subject to the constraints given by (7.5). The factor of / quadratic programming later convenience. This is an example of a problem in which we are trying to minimize a quadratic function subject to a set of linear inequality constraints. It appears that the bias parameter b has disappeared from the optimiza- tion. However, it is determined implicitly via the constraints, because these require that changes to ‖ w ‖ be compensated by changes to b . We shall see how this works shortly. In order to solve this constrained optimization problem, we introduce Lagrange for each of the constraints in (7.5), giving , with one multiplier a  0 Appendix E a multipliers n n the Lagrangian function N ∑ } { 1 2 T t ) − b )+ (7.7) x a ( 1 φ − ( w ‖ ‖ w w ,b, ( a )= L n n n 2 n =1 T a =( a where ,...,a ) . Note the minus sign in front of the Lagrange multiplier N 1 term, because we are minimizing with respect to b , and maximizing with and w a . Setting the derivatives of L ( w ,b, a ) with respect to w and b equal to respect to zero, we obtain the following two conditions N ∑ = w (7.8) ) a x t ( φ n n n =1 n N ∑ 0= (7.9) . a t n n =1 n

349 7.1. Maximum Margin Classifiers 329 a w L ( w ,b, from ) using these conditions then gives the dual b Eliminating and representation of the maximum margin problem in which we maximize N N N ∑ ∑ ∑ 1 ̃ a x − ( a )= , L x (7.10) ) ( k a t a t m n n m n m n 2 n m =1 n =1 =1 subject to the constraints with respect to a  0 ,n (7.11) ,...,N, =1 a n N ∑ . a =0 t (7.12) n n n =1 T ′ ′ k , x Here the kernel function is defined by ( x )= ) φ φ ( x ( ) . Again, this takes the x form of a quadratic programming problem in which we optimize a quadratic function of a subject to a set of inequality constraints. We shall discuss techniques for solving such quadratic programming problems in Section 7.1.1. The solution to a quadratic programming problem in M variables in general has 3 ) . In going to the dual formulation we have ( O M computational complexity that is turned the original optimization problem, which involved minimizing (7.6) over M variables, into the dual problem (7.10), which has N variables. For a fixed set of M is smaller than the number N of data points, the basis functions whose number move to the dual problem appears disadvantageous. However, it allows the model to be reformulated using kernels, and so the maximum margin classifier can be applied efficiently to feature spaces whose dimensionality exceeds the number of data points, including infinite feature spaces. The kernel formulation also makes clear the role ′ ) be positive definite, because this of the constraint that the kernel function ( x , x k ̃ L ( a ) is bounded below, giving rise to a well- ensures that the Lagrangian function defined optimization problem. In order to classify new data points using the trained model, we evaluate the sign and } ( x ) defined by (7.1). This can be expressed in terms of the parameters { a of y n w using (7.8) to give the kernel function by substituting for N ∑ )+ x (7.13) b. , x a ( t k )= x ( y n n n n =1 years, Euler worked hard to persuade Lagrange to Joseph-Louis Lagrange move to Berlin, which he eventually did in 1766 where 1736–1813 he succeeded Euler as Director of Mathematics at the Berlin Academy. Later he moved to Paris, nar- Although widely considered to be rowly escaping with his life during the French revo- a French mathematician, Lagrange lution thanks to the personal intervention of Lavoisier was born in Turin in Italy. By the age (the French chemist who discovered oxygen) who him- of nineteen, he had already made self was later executed at the guillotine. Lagrange important contributions mathemat- made key contributions to the calculus of variations ics and had been appointed as Pro- and the foundations of dynamics. fessor at the Royal Artillery School in Turin. For many

350 330 7. SPARSE KERNEL MACHINES In Appendix E, we show that a constrained optimization of this form satisfies the Karush-Kuhn-Tucker (KKT) conditions, which in this case require that the following three properties hold a 0 (7.14)  n y ( x (7.15) ) − 1  0 t n n − t ( x (7.16) { y 1 } =0 . ) a n n n )=1 x =0 or t . Any data point for y ( a Thus for every data point, either n n n will not appear in the sum in (7.13) and hence plays no role in making =0 a which n , predictions for new data points. The remaining data points are called support vectors t and because they satisfy ( x y )=1 , they correspond to points that lie on the n n maximum margin hyperplanes in feature space, as illustrated in Figure 7.1. This property is central to the practical applicability of support vector machines. Once the model is trained, a significant proportion of the data points can be discarded and only the support vectors retained. Having solved the quadratic programming problem and found a value for ,we a b can then determine the value of the threshold parameter by noting that any support )=1 x satisfies t ( y . Using (7.13) this gives x vector n n n ( ) ∑ (7.17) =1 b a )+ t x k ( x , t n m n m m ∈S m S denotes the set of indices of the support vectors. Although we can solve where using an arbitrarily chosen support vector x this equation for b , a numerically more n 2 , making use of t =1 , stable solution is obtained by first multiplying through by t n n and then averaging these equations over all support vectors and solving for to give b ( ) ∑ ∑ 1 = b x (7.18) t , − ) x ( a k t m m m n n N S ∈S n ∈S m N where is the total number of support vectors. S For later comparison with alternative models, we can express the maximum- margin classifier in terms of the minimization of an error function, with a simple quadratic regularizer, in the form N ∑ 2 t E ‖ ( y ( x w ) (7.19) ‖ − 1) + λ n ∞ n =1 n where E otherwise and ensures that ( z ) is a function that is zero if z  0 and ∞ ∞ the constraints (7.5) are satisfied. Note that as long as the regularization parameter satisfies λ> 0 , its precise value plays no role. Figure 7.2 shows an example of the classification resulting from training a sup- port vector machine on a simple synthetic data set using a Gaussian kernel of the

351 7.1. Maximum Margin Classifiers 331 Example of synthetic data from Figure 7.2 two classes in two dimensions showing contours of constant ( ) obtained from a support x y vector machine having a Gaus- sian kernel function. Also shown are the decision boundary, the margin boundaries, and the sup- port vectors. form (6.23). Although the data set is not linearly separable in the two-dimensional x , it is linearly separable in the nonlinear feature space defined implicitly data space by the nonlinear kernel function. Thus the training data points are perfectly separated in the original data space. This example also provides a geometrical insight into the origin of sparsity in the SVM. The maximum margin hyperplane is defined by the location of the support vectors. Other data points can be moved around freely (so long as they remain out- side the margin region) without changing the decision boundary, and so the solution will be independent of such data points. 7.1.1 Overlapping class distributions So far, we have assumed that the training data points are linearly separable in the φ ( x ) . The resulting support vector machine will give exact separation feature space x of the training data in the original input space , although the corresponding decision boundary will be nonlinear. In practice, however, the class-conditional distributions may overlap, in which case exact separation of the training data can lead to poor generalization. We therefore need a way to modify the support vector machine so as to allow some of the training points to be misclassified. From (7.19) we see that in the case of separable classes, we implicitly used an error function that gave infinite error if a data point was misclassified and zero error if it was classified correctly, and then optimized the model parameters to maximize the margin. We now modify this approach so that data points are allowed to be on the ‘wrong side’ of the margin boundary, but with a penalty that increases with the distance from that boundary. For the subsequent optimization problem, it is convenient to make this penalty a linear slack variables ξ function of this distance. To do this, we introduce ,  0 where n =1 n , with one slack variable for each training data point (Bennett, 1992; ,...,N Cortes and Vapnik, 1995). These are defined by ξ =0 for data points that are on or n inside the correct margin boundary and ξ for other points. Thus a = | t | − y ( x ) n n n data point that is on the decision boundary y ( x , and points )=0 will have ξ =1 n n

352 332 7. SPARSE KERNEL MACHINES Illustration of the slack variables ξ  0 . Figure 7.3 n Data points with circles around them are = y 1 − support vectors. =0 y =1 y ξ> 1 1 ξ< ξ =0 ξ =0 ξ > 1 will be misclassified. The exact classification constraints (7.5) are then with n replaced with ,...,N y ( x =1 )  1 − ξ ,n (7.20) t n n n 0 . Data points for which  ξ in which the slack variables are constrained to satisfy n =0 are correctly classified and are either on the margin or on the correct side ξ n  1 lie inside the margin, but on the cor- <ξ of the margin. Points for which 0 n ξ rect side of the decision boundary, and those data points for which > lie on 1 n the wrong side of the decision boundary and are misclassified, as illustrated in Fig- ure 7.3. This is sometimes described as relaxing the hard margin constraint to give a and allows some of the training set data points to be misclassified. Note soft margin that while slack variables allow for overlapping class distributions, this framework is still sensitive to outliers because the penalty for misclassification increases linearly ξ with . Our goal is now to maximize the margin while softly penalizing points that lie on the wrong side of the margin boundary. We therefore minimize N ∑ 1 2 (7.21) ξ + w ‖ ‖ C n 2 n =1 where the parameter controls the trade-off between the slack variable penalty 0 C> > 1 , it follows that ξ and the margin. Because any point that is misclassified has n ∑ ξ is is an upper bound on the number of misclassified points. The parameter C n n therefore analogous to (the inverse of) a regularization coefficient because it controls the trade-off between minimizing training errors and controlling model complexity. In the limit C →∞ , we will recover the earlier support vector machine for separable data. We now wish to minimize (7.21) subject to the constraints (7.20) together with . The corresponding Lagrangian is given by  0 ξ n N N N ∑ ∑ ∑ 1 2 1+ − ) ξ + C }− μ x ( ξ y − (7.22) ξ t { a ‖ w ‖ )= L ( a ,b, w n n n n n n n 2 n =1 =1 n =1 n

353 7.1. Maximum Margin Classifiers 333  0 } and are Lagrange multipliers. The corresponding set of μ  0 } { where { a n n KKT conditions are given by Appendix E a  0 (7.23) n y ( x (7.24) ) − 1+ ξ 0  t n n n ( t (7.25) y ( x )=0 ) − 1+ ξ a n n n n μ  0 (7.26) n (7.27)  0 ξ n μ =0 (7.28) ξ n n ,...,N . =1 where n ) x ( y making use of the definition (7.1) of } { , and b , We now optimize out w ξ n to give N ∑ ∂L ) (7.29) a t x φ ( = =0 ⇒ w n n n w ∂ =1 n N ∑ ∂L =0 ⇒ =0 (7.30) a t n n ∂b =1 n ∂L − a =0 = C ⇒ μ (7.31) . n n ∂ξ n , b , and { ξ w Using these results to eliminate } from the Lagrangian, we obtain the n dual Lagrangian in the form N N N ∑ ∑ ∑ 1 ̃ a − )= L ( x , a (7.32) ) x a a t t k ( m n n m m n n 2 =1 n m =1 n =1 which is identical to the separable case, except that the constraints are somewhat  0 is required because different. To see what these constraints are, we note that a n 0 implies  μ these are Lagrange multipliers. Furthermore, (7.31) together with n a  C . We therefore have to minimize (7.32) with respect to the dual variables n { a } subject to n 0  a  (7.33) C n N ∑ t =0 (7.34) a n n n =1 for n =1 ,...,N , where (7.33) are known as box constraints . This again represents a quadratic programming problem. If we substitute (7.29) into (7.1), we see that predictions for new data points are again made by using (7.13). We can now interpret the resulting solution. As before, a subset of the data =0 , in which case they do not contribute to the predictive points may have a n

354 334 7. SPARSE KERNEL MACHINES model (7.13). The remaining data points constitute the support vectors. These have a > and hence from (7.25) must satisfy 0 n t ( )=1 y − ξ x . (7.35) n n n =0 and , which from (7.28) requires 0 a If n n n hence such points lie on the margin. Points with a C can lie inside the margin = n .  1 or misclassified if ξ 1 > and can either be correctly classified if ξ n n b in (7.1), we note that those support vectors for To determine the parameter 0 which ξ n or may not be misclassified) and a lower bound on the fraction of support vectors. An example of the ν -SVM applied to a synthetic data set is shown in Figure 7.4. Here ′ 2 − ‖ x − x exp ( Gaussian kernels of the form γ ‖ ) have been used, with γ =0 . 45 . Although predictions for new inputs are made using only the support vectors, the training phase (i.e., the determination of the parameters a and b ) makes use of the whole data set, and so it is important to have efficient algorithms for solving

355 7.1. Maximum Margin Classifiers 335 Illustration of the Figure 7.4 ν -SVM applied to a nonseparable data set in two dimensions. The support vectors 2 are indicated by circles. 0 −2 −2 2 0 ̃ the quadratic programming problem. We first note that the objective function L a ) ( given by (7.10) or (7.32) is quadratic and so any local optimum will also be a global optimum provided the constraints define a convex region (which they do as a conse- quence of being linear). Direct solution of the quadratic programming problem us- ing traditional techniques is often infeasible due to the demanding computation and memory requirements, and so more practical approaches need to be found. The tech- nique of chunking (Vapnik, 1982) exploits the fact that the value of the Lagrangian is unchanged if we remove the rows and columns of the kernel matrix corresponding to Lagrange multipliers that have value zero. This allows the full quadratic pro- gramming problem to be broken down into a series of smaller ones, whose goal is eventually to identify all of the nonzero Lagrange multipliers and discard the others. Chunking can be implemented using (Burges, 1998). protected conjugate gradients Although chunking reduces the size of the matrix in the quadratic function from the number of data points squared to approximately the number of nonzero Lagrange multipliers squared, even this may be too big to fit in memory for large-scale appli- Decomposition methods (Osuna cations. , 1996) also solve a series of smaller et al. quadratic programming problems but are designed so that each of these is of a fixed size, and so the technique can be applied to arbitrarily large data sets. However, it still involves numerical solution of quadratic programming subproblems and these can be problematic and expensive. One of the most popular approaches to training support vector machines is called ,or SMO (Platt, sequential minimal optimization 1999). It takes the concept of chunking to the extreme limit and considers just two Lagrange multipliers at a time. In this case, the subproblem can be solved analyti- cally, thereby avoiding numerical quadratic programming altogether. Heuristics are given for choosing the pair of Lagrange multipliers to be considered at each step. In practice, SMO is found to have a scaling with the number of data points that is somewhere between linear and quadratic depending on the particular application. We have seen that kernel functions correspond to inner products in feature spaces that can have high, or even infinite, dimensionality. By working directly in terms of the kernel function, without introducing the feature space explicitly, it might there- fore seem that support vector machines somehow manage to avoid the curse of di-

356 336 7. SPARSE KERNEL MACHINES mensionality. This is not the case, however, because there are constraints amongst Section 1.4 the feature values that restrict the effective dimensionality of feature space. To see this consider a simple second-order polynomial kernel that we can expand in terms of its components ( ) 2 T 2 , z k x ( )= =(1+ x z z z + x 1+ x ) 1 2 2 1 2 2 2 2 z + x z =1+2 z +2 x +2 x z x z + x x z 1 2 2 2 2 1 1 1 2 2 1 1 √ √ √ √ √ √ 2 2 2 T 2 ,x 2 x , =(1 2 x ,z x z ,x , z )(1 , , 2 z 2 , 2 2 z x ,z ) , 2 1 2 2 1 2 1 1 2 1 2 1 T ) ) φ φ ( z x . (7.42) = ( This kernel function therefore represents an inner product in a feature space having six dimensions, in which the mapping from input space to feature space is described φ ( x ) . However, the coefficients weighting these different by the vector function features are constrained to have specific forms. Thus any set of points in the original two-dimensional space would be constrained to lie exactly on a two-dimensional x nonlinear manifold embedded in the six-dimensional feature space. We have already highlighted the fact that the support vector machine does not provide probabilistic outputs but instead makes classification decisions for new in- put vectors. Veropoulos et al. (1999) discuss modifications to the SVM to allow the trade-off between false positive and false negative errors to be controlled. How- ever, if we wish to use the SVM as a module in a larger probabilistic system, then probabilistic predictions of the class label for new inputs x are required. t To address this issue, Platt (2000) has proposed fitting a logistic sigmoid to the outputs of a previously trained support vector machine. Specifically, the required conditional probability is assumed to be of the form p ( t =1 | x )= σ ( Ay ( x )+ B ) (7.43) where y x ) is defined by (7.1). Values for the parameters A and B are found by ( minimizing the cross-entropy error function defined by a training set consisting of and . The data used to fit the sigmoid needs to be independent ) t ( x pairs of values y n n of that used to train the original SVM in order to avoid severe over-fitting. This two- stage approach is equivalent to assuming that the output y ( x ) of the support vector machine represents the log-odds of x t =1 . Because the SVM belonging to class training procedure is not specifically intended to encourage this, the SVM can give a poor approximation to the posterior probabilities (Tipping, 2001). 7.1.2 Relation to logistic regression As with the separable case, we can re-cast the SVM for nonseparable distri- butions in terms of the minimization of a regularized error function. This will also allow us to highlight similarities, and differences, compared to the logistic regression model. Section 4.3.2 We have seen that for data points that are on the correct side of the margin , and for the =0 ξ t ,wehave 1  boundary, and which therefore satisfy y n n n

357 7.1. Maximum Margin Classifiers 337 Plot of the ‘hinge’ error function used Figure 7.5 ) z ( E in support vector machines, shown in blue, along with the error function for logistic regression, rescaled by a factor of 1 / ln(2) so that it passes (0 through the point , , shown in red. 1) Also shown are the misclassification error in black and the squared error in green. z 1012 2 − − t =1 − y remaining points we have ξ . Thus the objective function (7.21) can be n n n written (up to an overall multiplicative constant) in the form N ∑ 2 E ‖ ( y w t ‖ (7.44) )+ λ n SV n n =1 − 1 ) where =(2 C λ E ( · ) is the hinge error function defined by , and SV ] ( y t (7.45) y )=[1 − t E n n SV n n + where · ] [ denotes the positive part. The hinge error function, so-called because + of its shape, is plotted in Figure 7.5. It can be viewed as an approximation to the misclassification error, i.e., the error function that ideally we would like to minimize, which is also shown in Figure 7.5. When we considered the logistic regression model in Section 4.3.2, we found it convenient to work with target variable t ∈{ 0 , 1 } . For comparison with the support vector machine, we first reformulate maximum likelihood logistic regression using the target variable ∈{− 1 , 1 } . To do this, we note that p ( t =1 | y )= t ( y ) where σ y x ) is given by (7.1), and ( ( y ) is the logistic sigmoid function defined by (4.59). It σ follows that p ( t = − 1 | y )=1 − σ ( y )= σ ( − y ) , where we have used the properties of the logistic sigmoid function, and so we can write p t | y )= σ ( yt ( . (7.46) ) From this we can construct an error function by taking the negative logarithm of the likelihood function that, with a quadratic regularizer, takes the form Exercise 7.6 N ∑ 2 . E y ( (7.47) ‖ t w )+ λ ‖ n LR n =1 n where E . (7.48) yt )=ln(1+exp( − yt )) ( LR

358 338 7. SPARSE KERNEL MACHINES so that the error For comparison with other error functions, we can divide by ln(2) 1) (0 . This rescaled error function is also plotted function passes through the point , in Figure 7.5 and we see that it has a similar form to the support vector error function. The key difference is that the flat region in E ) leads to sparse solutions. ( yt SV Both the logistic error and the hinge loss can be viewed as continuous approx- imations to the misclassification error. Another continuous error function that has sometimes been used to solve classification problems is the squared error, which is again plotted in Figure 7.5. It has the property, however, of placing increasing emphasis on data points that are correctly classified but that are a long way from the decision boundary on the correct side. Such points will be strongly weighted at the expense of misclassified points, and so if the objective is to minimize the mis- classification rate, then a monotonically decreasing error function would be a better choice. 7.1.3 Multiclass SVMs The support vector machine is fundamentally a two-class classifier. In practice, however, we often have to tackle problems involving classes. Various meth- K> 2 ods have therefore been proposed for combining multiple two-class SVMs in order to build a multiclass classifier. K separate SVMs, One commonly used approach (Vapnik, 1998) is to construct th ) y ( x model is trained using the data from class C as the positive k in which the k k K − 1 classes as the negative examples. examples and the data from the remaining This is known as the one-versus-the-rest approach. However, in Figure 4.2 we saw that using the decisions of the individual classifiers can lead to inconsistent results in which an input is assigned to multiple classes simultaneously. This problem is x using sometimes addressed by making predictions for new inputs ) y . ( x (7.49) x )=max y ( k k Unfortunately, this heuristic approach suffers from the problem that the different classifiers were trained on different tasks, and there is no guarantee that the real- for different classifiers will have appropriate scales. ( x ) y valued quantities k Another problem with the one-versus-the-rest approach is that the training sets are imbalanced. For instance, if we have ten classes each with equal numbers of training data points, then the individual classifiers are trained on data sets comprising 90% negative examples and only 10% positive examples, and the symmetry of the original problem is lost. A variant of the one-versus-the-rest scheme was proposed et al. by Lee (2001) who modify the target values so that the positive class has target and the negative class has target − 1 +1 ( K − 1) . / Weston and Watkins (1999) define a single objective function for training all K SVMs simultaneously, based on maximizing the margin from each to remaining classes. However, this can result in much slower training because, instead of solving K separate optimization problems each over N data points with an overall cost of 2 N 1) K ) , a single optimization problem of size ( must be solved giving an − KN O ( 2 2 N ) . ( O overall cost of K

359 7.1. Maximum Margin Classifiers 339 K ( − 1) / 2 different 2-class SVMs on all possible Another approach is to train K pairs of classes, and then to classify test points according to which class has the high- one-versus-one . Again, est number of ‘votes’, an approach that is sometimes called we saw in Figure 4.2 that this can lead to ambiguities in the resulting classification. K this approach requires significantly more training time than the Also, for large one-versus-the-rest approach. Similarly, to evaluate test points, significantly more computation is required. The latter problem can be alleviated by organizing the pairwise classifiers into a directed acyclic graph (not to be confused with a probabilistic graphical model) DAGSVM (Platt et al. leading to the K classes, the DAGSVM has a total , 2000). For of ( K − 1) / 2 classifiers, and to classify a new test point only K K 1 pairwise − classifiers need to be evaluated, with the particular classifiers used depending on which path through the graph is traversed. A different approach to multiclass classification, based on error-correcting out- put codes, was developed by Dietterich and Bakiri (1995) and applied to support vector machines by Allwein et al. (2000). This can be viewed as a generalization of the voting scheme of the one-versus-one approach in which more general partitions of the classes are used to train the individual classifiers. The K classes themselves are represented as particular sets of responses from the two-class classifiers chosen, and together with a suitable decoding scheme, this gives robustness to errors and to ambiguity in the outputs of the individual classifiers. Although the application of SVMs to multiclass classification problems remains an open issue, in practice the one-versus-the-rest approach is the most widely used in spite of its ad-hoc formula- tion and its practical limitations. There are also single-class support vector machines, which solve an unsuper- vised learning problem related to probability density estimation. Instead of mod- elling the density of data, however, these methods aim to find a smooth boundary enclosing a region of high density. The boundary is chosen to represent a quantile of the density, that is, the probability that a data point drawn from the distribution will land inside that region is given by a fixed number between 0 and 1 that is specified in advance. This is a more restricted problem than estimating the full density but may be sufficient in specific applications. Two approaches to this problem using support ̈ vector machines have been proposed. The algorithm of Sch olkopf et al. (2001) tries to find a hyperplane that separates all but a fixed fraction ν of the training data from the origin while at the same time maximizing the distance (margin) of the hyperplane from the origin, while Tax and Duin (1999) look for the smallest sphere in feature ′ ) that k x , x of the data points. For kernels ν space that contains all but a fraction ( ′ x − x are functions only of , the two algorithms are equivalent. 7.1.4 SVMs for regression We now extend support vector machines to regression problems while at the same time preserving the property of sparseness. In simple linear regression, we Section 3.1.4

360 340 7. SPARSE KERNEL MACHINES Plot of an Figure 7.6 -insensitive error function (in E ( z ) red) in which the error increases lin- early with distance beyond the insen- sitive region. Also shown for compar- ison is the quadratic error function (in green).  z −  0 minimize a regularized error function given by N ∑ 1 λ 2 2 ‖ ‖ w . } { + y (7.50) − t n n 2 2 n =1 To obtain sparse solutions, the quadratic error function is replaced by an -insensitive  error function (Vapnik, 1995), which gives zero error if the absolute difference be- y ( x ) and the target tween the prediction is less than  where > 0 . A simple t example of an -insensitive error function, having a linear cost associated with errors  outside the insensitive region, is given by { | < t − ) x ( y ; 0 , if | )= − ) x ( y ( t (7.51) E otherwise x | − t |− , ( y ) and is illustrated in Figure 7.6. We therefore minimize a regularized error function given by N ∑ 1 2 (7.52) E )+ ( y ( x t ) − ‖ w ‖ C n n 2 =1 n where ( x ) is given by (7.1). By convention the (inverse) regularization parameter, y , appears in front of the error term. denoted C As before, we can re-express the optimization problem by introducing slack 0 , we now need two slack variables ξ  and variables. For each data point x n n ̂ ̂ ξ 0 , where ξ 0 > 0 corresponds to a point for which t > >y ( x ξ )+  , and  n n n n n −

361 7.1. Maximum Margin Classifiers 341 Illustration of SVM regression, showing Figure 7.7 y ( ) x  + y - the regression curve together with the insensitive ‘tube’. Also shown are exam- y 0 ξ> b and ples of the slack variables ξ ξ . Points b  − y and =0 , 0 ξ above the -tube have ξ> -tube have points below the ξ =0 and b ξ> 0 , and points inside the -tube have b ξ . =0 = ξ ̂ 0 ξ> x The error function for support vector regression can then be written as N ∑ 1 2 ̂ ‖ w ‖ C + ξ )+ ξ (7.55) ( n n 2 n =1 ̂ ξ which must be minimized subject to the constraints and as well as ξ 0  0  n n (7.53) and (7.54). This can be achieved by introducing Lagrange multipliers a  0 , n a μ  0 , ̂ and optimizing the Lagrangian  0 , and ̂ μ 0  n n n N N ∑ ∑ 1 2 ̂ ̂ C = L ) ξ μ )+ ξ ̂ + ( μ ξ − ξ + ( ‖ ‖ w n n n n n n 2 =1 n n =1 N N ∑ ∑ ̂ ) (  + ξ t + y + − t y ) − a (7.56) − ξ ̂ . a (  + − n n n n n n n n n =1 n =1 y ( x ) using (7.1) and then set the derivatives of the La- We now substitute for ̂ , and ξ to zero, giving ξ grangian with respect to w , b , n n N ∑ ∂L ( x ) ( a (7.57) − ̂ a φ ) ⇒ =0 = w n n n ∂ w =1 n N ∑ ∂L (7.58) ̂ )=0 a ( a − =0 ⇒ n n ∂b =1 n ∂L (7.59) =0 ⇒ a C + μ = n n ∂ξ n ∂L =0 C. = (7.60) ⇒ ̂ a μ + ̂ n n ̂ ξ ∂ n Using these results to eliminate the corresponding variables from the Lagrangian, we see that the dual problem involves maximizing Exercise 7.7

362 342 7. SPARSE KERNEL MACHINES N N ∑ ∑ 1 ̃ x L ( , ̂ a ) )= − a , ( a x − ̂ a ( )( a k − ̂ a ) m n n m n m 2 =1 m n =1 N N ∑ ∑ ( a ) + ̂ a a )+ (7.61) t ̂ − ( a −  n n n n n =1 n =1 n ′ with respect to { a and { ̂ )= } } , where we have introduced the kernel k ( x , x a n n T ′ φ ( x ) . Again, this is a constrained maximization, and to find the constraints ( φ x )   0 and ̂ a are both required because these are Lagrange 0 a we note that n n multipliers. Also μ  0 and ̂ μ 0  together with (7.59) and (7.60), require n n a and ̂ a C  C , and so again we have the box constraints  n n  (7.62) C  0 a n (7.63)  a ̂ C 0  n together with the condition (7.58). Substituting (7.57) into (7.1), we see that predictions for new inputs can be made using N ∑ )+ x , (7.64) b x ( a ( − ̂ a k ) y )= x ( n n n n =1 which is again expressed in terms of the kernel function. The corresponding Karush-Kuhn-Tucker (KKT) conditions, which state that at the solution the product of the dual variables and the constraints must vanish, are given by (  + ξ )=0 + (7.65) t − y a n n n n ̂ )=0 (  + ̂ ξ (7.66) − y a + t n n n n ) ξ (7.67) =0 C − a ( n n ̂ C − ( ξ ) ̂ a (7.68) =0 . n n From these we can obtain several useful results. First of all, we note that a coefficient , which implies that the data point can only be nonzero if  + ξ =0 + y t − a n n n n ) or lies above the upper =0 -tube ( ξ  either lies on the upper boundary of the n ̂ boundary ( ξ 0 ). Similarly, a nonzero value for ̂ a =0 implies  + , > t − y + ξ n n n n n and such points must lie either on or below the lower boundary of the  -tube. ̂ =0 y y =0 − t t + and  + ξ + −  + Furthermore, the two constraints ξ n n n n n n are incompatible, as is easily seen by adding them together and noting that ξ and n ̂ ξ are nonnegative while  is strictly positive, and so for every data point x , either n n a or ̂ a (or both) must be zero. n n The support vectors are those data points that contribute to predictions given by . These are points that =0 =0 or ̂ a a (7.64), in other words those for which either n n lie on the boundary of the  -tube or outside the tube. All points within the tube have

363 7.1. Maximum Margin Classifiers 343 =0 . We again have a sparse solution, and the only terms that have to be ̂ = a a n n evaluated in the predictive model (7.64) are those that involve the support vectors. can be found by considering a data point for which

364 344 7. SPARSE KERNEL MACHINES Illustration of the ν Figure 7.8 -SVM for re- gression applied to the sinusoidal synthetic data set using Gaussian 1 kernels. The predicted regression curve is shown by the red line, and t the -insensitive tube corresponds to the shaded region. Also, the 0 data points are shown in green, and those with support vectors are indicated by blue circles. −1 0 1 x 7.1.5 Computational learning theory Historically, support vector machines have largely been motivated and analysed computational learning theory using a theoretical framework known as , also some- (Anthony and Biggs, 1992; Kearns and Vazi- times called statistical learning theory rani, 1994; Vapnik, 1995; Vapnik, 1998). This has its origins with Valiant (1984) probably approximately correct , or PAC, learning framework. who formulated the The goal of the PAC framework is to understand how large a data set needs to be in order to give good generalization. It also gives bounds for the computational cost of learning, although we do not consider these here. D Suppose that a data set N is drawn from some joint distribution p ( x , t ) of size x t represents the class label, and that we restrict where is the input variable and attention to ‘noise free’ situations in which the class labels are determined by some t = g ( x ) (unknown) deterministic function . In PAC learning we say that a function f x ; D ) ( F of such functions on the basis of the training set , drawn from a space D , has good generalization if its expected error rate is below some pre-specified threshold  , so that D (7.75) I ( f ( x ; [ ) = t )] < E t x , I where · ) is the indicator function, and the expectation is with respect to the dis- ( ) p tribution , t ( . The quantity on the left-hand side is a random variable, because x it depends on the training set D , and the PAC framework requires that (7.75) holds, with probability greater than 1 − δ , for a data set D drawn randomly from p ( x , t ) . Here δ is another pre-specified parameter, and the terminology ‘probably approxi- mately correct’ comes from the requirement that with high probability (greater than 1 − δ  ). For a given choice of model space F , and ), the error rate be small (less than for given parameters  and δ , PAC learning aims to provide bounds on the minimum size N of data set needed to meet this criterion. A key quantity in PAC learning is the Vapnik-Chervonenkis dimension , or VC dimension, which provides a measure of the complexity of a space of functions, and which allows the PAC framework to be extended to spaces containing an infinite number of functions. The bounds derived within the PAC framework are often described as worst-

365 7.2. Relevance Vector Machines 345 any choice for the distribution ( x , t ) , so long as both case, because they apply to p the training and the test examples are drawn (independently) from the same distribu- F tion, and for ( x ) so long as it belongs to choice for the function . In real-world any f applications of machine learning, we deal with distributions that have significant reg- ularity, for example in which large regions of input space carry the same class label. As a consequence of the lack of any assumptions about the form of the distribution, the PAC bounds are very conservative, in other words they strongly over-estimate the size of data sets required to achieve a given generalization performance. For this reason, PAC bounds have found few, if any, practical applications. PAC-Bayesian One attempt to improve the tightness of the PAC bounds is the F of framework (McAllester, 2003), which considers a distribution over the space functions, somewhat analogous to the prior in a Bayesian treatment. This still con- p ( x , siders any possible choice for ) , and so although the bounds are tighter, they t are still very conservative. 7.2. Relevance Vector Machines Support vector machines have been used in a variety of classification and regres- sion applications. Nevertheless, they suffer from a number of limitations, several of which have been highlighted already in this chapter. In particular, the outputs of an SVM represent decisions rather than posterior probabilities. Also, the SVM was originally formulated for two classes, and the extension to K> 2 classes is prob- lematic. There is a complexity parameter C ν (as well as a parameter  in the case ,or of regression), that must be found using a hold-out method such as cross-validation. Finally, predictions are expressed as linear combinations of kernel functions that are centred on training data points and that are required to be positive definite. The relevance vector machine or RVM (Tipping, 2001) is a Bayesian sparse ker- nel technique for regression and classification that shares many of the characteristics of the SVM whilst avoiding its principal limitations. Additionally, it typically leads to much sparser models resulting in correspondingly faster performance on test data whilst maintaining comparable generalization error. In contrast to the SVM we shall find it more convenient to introduce the regres- sion form of the RVM first and then consider the extension to classification tasks. 7.2.1 RVM for regression The relevance vector machine for regression is a linear model of the form studied in Chapter 3 but with a modified prior that results in sparse solutions. The model defines a conditional distribution for a real-valued target variable t , given an input vector x , which takes the form − 1 (7.76) ) x , w ,β )= N ( ( | y ( x ) ,β p t | t

366 346 7. SPARSE KERNEL MACHINES − 2 is the noise precision (inverse noise variance), and the mean is given β where = σ by a linear model of the form M ∑ T ( )= x y ) φ w ( x )= (7.77) w φ ( x i i i =1 , which will typically include a constant ( x ) with fixed nonlinear basis functions φ i term so that the corresponding weight parameter represents a ‘bias’. The relevance vector machine is a specific instance of this model, which is in- tended to mirror the structure of the support vector machine. In particular, the basis functions are given by kernels, with one kernel associated with each of the data points from the training set. The general expression (7.77) then takes the SVM-like form N ∑ x y )= ( x w k ( x , (7.78) b )+ n n n =1 b is a bias parameter. The number of parameters in this case is M = N +1 , where and y x ) has the same form as the predictive model (7.64) for the SVM, except that ( . It should be emphasized that the subsequent are here denoted w a the coefficients n n analysis is valid for arbitrary choices of basis function, and for generality we shall work with the form (7.77). In contrast to the SVM, there is no restriction to positive- definite kernels, nor are the basis functions tied in either number or location to the training data points. N observations of the input vector x , which we Suppose we are given a set of T th row is x with n =1 ,...,N . The whose X n denote collectively by a data matrix n T . Thus, the likelihood ) ,...,t =( t t corresponding target values are given by N 1 function is given by N ∏ 1 − t w ,β )= p | X , ( w ( (7.79) | x . , p ,β ) t n n n =1 Next we introduce a prior distribution over the parameter vector w and as in Chapter 3, we shall consider a zero-mean Gaussian prior. However, the key differ- for each of the ence in the RVM is that we introduce a separate hyperparameter α i weight parameters w instead of a single shared hyperparameter. Thus the weight i prior takes the form M ∏ 1 − ) 0 ,α (7.80) | w N ( p )= α | w ( i i i =1 α where represents the precision of the corresponding parameter w , and α denotes i i T ) . We shall see that, when we maximize the evidence with respect ,...,α ( α M 1 to these hyperparameters, a significant proportion of them go to infinity, and the corresponding weight parameters have posterior distributions that are concentrated at zero. The basis functions associated with these parameters therefore play no role

367 7.2. Relevance Vector Machines 347 in the predictions made by the model and so are effectively pruned out, resulting in a sparse model. Using the result (3.49) for linear regression models, we see that the posterior distribution for the weights is again Gaussian and takes the form ( t , X , α ,β )= p ( w | m , Σ ) (7.81) w | N where the mean and covariance are given by T ΣΦ m = β t (7.82) ( ) 1 − T = Σ Φ β (7.83) + A Φ = , and = ( x A ) φ Φ is the N × M design matrix with elements Φ where n ni i ) . Note that in the specific case of the model (7.78), we have = K , where Φ diag( α i is the symmetric ( N +1) × ( N +1) kernel matrix with elements k ( x K , ) . x m n α and β are determined using type-2 maximum likelihood, also The values of evidence approximation , in which we maximize the marginal likeli- Section 3.5 known as the hood function obtained by integrating out the weight parameters ∫ . w )d α | w ( p (7.84) p ( t | X , w ,β ) )= α | t ( p X ,β , Exercise 7.10 Because this represents the convolution of two Gaussians, it is readily evaluated to give the log marginal likelihood in the form ( t | X , α ,β )=ln N ( t | 0 , C ) ln p } { 1 T − 1 (7.85) )+ln | C | + t π C N ln(2 t = − 2 T C matrix N ,...,t × ) N , and we have defined the given by where t =( t N 1 T − 1 1 − β C = ΦA + (7.86) Φ I . α Our goal is now to maximize (7.85) with respect to the hyperparameters and β . This requires only a small modification to the results obtained in Section 3.5 for the evidence approximation in the linear regression model. Again, we can identify two approaches. In the first, we simply set the required derivatives of the marginal Exercise 7.12 likelihood to zero and obtain the following re-estimation equations γ i new (7.87) = α i 2 m i 2 ‖ t − Φm ‖ − new 1 ∑ = (7.88) ) β ( N γ − i i th m i component of the posterior mean is the defined by (7.82). The where m i is determined by the measures how well the corresponding parameter w γ quantity i i data and is defined by Section 3.5.3

368 348 7. SPARSE KERNEL MACHINES =1 − Σ α (7.89) γ i ii i th in which Σ i Σ given by is the diagonal component of the posterior covariance ii and β , evalu- (7.83). Learning therefore proceeds by choosing initial values for α ating the mean and covariance of the posterior using (7.82) and (7.83), respectively, and then alternately re-estimating the hyperparameters, using (7.87) and (7.88), and re-estimating the posterior mean and covariance, using (7.82) and (7.83), until a suit- able convergence criterion is satisfied. The second approach is to use the EM algorithm, and is discussed in Sec- tion 9.3.4. These two approaches to finding the values of the hyperparameters that Exercise 9.23 maximize the evidence are formally equivalent. Numerically, however, it is found that the direct optimization approach corresponding to (7.87) and (7.88) gives some- what faster convergence (Tipping, 2001). As a result of the optimization, we find that a proportion of the hyperparameters { α are driven to large (in principle infinite) values, and so the weight parameters Section 7.2.2 } i w corresponding to these hyperparameters have posterior distributions with mean i and variance both zero. Thus those parameters, and the corresponding basis func- ( x ) , are removed from the model and play no role in making predictions for tions φ i x new inputs. In the case of models of the form (7.78), the inputs corresponding to n , because they are iden- the remaining nonzero weights are called relevance vectors tified through the mechanism of automatic relevance determination, and are analo- gous to the support vectors of an SVM. It is worth emphasizing, however, that this mechanism for achieving sparsity in probabilistic models through automatic rele- vance determination is quite general and can be applied to any model expressed as an adaptive linear combination of basis functions.   and β for the hyperparameters that maximize the Having found values α t for a new marginal likelihood, we can evaluate the predictive distribution over input x . Using (7.76) and (7.81), this is given by Exercise 7.14 ∫      , t , X | )d ,β )= ,β w p ( t | x , w ,β α ) p ( w p α X , x | t ( , t , ( ) T 2 N = φ ( x t ,σ | ( x ) m . (7.90) ) w set equal to the posterior mean Thus the predictive mean is given by (7.76) with m , and the variance of the predictive distribution is given by 2  − 1 T β )=( ) ( x + φ ( x ) (7.91) Σ φ ( x ) σ  and α and β Σ α are set to their optimized values is given by (7.83) in which where  β . This is just the familiar result (3.59) obtained in the context of linear regression. Recall that for localized basis functions, the predictive variance for linear regression models becomes small in regions of input space where there are no basis functions. In the case of an RVM with the basis functions centred on data points, the model will therefore become increasingly certain of its predictions when extrapolating outside ̃ the domain of the data (Rasmussen and Qui nonero-Candela, 2005), which of course is undesirable. The predictive distribution in Gaussian process regression does not Section 6.4.2

369 7.2. Relevance Vector Machines 349 Illustration of RVM regression us- Figure 7.9 ing the same data set, and the same Gaussian kernel functions, 1 as used in Figure 7.8 for the ν The -SVM regression model. t mean of the predictive distribu- tion for the RVM is shown by the 0 red line, and the one standard- deviation predictive distribution is shown by the shaded region. Also, the data points are shown −1 in green, and the relevance vec- tors are indicated by blue circles. Note that there are only 3 rele- vance vectors compared to 7 sup- 0 1 x -SVM in Fig- port vectors for the ν ure 7.8. suffer from this problem. However, the computational cost of making predictions with a Gaussian processes is typically much higher than with an RVM. Figure 7.9 shows an example of the RVM applied to the sinusoidal regression data set. Here the noise precision parameter β is also determined through evidence maximization. We see that the number of relevance vectors in the RVM is signif- icantly smaller than the number of support vectors used by the SVM. For a wide range of regression and classification tasks, the RVM is found to give models that are typically an order of magnitude more compact than the corresponding support vector machine, resulting in a significant improvement in the speed of processing on test data. Remarkably, this greater sparsity is achieved with little or no reduction in generalization error compared with the corresponding SVM. The principal disadvantage of the RVM compared to the SVM is that training involves optimizing a nonconvex function, and training times can be longer than for a comparable SVM. For a model with M basis functions, the RVM requires inversion 3 ) computation. In the , which in general requires M ( M of a matrix of size × M O M = N +1 . As we have noted, specific case of the SVM-like model (7.78), we have there are techniques for training SVMs whose cost is roughly quadratic in N .Of course, in the case of the RVM we always have the option of starting with a smaller N +1 . More significantly, in the relevance vector number of basis functions than machine the parameters governing complexity and noise variance are determined automatically from a single training run, whereas in the support vector machine the parameters C and  (or ν ) are generally found using cross-validation, which involves multiple training runs. Furthermore, in the next section we shall derive an alternative procedure for training the relevance vector machine that improves training speed significantly. 7.2.2 Analysis of sparsity We have noted earlier that the mechanism of automatic relevance determination causes a subset of parameters to be driven to zero. We now examine in more detail

370 350 7. SPARSE KERNEL MACHINES t t 2 2 t t φ C C t t 1 1 Figure 7.10 Illustration of the mechanism for sparsity in a Bayesian linear regression model, showing a training T ,t ) , indicated by the cross, for a model with one basis vector t t set vector of target values given by =( 1 2 T x φ ( =( φ ( ,φ )) ) , which is poorly aligned with the target data vector t . On the left we see a model having x 1 2 1 − ∞ I set to its most probable value. On β , with α = , corresponding to β C only isotropic noise, so that = α . In each case the red ellipse corresponds to unit the right we see the same model but with a finite value of C taking the same value for both plots, while the dashed green circle shows the | | Mahalanobis distance, with − 1 β contrition arising from the noise term α reduces the probability of the . We see that any finite value of observed data, and so for the most probable solution the basis vector is removed. the mechanism of sparsity in the context of the relevance vector machine. In the process, we will arrive at a significantly faster procedure for optimizing the hyper- parameters compared to the direct techniques given above. Before proceeding with a mathematical analysis, we first give some informal insight into the origin of sparsity in Bayesian linear models. Consider a data set , together with a model having a single and t t =2 N observations comprising 2 1 φ ( x ) , with hyperparameter α , along with isotropic noise having pre- basis function cision β . From (7.85), the marginal likelihood is given by p ( t | α, β )= N ( t | 0 , C ) in which the covariance matrix takes the form 1 1 T (7.92) I + φφ C = α β T ,φ ( t = )) ) , and similarly x φ denotes the N -dimensional vector ( φ ( x where 2 1 T t . Notice that this is just a zero-mean Gaussian process model over ,t with ) ( t 2 1   and β by C . Given a particular observation for t , our goal is to find covariance α maximizing the marginal likelihood. We see from Figure 7.10 that, if there is a poor alignment between the direction of φ and that of the training data vector t , then the corresponding hyperparameter will be driven to ∞ , and the basis vector will be α pruned from the model. This arises because any finite value for α will always assign a lower probability to the data, thereby decreasing the value of the density at t , pro- vided that β is set to its optimal value. We see that any finite value for α would cause the distribution to be elongated in a direction away from the data, thereby increasing the probability mass in regions away from the observed data and hence reducing the value of the density at the target data vector itself. For the more general case of M

371 7.2. Relevance Vector Machines 351 ,..., a similar intuition holds, namely that if a particular basis φ basis vectors φ M 1 , then it is likely to be pruned from the vector is poorly aligned with the data vector t model. We now investigate the mechanism for sparsity from a more mathematical per- M basis functions. To motivate this analysis spective, for a general case involving we first note that, in the result (7.87) for re-estimating the parameter α , the terms on i the right-hand side are themselves also functions of α . These results therefore rep- i resent implicit solutions, and iteration would be required even to determine a single α = for α with all other i fixed. j j i This suggests a different approach to solving the optimization problem for the RVM, in which we make explicit all of the dependence of the marginal likelihood α (7.85) on a particular and then determine its stationary points explicitly (Faul and i Tipping, 2002; Tipping and Faul, 2003). To do this, we first pull out the contribution α from C defined by (7.86) to give in the matrix i ∑ − 1 1 − T − 1 T β = C φ α φ φ φ I + α + i j j i j i j = i − 1 T α = C φ φ (7.93) + i − i i i th where φ i column of , in other words the N -dimensional vector with denotes the Φ i th Φ row of ( x ) . ,...,φ n ( x , which denotes the )) , in contrast to φ elements φ ( 1 i N i n C The matrix represents the matrix C with the contribution from basis function i i − removed. Using the matrix identities (C.7) and (C.15), the determinant and inverse C of can then be written − 1 − 1 T | = | C | C φ (7.94) | φ || 1+ C α − i i i − i i − − 1 1 T φ φ C C i i − i − i 1 − − 1 C = − (7.95) . C i − − 1 T φ φ α C + i i − i i Using these results, we can then write the log marginal likelihood function (7.85) in Exercise 7.15 the form (7.96) )+ λ ( α ) L ( α )= L ( α i i − ( L where α ) omitted, φ is simply the log marginal likelihood with basis function i − i is defined by ) α λ and the quantity ( i ] [ 2 1 q i α (7.97) − ln ( α )= + s )+ ln ( α λ i i i i + s 2 α i i and contains all of the dependence on α . Here we have introduced the two quantities i 1 − T s φ C (7.98) = φ i i i − i 1 − T q φ C (7.99) = . t i i − i Here s , and as we shall is called the sparsity and q φ is known as the quality of i i i see, a large value of s φ relative to the value of q means that the basis function i i i

372 352 7. SPARSE KERNEL MACHINES the log Plots Figure 7.11 of 2 2 versus ) marginal likelihood α λ ( i α ln showing on the left, the single i 2 0 0 =4 for q maximum at a finite α i i 2 ) and on >s =1 (so that q and s i i i α the right, the maximum at ∞ = i −2 −2 2 =2 (so that =1 and s for q i i 2 ). s →∞ provides a solution. Conversely, if q

373 7.2. Relevance Vector Machines 353 and s for all basis functions. Σ 3. Evaluate and , along with m q i i . 4. Select a candidate basis function φ i 2 is already included in >s φ , and α , so that the basis vector < ∞ 5. If q i i i i the model, then update α using (7.101). i 2 , then add >s to the model, and evaluate hyperpa- , and φ α = ∞ q 6. If i i i i rameter α using (7.101). i 2 α then remove basis function  s ∞ , and from the model, φ < q 7. If i i i i and set α ∞ = . i 8. If solving a regression problem, update β . 9. If converged terminate, otherwise go to 3. 2 Note that if  q is already excluded and α φ = ∞ , then the basis function s i i i i from the model and no action is required. In practice, it is convenient to evaluate the quantities 1 T − Q C = t (7.102) φ i i T − 1 = . C (7.103) φ φ S i i i The quality and sparseness variables can then be expressed in the form Q α i i q = (7.104) i S − α i i α S i i . = (7.105) s i − S α i i Note that when α = ∞ ,wehave q Exercise 7.17 = Q . Using (C.7), we can write and s S = i i i i i T T 2 T β t t − (7.106) = φ β φ ΦΣΦ Q i i i T T 2 T φ (7.107) φ φ ΦΣΦ − β = φ β S i i i i i Φ Σ involve only those basis vectors that correspond to finite hyperpa- and where 3 O ( M ) , . At each stage the required computations therefore scale like α rameters i M is the number of active basis vectors in the model and is typically much where smaller than the number N of training patterns. 7.2.3 RVM for classification We can extend the relevance vector machine framework to classification prob- lems by applying the ARD prior over weights to a probabilistic linear classification model of the kind studied in Chapter 4. To start with, we consider two-class prob- t ∈{ 0 , 1 lems with a binary target variable . The model now takes the form of a } linear combination of basis functions transformed by a logistic sigmoid function ( ) T , w )= σ y ( x φ ( x ) w (7.108)

374 354 7. SPARSE KERNEL MACHINES ( where is the logistic sigmoid function defined by (4.59). If we introduce a · σ ) , then we obtain the model that has been Gaussian prior over the weight vector w considered already in Chapter 4. The difference here is that in the RVM, this model uses the ARD prior (7.80) in which there is a separate precision hyperparameter associated with each weight parameter. In contrast to the regression model, we can no longer integrate analytically over w the parameter vector . Here we follow Tipping (2001) and use the Laplace ap- proximation, which was applied to the closely related problem of Bayesian logistic Section 4.4 regression in Section 4.5.1. . For this given value of We begin by initializing the hyperparameter vector α , we then build a Gaussian approximation to the posterior distribution and thereby α obtain an approximation to the marginal likelihood. Maximization of this approxi- α , and the process is mate marginal likelihood then leads to a re-estimated value for repeated until convergence. Let us consider the Laplace approximation for this model in more detail. For a fixed value of w is obtained by α , the mode of the posterior distribution over maximizing p ( w | t , α )=ln { p ( t | w ) p ( w | α ) }− ln p ( t | α ) ln N ∑ 1 T = t +const ln y Aw +(1 − t }− )ln(1 − y (7.109) ) { w n n n n 2 =1 n A =diag( α where ) . This can be done using iterative reweighted least squares i (IRLS) as discussed in Section 4.3.3. For this, we need the gradient vector and Hessian matrix of the log posterior distribution, which from (7.109) are given by Exercise 7.18 T ln p ( w | t , α )= Φ ∇ ( ) − Aw (7.110) t − y ) ( T ( , α )= − t w | p ln ∇∇ Φ A (7.111) BΦ + = y , the vector (1 − y ) where B is an N × N diagonal matrix with elements b n n n T y =( y ) ,...,y . Here Φ is the design matrix with elements Φ = φ ( x ) , and ni N n 1 i we have used the property (4.88) for the derivative of the logistic sigmoid function. At convergence of the IRLS algorithm, the negative Hessian represents the inverse covariance matrix for the Gaussian approximation to the posterior distribution. The mode of the resulting approximation to the posterior distribution, corre- sponding to the mean of the Gaussian approximation, is obtained setting (7.110) to zero, giving the mean and covariance of the Laplace approximation in the form T  − 1 (7.112) y Φ = ( t − A ) w ( ) − 1 T = Σ Φ A BΦ + . (7.113) We can now use this Laplace approximation to evaluate the marginal likelihood. Using the general result (4.135) for an integral evaluated using the Laplace approxi-

375 7.2. Relevance Vector Machines 355 mation, we have ∫ t )= | p ( α | | ) t ( w w α )d w ( p p / 1 2  2 M/  Σ ( p | α )(2 π ) ) (7.114) | w | . ( p t |  w   α | ) and then set the derivative of the marginal ) and p ( w ( w If we substitute for t p | equal to zero, we obtain Exercise 7.19 likelihood with respect to α i 1 1 1  2 w Σ ( ) − . + (7.115) =0 − ii i 2 2 α 2 i =1 − α Σ and rearranging then gives Defining γ i ii i γ i new α = (7.116) i  2 w ) ( i which is identical to the re-estimation formula (7.87) obtained for the regression RVM. If we define  − 1 ̂ Φw t ( = − y ) (7.117) + B t we can write the approximate log marginal likelihood in the form { } 1 T − 1 ̂ ̂ t +( N ln(2 ) π C )+ln | C t | (7.118) | α ,β )= − ln p ( t 2 where T . (7.119) + ΦAΦ C = B This takes the same form as (7.85) in the regression case, and so we can apply the same analysis of sparsity and obtain the same fast learning algorithm in which we fully optimize a single hyperparameter α at each step. i Figure 7.12 shows the relevance vector machine applied to a synthetic classifi- cation data set. We see that the relevance vectors tend not to lie in the region of the Appendix A decision boundary, in contrast to the support vector machine. This is consistent with ( x ) centred φ our earlier discussion of sparsity in the RVM, because a basis function i on a data point near the boundary will have a vector φ that is poorly aligned with i the training data vector t . One of the potential advantages of the relevance vector machine compared with the SVM is that it makes probabilistic predictions. For example, this allows the RVM to be used to help construct an emission density in a nonlinear extension of the linear dynamical system for tracking faces in video sequences (Williams et al. , 2005). Section 13.3 So far, we have considered the RVM for binary classification problems. For K> 2 classes, we again make use of the probabilistic approach in Section 4.3.4 in which there are K linear models of the form T = w x (7.120) a k k

376 356 7. SPARSE KERNEL MACHINES 2 0 −2 0 −2 2 Example of the relevance vector machine applied to a synthetic data set, in which the left-hand plot Figure 7.12 shows the decision boundary and the data points, with the relevance vectors indicated by circles. Comparison with the results shown in Figure 7.4 for the corresponding support vector machine shows that the RVM gives a much sparser model. The right-hand plot shows the posterior probability given by the RVM output in which the proportion of red (blue) ink indicates the probability of that point belonging to the red (blue) class. which are combined using a softmax function to give outputs ) exp( a k y )= x ∑ ( (7.121) . k exp( ) a j j The log likelihood function is then given by K N ∏ ∏ t nk ,..., )= w (7.122) y | T ln p ( w 1 K nk =1 n =1 k is a T have a 1-of- K coding for each data point n , and where the target values t nk matrix with elements t . Again, the Laplace approximation can be used to optimize nk the hyperparameters (Tipping, 2001), in which the model and its Hessian are found using IRLS. This gives a more principled approach to multiclass classification than the pairwise method used in the support vector machine and also provides probabilis- tic predictions for new data points. The principal disadvantage is that the Hessian MK × MK , where M is the number of active basis functions, which matrix has size 3 K gives an additional factor of in the computational cost of training compared with the two-class RVM. The principal disadvantage of the relevance vector machine is the relatively long training times compared with the SVM. This is offset, however, by the avoidance of cross-validation runs to set the model complexity parameters. Furthermore, because it yields sparser models, the computation time on test points, which is usually the more important consideration in practice, is typically much less.

377 Exercises 357 Exercises ) Suppose we have a data set of input vectors { x ( } with corresponding 7.1 www n ∈{− 1 , 1 } , and suppose that we model the density of input vec- t target values n tors within each class separately using a Parzen kernel density estimator (see Sec- ′ ) . Write down the minimum misclassification-rate x x ( k tion 2.5.1) with a kernel , decision rule assuming the two classes have equal prior probability. Show also that, ′ T ′ , then the classification rule reduces to )= x x ( , x k x if the kernel is chosen to be simply assigning a new input vector to the class having the closest mean. Finally, ′ ′ T x ) ( )= ( x φ ) , that the classification φ ( x , x show that, if the kernel takes the form k φ is based on the closest mean in the feature space ) . ( x 1 Show that, if the ( on the right-hand side of the constraint (7.5) is replaced by 7.2 ) 0 , the solution for the maximum margin hyperplane is some arbitrary constant γ> unchanged. ( ) Show that, irrespective of the dimensionality of the data space, a data set 7.3 consisting of just two data points, one from each class, is sufficient to determine the location of the maximum-margin hyperplane. ( ) 7.4 Show that the value ρ of the margin for the maximum-margin hyper- www plane is given by N ∑ 1 = (7.123) a n 2 ρ n =1 { a where are given by maximizing (7.10) subject to the constraints (7.11) and } n (7.12). } in the previous exercise also satisfy Show that the values of ρ and 7.5 a ( ) { n 1 ̃ =2 (7.124) L ( a ) 2 ρ ̃ L a ) is defined by (7.10). Similarly, show that ( where 1 2 = ‖ w ‖ . (7.125) 2 ρ 7.6 ( ) Consider the logistic regression model with a target variable t ∈{− 1 , 1 } .If y we define t =1 | y )= σ ( ( ) where y ( x ) is given by (7.1), show that the negative p log likelihood, with the addition of a quadratic regularization term, takes the form (7.47). 7.7 ) Consider the Lagrangian (7.56) for the regression support vector machine. By ( ̂ w , b , ξ setting the derivatives of the Lagrangian with respect to , and ξ to zero and n n then back substituting to eliminate the corresponding variables, show that the dual Lagrangian is given by (7.61).

378 358 7. SPARSE KERNEL MACHINES www For the regression support vector machine considered in Section 7.1.4, ( 7.8 ) show that all training data points for which ξ C a > = 0 , and similarly will have n n ̂ all points for which > will have ̂ a 0 = C . ξ n n ( ) Verify the results (7.82) and (7.83) for the mean and covariance of the posterior 7.9 distribution over weights in the regression RVM. 7.10 ) ( Derive the result (7.85) for the marginal likelihood function in the www in (7.84) using the regression RVM, by performing the Gaussian integral over w technique of completing the square in the exponential. 7.11 ) Repeat the above exercise, but this time make use of the general result (2.115). ( ( ) 7.12 Show that direct maximization of the log marginal likelihood (7.85) for www the regression relevance vector machine leads to the re-estimation equations (7.87) is defined by (7.89). and (7.88) where γ i ( 7.13 ) In the evidence framework for RVM regression, we obtained the re-estimation formulae (7.87) and (7.88) by maximizing the marginal likelihood given by (7.85). Extend this approach by inclusion of hyperpriors given by gamma distributions of the form (B.26) and obtain the corresponding re-estimation formulae for α and β by maximizing the corresponding posterior probability p t , α ,β | X ) with respect to α ( β and . ( ) Derive the result (7.90) for the predictive distribution in the relevance vector 7.14 machine for regression. Show that the predictive variance is given by (7.91). www Using the results (7.94) and (7.95), show that the marginal likelihood 7.15 ) ( (7.85) can be written in the form (7.96), where ( α λ ) is defined by (7.97) and the n sparsity and quality factors are defined by (7.98) and (7.99), respectively. 7.16 ( ) By taking the second derivative of the log marginal likelihood (7.97) for the regression RVM with respect to the hyperparameter α , show that the stationary i point given by (7.101) is a maximum of the marginal likelihood. 7.17 ( ) Using (7.83) and (7.86), together with the matrix identity (C.7), show that defined by (7.102) and (7.103) can be written in the form Q and the quantities S n n (7.106) and (7.107). Show that the gradient vector and Hessian matrix of the log poste- www ( 7.18 ) rior distribution (7.109) for the classification relevance vector machine are given by (7.110) and (7.111). 7.19 ( ) Verify that maximization of the approximate log marginal likelihood function (7.114) for the classification relevance vector machine leads to the result (7.116) for re-estimation of the hyperparameters.

379 8 Graphical Models Probabilities play a central role in modern pattern recognition. We have seen in Chapter 1 that probability theory can be expressed in terms of two simple equations corresponding to the sum rule and the product rule. All of the probabilistic infer- ence and learning manipulations discussed in this book, no matter how complex, amount to repeated application of these two equations. We could therefore proceed to formulate and solve complicated probabilistic models purely by algebraic ma- nipulation. However, we shall find it highly advantageous to augment the analysis using diagrammatic representations of probability distributions, called probabilistic graphical models . These offer several useful properties: 1. They provide a simple way to visualize the structure of a probabilistic model and can be used to design and motivate new models. 2. Insights into the properties of the model, including conditional independence properties, can be obtained by inspection of the graph. 359

380 360 8. GRAPHICAL MODELS 3. Complex computations, required to perform inference and learning in sophis- ticated models, can be expressed in terms of graphical manipulations, in which underlying mathematical expressions are carried along implicitly. (also called ) connected by links (also known A graph comprises nodes vertices or arcs ). In a probabilistic graphical model, each node represents a random as edges variable (or group of random variables), and the links express probabilistic relation- ships between these variables. The graph then captures the way in which the joint distribution over all of the random variables can be decomposed into a product of factors each depending only on a subset of the variables. We shall begin by dis- cussing Bayesian networks , also known as directed graphical models , in which the links of the graphs have a particular directionality indicated by arrows. The other major class of graphical models are , also known as undirected Markov random fields graphical models , in which the links do not carry arrows and have no directional significance. Directed graphs are useful for expressing causal relationships between random variables, whereas undirected graphs are better suited to expressing soft con- straints between random variables. For the purposes of solving inference problems, it is often convenient to convert both directed and undirected graphs into a different representation called a factor graph . In this chapter, we shall focus on the key aspects of graphical models as needed for applications in pattern recognition and machine learning. More general treat- ments of graphical models can be found in the books by Whittaker (1990), Lauritzen et al. (1997), Jordan (1999), Cowell et al. (1999), (1996), Jensen (1996), Castillo and Jordan (2007). 8.1. Bayesian Networks In order to motivate the use of directed graphs to describe probability distributions, consider first an arbitrary joint distribution p ( a, b, c ) over three variables a , b , and c . Note that at this stage, we do not need to specify anything further about these vari- ables, such as whether they are discrete or continuous. Indeed, one of the powerful aspects of graphical models is that a specific graph can make probabilistic statements for a broad class of distributions. By application of the product rule of probability (1.11), we can write the joint distribution in the form ( a, b, c )= p ( c | p ) p ( a, b ) . (8.1) a, b A second application of the product rule, this time to the second term on the right- hand side of (8.1), gives p ( a, b, c )= p ( c | a, b ) p ( b | a ) p ( a ) . (8.2) Note that this decomposition holds for any choice of the joint distribution. We now represent the right-hand side of (8.2) in terms of a simple graphical model as follows. First we introduce a node for each of the random variables a , b , and c and associate each node with the corresponding conditional distribution on the right-hand side of

381 8.1. Bayesian Networks 361 a A directed graphical model representing the joint probabil- Figure 8.1 b , and c , correspond- ity distribution over three variables , a b ing to the decomposition on the right-hand side of (8.2). c (8.2). Then, for each conditional distribution we add directed links (arrows) to the graph from the nodes corresponding to the variables on which the distribution is ( c | a, b ) , there will be links from nodes a conditioned. Thus for the factor b to p and c , whereas for the factor p ( a ) there will be no incoming links. The result is the node graph shown in Figure 8.1. If there is a link going from a node to a node b , then we a say that node is the parent of node b , and we say that node b is the child of node a . a Note that we shall not make any formal distinction between a node and the variable to which it corresponds but will simply use the same symbol to refer to both. An interesting point to note about (8.2) is that the left-hand side is symmetrical a , whereas the right-hand side is not. b , and c with respect to the three variables , Indeed, in making the decomposition in (8.2), we have implicitly chosen a particular a, b, c , and had we chosen a different ordering we would have ordering, namely obtained a different decomposition and hence a different graphical representation. We shall return to this point later. For the moment let us extend the example of Figure 8.1 by considering the joint . By repeated application of ) ,...,x p ( x variables given by distribution over K K 1 the product rule of probability, this joint distribution can be written as a product of conditional distributions, one for each of the variables ( p x p ( x | x (8.3) ,...,x ,...,x )= ...p ( x | x ) p ( x ) . ) 1 K 1 K K 1 2 1 1 − For a given choice of K , we can again represent this as a directed graph having K nodes, one for each conditional distribution on the right-hand side of (8.3), with each node having incoming links from all lower numbered nodes. We say that this graph because there is a link between every pair of nodes. fully connected is So far, we have worked with completely general joint distributions, so that the decompositions, and their representations as fully connected graphs, will be applica- of links ble to any choice of distribution. As we shall see shortly, it is the absence in the graph that conveys interesting information about the properties of the class of distributions that the graph represents. Consider the graph shown in Figure 8.2. This to x or is not a fully connected graph because, for instance, there is no link from x 2 1 from x to x . 7 3 We shall now go from this graph to the corresponding representation of the joint probability distribution written in terms of the product of a set of conditional dis- tributions, one for each node in the graph. Each such conditional distribution will be conditioned only on the parents of the corresponding node in the graph. For in- 7 . The joint distribution of all x will be conditioned on variables and x x stance, 1 5 3

382 362 8. GRAPHICAL MODELS Example of a directed acyclic graph describing the joint Figure 8.2 x 1 ,...,x . The corresponding x distribution over variables 1 7 x x decomposition of the joint distribution is given by (8.4). 2 3 x x 4 5 x x 6 7 is therefore given by p ( x (8.4) ) p ( x . ) p ( x ) ) p ( x ,x | x x ,x | ,x x ) p ( x ( | x p ,x ) ) p ( x x | 6 4 1 5 3 7 2 1 4 4 5 3 2 1 3 The reader should take a moment to study carefully the correspondence between (8.4) and Figure 8.2. We can now state in general terms the relationship between a given directed graph and the corresponding distribution over the variables. The joint distribution defined by a graph is given by the product, over all of the nodes of the graph, of a conditional distribution for each node conditioned on the variables corresponding K nodes, the joint to the parents of that node in the graph. Thus, for a graph with distribution is given by K ∏ )= p x ( | ( (8.5) x ) p pa k k k =1 where pa denotes the set of parents of , and x = { x ,...,x x } . This key K 1 k k equation expresses the factorization properties of the joint distribution for a directed graphical model. Although we have considered each node to correspond to a single variable, we can equally well associate sets of variables and vector-valued variables with the nodes of a graph. It is easy to show that the representation on the right- hand side of (8.5) is always correctly normalized provided the individual conditional distributions are normalized. Exercise 8.1 The directed graphs that we are considering are subject to an important restric- tion namely that there must be no directed cycles , in other words there are no closed paths within the graph such that we can move from node to node along links follow- ing the direction of the arrows and end up back at the starting node. Such graphs are also called directed acyclic graphs ,or DAGs . This is equivalent to the statement that Exercise 8.2 there exists an ordering of the nodes such that there are no links that go from any node to any lower numbered node. 8.1.1 Example: Polynomial regression As an illustration of the use of directed graphs to describe probability distri- butions, we consider the Bayesian polynomial regression model introduced in Sec-

383 8.1. Bayesian Networks 363 w Directed graphical model representing the joint Figure 8.3 distribution (8.6) corresponding to the Bayesian polynomial regression model introduced in Sec- tion 1.2.6. t t 1 N tion 1.2.6. The random variables in this model are the vector of polynomial coeffi- T ,...,t ) . In addition, this model contains cients w and the observed data t =( t N 1 T 2 the input data x =( x ) ,...,x , the noise variance σ α , and the hyperparameter N 1 w , all of which are parameters representing the precision of the Gaussian prior over of the model rather than random variables. Focussing just on the random variables for the moment, we see that the joint distribution is given by the product of the prior so that ,...,N =1 for ) w | n ( t p conditional distributions N and p ) w ( n N ∏ p w )= p ( w ) t , ( (8.6) . p ( t ) | w n =1 n This joint distribution can be represented by a graphical model shown in Figure 8.3. When we start to deal with more complex models later in the book, we shall find ,...,t explicitly as t it inconvenient to have to write out multiple nodes of the form 1 N in Figure 8.3. We therefore introduce a graphical notation that allows such multiple nodes to be expressed more compactly, in which we draw a single representative and then surround this with a box, called a N indicating plate , labelled with node t n that there are N nodes of this kind. Re-writing the graph of Figure 8.3 in this way, we obtain the graph shown in Figure 8.4. We shall sometimes find it helpful to make the parameters of a model, as well as its stochastic variables, explicit. In this case, (8.6) becomes N ∏ 2 2 p ( w | α ) ) . p ( t )= | w ,x ,σ x ,α,σ ( , p w | t n n n =1 x and α explicit in the graphical representation. To Correspondingly, we can make do this, we shall adopt the convention that random variables will be denoted by open circles, and deterministic parameters will be denoted by smaller solid circles. If we take the graph of Figure 8.4 and include the deterministic parameters, we obtain the graph shown in Figure 8.5. When we apply a graphical model to a problem in machine learning or pattern recognition, we will typically set some of the random variables to specific observed w Figure 8.4 An alternative, more compact, representation of the graph plate shown in Figure 8.3 in which we have introduced a t N ) that represents N nodes of which only (the box labelled n a single example t is shown explicitly. N n

384 364 8. GRAPHICAL MODELS Figure 8.5 This shows the same model as in Figure 8.4 but x n α with the deterministic parameters shown explicitly by the smaller solid nodes. w 2 σ t n N values, for example the variables { } from the training set in the case of polynomial t n curve fitting. In a graphical model, we will denote such observed variables by shad- ing the corresponding nodes. Thus the graph corresponding to Figure 8.5 in which } are observed is shown in Figure 8.6. Note that the value of w is t the variables { n not observed, and so is an example of a latent variable, also known as a hidden w variable. Such variables play a crucial role in many probabilistic models and will form the focus of Chapters 9 and 12. { t Having observed the values } we can, if desired, evaluate the posterior dis- n as discussed in Section 1.2.5. For the w tribution of the polynomial coefficients moment, we note that this involves a straightforward application of Bayes’ theorem N ∏ ) ( t (8.7) | w p ( w | T ) ∝ p ( w p ) n =1 n where again we have omitted the deterministic parameters in order to keep the nota- tion uncluttered. In general, model parameters such as w are of little direct interest in themselves, because our ultimate goal is to make predictions for new input values. Suppose we ̂ x and we wish to find the corresponding probability dis- are given a new input value ̂ t conditioned on the observed data. The graphical model that describes tribution for this problem is shown in Figure 8.7, and the corresponding joint distribution of all of the random variables in this model, conditioned on the deterministic parameters, is then given by ] [ N ∏ 2 2 2 ̂ ̂ x, x ,α,σ α )= (8.8) t , . ) p ( t w | x ,σ , w ,σ | ) w p ( w | t, ) p ( ̂ t | ̂ x, p ( n n =1 n Figure 8.6 As in Figure 8.5 but with the nodes { t shaded } n x n α to indicate that the corresponding random vari- ables have been set to their observed (training set) values. w 2 σ t n N

385 8.1. Bayesian Networks 365 The polynomial regression model, corresponding Figure 8.7 x n α to Figure 8.6, showing also a new input value b x together with the corresponding model prediction b t . w t n N x ˆ 2 σ ˆ t ̂ t is then obtained, from the sum rule of The required predictive distribution for probability, by integrating out the model parameters so that w ∫ 2 2 ̂ ̂ ( p , t ,α,σ | ) ∝ ̂ p ( t t, t , w | ̂ x, x ,α,σ x )d w x, where we are implicitly setting the random variables in to the specific values ob- t served in the data set. The details of this calculation were discussed in Chapter 3. 8.1.2 Generative models There are many situations in which we wish to draw samples from a given prob- ability distribution. Although we shall devote the whole of Chapter 11 to a detailed discussion of sampling methods, it is instructive to outline here one technique, called ancestral sampling , which is particularly relevant to graphical models. Consider a variables that factorizes according to (8.5) K over ) ,...,x ( p joint distribution x 1 K corresponding to a directed acyclic graph. We shall suppose that the variables have been ordered such that there are no links from any node to any lower numbered node, in other words each node has a higher number than any of its parents. Our goal is to x from the joint distribution. ,..., ̂ x ̂ draw a sample K 1 To do this, we start with the lowest-numbered node and draw a sample from the . We then work through each of the nodes in or- ) , which we call ̂ x distribution p ( x 1 1 der, so that for node n we draw a sample from the conditional distribution p ( x ) | pa n n in which the parent variables have been set to their sampled values. Note that at each stage, these parent values will always be available because they correspond to lower- numbered nodes that have already been sampled. Techniques for sampling from specific distributions will be discussed in detail in Chapter 11. Once we have sam- , we will have achieved our objective of obtaining a x pled from the final variable K sample from the joint distribution. To obtain a sample from some marginal distribu- tion corresponding to a subset of the variables, we simply take the sampled values for the required nodes and ignore the sampled values for the remaining nodes. For example, to draw a sample from the distribution p ( x , we simply sample from ,x ) 4 2 ̂ x ̂ x and discard the remaining , the full joint distribution and then retain the values 2 4 values { ̂ x } . j =2 , 4

386 366 8. GRAPHICAL MODELS A graphical model representing the process by which Figure 8.8 Position Orientation Object images of objects are created, in which the identity of an object (a discrete variable) and the position and orientation of that object (continuous variables) have independent prior probabilities. The image (a vector of pixel intensities) has a probability distribution that is dependent on the identity of the object as well as on its position and orientation. Image For practical applications of probabilistic models, it will typically be the higher- numbered variables corresponding to terminal nodes of the graph that represent the observations, with lower-numbered nodes corresponding to latent variables. The primary role of the latent variables is to allow a complicated distribution over the observed variables to be represented in terms of a model constructed from simpler (typically exponential family) conditional distributions. We can interpret such models as expressing the processes by which the observed data arose. For instance, consider an object recognition task in which each observed data point corresponds to an image (comprising a vector of pixel intensities) of one of the objects. In this case, the latent variables might have an interpretation as the position and orientation of the object. Given a particular observed image, our goal is to find the posterior distribution over objects, in which we integrate over all possible positions and orientations. We can represent this problem using a graphical model of the form show in Figure 8.8. The graphical model captures the causal process (Pearl, 1988) by which the ob- served data was generated. For this reason, such models are often called generative models. By contrast, the polynomial regression model described by Figure 8.5 is not generative because there is no probability distribution associated with the input x , and so it is not possible to generate synthetic data points from this model. variable p We could make it generative by introducing a suitable prior distribution x ) , at the ( expense of a more complex model. The hidden variables in a probabilistic model need not, however, have any ex- plicit physical interpretation but may be introduced simply to allow a more complex joint distribution to be constructed from simpler components. In either case, the technique of ancestral sampling applied to a generative model mimics the creation of the observed data and would therefore give rise to ‘fantasy’ data whose probability distribution (if the model were a perfect representation of reality) would be the same as that of the observed data. In practice, producing synthetic observations from a generative model can prove informative in understanding the form of the probability distribution represented by that model. 8.1.3 Discrete variables We have discussed the importance of probability distributions that are members of the exponential family, and we have seen that this family includes many well- Section 2.4 known distributions as particular cases. Although such distributions are relatively simple, they form useful building blocks for constructing more complex probability

387 8.1. Bayesian Networks 367 x x 2 1 Figure 8.9 (a) This fully-connected graph describes a general distribu- (a) tion over two K -state discrete variables having a total of 2 1 − parameters. (b) By dropping the link between the K . − 2( nodes, the number of parameters is reduced to K 1) x x 2 1 (b) distributions, and the framework of graphical models is very useful in expressing the way in which these building blocks are linked together. Such models have particularly nice properties if we choose the relationship be- tween each parent-child pair in a directed graph to be conjugate, and we shall ex- plore several examples of this shortly. Two cases are particularly worthy of note, namely when the parent and child node each correspond to discrete variables and when they each correspond to Gaussian variables, because in these two cases the relationship can be extended hierarchically to construct arbitrarily complex directed acyclic graphs. We begin by examining the discrete case. p x | μ ) for a single discrete variable x having K ( The probability distribution representation) is given by K possible states (using the 1-of- K ∏ x k μ (8.9) p | μ )= ( x k =1 k T ,...,μ ) . Due to the constraint and is governed by the parameters =( μ μ K 1 ∑ μ , only K − 1 values for μ =1 need to be specified in order to define the k k k distribution. Now suppose that we have two discrete variables, x and x , each of which has 1 2 states, and we wish to model their joint distribution. We denote the probability of K by the parameter =1 x denotes the x and μ , where =1 observing both x kl l 1 1 k k 2 th . The joint distribution can be written component of , and similarly for x x k l 2 1 K K ∏ ∏ x x 2 l k 1 ( p x | μ )= , x μ . 2 1 kl =1 k l =1 ∑ ∑ μ Because the parameters , this distri- μ are subject to the constraint =1 kl kl l k 2 bution is governed by K 1 parameters. It is easily seen that the total number of − M variables parameters that must be specified for an arbitrary joint distribution over M − 1 and therefore grows exponentially with the number M of variables. K is p Using the product rule, we can factor the joint distribution x ( , x ) in the form 1 2 , which corresponds to a two-node graph with a link going from the ) x ( | x p ) ( p x 1 1 2 x p x node as shown in Figure 8.9(a). The marginal distribution ( x node to the ) 1 2 1 is governed by K − 1 parameters, as before, Similarly, the conditional distribution possible K parameters for each of the 1 | x − ) requires the specification of K x ( p 1 2 values of x . The total number of parameters that must be specified in the joint 1 2 ( K − 1) + K ( K − 1) = K distribution is therefore − as before. 1 and x were independent, corresponding to Now suppose that the variables x 2 1 the graphical model shown in Figure 8.9(b). Each variable is then described by

388 368 8. GRAPHICAL MODELS x x x M 2 1 discrete nodes, each Figure 8.10 This chain of M K having states, requires the specification of K − 1+ − 1) ( K K − 1) parameters, which grows linearly ( M with the length M of the chain. In contrast, a fully con- M param- 1 − nodes would have M nected graph of K . eters, which grows exponentially with M a separate multinomial distribution, and the total number of parameters would be K 1) . For a distribution over M independent discrete variables, each having K − 2( ( K − 1) , which therefore grows states, the total number of parameters would be M linearly with the number of variables. From a graphical perspective, we have reduced the number of parameters by dropping links in the graph, at the expense of having a restricted class of distributions. ,..., x , we can model More generally, if we have M discrete variables x M 1 the joint distribution using a directed graph with one variable corresponding to each node. The conditional distribution at each node is given by a set of nonnegative pa- rameters subject to the usual normalization constraint. If the graph is fully connected M then we have a completely general distribution having K − 1 parameters, whereas if there are no links in the graph the joint distribution factorizes into the product of the marginals, and the total number of parameters is K − 1) . Graphs having in- M ( termediate levels of connectivity allow for more general distributions than the fully factorized one while requiring fewer parameters than the general joint distribution. As an illustration, consider the chain of nodes shown in Figure 8.10. The marginal condi- 1 − M ) requires K − 1 parameters, whereas each of the ( p distribution x 1 − K ( parameters. | x 1) , for i =2 ,...,M , requires K ) x ( p tional distributions i i − 1 This gives a total parameter count of K 1+( M − 1) K ( K − 1) , which is quadratic − in K M of the and which grows linearly (rather than exponentially) with the length chain. An alternative way to reduce the number of independent parameters in a model sharing parameters (also known as tying of parameters). For instance, in the is by chain example of Figure 8.10, we can arrange that all of the conditional distributions parameters. 1) − K | x ( K , are governed by the same set of ) , for i =2 ,...,M x ( p 1 i − i , this gives a total K 1 parameters governing the distribution of x Together with the − 1 2 − 1 parameters that must be specified in order to define the joint distribution. K of We can turn a graph over discrete variables into a Bayesian model by introduc- ing Dirichlet priors for the parameters. From a graphical point of view, each node then acquires an additional parent representing the Dirichlet distribution over the pa- rameters associated with the corresponding discrete node. This is illustrated for the chain model in Figure 8.11. The corresponding model in which we tie the parame- i , is shown | x ,...,M =2 ) , for ( p x ters governing the conditional distributions 1 i i − in Figure 8.12. Another way of controlling the exponential growth in the number of parameters in models of discrete variables is to use parameterized models for the conditional distributions instead of complete tables of conditional probability values. To illus- trate this idea, consider the graph in Figure 8.13 in which all of the nodes represent is governed by a single parame- binary variables. Each of the parent variables x i

389 8.1. Bayesian Networks 369 μ μ μ An extension of the model of Figure 8.11 2 M 1 Figure 8.10 to include Dirich- let priors over the param- eters governing the discrete distributions. x x x M 2 1 μ μ Figure 8.12 As in Figure 8.11 but with a sin- 1 shared gle set of parameters μ amongst all of the conditional x ( p distributions | ) . x i 1 − i x x x M 2 1 μ p ( x ter =1) , giving M parameters in total for the representing the probability i i ,...,x ) , however, would require p ( parent nodes. The conditional distribution y | x M 1 M M for each of the y =1) parameters representing the probability 2 p possible ( 2 settings of the parent variables. Thus in general the number of parameters required M . We can ob- to specify this conditional distribution will grow exponentially with tain a more parsimonious form for the conditional distribution by using a logistic sigmoid function acting on a linear combination of the parent variables, giving Section 2.4 ( ) M ∑ T = w x (8.10) ,...,x ) )= w w ( x σ σ + ( y =1 | x p i 0 M 1 i =1 i 1 − T − a )) a ( σ where ) = (1+exp( =( x is the logistic sigmoid, ,x is an ,...,x x ) M 1 0 ( M +1) -dimensional vector of parent states augmented with an additional variable T M whose value is clamped to 1, and w =( w is a vector of ,w +1 ,...,w ) x M 0 1 0 parameters. This is a more restricted form of conditional distribution than the general case but is now governed by a number of parameters that grows linearly with M .In this sense, it is analogous to the choice of a restrictive form of covariance matrix (for example, a diagonal matrix) in a multivariate Gaussian distribution. The motivation for the logistic sigmoid representation was discussed in Section 4.2. x x M 1 x parents ,...,x Figure 8.13 and a sin- A graph comprising M 1 M y gle child , used to illustrate the idea of parameterized conditional distributions for discrete variables. y

390 370 8. GRAPHICAL MODELS 8.1.4 Linear-Gaussian models In the previous section, we saw how to construct joint probability distributions over a set of discrete variables by expressing the variables as nodes in a directed acyclic graph. Here we show how a multivariate Gaussian can be expressed as a directed graph corresponding to a linear-Gaussian model over the component vari- ables. This allows us to impose interesting structure on the distribution, with the general Gaussian and the diagonal covariance Gaussian representing opposite ex- tremes. Several widely used techniques are examples of linear-Gaussian models, such as probabilistic principal component analysis, factor analysis, and linear dy- namical systems (Roweis and Ghahramani, 1999). We shall make extensive use of the results of this section in later chapters when we consider some of these techniques in detail. i D Consider an arbitrary directed acyclic graph over variables in which node x represents a single continuous random variable having a Gaussian distribution. i The mean of this distribution is taken to be a linear combination of the states of its i of node parent nodes pa i ∣ ⎞ ⎛ ∣ ∑ ∣ ∣ ⎝ ⎠ p x ( (8.11) N | pa )= ,v w x x + b j ij i i i i i ∣ ∣ pa j ∈ i is the variance of the are parameters governing the mean, and b and v w where i i ij . The log of the joint distribution is then the log of the conditional distribution for x i product of these conditionals over all nodes in the graph and hence takes the form D ∑ ( p ln )= x ln ) ( x (8.12) | pa p i i =1 i ⎛ ⎞ 2 D ∑ ∑ 1 ⎠ ⎝ = − +const x b − (8.13) − x w ij j i i 2 v i ∈ j =1 i pa i T ,...,x . We see that ) and ‘const’ denotes terms independent of x =( where x x D 1 x , and hence the joint distribution this is a quadratic function of the components of ( x ) is a multivariate Gaussian. p We can determine the mean and covariance of the joint distribution recursively has (conditional on the states of its parents) a Gaussian as follows. Each variable x i distribution of the form (8.11) and so ∑ √ = + v (8.14)  b w + x x i ij j i i i j ∈ pa i ]=0  is a zero mean, unit variance Gaussian random variable satisfying E [  where i i is the I , where  element of the identity matrix. Taking the I i, j ]=  [ E and ij j i ij expectation of (8.14), we have ∑ [ x E ]= (8.15) b ]+ x [ . E w ij j i i pa ∈ j i

391 8.1. Bayesian Networks 371 x x x 1 2 3 A directed graph over three Gaussian variables, Figure 8.14 with one missing link. T ]) [ x Thus we can find the components of ] ,..., E [ x E E [ by starting at the ]=( x D 1 lowest numbered node and working recursively through the graph (here we again assume that the nodes are numbered such that each node has a higher number than its parents). Similarly, we can use (8.14) and (8.15) to obtain the i, j element of the x ) in the form of a recursion relation covariance matrix for p ( [ ,x E [( x ])] − E ]= x x ])( x [ − E cov[ x j j j i i i ⎫ ⎧ ⎡ ⎤ ⎨ ⎬ ∑ √ ⎣ ⎦ ( ]) x (  − v E ]) + x w [ [ x E − x = E k k jk i j i j ⎩ ⎭ ∈ pa k j ∑ ,x (8.16) v w ]+ cov[ x = I i jk k ij j pa k ∈ j and so the covariance can similarly be evaluated recursively starting from the lowest numbered node. Let us consider two extreme cases. First of all, suppose that there are no links in the graph, which therefore comprises D isolated nodes. In this case, there are no b v and so there are just D parameters . From parameters and D w parameters i i ij p x ) is given by ( the recursion relations (8.15) and (8.16), we see that the mean of T v ) and the covariance matrix is diagonal of the form diag( ,...,v ,...,b ) . b ( 1 1 D D The joint distribution has a total of 2 D parameters and represents a set of D inde- pendent univariate Gaussian distributions. Now consider a fully connected graph in which each node has all lower num- th then has i − 1 entries on the i row and bered nodes as parents. The matrix w ij hence is a lower triangular matrix (with no entries on the leading diagonal). Then 2 is obtained by taking the number D of elements the total number of parameters w ij to account for the absence of elements on the lead- D matrix, subtracting D in a D × 2 because the matrix has elements only below the ing diagonal, and then dividing by D ( D − 1) / 2 . The total number of independent parameters diagonal, giving a total of in the covariance matrix is therefore corresponding to and { v } } D ( D +1) / 2 w { i ij Section 2.3 a general symmetric covariance matrix. Graphs having some intermediate level of complexity correspond to joint Gaus- sian distributions with partially constrained covariance matrices. Consider for ex- ample the graph shown in Figure 8.14, which has a link missing between variables and . Using the recursion relations (8.15) and (8.16), we see that the mean and x x 1 3 covariance of the joint distribution are given by Exercise 8.7 T + ) b ,b w + w w b (8.17) ,b b + w μ b =( 32 2 21 1 1 32 21 1 3 2 ( ) v w v w w v 32 21 1 21 1 1 2 2 v v ) + w w v v v w w ( + = Σ (8.18) . 32 2 21 1 1 2 1 21 21 2 2 2 w w v w ( ) v v + w v + ( v + w w ) v 32 2 1 2 1 21 32 1 3 21 32 21

392 372 8. GRAPHICAL MODELS We can readily extend the linear-Gaussian graphical model to the case in which the nodes of the graph represent multivariate Gaussian variables. In this case, we can in the form write the conditional distribution for node i ∣ ⎞ ⎛ ∣ ∑ ∣ ∣ ⎠ ⎝ x p ( Σ , | (8.19) )= pa N W x x + b j i i ij i i i ∣ ∣ pa j ∈ i where now W is a matrix (which is nonsquare if x have different dimen- and x j ij i sionalities). Again it is easy to verify that the joint distribution over all variables is Gaussian. Note that we have already encountered a specific example of the linear-Gaussian of a Gaussian relationship when we saw that the conjugate prior for the mean Section 2.3.6 μ is itself a Gaussian distribution over μ variable x and x . The joint distribution over is therefore Gaussian. This corresponds to a simple two-node graph in which μ μ is the parent of the node representing the node representing . The mean of the x distribution over is a parameter controlling a prior, and so it can be viewed as a μ hyperparameter. Because the value of this hyperparameter may itself be unknown, we can again treat it from a Bayesian perspective by introducing a prior over the hyperprior , which is again given by a Gaussian hyperparameter, sometimes called a distribution. This type of construction can be extended in principle to any level and is an illustration of a , of which we shall encounter further hierarchical Bayesian model examples in later chapters. 8.2. Conditional Independence An important concept for probability distributions over multiple variables is that of conditional independence (Dawid, 1980). Consider three variables a , b , and c , and suppose that the conditional distribution of a b and c , is such that it does not ,given b depend on the value of , so that ( a | b, c )= p ( a | c ) . p (8.20) We say that is conditionally independent of b given c . This can be expressed in a a a and b conditioned on slightly different way if we consider the joint distribution of c , which we can write in the form p ( a, b | c )= p ( a | b, c ) p ( b | c ) b = a | c ) p ( ( | c ) . (8.21) p where we have used the product rule of probability together with (8.20). Thus we see that, conditioned on c , the joint distribution of a and b factorizes into the prod- uct of the marginal distribution of and the marginal distribution of b (again both a conditioned on c ). This says that the variables a and b are statistically independent, given c . Note that our definition of conditional independence will require that (8.20),

393 8.2. Conditional Independence 373 c The first of three examples of graphs over three variables Figure 8.15 , and c used to discuss conditional independence a b , properties of directed graphical models. ab or equivalently (8.21), must hold for every possible value of , and not just for some c values. We shall sometimes use a shorthand notation for conditional independence (Dawid, 1979) in which a ⊥ b | c (8.22) ⊥ a is conditionally independent of b given denotes that and is equivalent to (8.20). c Conditional independence properties play an important role in using probabilis- tic models for pattern recognition by simplifying both the structure of a model and the computations needed to perform inference and learning under that model. We shall see examples of this shortly. If we are given an expression for the joint distribution over a set of variables in terms of a product of conditional distributions (i.e., the mathematical representation underlying a directed graph), then we could in principle test whether any poten- tial conditional independence property holds by repeated application of the sum and product rules of probability. In practice, such an approach would be very time con- suming. An important and elegant feature of graphical models is that conditional independence properties of the joint distribution can be read directly from the graph without having to perform any analytical manipulations. The general framework for achieving this is called , where the ‘d’ stands for ‘directed’ (Pearl, d-separation 1988). Here we shall motivate the concept of d-separation and give a general state- ment of the d-separation criterion. A formal proof can be found in Lauritzen (1996). 8.2.1 Three example graphs We begin our discussion of the conditional independence properties of directed graphs by considering three simple examples each involving graphs having just three nodes. Together, these will motivate and illustrate the key concepts of d-separation. The first of the three examples is shown in Figure 8.15, and the joint distribution corresponding to this graph is easily written down using the general result (8.5) to give p ( a, b, c )= p ( a | c ) p ( b | c ) p ( c ) . (8.23) If none of the variables are observed, then we can investigate whether and b are a independent by marginalizing both sides of (8.23) with respect to c to give ∑ . ) c c ( p ( a | c ) p ( b | (8.24) ) p a, b ( p )= c In general, this does not factorize into the product p ( a ) p ( b ) , and so a ⊥ ⊥ b |∅ (8.25)

394 374 8. GRAPHICAL MODELS c As in Figure 8.15 but where we have conditioned on the Figure 8.16 value of variable . c ab ∅ ⊥ ⊥ means that the conditional inde- denotes the empty set, and the symbol where pendence property does not hold in general. Of course, it may hold for a particular distribution by virtue of the specific numerical values associated with the various conditional probabilities, but it does not follow in general from the structure of the graph. Now suppose we condition on the variable c , as represented by the graph of Figure 8.16. From (8.23), we can easily write down the conditional distribution of a ,given b , in the form and c p ( a, b, c ) a, b | ( )= p c ( c p ) = ( a | c ) p ( b | c ) p and so we obtain the conditional independence property a ⊥ b | c. ⊥ We can provide a simple graphical interpretation of this result by considering tail-to-tail to node b via the path from node . The node c is said to be a with re- c spect to this path because the node is connected to the tails of the two arrows, and the presence of such a path connecting nodes a and b causes these nodes to be de- pendent. However, when we condition on node c , as in Figure 8.16, the conditioned node ‘blocks’ the path from a b and causes a and b to become (conditionally) to independent. We can similarly consider the graph shown in Figure 8.17. The joint distribution corresponding to this graph is again obtained from our general formula (8.5) to give ( a, b, c )= p ( a ) p ( c p a ) p ( b | c ) . (8.26) | First of all, suppose that none of the variables are observed. Again, we can test to see if a and b are independent by marginalizing over c to give ∑ c . p ( c | a ) p ( b | )= p ( a ) p ( b | a ) a, b )= p ( ( a ) p c Figure 8.17 The second of our three examples of 3-node ac b graphs used to motivate the conditional indepen- dence framework for directed graphical models.

395 8.2. Conditional Independence 375 As in Figure 8.17 but now conditioning on node c Figure 8.18 . ac b a ) p ( ( ) , and so which in general does not factorize into p b ⊥ ⊥ b |∅ (8.27) a as before. c Now suppose we condition on node , as shown in Figure 8.18. Using Bayes’ theorem, together with (8.26), we obtain p ( a, b, c ) a, b ( | p )= c p c ) ( p a ) p ( c | ( ) p ( b | c ) a = p ( ) c ( p a | c = p ( b | c ) ) and so again we obtain the conditional independence property a ⊥ ⊥ b | c. As before, we can interpret these results graphically. The node c is said to be with respect to the path from node a b . Such a path connects head-to-tail to node a b and renders them dependent. If we now observe c , as in Figure 8.18, nodes and a to then this observation ‘blocks’ the path from and so we obtain the conditional b independence property ⊥ ⊥ b | c a . Finally, we consider the third of our 3-node examples, shown by the graph in Figure 8.19. As we shall see, this has a more subtle behaviour than the two previous graphs. The joint distribution can again be written down using our general result (8.5) to give p ( a, b, c )= p ( a ) p ( b ) p ( c | a, b ) . (8.28) Consider first the case where none of the variables are observed. Marginalizing both c we obtain sides of (8.28) over p ( a, b )= p ( a ) p ( b ) Figure 8.19 The last of our three examples of 3-node graphs used to ab explore conditional independence properties in graphi- cal models. This graph has rather different properties from the two previous examples. c

396 376 8. GRAPHICAL MODELS As in Figure 8.19 but conditioning on the value of node Figure 8.20 ab c . In this graph, the act of conditioning induces a depen- . a dence between b and c and b are independent with no variables observed, in contrast to the two and so a previous examples. We can write this result as a b |∅ . (8.29) ⊥ ⊥ c , as indicated in Figure 8.20. The conditional distri- Now suppose we condition on bution of a and b is then given by a, b, c ( p ) c )= a, b ( p | ( ) p c p ( a ) p ( b ) p ( c | ) a, b = ( ) p c p ( a which in general does not factorize into the product p ( b ) , and so ) a ⊥ ⊥ b | c. Thus our third example has the opposite behaviour from the first two. Graphically, we say that node is head-to-head with respect to the path from a to b because it c c connects to the heads of the two arrows. When node is unobserved, it ‘blocks’ the path, and the variables a and b are independent. However, conditioning on c a and b dependent. ‘unblocks’ the path and renders There is one more subtlety associated with this third example that we need to consider. First we introduce some more terminology. We say that node is a de- y of node scendant if there is a path from x to y in which each step of the path x follows the directions of the arrows. Then it can be shown that a head-to-head path will become unblocked if either the node, or any of its descendants Exercise 8.10 , is observed. In summary, a tail-to-tail node or a head-to-tail node leaves a path unblocked unless it is observed in which case it blocks the path. By contrast, a head-to-head node blocks a path if it is unobserved, but once the node, and/or at least one of its descendants, is observed the path becomes unblocked. It is worth spending a moment to understand further the unusual behaviour of the graph of Figure 8.20. Consider a particular instance of such a graph corresponding to a problem with three binary random variables relating to the fuel system on a car, as shown in Figure 8.21. The variables are called B , representing the state of a battery that is either charged ( B =1 )orflat( B =0 ), F representing the state of ), and the fuel tank that is either full of fuel ( =1 ) or empty ( F =0 F G , which is the state of an electric fuel gauge and which indicates either full ( G =1 ) or empty

397 8.2. Conditional Independence 377 BF BF BF G G G Figure 8.21 An example of a 3-node graph used to illustrate the phenomenon of ‘explaining away’. The three nodes represent the state of the battery ( ) and the reading on the electric fuel B F ), the state of the fuel tank ( ). See the text for details. G gauge ( G =0 ). The battery is either charged or flat, and independently the fuel tank is ( either full or empty, with prior probabilities ( B . 9 p =1) = 0 F ( . 9 . p =1) = 0 Given the state of the fuel tank and the battery, the fuel gauge reads full with proba- bilities given by ( G =1 | B =1 ,F =1) = 0 . 8 p ( p =1 | B =1 ,F =0) = 0 . 2 G ,F ( | B =0 =1 =1) = 0 . 2 p G ( G =1 | B =0 p =0) = 0 . 1 ,F so this is a rather unreliable fuel gauge! All remaining probabilities are determined by the requirement that probabilities sum to one, and so we have a complete specifi- cation of the probabilistic model. Before we observe any data, the prior probability of the fuel tank being empty p ( is =0)=0 . 1 . Now suppose that we observe the fuel gauge and discover that F it reads empty, i.e., G =0 , corresponding to the middle graph in Figure 8.21. We can use Bayes’ theorem to evaluate the posterior probability of the fuel tank being empty. First we evaluate the denominator for Bayes’ theorem given by ∑ ∑ 315 p . )=0 F ( ) B (8.30) ( p ) B, F | =0 p ( G ( p =0)= G F } 1 } , , ∈{ B 0 ∈{ 1 0 and similarly we evaluate ∑ )=0 (8.31) B ( p 81 . p ( G =0 | B, F =0) =0)= F | G ( p =0 0 , 1 } B ∈{ and using these results we have =0) =0) F ( p F | p ( G =0 (8.32) 257 . 0  F ( =0)= G p | =0 =0) G p (

398 378 8. GRAPHICAL MODELS p F | G =0) >p ( F =0) . Thus observing that the gauge reads empty =0 and so ( makes it more likely that the tank is indeed empty, as we would intuitively expect. Next suppose that we also check the state of the battery and find that it is flat, i.e., =0 . We have now observed the states of both the fuel gauge and the battery, as B shown by the right-hand graph in Figure 8.21. The posterior probability that the fuel tank is empty given the observations of both the fuel gauge and the battery state is then given by =0) F ( p =0) ,F =0 p ( G =0 | B ∑ =0)= =0 G | =0 F ( p ,B 111 . 0  (8.33) F B ,F ) p ( =0 ) =0 G ( p | F } ∈{ 0 , 1 where the prior probability ( B =0) has cancelled between numerator and denom- p decreased 0 . 257 to inator. Thus the probability that the tank is empty has (from . 111 ) as a result of the observation of the state of the battery. This accords with our 0 explains away the observation that the intuition that finding out that the battery is flat fuel gauge reads empty. We see that the state of the fuel tank and that of the battery have indeed become dependent on each other as a result of observing the reading on the fuel gauge. In fact, this would also be the case if, instead of observing the fuel gauge directly, we observed the state of some descendant of G . Note that the probability p ( F =0 | G =0 ,B =0)  0 . 111 is greater than the prior probability p ( =0)=0 . 1 because the observation that the fuel gauge reads zero still provides F some evidence in favour of an empty fuel tank. 8.2.2 D-separation We now give a general statement of the d-separation property (Pearl, 1988) for , B , and directed graphs. Consider a general directed graph in which are arbi- A C trary nonintersecting sets of nodes (whose union may be smaller than the complete set of nodes in the graph). We wish to ascertain whether a particular conditional A ⊥ ⊥ B | C is implied by a given directed acyclic graph. To independence statement A do so, we consider all possible paths from any node in B . Any such to any node in blocked path is said to be if it includes a node such that either (a) the arrows on the path meet either head-to-tail or tail-to-tail at the node, and the C node is in the set ,or (b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, is in the set C . If all paths are blocked, then A is said to be d-separated from B by C , and the joint distribution over all of the variables in the graph will satisfy A ⊥ B | C . ⊥ The concept of d-separation is illustrated in Figure 8.22. In graph (a), the path from to b is not blocked by node f because it is a tail-to-tail node for this path a e because, although the latter is a and is not observed, nor is it blocked by node head-to-head node, it has a descendant c because is in the conditioning set. Thus the conditional independence statement a ⊥ ⊥ b | c does not follow from this graph. In graph (b), the path from to b is blocked by node f because this is a tail-to-tail a will a ⊥ ⊥ b | f node that is observed, and so the conditional independence property

399 8.2. Conditional Independence 379 Illustration of the con- Figure 8.22 f f a a cept of d-separation. See the text for details. e e b b c c (a) (b) be satisfied by any distribution that factorizes according to this graph. Note that this because e is a head-to-head node and neither it nor its path is also blocked by node e descendant are in the conditioning set. 2 in Figure 8.5, α σ For the purposes of d-separation, parameters such as and indicated by small filled circles, behave in the same was as observed nodes. How- ever, there are no marginal distributions associated with such nodes. Consequently parameter nodes never themselves have parents and so all paths through these nodes will always be tail-to-tail and hence blocked. Consequently they play no role in d-separation. Another example of conditional independence and d-separation is provided by the concept of i.i.d. (independent identically distributed) data introduced in Sec- tion 1.2.4. Consider the problem of finding the posterior distribution for the mean of a univariate Gaussian distribution. This can be represented by the directed graph Section 2.3 p μ ) to- shown in Figure 8.23 in which the joint distribution is defined by a prior ( gether with a set of conditional distributions p ( x | μ ) for n . In practice, ,...,N =1 n D { x we observe = ,...,x . Suppose, for a moment, } and our goal is to infer μ 1 N that we condition on μ and consider the joint distribution of the observations. Using to any other x and d-separation, we note that there is a unique path from any x i j = i . Every such path is μ that this path is tail-to-tail with respect to the observed node , so that ,...,x μ } are independent given = { blocked and so the observations D x N 1 N ∏ ) μ | ( (8.34) x . p μ D| ( p )= n =1 n μ (a) Directed graph corre- Figure 8.23 sponding to the problem μ of of inferring the mean μ a univariate Gaussian dis- tribution from observations N x ,...,x . (b) The same 1 N graph drawn using the plate x x N 1 notation. x n N (a) (b)

400 380 8. GRAPHICAL MODELS z Figure 8.24 A graphical representation of the ‘naive Bayes’ Conditioned on the model for classification. z , the components of the observed class label T are assumed to be ,...,x ) x vector =( x 1 D independent. x x D 1 , the observations are in general no longer indepen- However, if we integrate over μ dent ∫ N ∞ ∏ . x p ( D| μ ) p ( μ )d μ = ) (8.35) ( p p )= D ( n 0 n =1 is a latent variable, because its value is not observed. Here μ Another example of a model representing i.i.d. data is the graph in Figure 8.7 corresponding to Bayesian polynomial regression. Here the stochastic nodes corre- ̂ . We see that the node for } , w and is tail-to-tail with respect to w t { spond to t n ̂ the path from t t and so we have the following conditional to any one of the nodes n independence property ̂ t t ⊥ | w . ⊥ (8.36) n Thus, conditioned on the polynomial coefficients w , the predictive distribution for ̂ is independent of the training data { t t ,...,t . We can therefore first use the } N 1 training data to determine the posterior distribution over the coefficients w and then we can discard the training data and use the posterior distribution for w to make ̂ Section 3.3 t for new input observations ̂ x . predictions of A related graphical structure arises in an approach to classification called the model, in which we use conditional independence assumptions to sim- naive Bayes -dimensional plify the model structure. Suppose our observed variable consists of a D T to one of x ,...,x , and we wish to assign observed values of ) K vector =( x x 1 D K - K classes. Using the 1-of- encoding scheme, we can represent these classes by a . We can then define a generative model by introducing dimensional binary vector z th component μ μ of ) over the class labels, where the k p ( z a multinomial prior μ | k ) z , together with a conditional distribution p ( x | is the prior probability of class C k for the observed vector x . The key assumption of the naive Bayes model is that, conditioned on the class z , the distributions of the input variables x are in- ,...,x D 1 dependent. The graphical representation of this model is shown in Figure 8.24. We i (because such = and x j for see that observation of x z blocks the path between j i z paths are tail-to-tail at the node x ) and so and x are conditionally independent i j given z . If, however, we marginalize out z (so that z is unobserved) the tail-to-tail to x is no longer blocked. This tells us that in general the marginal path from x i j density p ( x ) will not factorize with respect to the components of x . We encountered a simple application of the naive Bayes model in the context of fusing data from different sources for medical diagnosis in Section 1.5. If we are given a labelled training set, comprising inputs { x together ,..., x } N 1 with their class labels, then we can fit the naive Bayes model to the training data

401 8.2. Conditional Independence 381 using maximum likelihood assuming that the data are drawn independently from the model. The solution is obtained by fitting the model for each class separately using the correspondingly labelled data. As an example, suppose that the probability density within each class is chosen to be Gaussian. In this case, the naive Bayes assumption then implies that the covariance matrix for each Gaussian is diagonal, and the contours of constant density within each class will be axis-aligned ellipsoids. The marginal density, however, is given by a superposition of diagonal Gaussians (with weighting coefficients given by the class priors) and so will no longer factorize with respect to its components. The naive Bayes assumption is helpful when the dimensionality D of the input D -dimensional space more chal- space is high, making density estimation in the full lenging. It is also useful if the input vector contains both discrete and continuous variables, since each can be represented separately using appropriate models (e.g., Bernoulli distributions for binary observations or Gaussians for real-valued vari- ables). The conditional independence assumption of this model is clearly a strong one that may lead to rather poor representations of the class-conditional densities. Nevertheless, even if this assumption is not precisely satisfied, the model may still give good classification performance in practice because the decision boundaries can be insensitive to some of the details in the class-conditional densities, as illustrated in Figure 1.27. We have seen that a particular directed graph represents a specific decomposition of a joint probability distribution into a product of conditional probabilities. The graph also expresses a set of conditional independence statements obtained through the d-separation criterion, and the d-separation theorem is really an expression of the equivalence of these two properties. In order to make this clear, it is helpful to think of a directed graph as a filter. Suppose we consider a particular joint probability distribution p ( x ) over the variables x corresponding to the (nonobserved) nodes of the graph. The filter will allow this distribution to pass through if, and only if, it can be expressed in terms of the factorization (8.5) implied by the graph. If we present to p x ) over the set of variables x , then the the filter the set of all possible distributions ( subset of distributions that are passed by the filter will be denoted ,for directed DF factorization . This is illustrated in Figure 8.25. Alternatively, we can use the graph as a different kind of filter by first listing all of the conditional independence properties obtained by applying the d-separation criterion to the graph, and then allowing a distribution to pass only if it satisfies all of these properties. If we present all possible distributions p ( x ) to this second kind of filter, then the d-separation theorem tells us that the set of distributions that will be allowed through is precisely the set . DF It should be emphasized that the conditional independence properties obtained from d-separation apply to any probabilistic model described by that particular di- rected graph. This will be true, for instance, whether the variables are discrete or continuous or a combination of these. Again, we see that a particular graph is de- scribing a whole family of probability distributions. At one extreme we have a fully connected graph that exhibits no conditional in- dependence properties at all, and which can represent any possible joint probability distribution over the given variables. The set DF will contain all possible distribu-

402 382 8. GRAPHICAL MODELS ( x p ) DF We can view a graphical model (in this case a directed graph) as a filter in which a prob- Figure 8.25 x ) is allowed through the filter if, and only if, it satisfies the directed ability distribution ( p ( ) that pass p factorization property (8.5). The set of all possible probability distributions x through the filter is denoted DF . We can alternatively use the graph to filter distributions according to whether they respect all of the conditional independencies implied by the d-separation properties of the graph. The d-separation theorem says that it is the same that will be allowed through this second kind of filter. DF set of distributions ( ) . At the other extreme, we have the fully disconnected graph, i.e., one tions p x having no links at all. This corresponds to joint distributions which factorize into the product of the marginal distributions over the variables comprising the nodes of the graph. will include any dis- Note that for any given graph, the set of distributions DF tributions that have additional independence properties beyond those described by the graph. For instance, a fully factorized distribution will always be passed through the filter implied by any graph over the corresponding set of variables. We end our discussion of conditional independence properties by exploring the Markov blanket or Markov boundary concept of a . Consider a joint distribution ,..., x nodes, and consider the ) represented by a directed graph having D p x ( D 1 conditional distribution of a particular node with variables x conditioned on all of i . Using the factorization property (8.5), we can express x the remaining variables j = i this conditional distribution in the form ) x ,..., x ( p D 1 ∫ ( x p )= | x i } i = j { p ( x x ,..., x )d D 1 i ∏ pa ( x ) | p k k k ∫ = ∏ p ( x )d | pa x k i k k in which the integral is replaced by a summation in the case of discrete variables. We ) that does not have any functional dependence pa | ( p now observe that any factor x k k x on can be taken outside the integral over x , and will therefore cancel between i i numerator and denominator. The only factors that remain will be the conditional itself, together with the conditional distributions x for node ) pa | ( p distribution x i i i x for any nodes such that node x x is in the conditioning set of p ( , in other ) | pa k k i k will depend on the ) pa is a parent of x | . The conditional p ( x x words for which i i k i ) pa | will depend on the children p ( x , whereas the conditionals x parents of node k i k

403 8.3. Markov Random Fields 383 The Markov blanket of a node x comprises the set Figure 8.26 i of parents, children and co-parents of the node. It has the property that the conditional distribution of , conditioned on all the remaining variables in the x i graph, is dependent only on the variables in the x i Markov blanket. x as well as on the of co-parents , in other words variables corresponding to parents i of node x other than node x . The set of nodes comprising the parents, the children i k and the co-parents is called the Markov blanket and is illustrated in Figure 8.26. We can think of the Markov blanket of a node x as being the minimal set of nodes that i from the rest of the graph. Note that it is not sufficient to include only the x isolates i x parents and children of node because the phenomenon of explaining away means i that observations of the child nodes will not block paths to the co-parents. We must therefore observe the co-parent nodes also. 8.3. Markov Random Fields We have seen that directed graphical models specify a factorization of the joint dis- tribution over a set of variables into a product of local conditional distributions. They also define a set of conditional independence properties that must be satisfied by any distribution that factorizes according to the graph. We turn now to the second ma- jor class of graphical models that are described by undirected graphs and that again specify both a factorization and a set of conditional independence relations. A Markov random field , also known as a Markov network or an undirected graphical model (Kindermann and Snell, 1980), has a set of nodes each of which corresponds to a variable or group of variables, as well as a set of links each of which connects a pair of nodes. The links are undirected, that is they do not carry arrows. In the case of undirected graphs, it is convenient to begin with a discussion of conditional independence properties. 8.3.1 Conditional independence properties In the case of directed graphs, we saw that it was possible to test whether a par- Section 8.2 ticular conditional independence property holds by applying a graphical test called d-separation. This involved testing whether or not the paths connecting two sets of nodes were ‘blocked’. The definition of blocked, however, was somewhat subtle due to the presence of paths having head-to-head nodes. We might ask whether it is possible to define an alternative graphical semantics for probability distributions such that conditional independence is determined by simple graph separation. This is indeed the case and corresponds to undirected graphical models. By removing the

404 384 8. GRAPHICAL MODELS An example of an undirected graph in Figure 8.27 which every path from any node in set B passes through A to any node in set C . Conse- at least one node in set quently the conditional independence A ⊥ ⊥ B | C property holds for any probability distribution described by this graph. C B A directionality from the links of the graph, the asymmetry between parent and child nodes is removed, and so the subtleties associated with head-to-head nodes no longer arise. Suppose that in an undirected graph we identify three sets of nodes, denoted , A B , and C , and that we consider the conditional independence property A ⊥ ⊥ B | C. (8.37) To test whether this property is satisfied by a probability distribution defined by a graph we consider all possible paths that connect nodes in set to nodes in set A . If all such paths pass through one or more nodes in set , then all such paths are B C ‘blocked’ and so the conditional independence property holds. However, if there is at least one such path that is not blocked, then the property does not necessarily hold, or more precisely there will exist at least some distributions corresponding to the graph that do not satisfy this conditional independence relation. This is illustrated with an example in Figure 8.27. Note that this is exactly the same as the d-separation crite- rion except that there is no ‘explaining away’ phenomenon. Testing for conditional independence in undirected graphs is therefore simpler than in directed graphs. An alternative way to view the conditional independence test is to imagine re- moving all nodes in set C from the graph together with any links that connect to those nodes. We then ask if there exists a path that connects any node in A to any node in . If there are no such paths, then the conditional independence property B must hold. The Markov blanket for an undirected graph takes a particularly simple form, because a node will be conditionally independent of all other nodes conditioned only on the neighbouring nodes, as illustrated in Figure 8.28. 8.3.2 Factorization properties We now seek a factorization rule for undirected graphs that will correspond to the above conditional independence test. Again, this will involve expressing the joint distribution p ( x ) as a product of functions defined over sets of variables that are local to the graph. We therefore need to decide what is the appropriate notion of locality in this case.

405 8.3. Markov Random Fields 385 For an undirected graph, the Markov blanket of a node Figure 8.28 consists of the set of neighbouring nodes. It has the x i x property that the conditional distribution of , conditioned i on all the remaining variables in the graph, is dependent only on the variables in the Markov blanket. and If we consider two nodes x x that are not connected by a link, then these j i variables must be conditionally independent given all other nodes in the graph. This follows from the fact that there is no direct path between the two nodes, and all other paths pass through nodes that are observed, and hence those paths are blocked. This conditional independence property can be expressed as ) x | ,x ( | x (8.38) p ) )= p ( x x | x p ( x i i j j \{ } \{ i,j } } i,j i,j \{ removed. The factor- x and x denotes the set x of all variables with x where j i } i,j \{ and x do not appear ization of the joint distribution must therefore be such that x j i in the same factor in order for the conditional independence property to hold for all possible distributions belonging to the graph. clique This leads us to consider a graphical concept called a , which is defined as a subset of the nodes in a graph such that there exists a link between all pairs of nodes in the subset. In other words, the set of nodes in a clique is fully connected. maximal clique Furthermore, a is a clique such that it is not possible to include any other nodes from the graph in the set without it ceasing to be a clique. These concepts are illustrated by the undirected graph over four variables shown in Figure 8.29. This graph has five cliques of two nodes given by { x , ,x } } , { x ,x ,x x } , { x { ,x , } 4 3 4 2 2 2 1 3 ,x . , as well as two maximal cliques given by { x ,x ,x } } and { x ,x ,x } { and x 2 3 3 2 4 1 1 3 . x to x ,x is not a clique because of the missing link from ,x } ,x x { The set 4 4 1 2 1 3 We can therefore define the factors in the decomposition of the joint distribution to be functions of the variables in the cliques. In fact, we can consider functions of the maximal cliques, without loss of generality, because other cliques must be is a maximal clique and we define } ,x ,x { subsets of maximal cliques. Thus, if x 1 3 2 an arbitrary function over this clique, then including another factor defined over a subset of these variables would be redundant. C and the set of variables in that clique by x Let us denote a clique by . Then C Figure 8.29 A four-node undirected graph showing a clique (outlined in x 1 green) and a maximal clique (outlined in blue). x 2 x 3 x 4

406 386 8. GRAPHICAL MODELS ( x ) over the potential functions ψ the joint distribution is written as a product of C C maximal cliques of the graph ∏ 1 ( x )= p ψ (8.39) ( x . ) C C Z C Z , sometimes called the partition function , is a normalization con- Here the quantity stant and is given by ∑ ∏ (8.40) ψ ) ( x Z = C C x C p ( x ) given by (8.39) is correctly normalized. which ensures that the distribution By considering only potential functions which satisfy ψ ( x )  0 we ensure that C C p ( x )  0 . In (8.40) we have assumed that x comprises discrete variables, but the framework is equally applicable to continuous variables, or a combination of the two, in which the summation is replaced by the appropriate combination of summation and integration. Note that we do not restrict the choice of potential functions to those that have a specific probabilistic interpretation as marginal or conditional distributions. This is in contrast to directed graphs in which each factor represents the conditional distribu- tion of the corresponding variable, conditioned on the state of its parents. However, in special cases, for instance where the undirected graph is constructed by starting with a directed graph, the potential functions may indeed have such an interpretation, as we shall see shortly. ) is that ( x One consequence of the generality of the potential functions ψ C C their product will in general not be correctly normalized. We therefore have to in- troduce an explicit normalization factor given by (8.40). Recall that for directed graphs, the joint distribution was automatically normalized as a consequence of the normalization of each of the conditional distributions in the factorization. The presence of this normalization constant is one of the major limitations of undirected graphs. If we have a model with M discrete nodes each having K states, M states and K then the evaluation of the normalization term involves summing over so (in the worst case) is exponential in the size of the model. The partition function is needed for parameter learning because it will be a function of any parameters that . However, for evaluation of local conditional ) ( x govern the potential functions ψ C C distributions, the partition function is not needed because a conditional is the ratio of two marginals, and the partition function cancels between numerator and denom- inator when evaluating this ratio. Similarly, for evaluating local marginal probabil- ities we can work with the unnormalized joint distribution and then normalize the marginals explicitly at the end. Provided the marginals only involves a small number of variables, the evaluation of their normalization coefficient will be feasible. So far, we have discussed the notion of conditional independence based on sim- ple graph separation and we have proposed a factorization of the joint distribution that is intended to correspond to this conditional independence structure. However, we have not made any formal connection between conditional independence and factorization for undirected graphs. To do so we need to restrict attention to poten- tial functions ψ ( x ) that are strictly positive (i.e., never zero or negative for any C C

407 8.3. Markov Random Fields 387 ). Given this restriction, we can make a precise relationship between x choice of C factorization and conditional independence. To do this we again return to the concept of a graphical model as a filter, corre- sponding to Figure 8.25. Consider the set of all possible distributions defined over a fixed set of variables corresponding to the nodes of a particular undirected graph. We can define to be the set of such distributions that are consistent with the set UI of conditional independence statements that can be read from the graph using graph UF to be the set of such distributions that can separation. Similarly, we can define be expressed as a factorization of the form (8.39) with respect to the maximal cliques theorem (Clifford, 1990) states that the sets of the graph. The Hammersley-Clifford UI and UF are identical. Because we are restricted to potential functions which are strictly positive it is convenient to express them as exponentials, so that ψ ( (8.41) } ) = exp {− E ( x ) x C C C E x ( where ) is called an energy function , and the exponential representation is C called the Boltzmann distribution . The joint distribution is defined as the product of potentials, and so the total energy is obtained by adding the energies of each of the maximal cliques. In contrast to the factors in the joint distribution for a directed graph, the po- tentials in an undirected graph do not have a specific probabilistic interpretation. Although this gives greater flexibility in choosing the potential functions, because there is no normalization constraint, it does raise the question of how to motivate a choice of potential function for a particular application. This can be done by view- ing the potential function as expressing which configurations of the local variables are preferred to others. Global configurations that have a relatively high probability are those that find a good balance in satisfying the (possibly conflicting) influences of the clique potentials. We turn now to a specific example to illustrate the use of undirected graphs. 8.3.3 Illustration: Image de-noising We can illustrate the application of undirected graphs using an example of noise removal from a binary image (Besag, 1974; Geman and Geman, 1984; Besag, 1986). Although a very simple example, this is typical of more sophisticated applications. y Let the observed noisy image be described by an array of binary pixel values ∈ i 1 , +1 } , where the index i =1 ,...,D runs over all pixels. We shall suppose {− that the image is obtained by taking an unknown noise-free image, described by and randomly flipping the sign of pixels with ∈{− 1 , +1 } binary pixel values x i some small probability. An example binary image, together with a noise corrupted image obtained by flipping the sign of the pixels with probability 10%, is shown in Figure 8.30. Given the noisy image, our goal is to recover the original noise-free image. Because the noise level is small, we know that there will be a strong correlation between x in an image and y x . We also know that neighbouring pixels x and i i j i are strongly correlated. This prior knowledge can be captured using the Markov

408 388 8. GRAPHICAL MODELS Illustration of image de-noising using a Markov random field. The top row shows the original Figure 8.30 binary image on the left and the corrupted image after randomly changing 10% of the pixels on the right. The bottom row shows the restored images obtained using iterated conditional models (ICM) on the left and using the graph-cut algorithm on the right. ICM produces an image where 96% of the pixels agree with the original image, whereas the corresponding number for graph-cut is 99%. random field model whose undirected graph is shown in Figure 8.31. This graph has two types of cliques, each of which contains two variables. The cliques of the form } have an associated energy function that expresses the correlation between ,y { x i i these variables. We choose a very simple energy function for these cliques of the form − ηx y where η is a positive constant. This has the desired effect of giving a i i have the same y and lower energy (thus encouraging a higher probability) when x i i sign and a higher energy when they have the opposite sign. The remaining cliques comprise pairs of variables { x are ,x j } where i and j i indices of neighbouring pixels. Again, we want the energy to be lower when the pixels have the same sign than when they have the opposite sign, and so we choose − βx an energy given by x where β is a positive constant. j i Because a potential function is an arbitrary, nonnegative function over a maximal clique, we can multiply it by any nonnegative functions of subsets of the clique, or

409 8.3. Markov Random Fields 389 An undirected graphical model representing a Figure 8.31 Markov random field for image de-noising, in is a binary variable denoting the state which x i y i of pixel in the unknown noise-free image, and i y i denotes the corresponding value of pixel i in the observed noisy image. x i equivalently we can add the corresponding energies. In this example, this allows us for each pixel i in the noise-free image. Such a term has hx to add an extra term i the effect of biasing the model towards pixel values that have one particular sign in preference to the other. The complete energy function for the model then takes the form ∑ ∑ ∑ y − x (8.42) x x η x − β x , E )= h ( y j i i i i i i } { i,j which defines a joint distribution over and y given by x 1 . } ) y , E x exp {− (8.43) ( )= , x ( p y Z y We now fix the elements of to the observed values given by the pixels of the p ( x | y ) over noise- noisy image, which implicitly defines a conditional distribution Ising model , which has been widely studied in free images. This is an example of the x statistical physics. For the purposes of image restoration, we wish to find an image having a high probability (ideally the maximum probability). To do this we shall use iterated conditional modes ,or a simple iterative technique called (Kittler and ICM ̈ F oglein, 1984), which is simply an application of coordinate-wise gradient ascent. = x , which we do by simply setting } x The idea is first to initialize the variables { i i i . Then we take one node x at a time and we evaluate the total energy for all y i j , keeping all other node variables 1 − =+1 and x = for the two possible states x j j fixed, and set x to whichever state has the lower energy. This will either leave j x the probability unchanged, if is unchanged, or will increase it. Because only j one variable is changed, this is a simple local computation that can be performed Exercise 8.13 efficiently. We then repeat the update for another site, and so on, until some suitable stopping criterion is satisfied. The nodes may be updated in a systematic way, for instance by repeatedly raster scanning through the image, or by choosing nodes at random. If we have a sequence of updates in which every site is visited at least once, and in which no changes to the variables are made, then by definition the algorithm

410 390 8. GRAPHICAL MODELS x x x x N − 1 (a) Example of a directed Figure 8.32 N 1 2 (a) graph. (b) The equivalent undirected graph. x x x x N − 1 1 N 2 (b) will have converged to a local maximum of the probability. This need not, however, correspond to the global maximum. For the purposes of this simple illustration, we have fixed the parameters to be =1 . 0 , η =2 . 1 and h =0 . Note that leaving β =0 simply means that the prior h probabilities of the two states of x are equal. Starting with the observed noisy image i as the initial configuration, we run ICM until convergence, leading to the de-noised =0 image shown in the lower left panel of Figure 8.30. Note that if we set , β which effectively removes the links between neighbouring pixels, then the global most probable solution is given by x y = for all i , corresponding to the observed i i Exercise 8.14 noisy image. Later we shall discuss a more effective algorithm for finding high probability so- lutions called the max-product algorithm, which typically leads to better solutions, Section 8.4 although this is still not guaranteed to find the global maximum of the posterior dis- tribution. However, for certain classes of model, including the one given by (8.42), graph cuts that are guaranteed to find the there exist efficient algorithms based on et al. , 1989; Boykov global maximum (Greig , 2001; Kolmogorov and Zabih, et al. 2004). The lower right panel of Figure 8.30 shows the result of applying a graph-cut algorithm to the de-noising problem. 8.3.4 Relation to directed graphs We have introduced two graphical frameworks for representing probability dis- tributions, corresponding to directed and undirected graphs, and it is instructive to discuss the relation between these. Consider first the problem of taking a model that is specified using a directed graph and trying to convert it to an undirected graph. In some cases this is straightforward, as in the simple example in Figure 8.32. Here the joint distribution for the directed graph is given as a product of conditionals in the form (8.44) . ) x | x ( ) p ( x p | x ··· ) p ( ) x | x p p ( x )= x ( N 2 N − 1 1 2 1 3 Now let us convert this to an undirected graph representation, as shown in Fig- ure 8.32. In the undirected graph, the maximal cliques are simply the pairs of neigh- bouring nodes, and so from (8.39) we wish to write the joint distribution in the form 1 . ) ,x (8.45) x ( ψ ··· ) ( x ,x ,x x ) ψ ( ψ p ( )= x 1 3 2 , 1 N , 1 ,N 3 2 N − 1 2 N 2 − Z

411 8.3. Markov Random Fields 391 x x x x 1 1 3 3 Example of a simple Figure 8.33 directed graph (a) and the corre- x x sponding moral graph (b). 2 2 x x 4 4 (a) (b) This is easily done by identifying ψ ) x ( x | ,x x )= p ( x ( ) p 1 1 2 1 , 1 2 2 ) ( ( x | ,x x )= p x ψ 3 2 3 2 , 2 3 . . . ψ ) x | ( x )= x ( ,x p 1 N 1 − N ,N N N − 1 − N where we have absorbed the marginal p ( x for the first node into the first potential ) 1 =1 function. Note that in this case, the partition function Z . Let us consider how to generalize this construction, so that we can convert any distribution specified by a factorization over a directed graph into one specified by a factorization over an undirected graph. This can be achieved if the clique potentials of the undirected graph are given by the conditional distributions of the directed graph. In order for this to be valid, we must ensure that the set of variables that appears in each of the conditional distributions is a member of at least one clique of the undirected graph. For nodes on the directed graph having just one parent, this is achieved simply by replacing the directed link with an undirected link. However, for nodes in the directed graph having more than one parent, this is not sufficient. These are nodes that have ‘head-to-head’ paths encountered in our discussion of conditional independence. Consider a simple directed graph over 4 nodes shown in Figure 8.33. The joint distribution for the directed graph takes the form ) ,x (8.46) . ,x x ) p ( x ) p ( x ) p ( x | x ( )= x ( p p 2 3 4 3 1 2 1 We see that the factor ( x p , and | x x ,x , ,x x ) involves the four variables x , 1 2 1 3 4 2 3 x , and so these must all belong to a single clique if this conditional distribution is 4 to be absorbed into a clique potential. To ensure this, we add extra links between . Anachronistically, this process of ‘marrying x all pairs of parents of the node 4 moralization , and the resulting undirected graph, the parents’ has become known as after dropping the arrows, is called the moral graph . It is important to observe that the moral graph in this example is fully connected and so exhibits no conditional independence properties, in contrast to the original directed graph. Thus in general to convert a directed graph into an undirected graph, we first add additional undirected links between all pairs of parents for each node in the graph and

412 392 8. GRAPHICAL MODELS then drop the arrows on the original links to give the moral graph. Then we initialize all of the clique potentials of the moral graph to 1. We then take each conditional distribution factor in the original directed graph and multiply it into one of the clique potentials. There will always exist at least one maximal clique that contains all of the variables in the factor as a result of the moralization step. Note that in all cases the partition function is given by Z =1 . The process of converting a directed graph into an undirected graph plays an junction tree algorithm . Section 8.4 important role in exact inference techniques such as the Converting from an undirected to a directed representation is much less common and in general presents problems due to the normalization constraints. We saw that in going from a directed to an undirected representation we had to discard some conditional independence properties from the graph. Of course, we could always trivially convert any distribution over a directed graph into one over an undirected graph by simply using a fully connected undirected graph. This would, however, discard all conditional independence properties and so would be vacuous. The process of moralization adds the fewest extra links and so retains the maximum number of independence properties. We have seen that the procedure for determining the conditional independence properties is different between directed and undirected graphs. It turns out that the two types of graph can express different conditional independence properties, and it is worth exploring this issue in more detail. To do so, we return to the view of a specific (directed or undirected) graph as a filter, so that the set of all possible Section 8.2 distributions over the given variables could be reduced to a subset that respects the conditional independencies implied by the graph. A graph is said to be a D map (for ‘dependency map’) of a distribution if every conditional independence statement satisfied by the distribution is reflected in the graph. Thus a completely disconnected graph (no links) will be a trivial D map for any distribution. Alternatively, we can consider a specific distribution and ask which graphs have the appropriate conditional independence properties. If every conditional indepen- dence statement implied by a graph is satisfied by a specific distribution, then the graph is said to be an I map (for ‘independence map’) of that distribution. Clearly a fully connected graph will be a trivial I map for any distribution. If it is the case that every conditional independence property of the distribution is reflected in the graph, and vice versa, then the graph is said to be a for perfect map Figure 8.34 Venn diagram illustrating the set of all distributions P over a given set of variables, together with the set of distributions D that can be represented as a perfect map using a directed graph, and the set U that can be represented as a perfect map using an undirected graph. U D P

413 8.4. Inference in Graphical Models 393 A directed graph whose conditional independence Figure 8.35 AB properties cannot be expressed using an undirected graph over the same three variables. C that distribution. A perfect map is therefore both an I map and a D map. Consider the set of distributions such that for each distribution there exists a directed graph that is a perfect map. This set is distinct from the set of distributions such that for each distribution there exists an undirected graph that is a perfect map. In addition there are distributions for which neither directed nor undirected graphs offer a perfect map. This is illustrated as a Venn diagram in Figure 8.34. Figure 8.35 shows an example of a directed graph that is a perfect map for A ⊥ a distribution satisfying the conditional independence properties B |∅ and ⊥ A ⊥ ⊥ B C . There is no corresponding undirected graph over the same three vari- | ables that is a perfect map. Conversely, consider the undirected graph over four variables shown in Fig- ure 8.36. This graph exhibits the properties A ⊥ ⊥ B |∅ , C ⊥ ⊥ D | A ∪ B and A ⊥ B | C ∪ D . There is no directed graph over four variables that implies the same ⊥ set of conditional independence properties. The graphical framework can be extended in a consistent way to graphs that include both directed and undirected links. These are called chain graphs (Lauritzen and Wermuth, 1989; Frydenberg, 1990), and contain the directed and undirected graphs considered so far as special cases. Although such graphs can represent a broader class of distributions than either directed or undirected alone, there remain distributions for which even a chain graph cannot provide a perfect map. Chain graphs are not discussed further in this book. Figure 8.36 An undirected graph whose conditional independence C properties cannot be expressed in terms of a directed graph over the same variables. B A D 8.4. Inference in Graphical Models We turn now to the problem of inference in graphical models, in which some of the nodes in a graph are clamped to observed values, and we wish to compute the posterior distributions of one or more subsets of other nodes. As we shall see, we can exploit the graphical structure both to find efficient algorithms for inference, and

414 394 8. GRAPHICAL MODELS x x x Figure 8.37 A graphical representation of Bayes’ theorem. See the text for details. y y y (a) (c) (b) to make the structure of those algorithms transparent. Specifically, we shall see that many algorithms can be expressed in terms of the propagation of local messages around the graph. In this section, we shall focus primarily on techniques for exact inference, and in Chapter 10 we shall consider a number of approximate inference algorithms. To start with, let us consider the graphical interpretation of Bayes’ theorem. ( x, y ) Suppose we decompose the joint distribution x and y into p over two variables p ( x, y )= p ( x ) p ( y | x ) . This can be represented by a product of factors in the form the directed graph shown in Figure 8.37(a). Now suppose we observe the value of y , as indicated by the shaded node in Figure 8.37(b). We can view the marginal x ) p x , and our goal is to infer the distribution ( as a prior over the latent variable . Using the sum and product rules of corresponding posterior distribution over x probability we can evaluate ∑ ′ ′ x (8.47) ) p ( y | x ( ) p )= ( p y ′ x which can then be used in Bayes’ theorem to calculate p ) ) x ( p ( y | x p ( )= y | x . (8.48) p y ) ( Thus the joint distribution is now expressed in terms of p ( y ) and p ( x | y ) . From a graphical perspective, the joint distribution ( x, y ) is now represented by the graph p shown in Figure 8.37(c), in which the direction of the arrow is reversed. This is the simplest example of an inference problem for a graphical model. 8.4.1 Inference on a chain Now consider a more complex problem involving the chain of nodes of the form shown in Figure 8.32. This example will lay the foundation for a discussion of exact inference in more general graphs later in this section. Specifically, we shall consider the undirected graph in Figure 8.32(b). We have already seen that the directed chain can be transformed into an equivalent undirected chain. Because the directed graph does not have any nodes with more than one parent, this does not require the addition of any extra links, and the directed and undirected versions of this graph express exactly the same set of conditional inde- pendence statements.

415 8.4. Inference in Graphical Models 395 The joint distribution for this graph takes the form 1 ψ p ( )= x ,x ··· x ,x ) ψ ( x ( ) ψ ( x ,x ) . (8.49) 2 1 , N 2 1 ,N 3 2 N − 1 1 N 2 , 3 − Z We shall consider the specific case in which the N nodes represent discrete vari- ables each having K states, in which case each potential function ψ ,x ) x ( ,n n − 1 1 − n n 2 parameters. K ( N − 1) K × table, and so the joint distribution has comprises an K ( x Let us consider the inference problem of finding the marginal distribution p ) n that is part way along the chain. Note that, for the moment, for a specific node x n there are no observed nodes. By definition, the required marginal is obtained by , so that summing the joint distribution over all variables except x n ∑ ∑ ∑ ∑ . ) ··· (8.50) x ( p )= ··· ( x p n x x x x − n +1 1 1 n N In a naive implementation, we would first evaluate the joint distribution and then perform the summations explicitly. The joint distribution can be represented as a set of numbers, one for each possible value for x . Because there are N variables N values for x and so evaluation and storage of the each with K K states, there are p joint distribution, as well as marginalization to obtain ( x ) , all involve storage and n computation that scale exponentially with the length of the chain. N We can, however, obtain a much more efficient algorithm by exploiting the con- ditional independence properties of the graphical model. If we substitute the factor- ized expression (8.49) for the joint distribution into (8.50), then we can rearrange the order of the summations and the multiplications to allow the required marginal to be . The x evaluated much more efficiently. Consider for instance the summation over N , and so we can x is the only one that depends on ( x ) ,x potential ψ 1 ,N N N 1 − N N − perform the summation ∑ (8.51) ) ψ ( x ,x ,N N N N − 1 − 1 x N . We can then use this to perform the summation x first to give a function of N − 1 , which will involve only this new function together with the potential over x 1 N − ψ ( x ,x ) , because this is the only other place that x appears. N N − 2 − N − 2 ,N − 1 N − 1 1 ,x and so involves only the potential ψ ) x ( Similarly, the summation over x 2 , 2 1 1 1 can be performed separately to give a function of x , and so on. Because each 2 summation effectively removes a variable from the distribution, this can be viewed as the removal of a node from the graph. If we group the potentials and summations together in this way, we can express

416 396 8. GRAPHICAL MODELS the desired marginal in the form 1 ( x p )= n Z ⎡ ⎤ ]] [ [ ∑ ∑ ∑ ⎦ ⎣ ψ ( ,x ) ψ ··· x ( x ,x ) ) ··· ,x ψ ( x 1 ,n 2 2 3 n n − 2 1 − 1 , 2 , n 1 3 x x x 1 − 2 1 n ︷︷ ︸ ︸ ) ( x μ α n ⎤ ⎡ ] [ ∑ ∑ ⎦ ⎣ ( ) x ,x ··· ψ ) ··· ,x x ( ψ . (8.52) +1 ,N N n,n N − 1 +1 N − 1 n n x x n +1 N ︷︷ ︸ ︸ ( x ) μ β n The reader is encouraged to study this re-ordering carefully as the underlying idea forms the basis for the later discussion of the general sum-product algorithm. Here the key concept that we are exploiting is that multiplication is distributive over addi- tion, so that + ac = ab ( b + c ) (8.53) a in which the left-hand side involves three arithmetic operations whereas the right- hand side reduces this to two operations. Let us work out the computational cost of evaluating the required marginal using this re-ordered expression. We have to perform N 1 summations each of which is − K states and each of which involves a function of two variables. For instance, over ) involves only the function ψ , which is a table of ,x x ( the summation over x 1 2 1 , 2 1 x for each value of and so this K × K numbers. We have to sum this table over x 2 1 2 has K O ( ) K numbers is multiplied by the matrix of cost. The resulting vector of 2 ) 1 ( x − ,x summations and so is again N ( K O ) . Because there are ψ numbers 2 2 3 , 3 is ) ( x p and multiplications of this kind, the total cost of evaluating the marginal n 2 NK ( O . This is linear in the length of the chain, in contrast to the exponential cost ) of a naive approach. We have therefore been able to exploit the many conditional independence properties of this simple graph in order to obtain an efficient calcula- tion. If the graph had been fully connected, there would have been no conditional independence properties, and we would have been forced to work directly with the full joint distribution. We now give a powerful interpretation of this calculation in terms of the passing of local messages around on the graph. From (8.52) we see that the expression for the ) decomposes into the product of two factors times the normalization x marginal p ( n constant 1 ( p x (8.54) ) )= . x ) μ ( x ( μ n n n α β Z μ We shall interpret ( x ) as a message passed forwards along the chain from node α n ) x ( can be viewed as a message passed backwards to node x μ . Similarly, x n β − n n 1

417 8.4. Inference in Graphical Models 397 The marginal distribution Figure 8.38 ) x ( ) x ( μ μ μ μ ) x ( x ( ) n β +1 β n α n − 1 α n for a node ) x along the chain is ob- ( p x n n tained by multiplying the two messages and ) μ ( x , and then normaliz- ( ) x μ n n α β x x x x x − n +1 n N 1 n 1 ing. These messages can themselves be evaluated recursively by passing mes- sages from both ends of the chain to- . wards node x n x from node x . Note that each of the messages com- along the chain to node +1 n n K x prises a set of values, one for each choice of , and so the product of two mes- n sages should be interpreted as the point-wise multiplication of the elements of the values. two messages to give another set of K ( x can be evaluated recursively because ) μ The message n α ⎤ ⎡ ∑ ∑ ⎦ ⎣ μ x x )= ψ ( ) ,x ··· ( n ,n 1 1 n − − n n α x x − 2 − 1 n n ∑ ( (8.55) . ) x ψ ( μ ) ,x = x n − n 1 − α n 1 n − 1 ,n x 1 n − We therefore first evaluate ∑ ψ x )= (8.56) ( ,x ( x ) μ α 2 1 , 2 2 1 x 1 and then apply (8.55) repeatedly until we reach the desired node. Note carefully the μ structure of the message passing equation. The outgoing message ( ) in (8.55) x α n ) x ( by the local potential μ is obtained by multiplying the incoming message α − 1 n involving the node variable and the outgoing variable and then summing over the node variable. μ Similarly, the message ( ) can be evaluated recursively by starting with x n β node x and using N ⎤ ⎡ ∑ ∑ ⎦ ⎣ μ ··· ( )= ψ x ) ,x ( x n +1 n +1 n β n ,n x x n +2 n +1 ∑ ) (8.57) . ) ψ x ( μ ( x = ,x +1 +1 β +1 n n ,n n n x n +1 This recursive message passing is illustrated in Figure 8.38. The normalization con- Z is easily evaluated by summing the right-hand side of (8.54) over all states stant computation. , an operation that requires only O ( K ) of x n Graphs of the form shown in Figure 8.38 are called Markov chains , and the corresponding message passing equations represent an example of the Chapman- Kolmogorov equations for Markov processes (Papoulis, 1984).

418 398 8. GRAPHICAL MODELS ) for every node n ∈ p ( Now suppose we wish to evaluate the marginals x n 1 in the chain. Simply applying the above procedure separately for each ,...,N { } 2 2 N ( O node will have computational cost that is M ) . However, such an approach ( x would be very wasteful of computation. For instance, to find p we need to prop- ) 1 agate a message μ . Similarly, to evaluate · from node x ) ( x x ) p ( back to node 2 2 N β x back to node ( · . This will from node x ) μ we need to propagate a messages 3 N β involve much duplicated computation because most of the messages will be identical in the two cases. Suppose instead we first launch a message μ starting from node ( x x ) β 1 N − N and propagate corresponding messages all the way back to node , and suppose we x 1 μ similarly launch a message ( x ) starting from node x and propagate the corre- 1 α 2 sponding messages all the way forward to node x . Provided we store all of the N intermediate messages along the way, then any node can evaluate its marginal sim- ply by applying (8.54). The computational cost is only twice that for finding the N times as much. Observe that a message marginal of a single node, rather than has passed once in each direction across each link in the graph. Note also that the normalization constant Z need be evaluated only once, using any convenient node. If some of the nodes in the graph are observed, then the corresponding variables are simply clamped to their observed values and there is no summation. To see can x ̂ to an observed value x this, note that the effect of clamping a variable n n be expressed by multiplying the joint distribution by (one or more copies of) an ) , ̂ x and the value when , which takes the value 1 x x = ̂ x ( additional function I n n n n 0 otherwise. One such function can then be absorbed into each of the potentials that x x . Summations over . ̂ then contain only one term in which x = x contain n n n n ,x ) for two p x Now suppose we wish to calculate the joint distribution ( − 1 n n neighbouring nodes on the chain. This is similar to the evaluation of the marginal for a single node, except that there are now two variables that are not summed out. A few moments thought will show that the required joint distribution can be written Exercise 8.15 in the form 1 ψ ,x )= . (8.58) ) ( x ) x ( x ( ,x ) μ μ ( x p − 1 ,n 1 α n − n n n n − β 1 − n 1 n Z Thus we can obtain the joint distributions over all of the sets of variables in each of the potentials directly once we have completed the message passing required to obtain the marginals. This is a useful result because in practice we may wish to use parametric forms for the clique potentials, or equivalently for the conditional distributions if we started from a directed graph. In order to learn the parameters of these potentials in situa- tions where not all of the variables are observed, we can employ the EM algorithm , Chapter 9 and it turns out that the local joint distributions of the cliques, conditioned on any observed data, is precisely what is needed in the E step. We shall consider some examples of this in detail in Chapter 13. 8.4.2 Trees We have seen that exact inference on a graph comprising a chain of nodes can be performed efficiently in time that is linear in the number of nodes, using an algorithm

419 8.4. Inference in Graphical Models 399 Examples of Figure 8.39 tree- structured graphs, showing (a) an undirected tree, (b) a directed tree, and (c) a directed polytree. (c) (b) (a) that can be interpreted in terms of messages passed along the chain. More generally, inference can be performed efficiently using local message passing on a broader class of graphs called trees . In particular, we shall shortly generalize the message sum-product algorithm, which passing formalism derived above for chains to give the provides an efficient framework for exact inference in tree-structured graphs. In the case of an undirected graph, a tree is defined as a graph in which there is one, and only one, path between any pair of nodes. Such graphs therefore do not have loops. In the case of directed graphs, a tree is defined such that there is a single root , which has no parents, and all other nodes have one parent. If node, called the we convert a directed tree into an undirected graph, we see that the moralization step will not add any links as all nodes have at most one parent, and as a consequence the corresponding moralized graph will be an undirected tree. Examples of undirected and directed trees are shown in Figure 8.39(a) and 8.39(b). Note that a distribution represented as a directed tree can easily be converted into one represented by an undirected tree, and vice versa. Exercise 8.18 If there are nodes in a directed graph that have more than one parent, but there is still only one path (ignoring the direction of the arrows) between any two nodes, then the graph is a called a polytree , as illustrated in Figure 8.39(c). Such a graph will have more than one node with the property of having no parents, and furthermore, the corresponding moralized undirected graph will have loops. 8.4.3 Factor graphs The sum-product algorithm that we derive in the next section is applicable to undirected and directed trees and to polytrees. It can be cast in a particularly simple and general form if we first introduce a new graphical construction called a factor (Frey, 1998; Kschischnang et al. graph , 2001). Both directed and undirected graphs allow a global function of several vari- ables to be expressed as a product of factors over subsets of those variables. Factor graphs make this decomposition explicit by introducing additional nodes for the fac- tors themselves in addition to the nodes representing the variables. They also allow us to be more explicit about the details of the factorization, as we shall see. Let us write the joint distribution over a set of variables in the form of a product of factors ∏ ( (8.59) f ) x )= p ( x s s s denotes a subset of the variables. For convenience, we shall denote the where x s

420 400 8. GRAPHICAL MODELS x x x 2 3 1 Example of a factor graph, which corresponds Figure 8.40 to the factorization (8.60). f f f f c d a b , however, as in earlier discussions, these can comprise x individual variables by i f groups of variables (such as vectors or matrices). Each factor is a function of a s corresponding set of variables x . s Directed graphs, whose factorization is defined by (8.5), represent special cases f of (8.59) in which the factors ( x ) are local conditional distributions. Similarly, s s undirected graphs, given by (8.39), are a special case in which the factors are po- /Z can be tential functions over the maximal cliques (the normalizing coefficient 1 viewed as a factor defined over the empty set of variables). In a factor graph, there is a node (depicted as usual by a circle) for every variable in the distribution, as was the case for directed and undirected graphs. There are also ( x in the joint dis- ) f additional nodes (depicted by small squares) for each factor s s tribution. Finally, there are undirected links connecting each factor node to all of the variables nodes on which that factor depends. Consider, for example, a distribution that is expressed in terms of the factorization . ) x ( f ( x ) ,x ,x ) f x ( x ( ,x (8.60) ) f f )= x ( p d 1 a 1 c b 3 2 2 3 2 This can be expressed by the factor graph shown in Figure 8.40. Note that there are that are defined over the same set of variables. ) ,x ( x x ,x ( ) and f f two factors b a 2 2 1 1 In an undirected graph, the product of two such factors would simply be lumped f ( x could be ,x ) ) and x ( f together into the same clique potential. Similarly, d c 3 3 2 . The factor graph, however, keeps x and x combined into a single potential over 2 3 such factors explicit and so is able to convey more detailed information about the underlying factorization. x x x x x x 1 2 2 1 2 1 f f a f b x x x 3 3 3 (b) (a) (c) ,x . (b) A factor graph with factor ,x ) Figure 8.41 ( x ψ (a) An undirected graph with a single clique potential 3 2 1 f ( x representing the same distribution as the undirected graph. (c) A different factor ,x ) ,x ,x )= ψ ( x ,x 1 3 3 2 1 2 ) ,x ,x ( x x ,x ( ,x ψ ) . )= ( x ,x f graph representing the same distribution, whose factors satisfy f 3 2 1 2 a 2 1 3 1 b

421 8.4. Inference in Graphical Models 401 x x x x x x 1 1 2 1 2 2 f f c f f a b x x x 3 3 3 (c) (a) (b) (a) A directed graph with the factorization p ( x Figure 8.42 ) p ( x ) ) p ( . (b) A factor graph representing ,x | x x 1 2 2 1 3 x ( ,x ) ,x ,x )= p ( . (c) ) p x x ) p ( x | ( the same distribution as the directed graph, whose factor satisfies x f 3 2 1 3 2 2 1 1 and ) x ( x ( )= p ( x p ) , f )= ( x f A different factor graph representing the same distribution with factors 2 1 a 2 1 b x ( x . ,x ) ,x ,x )= p ( x | f 3 1 3 2 2 1 c bipartite because they consist of two distinct kinds Factor graphs are said to be of nodes, and all links go between nodes of opposite type. In general, factor graphs can therefore always be drawn as two rows of nodes (variable nodes at the top and factor nodes at the bottom) with links between the rows, as shown in the example in Figure 8.40. In some situations, however, other ways of laying out the graph may be more intuitive, for example when the factor graph is derived from a directed or undirected graph, as we shall see. If we are given a distribution that is expressed in terms of an undirected graph, then we can readily convert it to a factor graph. To do this, we create variable nodes corresponding to the nodes in the original undirected graph, and then create addi- are ) . The factors f ( x tional factor nodes corresponding to the maximal cliques x s s s then set equal to the clique potentials. Note that there may be several different factor graphs that correspond to the same undirected graph. These concepts are illustrated in Figure 8.41. Similarly, to convert a directed graph to a factor graph, we simply create variable nodes in the factor graph corresponding to the nodes of the directed graph, and then create factor nodes corresponding to the conditional distributions, and then finally add the appropriate links. Again, there can be multiple factor graphs all of which correspond to the same directed graph. The conversion of a directed graph to a factor graph is illustrated in Figure 8.42. We have already noted the importance of tree-structured graphs for performing efficient inference. If we take a directed or undirected tree and convert it into a factor graph, then the result will again be a tree (in other words, the factor graph will have no loops, and there will be one and only one path connecting any two nodes). In the case of a directed polytree, conversion to an undirected graph results in loops due to the moralization step, whereas conversion to a factor graph again results in a tree, as illustrated in Figure 8.43. In fact, local cycles in a directed graph due to links connecting parents of a node can be removed on conversion to a factor graph by defining the appropriate factor function, as shown in Figure 8.44. We have seen that multiple different factor graphs can represent the same di- rected or undirected graph. This allows factor graphs to be more specific about the

422 402 8. GRAPHICAL MODELS (b) (c) (a) (a) A directed polytree. (b) The result of converting the polytree into an undirected graph showing Figure 8.43 the creation of loops. (c) The result of converting the polytree into a factor graph, which retains the tree structure. precise form of the factorization. Figure 8.45 shows an example of a fully connected undirected graph along with two different factor graphs. In (b), the joint distri- ) ,x ,x , whereas in (c), it is given p ( f )= bution is given by a general form x x ( 3 2 1 by the more specific factorization p ( x )= f ( x ,x . It should ) ) f ,x ( x x ,x ( ) f c 1 1 2 b 3 2 a 3 be emphasized that the factorization in (c) does not correspond to any conditional independence properties. 8.4.4 The sum-product algorithm We shall now make use of the factor graph framework to derive a powerful class of efficient, exact inference algorithms that are applicable to tree-structured graphs. Here we shall focus on the problem of evaluating local marginals over nodes or subsets of nodes, which will lead us to the algorithm. Later we shall sum-product modify the technique to allow the most probable state to be found, giving rise to the max-sum algorithm. Also we shall suppose that all of the variables in the model are discrete, and so marginalization corresponds to performing sums. The framework, however, is equally applicable to linear-Gaussian models in which case marginalization involves integration, and we shall consider an example of this in detail when we discuss linear Section 13.3 dynamical systems. x x x x 1 2 2 1 (a) A fragment of a di- Figure 8.44 rected graph having a lo- cal cycle. (b) Conversion to a fragment of a factor graph having a tree struc- ) ,x ( ,x x f 1 3 2 ,x ,x )= x ( f ture, in which 3 2 1 | x x ,x p ) . | x ) ) p ( x ( p ( x 3 2 1 1 1 2 x x 3 3 (a) (b)

423 8.4. Inference in Graphical Models 403 x x x x x x 1 2 1 1 2 2 f a x ( ,x ,x ) f 1 2 3 f f c b x x x 3 3 3 (b) (c) (a) (a) A fully connected undirected graph. (b) and (c) Two factor graphs each of which corresponds Figure 8.45 to the undirected graph in (a). There is an algorithm for exact inference on directed graphs without loops known as belief propagation (Pearl, 1988; Lauritzen and Spiegelhalter, 1988), and is equiv- alent to a special case of the sum-product algorithm. Here we shall consider only the sum-product algorithm because it is simpler to derive and to apply, as well as being more general. We shall assume that the original graph is an undirected tree or a directed tree or polytree, so that the corresponding factor graph has a tree structure. We first convert the original graph into a factor graph so that we can deal with both directed and undirected models using the same framework. Our goal is to exploit the structure of the graph to achieve two things: (i) to obtain an efficient, exact inference algorithm for finding marginals; (ii) in situations where several marginals are required to allow computations to be shared efficiently. We begin by considering the problem of finding the marginal ( x ) for partic- p x . For the moment, we shall suppose that all of the variables ular variable node are hidden. Later we shall see how to modify the algorithm to incorporate evidence corresponding to observed variables. By definition, the marginal is obtained by sum- ming the joint distribution over all variables except x so that ∑ (8.61) ) p ( x p x )= ( x x \ x \ x denotes the set of variables in x with variable x omitted. The idea is where p ( x ) using the factor graph expression (8.59) and then interchange to substitute for summations and products in order to obtain an efficient algorithm. Consider the fragment of graph shown in Figure 8.46 in which we see that the tree structure of the graph allows us to partition the factors in the joint distribution into groups, with one group associated with each of the factor nodes that is a neighbour of the variable node x . We see that the joint distribution can be written as a product of the form ∏ (8.62) ) x, X ( F p x )= ( s s ) ne( ∈ s x ne( x ) denotes the set of factor nodes that are neighbours of x , and X denotes the s set of all variables in the subtree connected to the variable node x via the factor node

424 404 8. GRAPHICAL MODELS Figure 8.46 A fragment of a factor graph illustrating the ( p . x evaluation of the marginal ) μ ( ) x x → f s ) s x, X ( s f s x F , and F f ( x, X represents the product of all the factors in the group associated ) s s s f with factor . s Substituting (8.62) into (8.61) and interchanging the sums and products, we ob- tain [ ] ∑ ∏ ) x, X ( F x p )= ( s s X ) x ∈ s ne( s ∏ = (8.63) . ) x μ ( → f x s x ne( ) s ∈ , defined by ) x ( μ Here we have introduced a set of functions x → f s ∑ μ x ( (8.64) ) ≡ ) x, X ( F s x f s → s X s messages from the factor nodes f which can be viewed as to the variable node x . s We see that the required marginal ( x ) is given by the product of all the incoming p x . messages arriving at node In order to evaluate these messages, we again turn to Figure 8.46 and note that F each factor x, X ) ( is described by a factor (sub-)graph and so can itself be fac- s s torized. In particular, we can write F ( x, X (8.65) f ( x, x ,...,x ) G ( x ,X ) ...G ( x ,X ) )= s 1 s s M 1 1 M s sM 1 M where, for convenience, we have denoted the variables associated with factor f ,in x x ,by addition to x . This factorization is illustrated in Figure 8.47. Note ,...,x M 1 x, x that the set of variables { ,...,x is the set of variables on which the factor } M 1 depends, and so it can also be denoted x , using the notation of (8.59). f s s Substituting (8.65) into (8.64) we obtain ] [ ∑ ∑ ∑ ∏ μ f )= ) ( ( x, x ,...,x ) x ( x ... G ,X m s → sm m 1 f M x s x x X f m ) \ x ∈ ne( 1 M xm s ∏ ∑ ∑ ) (8.66) ( ) f μ ( x, x ... ,...,x = x 1 m s x M → f m s x x ne( f m ) \ x ∈ 1 M s

425 8.4. Inference in Graphical Models 405 x M Figure 8.47 Illustration of the factorization of the subgraph as- . sociated with factor node f s ) μ x ( → f x M s M f s x μ x ) ( x → f s x m G ( ,X ) x m sm m f where ) denotes the set of variable nodes that are neighbours of the factor node ne( s f , and ne( f removed. Here we have ) \ x denotes the same set but with node x s s defined the following messages from variable nodes to factor nodes ∑ μ . ) ,X ( x x ) ≡ (8.67) ( G x m m sm f → m s m X sm We have therefore introduced two distinct kinds of message, those that go from factor ( x ) , and those that go from variable nodes to μ nodes to variable nodes denoted x f → μ factor nodes denoted ( x ) . In each case, we see that messages passed along a → f x link are always a function of the variable associated with the variable node that link connects to. The result (8.66) says that to evaluate the message sent by a factor node to a vari- able node along the link connecting them, take the product of the incoming messages along all other links coming into the factor node, multiply by the factor associated with that node, and then marginalize over all of the variables associated with the incoming messages. This is illustrated in Figure 8.47. It is important to note that a factor node can send a message to a variable node once it has received incoming messages from all other neighbouring variable nodes. Finally, we derive an expression for evaluating the messages from variable nodes to factor nodes, again by making use of the (sub-)graph factorization. From Fig- is given by a x ( x associated with node ,X ) G ure 8.48, we see that term sm m m m that is f ( x each associated with one of the factor nodes ,X ) product of terms F l ml m l linked to node x (excluding node f ), so that m s ∏ G ,X ( (8.68) ,X ) )= x x ( F m l m m ml sm ∈ ) l f x ne( \ s m where the product is taken over all neighbours of node x f . Note except for node m s represents a subtree of the original graph of ) ,X ( x F that each of the factors m l ml precisely the same kind as introduced in (8.62). Substituting (8.68) into (8.67), we

426 406 8. GRAPHICAL MODELS Figure 8.48 Illustration of the evaluation of the message sent by a f L variable node to an adjacent factor node. f s x m f l ( F x ,X ) ml m l then obtain ] [ ∏ ∑ F x )= ) ( ,X x ( μ m m x f l → ml m s X ∈ f \ ) l x ne( ml s m ∏ = (8.69) ) x ( μ x → f m m l f ) x ne( ∈ l \ s m where we have used the definition (8.64) of the messages passed from factor nodes to variable nodes. Thus to evaluate the message sent by a variable node to an adjacent factor node along the connecting link, we simply take the product of the incoming messages along all of the other links. Note that any variable node that has only two neighbours performs no computation but simply passes messages through un- changed. Also, we note that a variable node can send a message to a factor node once it has received incoming messages from all other neighbouring factor nodes. Recall that our goal is to calculate the marginal for variable node , and that this x marginal is given by the product of incoming messages along all of the links arriving at that node. Each of these messages can be computed recursively in terms of other x as the root of the messages. In order to start this recursion, we can view the node tree and begin at the leaf nodes. From the definition (8.69), we see that if a leaf node is a variable node, then the message that it sends along its one and only link is given by ( )=1 (8.70) x μ x → f as illustrated in Figure 8.49(a). Similarly, if the leaf node is a factor node, we see from (8.66) that the message sent should take the form ( x )= f ( x ) (8.71) μ x f → Figure 8.49 The sum-product algorithm ) x ( f )= x ( )=1 x ( μ μ → x f → f x begins with messages sent by the leaf nodes, which de- pend on whether the leaf x x f f node is (a) a variable node, (a) (b) or (b) a factor node.

427 8.4. Inference in Graphical Models 407 as illustrated in Figure 8.49(b). At this point, it is worth pausing to summarize the particular version of the sum- ( x ) . We start by p product algorithm obtained so far for evaluating the marginal as the root of the factor graph and initiating messages x viewing the variable node at the leaves of the graph using (8.70) and (8.71). The message passing steps (8.66) and (8.69) are then applied recursively until messages have been propagated along every link, and the root node has received messages from all of its neighbours. Each node can send a message towards the root once it has received messages from all of its other neighbours. Once the root node has received messages from all of its neighbours, the required marginal can be evaluated using (8.63). We shall illustrate this process shortly. To see that each node will always receive enough messages to be able to send out a message, we can use a simple inductive argument as follows. Clearly, for a graph comprising a variable root node connected directly to several factor leaf nodes, the algorithm trivially involves sending messages of the form (8.71) directly from the leaves to the root. Now imagine building up a general graph by adding nodes one at a time, and suppose that for some particular graph we have a valid algorithm. When one more (variable or factor) node is added, it can be connected only by a single link because the overall graph must remain a tree, and so the new node will be a leaf node. It therefore sends a message to the node to which it is linked, which in turn will therefore receive all the messages it requires in order to send its own message towards the root, and so again we have a valid algorithm, thereby completing the proof. Now suppose we wish to find the marginals for every variable node in the graph. This could be done by simply running the above algorithm afresh for each such node. However, this would be very wasteful as many of the required computations would be repeated. We can obtain a much more efficient procedure by ‘overlaying’ these multiple message passing algorithms to obtain the general sum-product algorithm as follows. Arbitrarily pick any (variable or factor) node and designate it as the root. Propagate messages from the leaves to the root as before. At this point, the root node will have received messages from all of its neighbours. It can therefore send out messages to all of its neighbours. These in turn will then have received messages from all of their neighbours and so can send out messages along the links going away from the root, and so on. In this way, messages are passed outwards from the root all the way to the leaves. By now, a message will have passed in both directions across every link in the graph, and every node will have received a message from all of its neighbours. Again a simple inductive argument can be used to verify the validity of this message passing protocol. Because every variable Exercise 8.20 node will have received messages from all of its neighbours, we can readily calculate the marginal distribution for every variable in the graph. The number of messages that have to be computed is given by twice the number of links in the graph and so involves only twice the computation involved in finding a single marginal. By comparison, if we had run the sum-product algorithm separately for each node, the amount of computation would grow quadratically with the size of the graph. Note that this algorithm is in fact independent of which node was designated as the root,

428 408 8. GRAPHICAL MODELS The sum-product algorithm can be viewed Figure 8.50 purely in terms of messages sent out by factor nodes to other factor nodes. In this example, the outgoing message shown by the blue arrow x is obtained by taking the product of all the in- 1 coming messages shown by green arrows, mul- x 3 , and marginalizing over tiplying by the factor f s x . and the variables x f 1 2 s x 2 and indeed the notion of one node having a special status was introduced only as a convenient way to explain the message passing protocol. ) associated with x Next suppose we wish to find the marginal distributions p ( s the sets of variables belonging to each of the factors. By a similar argument to that Exercise 8.21 used above, it is easy to see that the marginal associated with a factor is given by the product of messages arriving at the factor node and the local factor at that node ∏ )= f ( μ ( ) x (8.72) ) x x ( p s s f i x → s s i ) ne( ∈ i f s in complete analogy with the marginals at the variable nodes. If the factors are parameterized functions and we wish to learn the values of the parameters using the EM algorithm, then these marginals are precisely the quantities we will need to calculate in the E step, as we shall see in detail when we discuss the hidden Markov model in Chapter 13. The message sent by a variable node to a factor node, as we have seen, is simply the product of the incoming messages on other links. We can if we wish view the sum-product algorithm in a slightly different form by eliminating messages from variable nodes to factor nodes and simply considering messages that are sent out by factor nodes. This is most easily seen by considering the example in Figure 8.50. So far, we have rather neglected the issue of normalization. If the factor graph was derived from a directed graph, then the joint distribution is already correctly nor- malized, and so the marginals obtained by the sum-product algorithm will similarly be normalized correctly. However, if we started from an undirected graph, then in 1 /Z . As with the simple general there will be an unknown normalization coefficient chain example of Figure 8.38, this is easily handled by working with an unnormal- . We first run the ̃ p ( x ) of the joint distribution, where p ( x )= ̃ p ( x ) /Z ized version sum-product algorithm to find the corresponding unnormalized marginals ̃ p ( x ) . The i coefficient 1 /Z is then easily obtained by normalizing any one of these marginals, and this is computationally efficient because the normalization is done over a single variable rather than over the entire set of variables as would be required to normalize ̃ p ( x ) directly. At this point, it may be helpful to consider a simple example to illustrate the operation of the sum-product algorithm. Figure 8.51 shows a simple 4-node factor

429 8.4. Inference in Graphical Models 409 x x x 1 2 3 A simple factor graph used to illustrate the Figure 8.51 sum-product algorithm. f f a b f c x 4 graph whose unnormalized joint distribution is given by ( x )= f x ( x ̃ ,x (8.73) ) f . ( x ) ,x ,x ) f p ( c a 2 b 2 2 4 3 1 In order to apply the sum-product algorithm to this graph, let us designate node x 3 as the root, in which case there are two leaf nodes x x . Starting with the leaf and 4 1 nodes, we then have the following sequence of six messages ( x )=1 (8.74) μ f 1 → x 1 a ∑ )= x ,x (8.75) ( ) f ( x μ 2 x a f 2 1 → a 2 x 1 μ (8.76) )=1 x ( 4 f → x c 4 ∑ ) ,x x ( ( x f )= (8.77) μ 4 x 2 → c f 2 2 c x 4 μ ( x )= μ ( ) ( x ) μ x (8.78) x 2 2 f 2 x f f → → x → 2 c 2 a 2 b ∑ . μ ) ,x x ( x ( )= (8.79) f μ b x 3 → 2 x → f 3 f 2 3 b b x 2 The direction of flow of these messages is illustrated in Figure 8.52. Once this mes- sage propagation is complete, we can then propagate messages from the root node out to the leaf nodes, and these are given by ( x )=1 (8.80) μ → f x 3 3 b ∑ ( ( x ,x )= x (8.81) ) f μ b 2 x 2 → 3 f 2 b x 3 μ (8.82) ( x )= μ x ) ( ) μ x ( x 2 x 2 2 → f f f → x → 2 c 2 2 a b ∑ ) x ( μ ) ( x ,x )= (8.83) x ( f μ f 1 1 2 a x x 2 → f → a 2 1 a x 2 μ x ( (8.84) )= μ ) ( x ) μ ( x → x → → 2 2 f x f f 2 x 2 2 a c 2 b ∑ ) x ( μ ) ( x ,x )= (8.85) x ( . f μ c 2 4 4 f x x 2 → f → c c 4 2 x 2

430 410 8. GRAPHICAL MODELS x x x x x x 1 1 2 2 3 3 x x 4 4 (b) (a) Flow of messages for the sum-product algorithm applied to the example graph in Figure 8.51. (a) Figure 8.52 and towards the root node x . (b) From the root node towards the leaf nodes. x x From the leaf nodes 1 4 3 One message has now passed in each direction across each link, and we can now ) is ( x p evaluate the marginals. As a simple check, let us verify that the marginal 2 given by the correct expression. Using (8.63) and substituting for the messages using the above results, we have ( x )= μ ( ( x ) μ ̃ p x ( x ) μ ) x 2 x 2 f 2 2 → → f f → x 2 2 2 a c b [ ][ ] ][ ∑ ∑ ∑ ) ( ) ,x f f ( x ( ,x ) x ,x x f = 1 b 4 2 2 a c 3 2 x x x 4 3 1 ∑ ∑ ∑ ,x = x ( x f ) f ,x ( ) x ,x ( ) f b 2 2 a 3 1 4 c 2 x x x 1 2 4 ∑ ∑ ∑ = x (8.86) ) ( p ̃ x x x 1 4 3 as required. So far, we have assumed that all of the variables in the graph are hidden. In most practical applications, a subset of the variables will be observed, and we wish to cal- culate posterior distributions conditioned on these observations. Observed nodes are x easily handled within the sum-product algorithm as follows. Suppose we partition into hidden variables and observed variables v , and that the observed value of v h ∏ ̂ v . Then we simply multiply the joint distribution p ( x ) by ) , I ( v v , ̂ is denoted i i i where ( v, I v )=1 if v = ̂ v and I ( v, ̂ v )=0 otherwise. This product corresponds ̂ . By run- ) v ̂ = v | ̂ v ) and hence is an unnormalized version of p ( h to = h ( p , v ning the sum-product algorithm, we can efficiently calculate the posterior marginals p ( h up to a normalization coefficient whose value can be found efficiently | v = ̂ v ) i using a local computation. Any summations over variables in v then collapse into a single term. We have assumed throughout this section that we are dealing with discrete vari- ables. However, there is nothing specific to discrete variables either in the graphical framework or in the probabilistic construction of the sum-product algorithm. For

431 8.4. Inference in Graphical Models 411 Example of a joint distribution over two binary variables for Table 8.1 =1 x x =0 which the maximum of the joint distribution occurs for dif- 0.3 0.4 y =0 ferent variable values compared to the maxima of the two =1 y 0.3 0.0 marginals. continuous variables the summations are simply replaced by integrations. We shall give an example of the sum-product algorithm applied to a graph of linear-Gaussian variables when we consider linear dynamical systems. Section 13.3 8.4.5 The max-sum algorithm ( x The sum-product algorithm allows us to take a joint distribution expressed p ) as a factor graph and efficiently find marginals over the component variables. Two other common tasks are to find a setting of the variables that has the largest prob- ability and to find the value of that probability. These can be addressed through a max-sum closely related algorithm called , which can be viewed as an application of dynamic programming in the context of graphical models (Cormen et al. , 2001). A simple approach to finding latent variable values having high probability ) for ev- ( x would be to run the sum-product algorithm to obtain the marginals p i  x ery variable, and then, for each marginal in turn, to find the value that maximizes i individually the that marginal. However, this would give the set of values that are jointly most probable. In practice, we typically wish to find the set of values that max that maximizes the joint have the largest probability, in other words the vector x distribution, so that max (8.87) =argmax ( x ) p x x for which the corresponding value of the joint probability will be given by max ) = max (8.88) . ) p ( x x p ( x  max is not the same as the set of x values, as we can easily show using x In general, i p ( x, y ) over two binary variables a simple example. Consider the joint distribution x, y ∈{ 0 , 1 } given in Table 8.1. The joint distribution is maximized by setting x = 1 and =0 , corresponding the value 0 . 4 . However, the marginal for p ( x ) , obtained y and y p ( x =0)=0 . 6 by summing over both values of p ( x =1)=0 . 4 , ,isgivenby and similarly the marginal for y is given by p ( y =0)=0 . 7 and p ( y =1)=0 . 3 , and so the marginals are maximized by x and y =0 , which corresponds to a =0 0 3 value of for the joint distribution. In fact, it is not difficult to construct examples . for which the set of individually most probable values has probability zero under the Exercise 8.27 joint distribution. We therefore seek an efficient algorithm for finding the value of x that maxi- mizes the joint distribution p ( x ) and that will allow us to obtain the value of the joint distribution at its maximum. To address the second of these problems, we shall simply write out the max operator in terms of its components ) x p ( x )=max ( (8.89) ... max p max x x x 1 M

432 412 8. GRAPHICAL MODELS p ( x ) using its where M is the total number of variables, and then substitute for expansion in terms of a product of factors. In deriving the sum-product algorithm, we made use of the distributive law (8.53) for multiplication. Here we make use of the analogous law for the max operator max( max( b, c ) (8.90) )= a ab, ac 0 (as will always be the case for the factors in a graphical model). a which holds if  This allows us to exchange products with maximizations. Consider first the simple example of a chain of nodes described by (8.49). The evaluation of the probability maximum can be written as 1 max ,x p ( x x )] ··· max ( [ ψ ψ ··· ) ( x ,x )= max 2 , 1 N − 2 ,N N 1 N − 1 1 x x x Z 1 N ]] [ [ 1 max = ψ ψ ,x x ( ( x . ,x max ) ) ··· 2 2 N − 1 ,N 1 , N − 1 1 N x x Z 1 N As with the calculation of marginals, we see that exchanging the max and product operators results in a much more efficient computation, and one that is easily inter- preted in terms of messages passed from node x backwards along the chain to node N . x 1 We can readily generalize this result to arbitrary tree-structured factor graphs by substituting the expression (8.59) for the factor graph expansion into (8.89) and again exchanging maximizations with products. The structure of this calculation is identical to that of the sum-product algorithm, and so we can simply translate those results into the present context. In particular, suppose that we designate a particular variable node as the ‘root’ of the graph. Then we start a set of messages propagating inwards from the leaves of the tree towards the root, with each node sending its message towards the root once it has received all incoming messages from its other neighbours. The final maximization is performed over the product of all messages p ( x ) arriving at the root node, and gives the maximum value for . This could be called the max-product algorithm and is identical to the sum-product algorithm except that summations are replaced by maximizations. Note that at this stage, messages have been sent from leaves to the root, but not in the other direction. In practice, products of many small probabilities can lead to numerical under- flow problems, and so it is convenient to work with the logarithm of the joint distri- bution. The logarithm is a monotonic function, so that if then ln a> ln a>b , and b hence the max operator and the logarithm function can be interchanged, so that ( ) max . p ( x ) (8.91) =max ) ln p ( x ln x x The distributive property is preserved because max( a + b, a + c )= a +max( b, c ) . (8.92) Thus taking the logarithm simply has the effect of replacing the products in the max-product algorithm with sums, and so we obtain the max-sum algorithm. From

433 8.4. Inference in Graphical Models 413 the results (8.66) and (8.69) derived earlier for the sum-product algorithm, we can readily write down the max-sum algorithm in terms of message passing simply by replacing ‘sum’ with ‘max’ and replacing products with sums of logarithms to give ⎤ ⎡ ∑ ⎦ ⎣ μ max ) x (8.93) ( ( x ln f ( x, x μ ,...,x ) = )+ M m f → x f → x 1 m ,...,x x 1 M ) \ x f ne( ∈ m s ∑ μ . x )= (8.94) ( ) x μ ( x x f → f → l ) \ x ne( ∈ l f The initial messages sent by the leaf nodes are obtained by analogy with (8.70) and (8.71) and are given by ( x )=0 (8.95) μ x → f (8.96) ) x ( ( x )=ln f μ x f → while at the root node the maximum probability can then be computed, by analogy with (8.63), using ⎤ ⎡ ∑ max ⎣ ⎦ . (8.97) =max ) μ x ( p → f x s x x ne( ∈ s ) So far, we have seen how to find the maximum of the joint distribution by prop- agating messages from the leaves to an arbitrarily chosen root node. The result will be the same irrespective of which node is chosen as the root. Now we turn to the second problem of finding the configuration of the variables for which the joint dis- tribution attains this maximum value. So far, we have sent messages from the leaves max for the to the root. The process of evaluating (8.97) will also give the value x most probable value of the root node variable, defined by ⎤ ⎡ ∑ max ⎦ ⎣ =argmax ( (8.98) ) μ x . x → f x s x ) x ne( ∈ s At this point, we might be tempted simply to continue with the message passing al- gorithm and send messages from the root back out to the leaves, using (8.93) and (8.94), then apply (8.98) to all of the remaining variable nodes. However, because we are now maximizing rather than summing, it is possible that there may be mul- .In tiple configurations of p ( x ) all of which give rise to the maximum value for x such cases, this strategy can fail because it is possible for the individual variable values obtained by maximizing the product of messages at each node to belong to different maximizing configurations, giving an overall configuration that no longer corresponds to a maximum. The problem can be resolved by adopting a rather different kind of message passing from the root node to the leaves. To see how this works, let us return once states, K each having ,...,x again to the simple chain example of x variables N N 1

434 414 8. GRAPHICAL MODELS A lattice, or trellis, diagram show- Figure 8.53 K possible states (one per row ing explicitly the in the x of the diagram) for each of the variables n =1 k =3 . The ar-