1 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Information Theory , Inference, and Learning Algorithms David J.C. MacKa y

2 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Theory , Information Inference, Learning Algorithms and David J.C. MacKa y [email protected] c 1996, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005 1995, 1997, c Cam bridge Univ y Press 2003 ersit Version (fourth prin ting) Marc h 28, 2005 7.2 send book k on this Please via feedbac cam.ac.uk/mackay/itila/ http://www.inference.phy. 6.0 of this book Version by C.U.P . in Septem ber 2003. It will was published remain view able on-screen on the above website, in postscript, djvu, and pdf formats. In the prin ting (version 6.6) minor typos were corrected, and the book second the was sligh altered to mo dify tly placemen t of section num bers. design In the third prin ting (version 7.0) minor typos were corrected, and chapter 8 was renamed enden t random variables' (instead of `Correlated'). `Dep In the fourth prin ting (version 7.2) minor typos were corrected. (C.U.P. replac e this page with their own page ii.)

3 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. ten ts Con v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Preface Theory 3 Introduction . . . . . . . . . . . . . 1 to Information y, and Inference . . . . . . . . . . . . . . 2 Probabilit y, Entrop 22 about Inference . . . . . . . . . . . . . . . . . . . . . 48 More 3 I ression . . . . . . . . . . . . . . . . . . . . . . 65 Data Comp The Coding Theorem . . . . . . . . . . . . . . . . . 67 4 Source bol Codes Sym 91 5 . . . . . . . . . . . . . . . . . . . . . . . . . 6 Stream Codes . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7 for Integers . . . . . . . . . . . . . . . . . . . . . . . 132 Codes II Coding . . . . . . . . . . . . . . . . . . . . 137 Noisy-Channel Dep enden Variables . . . . . . . . . . . . . . . . . 138 8 t Random Comm over a Noisy Channel . . . . . . . . . . . . 146 9 unication The Noisy-Channel Coding Theorem . . . . . . . . . . . . . 10 162 11 Codes and Real Channels . . . . . . . . . 177 Error-Correcting III Topics in Info rmation Theo ry . . . . . . . . . . . . . 191 Further 12 Hash Codes: Codes for Ecien t Information Retriev al . . 193 13 Binary . . . . . . . . . . . . . . . . . . . . . . . . . 206 Codes Very Good Linear Exist . . . . . . . . . . . . . . . . 229 14 Codes Exercises Theory Information Further . . . . . . . . . . 233 on 15 Passing . . . . . . . . . . . . . . . . . . . . . . . . 241 Message 16 Comm unication 17 Noiseless Channels . . . 248 over Constrained 18 ords and Codebreaking . . . . . . . . . . . . . . . . 260 Crossw 19 y have Sex? Information Acquisition and Evolution . . 269 Wh IV Probabilities and Inference . . . . . . . . . . . . . . . . . . 281 20 An Inference Task: Clustering . . . . . . . . . . . 284 Example Exact . . . . . . . . . by Complete Enumeration 21 293 Inference Maxim 22 Lik eliho od and Clustering . . . . . . . . . . . . . 300 um 23 Useful Probabilit y Distributions . . . . . . . . . . . . . . . 311 24 Exact . . . . . . . . . . . . . . . . . . . . . 319 Marginalization 25 Marginalization in Trellises . . . . . . . . . . . . . . 324 Exact 26 Exact Marginalization in Graphs . . . . . . . . . . . . . . . 334 27 Laplace's Metho d . . . . . . . . . . . . . . . . . . . . . . . 341

4 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Mo del and Occam's Razor . . . . . . . . . . . 343 28 Comparison te Carlo ds . . . . . . . . . . . . . . . . . . . . . 357 Mon 29 Metho t Mon te Carlo Metho ds . . . . . . . . . . . . . . . . 387 30 Ecien Mo . . . . . . . . . . . . . . . . . . . . . . . . . . Ising 400 31 dels Mon . . . . . . . . . . . . . . . . . Sampling Exact 413 32 te Carlo Metho ds . . . . . . . . . . . . . . . . . . . . . . 422 33 Variational Indep enden t Comp onen 34 and Laten t Variable Mo d- t Analysis elling 437 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Random Inference . . . . . . . . . . . . . . . . . . . 445 35 Topics Theory . . . . . . . . . . . . . . . . . . . . . . . . 451 36 Decision Bayesian Inference and Sampling Theory . . . . . . . . . . 457 37 V Neural . . . . . . . . . . . . . . . . . . . . . . . . 467 networks Introduction . . . . . . . . . . . . . . . Net works 38 468 to Neural 39 Single Neuron as a Classi er . . . . . . . . . . . . . . . 471 The 40 Capacit y of a Single Neuron . . . . . . . . . . . . . . . . . . 483 41 Learning . . . . . . . . . . . . . . . . . . . . . 492 as Inference Hop eld Net . . . . . . . . . . . . . . . . . . . . . . . 505 42 works Boltzmann hines . . . . . . . . . . . . . . . . . . . . . . 522 43 Mac Sup ervised Learning in Multila yer Net works . . . . . . . . . 44 527 45 Pro cesses . . . . . . . . . . . . . . . . . . . . . . 535 Gaussian Decon . . . . . . . . . . . . . . . . . . . . . . . . . 46 549 volution Spa VI Graph Codes . . . . . . . . . . . . . . . . . . . . . 555 rse 47 Low-Densit y Parit y-Chec k Codes . . . . . . . . . . . . . . 557 48 Con Codes and Turb o Codes . . . . . . . . . . . . 574 volutional Rep ulate 49 Codes . . . . . . . . . . . . . . . . . . 582 eat{Accum 50 Foun tain Codes . . . . . . . . . . . . . . . . . . . . 589 Digital VII App endices . . . . . . . . . . . . . . . . . . . . . . . . . . 597 A . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598 Notation B Some Physics . . . . . . . . . . . . . . . . . . . . . . . . . . 601 C Some Mathematics . . . . . . . . . . . . . . . . . . . . . . . 605 Bibliograph y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620

5 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Preface This at senior undergraduates and graduate studen ts in Engi- book is aimed Mathematics, and It exp ects familiarit y with neering, Science, Computing. y theory linear algebra as taugh t in a rst- or second- probabilit calculus, , and course on mathematics for scien tists and engineers. year undergraduate courses information ventional theory cover not only the beauti- Con on al practic of Shannon, but also theoretic al solutions to comm unica- ful ideas This book goes further, bringing in Bayesian data mo delling, tion problems. te Carlo metho ds, variational metho ds, clustering algorithms, Mon neural and net works. y unify information and mac hine learning? Because they are Wh theory same a single In the 1960s, of the eld, cyb ernetics, was coin. two sides theorists, computer scien tists, and neuroscien tists, by information populated studying common problems. Information theory all mac hine learning still and belong Brains are the ultimate compression and comm unication together. And both state-of-the-art algorithms for systems. data compression and the hine codes the same tools as mac error-correcting learning. use How to use this book The essen tial dep endencies between chapters are indicated in the gure on the next page. arro w from one chapter to another indicates that the second An requires some rst. chapter of the I, II, IV, V of this book, chapters on adv anced or optional Parts and Within towards topics end. All chapters of Part III are optional on a rst are the except perhaps for Chapter 16 (Message Passing). reading, same The sometimes applies within a chapter: the nal sections of- system For exam- deal adv anced topics ten can be skipp ed on a rst reading. with that ple in two key chapters { Chapter 4 (The Source Coding Theorem) and Chap- ter 10 (The Noisy-Channel Coding Theorem) { the rst-time reader should detour at section and section 10.4 resp ectiv ely. 4.5 First, vii{x ways to use this book. w a few I give the roadmap for Pages sho that I teac h in Cam bridge: a course theory , pattern recognition, `Information and net works'. The book is also neural as a textb ook for traditional intended courses in information theory . The second roadmap sho ws the chapters for an third introductory course and the theory for a course aimed at an information understanding of state-of-the-art error-correcting codes. The fourth roadmap ventional sho how to use the text in a con ws course on mac hine learning. v

6 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Preface vi Introduction to Information Theory and Inference IV Probabilities 1 Inference y, Entrop Probabilit y, and 2 Inference Clustering An Example Task: 20 More Inference about 3 Exact Inference by Complete Enumeration 21 Clustering od and eliho Lik um Maxim 22 I Compression Data Useful Probabilit y Distributions 23 Coding Source The Exact Marginalization Theorem 4 24 Exact in Trellises Sym bol Codes Marginalization 5 25 Codes Marginalization Stream Exact in Graphs 6 26 d for Codes Integers Metho Laplace's 7 27 Mo Occam's del Comparison and Razor 28 Mon II Coding ds Metho te Carlo Noisy-Channel 29 te Carlo t Mon Ecien ds Metho 30 Dep enden t Random Variables 8 dels Mo Ising 31 Comm unication over a Noisy Channel 9 te Carlo Mon Exact Sampling 32 The Noisy-Channel Theorem Coding 10 Metho Variational ds 33 Channels Real Error-Correcting and Codes 11 enden t Comp onen t Analysis Indep 34 Topics Random Inference 35 Topics in Information Further Theory III Decision Theory 36 Hash Codes 12 Bayesian Theory Inference and Sampling 37 Binary Codes 13 Exist Codes Good Linear Very 14 net Neural V works Information Further Exercises Theory on 15 Net to Neural Introduction works 38 Passing Message 16 Single Neuron as a Classi er The 39 Channels Noiseless Constrained 17 y of a Single Capacit Neuron 40 Crossw ords and Codebreaking 18 as Inference Learning 41 Wh y have Sex? 19 Net works Hop eld 42 Boltzmann hines Mac 43 Sup ervised Learning in Multila yer Net works 44 Gaussian Pro cesses 45 Decon volution 46 VI Sparse Graph Codes y Parit y-Chec k Codes Low-Densit 47 Dependencies Con o Codes Codes and Turb volutional 48 eat{Accum ulate Codes Rep 49 Digital Foun tain Codes 50

7 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. vii Preface Theory Theory to Information Introduction IV Introduction Inference and Probabilities to Information 1 1 y, and Inference y, Entrop Probabilit Inference y, and Probabilit y, Entrop 2 2 An Clustering Task: Inference Example An Clustering Task: Inference Example 20 20 More Inference Inference about More about 3 3 Exact Inference by Complete Enumeration Exact Inference by Complete Enumeration 21 21 Maxim Clustering od and eliho Lik um Clustering Maxim um Lik eliho od and 22 22 I Data Compression y Distributions Probabilit Useful 23 Theorem Source Exact Marginalization Marginalization Exact Coding Theorem The Source Coding The 24 4 4 24 Sym Exact bol Codes in Trellises Marginalization Sym bol Codes 5 5 25 Stream Codes Stream Exact Marginalization in Graphs Codes 6 6 26 Metho Metho for Integers Laplace's Laplace's Codes d d 27 27 7 del Mo Comparison Razor Occam's and 28 ds Metho te Carlo Coding Noisy-Channel II Mon Mon te Carlo Metho ds 29 29 Ecien t Mon te Carlo Metho ds te Carlo Ecien t Mon Metho ds 30 30 enden t Random Variables Variables Dep Dep enden t Random 8 8 dels Ising Mo Ising dels Mo 31 31 Channel Comm Comm unication over a Noisy over a Noisy unication Channel 9 9 Exact te Carlo Sampling Exact Mon te Carlo Mon Sampling 32 32 Theorem The Noisy-Channel Coding Theorem The Noisy-Channel Coding 10 10 Metho Variational Metho ds Variational ds 33 33 Error-Correcting Codes and Real Channels Channels Real and Codes Error-Correcting 11 11 t Comp enden onen t Analysis Indep 34 Random Inference Topics 35 in Information Theory III Further Topics Theory Decision 36 Hash Codes 12 Inference and Sampling Theory Bayesian 37 Codes Binary 13 Very Codes Good Linear Exist 14 Neural V works net Information Further Exercises on Theory 15 works works Introduction to Neural Net Net to Neural Introduction 38 38 Message Passing 16 Neuron as a Classi er Neuron Single as a Classi er The The Single 39 39 Channels Noiseless Constrained 17 y of a Single Capacit Capacit Neuron y of a Single Neuron 40 40 and Crossw ords Codebreaking 18 as Inference Learning Learning as Inference 41 41 Wh y have Sex? 19 Net works works Net Hop eld Hop eld 42 42 hines Boltzmann Mac 43 in Multila Learning yer Net Sup works ervised 44 cesses Pro Gaussian 45 Decon volution 46 Codes VI Graph Sparse My on, Course bridge Cam k Codes Low-Densit y Parit k Codes y-Chec y Parit Low-Densit y-Chec , Theory Information 47 47 Recognition, Pattern Codes o Codes Con and Turb volutional 48 Net Neural and works Rep eat{Accum ulate Codes 49 tain Digital Foun Codes 50

8 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Preface viii Introduction to Information Theory and Probabilities IV to Information Introduction Inference Theory 1 1 y, Entrop y, and Inference Inference Probabilit y, and Probabilit y, Entrop 2 2 Task: Clustering An Example Inference 20 More about Inference 3 Enumeration Inference Exact by Complete 21 Lik Clustering Maxim od and eliho um 22 I Compression Data Useful Probabilit y Distributions 23 Exact Theorem Coding The Source Coding Theorem The Source Marginalization 24 4 4 bol Codes Marginalization in Trellises Sym Exact Sym bol Codes 25 5 5 Exact Marginalization in Graphs Stream Codes Stream Codes 26 6 6 Codes for Integers d Metho Laplace's 27 7 Occam's del Razor and Mo Comparison 28 Mon II Noisy-Channel Coding ds Metho te Carlo 29 Metho te Carlo t Mon Ecien ds 30 Variables t Random t Random enden Dep enden Dep Variables 8 8 Ising dels Mo 31 Comm unication over a Noisy Channel Channel Comm unication over a Noisy 9 9 Sampling Exact Mon te Carlo 32 The Theorem Coding Noisy-Channel Noisy-Channel Coding The Theorem 10 10 Metho ds Variational 33 Codes Channels Error-Correcting and Real 11 Indep enden t Comp onen t Analysis 34 Topics Inference Random 35 Topics Theory in Information III Further Decision Theory 36 Codes Hash 12 and Inference Sampling Theory Bayesian 37 Binary Codes 13 Good Linear Very Codes Exist 14 works V Neural net Theory on Exercises Further Information 15 Introduction works Net to Neural 38 Passing Message 16 as a Classi er Neuron Single The 39 Channels Noiseless Constrained 17 Neuron Capacit y of a Single 40 Crossw ords and Codebreaking 18 as Inference Learning 41 y have Sex? Wh 19 Hop eld Net works 42 Mac Boltzmann hines 43 Learning in Multila yer Net works Sup ervised 44 Gaussian Pro cesses 45 Decon volution 46 Sparse Graph Codes VI Course on Short y-Chec k Codes y Parit Low-Densit 47 Theory Information and Codes Con volutional Turb o Codes 48 eat{Accum Codes ulate Rep 49 Foun Digital tain Codes 50

9 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. ix Preface Inference Theory Introduction to Information Probabilities IV and 1 Inference y, and y, Entrop Probabilit 2 An Clustering Inference Task: Example 20 More Inference about 3 Exact Inference by Complete Enumeration 21 Clustering od and eliho Lik um Maxim 22 I Data Compression Probabilit Useful y Distributions 23 Theorem Exact Marginalization Coding Source The Marginalization Exact 24 4 24 Sym in Trellises in Trellises Marginalization bol Codes Exact Marginalization Exact 25 25 5 Exact Exact Stream in Graphs Marginalization Codes in Graphs Marginalization 6 26 26 d Metho Laplace's for Codes Integers 27 7 and Mo del Comparison Occam's Razor 28 te Carlo ds Metho Noisy-Channel Mon II Coding 29 ds Ecien t Mon te Carlo Metho 30 Dep enden t Random Variables 8 Ising dels Mo 31 Comm unication over a Noisy Channel 9 Exact Sampling te Carlo Mon 32 Noisy-Channel Coding The Theorem 10 Variational Metho ds 33 Real Error-Correcting Codes and Codes Channels and Error-Correcting Channels Real 11 11 enden Indep t Comp onen t Analysis 34 Random Topics Inference 35 in Information Further III Theory Topics Decision Theory 36 Codes Hash Hash Codes 12 12 Theory Inference and Bayesian Sampling 37 Codes Codes Binary Binary 13 13 Exist Codes Very Exist Good Linear Very Good Linear Codes 14 14 V Neural works net on Information Theory Further Exercises on Further Exercises Information Theory 15 15 Introduction works Net to Neural 38 Passing Passing Message Message 16 16 as a Classi er Neuron Single The 39 Constrained Constrained Noiseless Channels Channels Noiseless 17 17 Capacit y of a Single Neuron 40 and Codebreaking Crossw ords 18 Learning as Inference 41 Wh y have Sex? 19 Net works Hop eld 42 hines Boltzmann Mac 43 Sup ervised Learning in Multila yer Net works 44 Pro cesses Gaussian 45 Decon volution 46 Codes VI Sparse Graph on Course Adv anced y Parit k Codes y-Chec y Parit Low-Densit y-Chec Low-Densit k Codes 47 47 and Theory Information Coding Con Con Codes and Turb o Codes o Codes volutional volutional Codes and Turb 48 48 ulate Codes eat{Accum Rep eat{Accum ulate Codes Rep 49 49 tain Codes Foun Digital Foun tain Digital Codes 50 50

10 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. x Preface to Information Theory and Inference IV Probabilities Introduction 1 y, Entrop y, and Inference Inference y, and y, Entrop Probabilit Probabilit 2 2 Inference Clustering Task: Inference Example An Clustering Task: An Example 20 20 about Inference More about More Inference 3 3 Exact Inference by Complete Enumeration by Complete Inference Enumeration Exact 21 21 od and Maxim Clustering od and eliho Lik um Clustering Maxim um Lik eliho 22 22 Data I Compression Useful Probabilit y Distributions 23 Marginalization Theorem Exact Marginalization Coding The Source Exact 4 24 24 Marginalization bol Codes Sym in Trellises Exact 25 5 in Graphs Exact Marginalization Stream Codes 6 26 Integers Laplace's d Laplace's d Metho for Codes Metho 27 7 27 Razor and Comparison and Razor del Mo Mo del Comparison Occam's Occam's 28 28 Mon Coding ds Noisy-Channel II te Carlo te Carlo Metho ds Mon Metho 29 29 te Carlo ds Metho t Mon t Mon Ecien Ecien te Carlo ds Metho 30 30 enden Dep t Random Variables 8 Mo dels Ising dels Ising Mo 31 31 Comm unication over a Noisy Channel 9 Exact Mon te Carlo Sampling te Carlo Mon Exact Sampling 32 32 Coding Theorem Noisy-Channel The 10 Metho Metho Variational ds ds Variational 33 33 and Error-Correcting Real Channels Codes 11 enden onen t Analysis Indep t Comp t Comp onen t Analysis Indep enden 34 34 Topics Inference Random 35 Theory in Information Topics III Further Decision Theory 36 Hash Codes 12 Inference and Sampling Theory Bayesian 37 Binary Codes 13 Exist Codes Very Good Linear 14 net V Neural works Theory Information Further on Exercises 15 works Net to Neural Introduction Net works Introduction to Neural 38 38 Passing Message 16 as a Classi er Single The Neuron Single The Neuron as a Classi er 39 39 Channels Noiseless Constrained 17 Capacit Neuron Neuron y of a Single Capacit y of a Single 40 40 Crossw Codebreaking and ords 18 as Inference as Inference Learning Learning 41 41 Wh y have Sex? 19 works Net Hop eld works Net Hop eld 42 42 hines Boltzmann Mac Boltzmann Mac hines 43 43 yer Net works works yer Net Sup Learning ervised in Multila Sup ervised Learning in Multila 44 44 Pro cesses cesses Gaussian Pro Gaussian 45 45 Decon volution 46 Graph Codes Sparse VI A Course on Inference Bayesian Low-Densit y Parit y-Chec k Codes 47 hine Mac and Learning volutional Codes and Turb o Codes Con 48 Codes Rep eat{Accum ulate 49 Digital Foun tain Codes 50

11 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. xi Preface exercises About the understand You only by creating it for yourself. The exercises a sub can ject a rating tial book. For guidance, eac h has in this (similar to essen play an role by Knuth (1968)) from 1 to 5 to indicate its dicult y. that used exercises In are esp ecially recommended are mark ed by a addition, that rat. exercises encouraging that require the use of a computer Some marginal ed with a C . mark are ers to man y exercises are pro vided. Use them wisely . Where Answ a solu- tion vided, this is indicated by including its page num ber alongside the is pro y rating. dicult man y of the other exercises will be supplied to instructors Solutions to [email protected] book teac hing; please email this . using in their ry of codes for exercises Summa ecially recommended Esp ute) ] 1 [ Simple (one min hour) 2 ] Medium (quarter [ Recommended . [ hard derately Mo ] 3 a computer C require Parts [ 4 ] Hard on [p. 42] page Solution pro vided 42 [ 5 ] Researc h pro ject Internet resources website The http://www.inference.phy .cam.ac.uk/mackay/itila several tains resources: con e . Teac hing soft 1. that I use in lectures, interactiv e soft ware, Softwar ware researc h soft ware, written in perl , octave , tcl , and , and gnuplot . C Also animations. some Corr these! to the book . Thank you in adv ance for emailing 2. ections postscript 3. . The book is pro vided in book , pdf , and djvu formats This for on-screen viewing. The same cop yrigh t restrictions apply as to a normal book. this About edition is the fourth prin ting of the rst edition. In the second This ting, the prin design book was altered sligh of the um bering generally remained tly. Page-n unc hanged, except in chapters 1, 6, and 28, where a few paragraphs, gures, section, and around. All equation, moved and exercise num bers equations were unc hanged. In the third prin ting, chapter 8 was renamed `Dep enden t Random Variables', instead of `Correlated', whic h was slopp y.

12 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Preface xii Ackno wledgments grateful I am who have supp orted me while this to the most organizations Royal Societ a fantas- Darwin College who gave me the book gestated: y and ersit wship early years; the Univ h fello y of Cam bridge; the researc tic in the Kec at the Univ ersit y of California in San Francisco, where I spent a k Cen tre ductiv and the Gatsb y Charitable Foundation, whose supp ort pro e sabbatical; staircase the out of the Esc her to break that book-writing gave me freedom become. had work has dep My on the generosit y of free soft ware authors. I wrote ended A in L the T book X 2 ort! . Three cheers for Donald Knuth and Leslie Lamp " E run the GNU/Lin ux operating computers I use emacs , perl , and Our system. every day. Thank you Ric hard Stallman, thank you Lin us Torv alds, gnuplot you thank everyone. y readers, to name here, have given feedbac k on the Man too numerous and book, all I extend my sincere ackno wledgmen ts. I esp ecially wish to them to thank all the studen ts and colleagues at Cam bridge Univ ersit y who have attended my lectures theory and mac hine learning over the last on information years. nine bers of the Inference researc mem have given immense supp ort, The h group I thank them all for their generosit y and patience over the last ten years: and Gibbs, Povinelli, helle Mark Simon Wilson, Coryn Bailer-Jones, Matthew Mic Macphee, Edw Miskin, David Ward, Davey, Katriona ard Ratzer, Seb James Cowans, Wills, , John Winn, Phil Barry Hanna Wallac h, Matthew Gar- John rett, and esp ecially Sanjo y Maha jan. Thank you too to Graeme Mitc hison, Mik e Cates, Davin Yap. and I would my debt to my personal hero es, the men tors Finally like to express whom I have learned so much: Yaser Abu-Mostafa, Andrew from e, John Blak Bridle, Cheeseman, Stev e Gull, Geo Peter ton, John Hop eld, Stev e Lut- Hin trell, Rob ert MacKa y, Bob McEliece, Radford Neal, Roger Sew ell, and John Skilling. Dedication book is dedicated to the campaign This the arms trade. against www.caat.org.uk Peace cannot be kept by force. It can only be achiev ed through understanding. { Albert Einstein

13 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 1 Ab out Chapter you will chapter, with the binomial distribution. need rst In the to be familiar { you in the { whic h I urge you to do exercises will need e the to solv And text x x the ! function, x for ' x ximation e appro Stirling's , and to kno w factorial ! N N ed topics notation? Unfamiliar below. review . These are be able it to to apply = r ! )! r N ( r endix A, p.598 . See App distribution binomial The coin probabilit of coming up heads. The f is A bent coin y 1.1. Example has is the probabilit y distribution of the num ber of N tossed What times. are the mean and variance of r ? heads, r ? What num distribution. a binomial has ber of heads Solution. The 0.3 0.25 0.2 0.15 N r N r 0.1 f ) : (1.1) f (1 ) = f;N j P ( r 0.05 r 0 7 8 9 10 0 1 2 3 4 5 6 variance, de ned by are The mean, E [ r ], and distribution var[ r ], of this r N X . The Figure binomial 1.1 r ( P j ) f;N (1.2) r ] r [ E = 0 f j r ( P distribution ; N = 10). 3 : =0 r i h 2 (1.3) r E r r ]) ] E var[ ( [ N X 2 2 2 2 r ( r j f;N (1.4) : P ( E [ r ]) ) [ r ]) E = [ r = ] ( E =0 r than evaluating sums over r in (1.2) and (1.4) directly , it is easiest Rather the mean is the variance by noting that r the sum of N indep endent and to obtain rst namely num ber of heads in the , the toss (whic h is either variables, random or one), the num ber of heads in the second toss, zero so forth. In general, and E x + y ] = E [ x ] + E [ y ] for any random variables x and y [ ; (1.5) + y ] = var[ x ] + var[ y ] if x and y are indep enden t : var[ x So the mean of r is the sum of the means of those random variables, and the mean variance sum of their variances. The is the num ber of heads in a r of toss is f 1 + (1 f ) 0 = f , and the variance of the num ber of heads single toss in a single is 2 2 2 2 (1.6) = f f = f (1 f ) ; f 1 0 + (1 f ) f so the and variance of r are: mean E [ r ] = Nf and var[ r ] = Nf (1 f ) : 2 (1.7) 1

14 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Ab Chapter 1 2 out N oximating and Appr ! x r 0.12 We start route. uncon by an appro ximation e Stirling's Let's deriv ventional 0.1 mean with distribution Poisson from the , 0.08 0.06 0.04 r 0.02 r ( P e ) = j r (1.8) : g ;::: 2 ; 1 ; 0 2f 0 r ! 15 10 0 20 25 5 r For large y of is well appro vicinit , this distribution ximated { at least in the with mean ' { by a Gaussian distribution r and variance : Figure Poisson 1.2 . The 2 ( P distribution = 15). j r r r ( ) 1 2 p : (1.9) ' e e ! r 2 plug Let's into this form ula, then rearrange it. r = 1 p e ' (1.10) ! 2 p 2 : (1.11) ' ) e ! This appro for the factorial function. is Stirling's ximation p x x 1 (1.12) x: ln 2 ! ' x x e x ' x ln x x + ln ! , x 2 2 x x the leading We have deriv beha viour, x ! ' x ed e not only , but also, order p at no cost, the next-order correction term 2 x . We now apply Stirling's N appro ximation to ln : r N N N N ! r (1.13) : r ln + ' ( N ) ln ln ln r N ! r )! N r r ( r terms in this equation are logarithms, this result can be rewritten Since all the log x e . (log in any base. ) by `ln', and logarithms Recall that log We will x = denote natural logarithms 2 e log 2 e '. 2 (log to base ) by `log 2 @ log x 1 1 2 Note = that . If we introduce the binary entrop y function , 2 log x @x e 1 1 ) log (1.14) x + (1 ; log x x ( H ) 2 ) (1 x x rewrite the appro ximation (1.13) as then we can 1 ( x ) H 2 0.8 N ; ) r=N ( log NH ' (1.15) 2 0.6 r 0.4 or, equiv tly, alen 0.2 N ) r=N ( N H 2 0 : 2 ' (1.16) 0 0.2 0.4 0.6 0.8 1 x r terms of the include appro we can ximation, If we need a more accurate next entrop Figure 1.3 . The y binary function. order appro ximation (1.12): Stirling's from N r N r 1 log 2 N NH (1.17) ( r=N : ' ) log 2 2 N N r

15 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 1 to Information Theory Introduction problem unication is that of repro ducing The of comm fundamen tal or appro ximately at one selected at point either exactly a message point. another (Claude Shannon, 1948) half of this book we study how to measure information con ten t; we In the rst data; learn we learn how to comm unicate perfectly over how to compress and comm channels. erfect imp unication a feeling for this last problem. We start by getting erfect, perfect communication over an imp we achieve 1.1 How can communication channel? noisy examples of noisy comm unication channels are: Some phone - - mo mo dem dem line h two mo dems comm over whic unicate digital an analogue telephone line, information; radio - - Galileo Earth waves radio space- from link Galileo, unication comm Jupiter-orbiting the the to earth; craft, daugh ter cell information tains h the DNA in whic cells, cells' ter daugh repro ducing con t paren the t cells; paren from cell @ R @ daugh ter a disk driv e. cell that ws sho example last have to involve informa- doesn't unication comm The computer computer disk - - memory memory driv e from driv plac e to another. When we write a le on a disk going e, one tion it o in the same location { but at a later time . read we'll channels are noisy . A telephone line su ers from cross-talk with These lines; other hardw are in the line distorts and adds noise to the transmitted the The that space net work signal. listens to Galileo's pun y transmitter deep and es bac radiation from terrestrial kground cosmic sources. DNA is receiv sub ject to mutations and damage. A disk driv e, whic h writes a binary digit (a one or zero, kno wn as a bit ) by aligning a patc h of magnetic material also of two orien to read may later fail in one out the stored binary digit: tations, patc h of material migh t spontaneously ip magnetization, or a glitc h of the bac kground noise migh t cause the reading circuit to rep ort the wrong value magnetization for digit, or the writing head migh t not induce the binary the in the rst place because of interference from neigh bouring bits. In all these cases, if we transmit data, e.g., a string of bits, over the channel, ed message there probabilit y that the receiv is some will not be iden tical to the 3

16 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 1 | Introduction 4 to Information Theory prefer We would channel for message. to have a comm transmitted unication { or so close for that y was zero practical purp oses to zero whic h this probabilit zero. from it is indistinguishable disk driv Let's transmits eac h bit correctly with consider a noisy e that ) and incorrectly with probabilit y f f mo del comm uni- y (1 probabilit . This is kno wn as the binary symmetric channel ( gure 1.4). cation channel symmetric Figure 1.4 . The binary - 0 0 channel. bol The transmitted sym ( y = 0 j x = 0) = 1 f ; P ( y = 0 j x = 1) = f ; P y x @ is . ed sym y receiv the and x bol R @ - 1 f: P ( y = 1 j x = 0) = f P ( y ; = 1 x = 1) = j 1 1 The noise y probabilit level, the that is ipp f . a bit is ed, data . A binary 1.5 Figure ) (1 f - 0 0 sequence of length 10 000 over a binary transmitted @ f @ symmetric channel with noise @ = 0 image 1. [Dilb : ert f level R @ - c 1 1 yrigh t Cop 1997 United Feature f ) (1 Inc., with Syndicate, used permission.] : example, let's imagine that As = 0 an 1, that is, ten per cen t of the bits are f ipp ed ( gure 1.5). A useful disk driv e would ip no bits at all in its entire lifetime. If we exp to read and write a gigab yte per day for ten years, we ect 15 a bit order of 10 error y of the , or smaller. There are two require probabilit to this hes approac goal. al solution The physic solution physical ve the physical characteristics of the comm u- The is to impro channel to reduce its error probabilit y. We could impro ve our disk nication e by driv more reliable comp onen ts in its circuitry; 1. using the from the disk enclosure so as to eliminate the turbu- 2. evacuating air that perturbs the reading head from the trac k; lence 3. using a larger patc h to represen t eac h bit; or magnetic in order higher-p 4. using cooling the circuitry ower signals to reduce or thermal noise. These physical mo di cations typically increase the cost of the comm unication channel. The solution `system' theory and theory o er an alternativ e (and much more ex- Information coding approac the given noisy channel as it is and add comm uni- citing) h: we accept systems to it so that we can cation and correct the errors introduced by detect the As sho wn in gure 1.6, we add channel. enco der before the channel and an a deco der after it. The enco der enco des the source message s into a transmit- ted message , adding redundancy to the original message in some way. The t yielding channel to the transmitted message, noise a receiv ed message r . adds The deco der uses the kno wn redundancy introduced by the enco ding system to infer both the original signal s and the added noise.

17 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Error-correcting codes the binary symmetric channel 5 1.2: for `system' solution . The 1.6 Figure Source for achieving reliable 6 comm unication over a noisy s ^ s enco The system channel. ding ? introduces systematic redundancy . The t vector into the transmitted Enco Deco der der deco ding system uses this kno wn redundancy to deduce from the the r both original receiv ed vector 6 r t and the noise vector source Noisy - channel. introduced by the channel solutions give incremen tal channel impro vemen ts only at Whereas physical ever-increasing cost, system solutions can turn noisy channels an into reliable comm channels with the only cost being a computational requiremen t unication enco der deco der. at the and is concerned the theoretical limitations and po- theory with Information of suc h systems. ten is the best error-correcting performance we tials `What achiev e?' could theory is concerned with the creation of practical enco ding and Coding ding systems. deco 1.2 r-co rrecting codes for the bina ry symmetric channel Erro examples of enco ding and deco ding systems. What is the We now consider way to add simplest redundancy to a transmission? [To mak e the rules useful game of the we want to be able to detect and correct errors; and re- clear: transmission is not an option. We get only one chance to enco de, transmit, and deco de.] codes Repetition Source Transmitted sequence sequence is to rep message a prearranged ard eat tforw A straigh idea every bit of the s t 1.7. We call ber of times { for example, three times, num wn in table as sho etition code `R this '. rep 3 000 0 we transmit message Imagine that source the 1 111 s = 0 0 1 0 1 1 0 rep etition code R Table . 1.7 . The 3 over a binary channel with noise level f = 0 : 1 using this rep etition symmetric We can channel e the code. as `adding' a sparse noise vector n to the describ vector { adding in mo dulo 2 arithmetic, i.e., the binary algebra transmitted in whic h 1 + 1 = 0 . A possible noise vector n and receiv ed vector r = t + n are sho 1.8. wn in gure Figure . An example 1.8 using . transmission R 0 1 0 1 0 0 s 1 3 z }| { { z }| { }| z }| { z z }| { }| z { { }| z t 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 n 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0 0 0 r How should we deco de this receiv ed vector? The optimal algorithm looks tak at the ed bits three at a time and receiv es a ma jorit y vote (algorithm 1.9).

18 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. to Information 6 1 | Introduction Theory . Ma jorit 1.9 rithm Algo y-v ote j s = ( P ) r 1 R for algorithm ding deco . Also ^ ded s sequence Deco eliho Lik r sequence od ratio Receiv ed 3 ( 0 r j = s ) P wn od ratios likeliho the are sho 3 0 000 is a channel the assuming (1.23), 1 binary symmetric channel; 0 001 1 (1 . =f ) f 010 0 1 100 0 1 1 101 1 110 1 1 1 011 3 1 111 the of explaining let's pro ve this result. The optimal At the risk obvious, smallest decision sense of having the in the probabilit y of ding deco (optimal is to nd whic h value of s is most probable, given r . Consider being wrong) ( ding s , whic h was enco ded as t bit s ) and gave rise to three the deco of a single r = r theorem, r receiv r is . By Bayes' ed bits the posterior probabilit y of s 3 1 2 s ( P ( r P r ) r ) j s 3 1 2 : (1.18) ) = j r P r ( r s 1 2 3 P r r ( ) r 3 1 2 posterior probabilit spell out two alternativ es thus: We can the y of the 1 = s ( P ) 1 = s j r P ( r ) r 1 2 3 s ( r j 1 = r ) = P r (1.19) ; 3 1 2 r r ) r P ( 2 3 1 ) 0 = s ( P ) 0 = s j r P ( r r 1 3 2 P r ) = r ( r j 0 = s (1.20) : 3 1 2 r r ) r P ( 1 3 2 posterior y is determined by two factors: the prior probabilit y This probabilit ( s ), and the data-dep enden t term P ( r P r od r likeliho j s ), whic h is called the 3 2 1 s . The normalizing constan t P the r of r r ) needn't be computed when nding ( 3 2 1 optimal deco decision, whic h is to guess ^ s = 0 if P ( s = 0 j r ) > P ( s = 1 j r ), ding ^ s 1 otherwise. and = about ( = 0 j r ) and P ( s = 1 j r ), we must P e an assumption s the To nd mak probabilities two hypotheses s = 0 and s of the 1 , and we must mak e an prior = about the probabilit y of r given s . We assume that the prior prob- assumption are P abilities ( s = 0 ) = P ( s = 1 ) = 0 : 5; then maximizing the posterior equal: y likeliho ( s j r ) is equiv alen t to maximizing the probabilit od P ( r j s ). And we P 0 assume channel is a binary symmetric channel with noise level f < the : 5, that so that the likeliho od is N Y r j s ) = P ( r j t ( s )) = ; P (1.21) ( P ( r )) j t s ( n n n =1 num ber of transmitted bits in the blo = 3 is the considering, N where ck we are and r if ) f (1 t = n n (1.22) P ) = r ( t j n n if 6 = r f : t n n likeliho od ratio for the two hypotheses is Thus the N Y r )) 1 ( t j ( P j 1 ) P ( s r = n n = (1.23) ; ( ) = s j r ( P 0 r P j t ( 0 )) n n =1 n f ) (1 j t P 1 )) ( r ( f n n The = 1 and r if if = 0. r equals ratio eac h factor n n ) f j )) (1 f ( r ( t P 0 n n (1 f ) hypothesis is greater winning is the than 1, since f < 0 : 5, so the f one with the most `votes', eac h vote coun ting for a factor of in the likeliho od ratio.

19 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. codes 1.2: binary symmetric channel 7 for Error-correcting the is the y-v der sho wn in algorithm 1.9 deco optimal deco der jorit ote Thus the ma channel is a binary symmetric if we assume and that the two that the channel messages and 1 have equal 0 probabilit y. source possible prior ma jorit y vote deco der We now apply receiv ed vector of gure 1.8. the to the rst receiv ed bits are all 0 , so we deco de this triplet as a 0 . In the The three of gure 0 there are two triplet s and one 1 , so we deco de this triplet second 1.8, Not { whic case corrects the error. h in this all errors are corrected, as a 0 If we are unluc ky and two errors fall in a single blo ck, as in the fth however. wn of gure the deco ding rule gets the wrong answ er, as sho then triplet 1.8, 1.10. in gure ed receiv the ding . Deco 1.10 Figure gure 1.8. vector from s 0 0 1 0 1 1 0 z z }| { z }| { }| }| { { z { z }| { z }| { }| z t 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 n 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 r 0 0 0 | {z | {z } | | {z } {z } } | {z {z } } | } {z | ^ 0 1 0 0 1 0 0 s ? corrected errors ? errors undetected [ p.16 ] 2, Sho w that the error probabilit y is reduced by the Exercise use The exercise's rating, e.g.`[ 2 ]', 1.2. of its dicult y: `1' indicates error the by computing R code for a binary symmetric y of this probabilit 3 exercises the easiest. Exercises are noise channel with . f level accompanied that by a are are rat esp ecially marginal two bits error y is dominated by the probabilit y that probabilit in The or If a solution recommended. 2 ed, f of the case the a blo ck of three are ipp . In whic h scales as binary partial the solution is pro vided, a probabilit has = 0 code R 1, the : channel y of error, symmetric f with after page is indicated after the 3 example, dicult y rating; for this of transmitting deco ding, of p result ' 0 : 03 per bit. Figure 1.11 sho ws the a b exercise's 16. is on page solution binary image symmetric channel using the rep etition code. over a binary encoder decoder channel s r t ^ s f = 10% - - - Figure 1.11 . Transmitting 10 000 source bits over a binary symmetric channel with f = 10% using a rep code and the etition ma y vote deco ding algorithm. jorit The probabilit y of deco ded bit error has fallen to about 3%; the rate has fallen to 1/3.

20 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 1 | Introduction to Information 8 Theory p y probabilit . Error 1.12 Figure b 0.1 R1 rep versus rate codes etition for R5 0.1 R1 0.01 R3 channel symmetric over a binary with t-hand righ f 1. The : = 0 0.08 ws a logarithmic on p sho gure 1e-05 b more useful codes p scale. We would like the rate to b to be small. and p be large 0.06 b 0.04 1e-10 R3 0.02 more useful codes R5 R61 R61 1e-15 0 0.6 1 0.4 0 0.2 0.4 0.6 0.8 0.8 1 0.2 0 Rate Rate reduced etition R rep has The code the probabilit y of error, as therefore 3 Yet we have lost desired. our rate of information transfer has something: fallen of three. So if we use a rep etition code to comm unicate data by a factor line, it will the error frequency , but it will also reduce over a telephone reduce unication as much for We will have to pay three times comm eac h our rate. noisy Similarly need three of the original call. gigab yte disk phone , we would es in order to create a one-gigab yte disk driv driv p 03. = 0 : e with b values we push probabilit y lower, to the error required for a sell- the Can 15 e { 10 ? We could disk able achiev e lower error probabilities by using driv etitions. etition with more rep rep codes [ 3, p.16 ] 1.3. probabilit (a) Sho w that the Exercise y of error of R e- , the rep N tition with N rep etitions, is code N X N n N n ) ; f (1 (1.24) f = p b n = +1) N =( 2 n odd N . for Assuming f = 0 : 1, whic h of the terms in this sum (b) biggest? is the How much bigger the second-biggest term? is it than N in the ximate the appro Use (c) ximation (p.2) to appro Stirling's n term, and largest appro ximately , the probabilit y of error of nd, the rep etition code with N rep etitions. (d) Assuming f = 0 : 1, nd how man y rep etitions are required to get 15 the to 10 probabilit down . [Answ er: about 60.] y of error a required gigab yte disk driv e with the So to build reliabilit y from noisy single need yte es with f = 0 : 1, we would gigab sixty of the noisy disk driv es. driv The tradeo between error probabilit y and rate for rep etition codes is sho wn in gure 1.12. { the ; 4) Hamming code Block codes (7 We would unicate with tiny probabilit y of error and at a substan- like to comm tial rate. Can we impro ve on rep etition codes? What if we add redundancy to bit blo of data instead of enco ding one cks at a time? We now study a simple blo ck code .

21 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Error-correcting codes binary symmetric channel 9 1.2: for the s is a rule verting a sequence of source bits con , of length ck code for A blo t of length N bits. To add redundancy , K , say, into a transmitted sequence extra greater K . In a linear blo ck code, the than N K bits are N e we mak of the original K bits; these extra bits are called parit y-chec k linear functions Hamming example ck code is the (7 ; 4) blo code , whic h of a linear . An bits = 7 bits for every K = 4 source bits. transmits N Figure . Pictorial 1.13 t 1 the for represen ding of enco tation 5 ; code. (7 4) Hamming s s 1 0 1 2 s 0 3 t t s 0 1 7 4 6 0 (b) (a) ding operation for the code is sho wn pictorially in gure The We enco 1.13. the bits in three intersecting seven transmitted The rst four arrange circles. bits, t . The t transmitted t s t s , are set equal to the four source bits, s s 1 4 3 3 4 2 1 2 y-chec k bits t parit t is even: t h circle are set so that the parit y within eac 7 5 6 rst parit y-chec k bit is the parit y of the rst three source bits (that is, it the 0 if the sum of those bits is even, and 1 is sum is odd); the second is if the is the parit last three; and the third parit y bit the parit y of source bits y of the one, three and four. As an example, gure 1.13b sho ws the transmitted codew ord for the case 4 s . Table 1.14 sho ws the codew ords generated by eac h of the 2 = = 1000 sixteen settings source bits. These codew ords have the special of the four erty that any pair eac h other in at least three bits. prop di er from sixteen . The 1.14 Table codew ords s s t t t s s t of the code. g 4) Hamming t f ; (7 1100 1100011 1000101 1000 0000 0100110 0100 0000000 di er of codew Any pair ords from 0101101 1101000 0001 0001011 1001 0101 1101 1001110 in at least three bits. h other eac 0110 0110001 0010111 1110 1110100 1010010 0010 1010 1111111 0111010 1011001 1011 0111 0011100 0011 1111 the Because code is a linear code, it can be written compactly in Hamming terms of matrices as follo ws. The transmitted codew ord t is obtained from the source s sequence by a linear operation, T G ; (1.25) = t s is the generator matrix of the code, where G 2 3 1 0 0 0 6 7 0 1 0 0 6 7 6 7 0 0 1 0 6 7 T 6 7 0 0 0 1 G = (1.26) ; 6 7 6 7 1 1 1 0 6 7 4 5 0 1 1 1 1 0 1 1 and the enco ding operation (1.25) uses mo dulo-2 arithmetic ( 1 + 1 = 0 , 0 + 1 = 1 , etc.). In the enco ding operation (1.25) I have assumed that s and t are column vectors. is replaced If instead are row vectors, then this equation they by t = sG ; (1.27)

22 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 1 | Introduction to Information 10 Theory where 2 3 1 0 0 0 1 0 1 6 7 0 1 0 0 1 1 0 6 7 = G : (1.28) 4 5 0 0 1 0 1 1 1 0 0 0 1 0 1 1 the t-m ultiplication to relate than righ left-m ultiplica- to the it easier I nd (1.25) y coding theory texts use tion left-m ultiplying con ventions (1.27). Man the however. (1.27{1.28), generator matrix (1.28) can rows ed as de ning four basis The of the be view in a sev en-dimensional binary space. The sixteen codew ords vectors lying are by making possible linear com binations of these vectors. obtained all (7 Decoding Hamming code ; the 4) task enco s complex t , the der of deco ding the we invent a more When ! vector r becomes less straigh receiv ard. Remem ber that any of the ed tforw may have been ed, including the parit y bits. bits ipp that channel is a binary symmetric channel and that all If we assume the source are equiprobable, then the optimal deco der iden ti es the source vectors vector whose enco ding t ( s ) di ers from the receiv ed vector r in the few est s [Refer to the od function (1.23) to see why this is so.] We could bits. likeliho eac deco by measuring how far r is from problem h of the sixteen e the ding solv in table 1.14, then codew the closest. Is there a more ecien t way ords picking the most probable source vector? of nding ome decoding for the Hamming code Syndr the ; For 4) Hamming code there is a pictorial solution to the deco ding (7 based the enco ding picture, gure 1.13. problem, on a rst example, let's assume the transmission was t = 1000101 As the and noise the second bit, so the receiv ed ips is r = 1000101 0100000 = vector 1100101 . We write the receiv ed vector into the three circles as sho wn in gure 1.15a, look at eac h of the three circles to see whether its parit y and The circles parit y is not even are sho wn by dashed lines in is even. whose The that ding task is to nd the smallest set of ipp ed bits 1.15b. gure deco t for parit violations of the accoun y rules. [The pattern of violations can these , and parit is called the syndrome ks can be written as a binary of the y chec { for example, in gure 1.15b, the vector is z = ( 1 ; 1 ; 0 ), because syndrome the two circles are `unhapp y' (parit y 1 ) and the third circle is `happ y' rst y ).] (parit 0 e the deco ding task, we ask the question: can we nd a unique bit To solv outside lies all the `unhapp y' circles and inside all the `happ y' circles? If that so, the ipping of that bit would accoun t for the observ ed syndrome. In the case sho in gure 1.15b, the bit r wn lies inside the two unhapp y circles and 2 this y circle; no other single bit has happ prop erty, so r outside is the only the 2 bit capable of explaining the syndrome. single work through a couple more examples. Figure 1.15c sho ws what Let's , is ipp ens happ parit y bits, t of the if one ed by the noise. Just one of the 5 chec ks is violated. Only r other lies inside this unhapp y circle and outside the 5 two happ so r of explaining is iden ti ed as the y circles, single bit capable only 5 the syndrome. If the cen tral bit r three is receiv ed ipp ed, gure 1.15d sho ws that all 3 three ks are violated; only r as lies inside all chec circles, so r ti ed is iden 3 3 the susp ect bit.

23 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. codes Error-correcting binary symmetric channel 11 for 1.2: the Figure 1.15 . Pictorial r ding of deco tation represen of the 5 Hamming (7 ; The 4) code. r r ed vector receiv into the is written 1 2 r 3 In in (a). wn as sho diagram (b,c,d,e), the receiv ed vector is r r r sho that assuming wn, the 7 4 6 vector transmitted was as in (a) 1.13b elled and the bits lab gure * violated ed. ipp ? were by The 1 1 0 highligh ted parit by y chec ks are * sev en circles. One of the dashed 1 0 1 0 1 1 * 0 0 is the ect bits susp probable most 1 h `syndrome', eac t for to accoun 0 1 1 0 0 1 and of violated i.e., eac h pattern 0 0 0 ks. satis ed parit y chec (c) (d) (b) In examples the (d), and (c), (b), probable is the ect susp one most ed. was ipp that bit In example (e), two bits have been 1 1 . The t s and ed, ipp most 3 7 ed by , mark r is ect susp probable 1 0 2 1 1 - 0 * * in (e ), whic a circle ws h sho the 1 1 algorithm. output of the deco ding * * 0 0 0 0 0 0 0 (e (e) ) Algo en by tak . Actions 1.16 rithm 110 111 001 Syndrome z 010 011 100 101 000 optimal 4) the ; (7 the for der deco a assuming code, Hamming r r r r r Un ip r this r none bit 7 4 5 1 2 3 6 symmetric channel with binary noise . The f small syndrome level z eac whether vector h parit y lists k is violated ( 1 ) or satis ed chec t a di eren that nd you'll bits, seven of the any one ipping try If you 0 chec the through ), going ( ks in syndrome is obtained h h case { seven non-zero syndromes, one for eac in eac , r . r bits , and of the order the r 7 5 6 is only if the other syndrome, the all-zero syndrome. So There bit. one f symmetric a small noise level with , the optimal is a binary channel channel un ips at most one bit, dep ending deco the syndrome, as sho wn in der on 1.16. h syndrome could have been Eac by other noise patterns algorithm caused any other noise pattern that has the same syndrome must be less too, but because it involves a larger ber of noise events. probable num than ens noise happ ips more if the one bit? Figure 1.15e What actually ws the situation when two bits, r ed. and r syn- , are receiv ed ipp sho The 7 3 110 , mak es us susp ect the single bit r al- ; so our optimal deco ding drome, 2 pattern this bit, giving a deco ded ips with three errors as sho wn gorithm 0 1.15e in gure . If we use the optimal deco ding algorithm, any two-bit error pattern will to a deco ded seven-bit vector that con tains three errors. lead al view of decoding line ar codes: syndr ome decoding Gener for for describ deco ding problem also a linear code in terms of matrices. We can e the rst four receiv ed bits, r the r and r The r bits; , purp ort to be the four source 4 3 1 2 ort ed bits by r as de ned r receiv purp r to be the parities of the source bits, 6 5 7 the generator matrix G . We evaluate the three parit y-chec k bits for the receiv ed matc bits, r . The r r r r , and see whether they r h the three receiv ed bits, r 1 5 4 6 3 7 2 di erences (mo dulo 2) between these two triplets are called the syndrome of the three receiv If the syndrome is zero { if all vector. parit y chec ks are happ y ed is the receiv ed vector is a codew ord, and the most probable deco ding { then

24 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 1 | Introduction to Information 12 Theory decoder encoder channel r s t ^ s f = 10% - - - 8 > > < Figure 1.17 . Transmitting 10 000 parit y bits source over a binary bits > > : symmetric channel f = 10% with ; 4) Hamming code. The using a (7 ded probabilit error is y of deco bit about 7%. its rst four bits. If the out is non-zero, then the given by reading syndrome for noise blo ck was non-zero, and the syndrome is our pointer to sequence this most probable error pattern. the computation of the syndrome vector The operation. If we de ne the is a linear 3 4 matrix P suc h that the matrix of equation (1.26) is I 4 T ; (1.29) = G P then is the 4 4 iden tity matrix, where the syndrome vector is z = Hr , I 4 P I is given by where = the parit y-chec k matrix H H dulo 2 ; in mo 3 arithmetic, 1, so 1 2 3 1 1 1 0 1 0 0 4 5 P I = H = 0 1 1 1 0 1 0 : (1.30) 3 1 0 1 1 0 0 1 T the codew ords t = G s All of the code satisfy 3 2 0 5 4 0 : (1.31) Ht = 0 [ 1 ] T the ve that this is so by evaluating 1.4. 3 4 matrix HG . . Pro Exercise T receiv ed vector r is given by r = G , the s + n Since syndrome-deco ding the is to nd the most probable noise vector n satisfying the equation problem = : (1.32) Hn z ding algorithm that solv es this problem is called a maxim A deco eliho od um-lik deco . We will discuss deco der problems like this in later chapters. ding Summary of the (7 ; 4) Hamming code's properties Every possible ed vector of length 7 bits is either a codew ord, or it's one receiv away from ord. ip a codew there are three parit y constrain ts, eac h of whic h migh t or migh Since t not be violated, are 2 2 there 2 = 8 distinct syndromes. They can be divided into seven non-zero syndromes { one for eac h of the one-bit error patterns { and all-zero syndrome, corresp onding to the zero-noise case. the The optimal deco der tak es no action if the syndrome is zero, otherwise it error uses mapping of non-zero syndromes onto one-bit this patterns to un ip the susp ect bit.

25 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Error-correcting codes the binary symmetric channel 13 1.2: for not deco if the four deco ded bits ^ s ding ; ^ s is a ; ^ s all ; ^ s error do There 4 1 2 3 . The bits matc ;s h the ;s is the source p s probabilit y of blo ck error ;s 4 B 3 2 1 one or more of the probabilit ded bits in one blo ck fail to matc h the y that deco onding bits, corresp source ^ s = s 6 = ( ) : (1.33) p P B y of bit error p The is the probabilit probabilit y that a deco ded bit average b h the corresp onding source to matc fails bit, K X 1 s (1.34) : = 6 P (^ ) s = p k k b K =1 k case of the code, a deco ding error will occur whenev er In the Hamming has in a blo ed more than one bit noise ck of seven. The probabilit y the ipp are is thus probabilit y that two or more bits the ipp ed in a of blo ck error 2 probabilit y scales as O ( f for ), as did the probabilit y of error blo the ck. This etition code R rep . But notice that the Hamming code comm unicates at a 3 greater R = 4 = 7. rate, 1.17 sho ws a binary image transmitted over a binary symmetric Figure are using (7 ; 4) Hamming channel Ab out 7% of the deco ded bits the code. in error. that the errors are Notice often two or three successiv e correlated: deco ded bits are ipp ed. 1 ] [ Exercise This exercise and the next three refer to the (7 ; 4) Hamming 1.5. Deco de receiv ed strings: code. the = 1101011 (a) r 0110110 (b) = r (c) = 0100111 r = 1111111 . (d) r [ p.17 ] 2, (a) Calculate the 1.6. y of blo ck error p Exercise of the probabilit B ; 4) Hamming code as a function of the noise level f and sho w (7 2 to leading it goes as 21 f that . order [ 3 ] w that error order the probabilit y of bit Sho p to leading goes (b) b 2 f . as 9 ] [ 2, p.19 1.7. Exercise some noise vectors that give the all-zero syndrome Find (that vectors that leave all the parit y chec ks unviolated). How is, noise man y suc h noise vectors are there? [ 2 ] . Exercise I asserted above that a blo ck deco ding error will result when- 1.8. Sho bits ed in a single blo ck. ipp w that this is ever two or more are so. [In principle, there migh t be error patterns indeed after de- that, coding, only to the corruption of the parit y bits, led no source bits with incorrectly deco ded.] Summary of codes' performanc es Figure 1.18 ws the performance of rep etition codes and the Hamming code. sho blo sho the performance of a family of linear ws ck codes that are gen- It also eralizations of Hamming codes, called BCH codes. linear This ws that we can, using sho blo ck codes, achiev e better gure performance than rep etition codes; but the asymptotic situation still looks grim.

26 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 1 | Introduction to Information 14 Theory p y probabilit . Error 1.18 Figure b 0.1 R1 rep for rate versus R codes, etition R5 R1 0.1 H(7,4) 0.01 and (7 the code 4) Hamming ; cklengths blo with codes BCH up 0.08 to 1023 over a binary symmetric 1e-05 more useful codes H(7,4) p channel 1. The : = 0 f with b BCH(511,76) ws thand gure sho righ p on a 0.06 b logarithmic scale. 0.04 1e-10 BCH(31,16) R3 0.02 BCH(15,7) BCH(1023,101) more useful codes R5 0 1e-15 0.2 1 0 0.6 0.2 0.4 0.6 0 0.8 0.8 1 0.4 Rate Rate [ p.19 ] 4, Exercise an error-correcting Design a deco ding algorithm 1.9. code and it, estimate its for y of error, and add it to gure 1.18. [Don't probabilit worry nd it dicult to mak e a code better than the Hamming if you or if you nd it dicult a good deco der for your code; that's code, to nd exercise.] point of this the [ 3, p.20 ] A (7 ; 4) Hamming code can correct any Exercise error; migh t 1.10. one be a (14 ; 8) code that can correct any two errors? there extra: Does the answ er to this question dep Optional on whether the end code or nonlinear? is linear [ p.21 ] 4, Design an error-correcting 1.11. other than a rep etition Exercise code, that can correct code, two errors in a blo ck of size N . any What perfo rmance 1.3 the best codes achieve? can There seems to be a trade-o between the deco ded bit-error probabilit y p b (whic like to reduce) and the rate R (whic h we would like to keep h we would How can this be characterized? What points in the ( R;p large). ) trade-o b are This question was addressed able? Shannon in his plane achiev by Claude er of 1948, in whic h he both pioneering the eld of information pap created and solv ed most of its fundamen tal problems. theory time there was a widespread belief that the boundary between At that able achiev nonac hiev able points in the ( R;p and ) plane was a curv e passing b through origin ( R;p the ) = (0 ; 0); if this were so, then, in order to achiev e b the a vanishingly probabilit y p rate , one would have to reduce small error b corresp ondingly close to zero. `No pain, no gain.' However, Shannon pro ved the remark able result that the boundary be- tween achiev and nonac hiev able points meets the R axis at a non-zer o able R in gure C , as sho wn value 1.19. For any channel, there exist codes that = smal mak unicate with arbitr arily to comm l probabilit y of error p e it possible b at non-zero rates. The rst half of this book (Parts I{III) will be dev oted to coding understanding able result, whic h is called the noisy-c hannel remark this theorem . Example: f = 0 : 1 The maxim um rate at whic h comm unication is possible with arbitrarily small channel. p is called the capacit y of the capacit The form ula for the y of a b

27 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Summary 15 1.4: . Shannon's 1.19 Figure 0.1 R1 noisy-c theorem. coding hannel R5 R1 0.1 0.01 the solid The e sho curv ws able achiev on limit Shannon 0.08 values R;p ) for the of ( binary 1e-05 b H(7,4) p 1. f = 0 with channel symmetric : b achiev up to R = C are Rates able 0.06 with arbitrarily small p . The b points sho w the of performance 0.04 1e-10 ook as in some codes, textb R3 not achievable achievable gure 1.18. 0.02 equation The the de ning R5 solid (the limit Shannon e) is curv not achievable achievable ( where )) = C= p C and ; (1 H R 1e-15 0 2 b C C 0.8 0.8 1 0.2 0 0 0.2 0.4 0.6 0.6 1 0.4 are de ned in equation (1.35). H 2 Rate Rate channel with noise level f is binary symmetric 1 1 ; (1.35) + (1 f ) log H ) = 1 C ( f ) = 1 f ( log f 2 2 2 f f 1 = 0 channel earlier with noise level f discussing : 1 has capacit y the we were ' 0 : 53. Let us consider what this means C of noisy disk driv es. The in terms rep code R at a could comm unicate over this etition with p 03 = 0 : channel 3 b rate R = 1 = 3. Thus we kno w how to build a single gigab yte disk driv e with disk p 03 from three noisy gigab yte : driv es. We also kno w how to mak e a = 0 b 15 gigab e with p single ' 10 yte driv from sixt y noisy one-gigab yte driv es disk b 1.3, now Shannon passes by, notices us juggling with disk p.8). And (exercise codes and says: es and driv 15 you trying to achiev e? 10 `What performance ? You are don't need disk driv es { you can get that performance with just sixty disk And es (since 1/2 is less than 0 : 53). two if you want driv 18 24 or 10 p you or anything, = 10 can get there with two disk b driv es too!' [Strictly , the above statemen ts migh t not be quite righ t, since, as we shall see, Shannon pro his noisy-c hannel coding theorem by studying sequences of ved the ck codes blo cklengths, and ever-increasing required blo cklength blo with t be bigger than a gigab yte (the size of our migh driv e), in whic h case, disk Shannon t say `well, you can't do it with those migh disk driv es, but if you tiny had two noisy terabyte driv es, you could mak e a single high-qualit y terab yte driv e from them'.] 1.4 ry Summa (7 ; 4) Hamming Code The By three parit y-chec k bits in a blo ck of 7 bits it is possible to detect including and correct any single bit error in eac h blo ck. Shannon coding theorem 's noisy-channel Information can be communic ated over a noisy channel at a non-zer o rate with arbitr arily smal l error probability.

28 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 1 | Introduction to Information 16 Theory both addresses and the possibilities of theory the Information limitations pro noisy-c theorem, whic h we will coding ve in The unication. comm hannel both that reliable comm unication at any rate beyond Chapter 10, asserts the ossible, that reliable comm unication at all rates up to capacit y is imp and y is possible. capacit this lay the foundations for few result by discussing next chapters The con ten t and the intimately related topic of data how to measure information . compression exercises Further 1.5 [ 2, p.21 ] way of viewing the rep etition code R . 1.12. Exercise Consider . One 9 a concatenation of R this with is as . We rst enco de the code R 3 3 output with source , then enco de the resulting R with R stream . We 3 3 2 e deco alternativ an ding ates '. This idea motiv code this call could `R 3 in whic h we deco de the bits three at a time using the deco der algorithm, R for deco de the deco ded bits from that rst deco der using the ; then 3 der R . deco for 3 probabilit y of error for this deco der and compare Evaluate the it with probabilit for the optimal deco y of error for R the . der 9 2 concatenated enco and deco der for R the Do der antages over have adv 3 those for R ? 9 1.6 Solutions 1.2 is made An error Solution by R (p.7). if two or more bits are to exercise 3 ed in a blo ck of three. So the error probabilit y of R ipp is a sum of two 3 3 terms: y that all three bits are ipp ed, f probabilit ; and the probabilit y the 2 are ipp ed, 3 f that (1 two bits f ). [If these expressions are not exactly obvious, see example 1.1 (p.1): the expressions are P ( r = 3 j f;N = 3) and P ( = 2 j f;N = 3).] r 2 3 2 3 = 3 (1 f ) + f = 3 p f f 2 f = : (1.36) p B b 2 y is dominated small f by the probabilit 3 f term . This for 2.38 discussion for further exercise of this problem. See (p.39) the 1.3 The probabilit y of error for to exercise rep etition code Solution (p.8). y that is dominated by the probabilit R d N= 2 e bits are ipp ed, whic h goes N ) as Notation: (for N= 2 N denotes the odd integer than or smallest greater N +1) 2 N ( = 2 1) = ( N f (1 f ) (1.37) : N= 2. equal to e d N= 2 N y function: binary entrop can be appro ximated using the term The K N 1 N ( N H ) K=N ) ( K=N N H N H ( ) K=N 2 2 2 ; ' 2 (1.38) 2 ) 2 K K + 1 N p appro ximation introduces an error of order where this { as wn in sho N (1.17). So equation N N= 2 N= 2 ' 2 ( f (1 p f )) (1.39) = = (4 f (1 f )) p : B b 15 10 log 15 = 68. of 10 value N ' 2 required to the equal this we nd Setting f (1 f ) 4 log This answ er is a little out because the appro ximation we used overestimated N 2. and we did not distinguish N= d N= 2 e and between K

29 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 1.6: 17 more answ computation) goes as follo ws. tly of explicit careful er (short A sligh N order, next to the we nd: ximation Taking the appro for K 1 N N p ' 2 : (1.40) 2 N= N= 2 4 can be pro ved from an accurate version of Stirling's ap- This appro ximation or by considering pro p = 1 = 2 binomial with (1.12), the ximation distribution and noting 2 N= X X p 2 2 N N N N N = N 2 r ; (1.41) 2 1 = 2 ' 2 e 2 ' K N= 2 2 N= K r = 2 N= p where N= 4, from whic h equation (1.40) follo ws. The distinction between = N and N= 2 is not imp ortan t in this term since d N= 2 e has um a maxim at K N= = 2. K y of error (for odd Then ) is to leading order the probabilit N N = 1) N ( 2 2 ( N +1) = f f ) (1.42) (1 ' p b 2 = +1) N ( 1 1 N = 1) ( N 1) = 2 N 2 ( p p 2 ' ' f )] : (1.43) f [4 f (1 f )] f [ f (1 8 N= N= 2 15 = 10 p equation can The In equation (1.44), the logarithms be written b be tak en to any base, as long can p N= 8 15 the base same throughout. as it's + log log 10 f 10. In equation (1.45), I use base (1.44) ' N ( = 1) 2 (1 f ) f 4 log ^ N iterativ ely, the rst iteration starting from whic N h may be solv = 68: ed for 1 15 + 1 : 7 ^ ^ 2 ' : N ( (1.45) : 9 1) = 29 : 9 ) = N 60 ' 2 2 0 : 44 er is found to be stable, so N ' 61 is the blo This at whic h answ cklength 15 ' 10 . p b to exercise 1.6 (p.13) . Solution The y of blo ck error of the Hamming code is a sum of six terms probabilit (a) probabilities that 2, 3, 4, 5, 6, or 7 errors occur in one blo ck. { the 7 X 7 r 7 r = f f ) (1 (1.46) p : B r =2 r order, this To leading goes as 7 2 2 = 21 p ' f (1.47) : f B 2 The probabilit y of bit error of the Hamming (b) is smaller than the code probabilit ck error because a blo ck error rarely corrupts all bits in y of blo deco ded blo ck. The leading-order beha viour is found by considering the the outcome in the most probable case where the noise vector has weigh t di ed two. The will erroneously ip a thir d bit, so that the mo der deco receiv vector (of length 7) di ers ed bits from the transmitted in three vector. That means, if we average over all seven bits, the probabilit y that probabilit a randomly bit is ipp ed is 3 = 7 times the blo ck error chosen y, to leading order. Now, what we really care about is the probabilit y that

30 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. to Information 18 Theory 1 | Introduction likely ed. y bits or source bits more parit to be a source Are bit is ipp ipp ed bits, or are all seven bits equally likely to be among these three noise corrupted has weigh t two? The Hamming code when vector the in the protection it a ords to the seven symmetric is in fact completely a binary symmetric channel). [This symmetry can bits (assuming be ved wing that the role of a parit y bit can be exc hanged with pro by sho 4) Hamming and code is still a (7 ; resulting code; see bit the a source probabilit y that any one bit ends up below.] is the same The corrupted all bits. So the probabilit y of bit seven (for the source bits) is for error three sevenths of the probabilit y of blo ck error. simply 3 2 p ' ' f p : (1.48) 9 B b 7 of the (7 ; 4) code Symmetry Hamming the (7 ; 4) code protects all bits equally , we start from the To pro y- ve that parit k matrix chec 3 2 1 1 1 0 1 0 0 4 5 H = 0 1 1 1 0 1 0 : (1.49) 1 0 1 1 0 0 1 symmetry among the seven transmitted bits will be easiest to see if we The ) the the perm utation ( t t t t t t t using ! ( t seven bits t t t t t t ). reorder 6 2 3 3 7 4 5 1 4 6 2 7 5 1 we can rewrite H thus: Then 2 3 1 1 1 0 1 0 0 4 5 0 1 1 1 0 1 0 (1.50) H = : 0 0 1 1 1 0 1 Now, e any two parit y constrain ts that t satis es and add them if we tak we get another y constrain t. For example, row 1 asserts t together, + parit 5 + + t = even, and row 2 asserts t t + t + t + t = even, and the sum of t 6 2 3 4 1 3 2 two constrain ts is these (1.51) + 2 t = even; + 2 t t + t t + t + 4 3 6 2 5 1 drop t terms 2 t the and 2 t are; , since they are even whatev er t we can and 2 3 2 3 ed the parit y constrain t t h we + thus we have deriv whic + t = even, + t t 6 4 1 5 can if we wish add into the parit y-chec k matrix as a fourth row. [The set of vectors satisfying = 0 will not be changed.] We thus de ne Ht 2 3 1 1 1 0 1 0 0 6 7 0 1 1 1 0 1 0 0 6 7 H (1.52) = : 5 4 0 0 1 1 1 0 1 1 0 0 1 1 1 0 fourth row is the sum The dulo two) of the top two rows. Notice that the (mo second, d, and fourth rows are all cyclic thir of the top row . If, having shifts added the fourth redundan t constrain t, we drop the rst constrain t, we obtain 00 a new k matrix H parit , y-chec 2 3 0 1 1 1 0 1 0 00 4 5 = 0 0 1 1 1 0 1 H ; (1.53) 1 0 0 1 1 1 0 00 like h still H whic t = 0 for all codew ords, and whic h looks just satis es all the H in (1.50), except that starting the columns have shifted along one

31 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 19 1.6: t, and righ tmost column has reapp eared at the left (a cyclic the to the righ columns). utation perm of the the among seven establishes Iterating the above symmetry This the bits. times, we can mak e a total pro di eren t H matrices cedure ve more of seven same code, eac h of whic h assigns original h bit to a di eren t role. the for eac the sup We may also t seven-ro w parit y-chec k matrix construct er-redundan the for code, 3 2 1 1 1 0 1 0 0 7 6 0 1 1 1 0 1 0 7 6 7 6 0 0 1 1 1 0 1 7 6 000 7 6 : (1.54) 1 0 0 1 1 1 0 H = 7 6 7 6 0 1 0 0 1 1 1 7 6 5 4 1 0 1 0 0 1 1 1 1 0 1 0 0 1 space t' in the sense that the matrix spanned by its rows is This is `redundan three-dimensional, not only seven. This is also a cyclic matrix. Every row is a cyclic perm utation of matrix top row. the if there codes: of the bits t is an :::t ordering suc h that a linear Cyclic 1 N a cyclic parit y-chec k matrix, code the code is called a cyclic has then . code ords of suc h a code The have cyclic prop erties: any cyclic codew also utation of a codew ord is a codew ord. perm the (7 For example, ; 4) code, with its bits ordered as above, Hamming of all seven cyclic of the codew ords 1110100 and 1011000 , consists shifts the codew ords 0000000 and 1111111 . and codes of the a cornerstone Cyclic algebraic approac h to error-correcting are have been We won't them again codes. book, however, as they use in this sup erceded by sparse-graph codes (Part VI). Solution to exercise 1.7 (p.13) . There are fteen non-zero noise vectors whic h give the all-zero are precisely the fteen non-zero codew ords syndrome; these code. of the because the Hamming code is line ar , the that Notice Hamming is a codew ord. ords of any two codew sum Graphs corresponding to codes 1.9 (p.14) . When answ ering this question, Solution will prob- to exercise you nd it is easier to invent new codes than to nd optimal deco ders ably that what them. man y ways to design codes, and are follo ws is just one for There train of though t. We mak e a linear blo ck code that is similar to the possible ; code, but bigger. 4) Hamming (7 In of graphs. in terms expressed tly venien be con can y codes Man g- of the graph . The 1.20 Figure represen tation a pictorial of the code. 4) Hamming ; (7 ure 1.13, we introduced (7 ; 4) Hamming code. The 7 parit If we replace that y of big circles, eac h of whic h sho ws that the gure's the 3 circles are bit nodes and the to the four particular bits is even, by a `parit y-chec k node' that is connected parit y-chec k squares are the by a four then we obtain the represen tation of the (7 ; 4) Hamming code bits, nodes. bipartite as sho wn in gure 1.20. graph 7 circles are the 7 transmitted The bits. The 3 squares are the parit y-chec k nodes (not to be confused with the graph 3 parit k bits , whic h are the three most peripheral circles). The y-chec fall is a `bipartite' because its nodes graph into two classes { bits and chec ks

32 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 1 | Introduction to Information 20 Theory edges are nodes in di eren t classes. The graph and there only { and between h other: y-chec are simply related to eac (1.30) eac h parit code's the k matrix onds to a row of parit and eac h bit node corresp onds to y-chec k node corresp H H every 1 in H , there is an edge ; for the corresp onding a column of between of nodes. pair this connection between linear codes and graphs, one way Having noticed is simply codes graph. For example, to invent linear to think of a bipartite can from a dodecahedron by calling the graph be obtained y bipartite a prett vertices parit y-chec k nodes, and putting a transmitted of the dodecahedron the eac h edge in the dodecahedron. This construction de nes a parit y- bit on . The de ning 1.21 Figure graph h every column has weigh chec k matrix in whic weigh row has every t 2 and t 3. code. dodecahedron 11) ; (30 the num is the vector weigh tains.] [The s it con 1 ber of t of a binary 30 transmitted the are circles The to have ears it app and bits, N has code This = 30 y- parit = 20 M bits the triangles are the 20 and apparen t y chec ks. k is parit One parit y chec only chec ts. Actually , there are k constrain M = 19 indep endent constrain ts; t. redundan 20th constrain t (that is, if 19 constrain ts are satis ed, then the t is redundan is automatically bits so the num ber of source 20th is K = the satis ed); M = 11. The code is a (30 ; 11) code. N to nd a deco ding algorithm for this code, but we can estimate It is hard t codew probabilit its lowest-w eigh by nding ords. If we ip all its y of error bits surrounding one face of the original dodecahedron, then all the parit y the t 5, one ks will code has 12 codew ords of weigh so the for eac h chec be satis ed; Since the lowest-w eigh t codew ords have weigh t 5, we say that the code face. has distance = 5; the (7 ; 4) Hamming code had distance 3 and could correct d 5 can all errors. A code with distance bit- ip correct all double bit- ip single errors, but there are some triple bit- ip errors that it cannot correct. So the 1 / of a rate- . Graph 1.22 Figure 4 error be probabilit code, assuming a binary symmetric channel, will y of this 3 k code y-chec y parit low-densit least of order , perhaps dominated, at f for low noise levels f , by a term code) (Gallager cklength blo with like something M and = 16, N k y-chec = 12 parit 5 27 3 circle constrain ts. Eac h white 12 ) : f (1.55) (1 f 3 represen bit. Eac ts a transmitted h = 3 participates in j bit e codes Of course, there is no obligation to mak whose graphs be rep- can by ted represen ts, constrain h have simple one can; the best linear codes, whic ted resen on a plane, as this nodes squares. The edges between at random. placed were (See that as illustrated by are have graphs descriptions, graphical tangled, more Chapter 47 for more.) 1.22. ; 4) code tiny (16 the of gure king Furthermore, reason for stic is no to linear codes; indeed some there nonlinear codes { codes whose codew ords cannot be de ned by a linear equa- tion like = 0 { have very good prop erties. But the enco ding and deco ding Ht code are even kier tasks. of a nonlinear tric assume 1.10 to exercise First let's (p.14) we are making a linear Solution . and deco ding it with syndrome deco ding. If there are N code transmitted bits, the num ber of possible error patterns of weigh t up to two is then N N N : (1.56) + + 0 1 2 For N = 14, that's 91 + 14 + 1 = 106 patterns. Now, every distinguishable error pattern give rise to a distinct syndrome; and the syndrome is a must M num list so the maxim um possible M ber of syndromes is 2 of . For a bits, 6 ; 8) code, M = 6, so there are at most 2 = 64 (14 syndromes. The num ber of is bigger possible of weigh t up to two, 106, patterns than the num ber of error syndromes, 64, so we can immediately rule out the possibilit y that there is a (14 ; 8) code that is 2-error-correcting.

33 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 21 1.6: coun ting t works ne for nonlinear codes too. When The same argumen r receiv r = t + n , his aim der both t and n from es . If the deco is to deduce that sender can select any transmission the from a code of size case it is the t , and S channel the select any noise vector from a set of size S those , and can n t r be reco the receiv ed bit string from , whic h is one of vered two selections can N at most 2 then it must be the case that possible strings, N S 2 : (1.57) S t n for a ( N;K ) two-error-correcting So, whether linear or nonlinear, code, N N N N K 2 (1.58) : 2 + + 0 1 2 to exercise . There are various (p.14) for making codes Solution 1.11 strategies correct that errors, and I strongly recommend you think out one can multiple for yourself. or two of them approac h uses a linear code, e.g., one with a collection If your M parit y of chec it is helpful to bear in mind the coun ting argumen t given in the previous ks, in order ks, how man y parit y chec exercise, M , you migh t need. to anticipate the Examples can correct any two errors are that (30 ; 11) dodeca- of codes hedron code on page 20, and the (15 ; 6) pentagonful code to be introduced on correct p.221 making codes that can for multiple errors ideas simple . Further that can correct only one error are discussed in section 13.7. from codes 2 is, . The probabilit y of error of R 1.12 Solution (p.16) to leading to exercise 3 order, 2 4 2 2 2 + ' p (1.59) (R ; )] ) = 3(3 f ) 3 [ + = 27 f p (R 3 b b 3 the probabilit of R whereas is dominated by the probabilit y of ve y of error 9 ips, 9 5 4 5 ' ) (1.60) f (R (1 f ) : ' 126 f p + 9 b 5 2 is therefore The deco ding pro cedure R sub optimal, since there are noise vec- 3 tors t four that cause it to mak e a deco ding error. of weigh the adv antage, however, of requiring smaller computational re- It has sources: only memorization of three bits, and coun ting up to three, rather than coun up to nine. ting simple code This an imp ortan t concept. Concatenated codes illustrates are widely used in practice because concatenation allo ws large codes to be are. implemen using simple enco ding and deco ding hardw ted Some of the best kno wn practical codes are concatenated codes.

34 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 2 Inference Probabilit y, and y, Entrop sibling, its ote some time to notation. Just This and Chapter chapter, 8, dev song, the the name of the song, t distinguished Knigh White as the between and song was called (Carroll, 1998), we will sometimes what the name of the between variable, the value of the to distinguish a random need to be careful p i a i i asserts that the random variable variable, and random prop osition that the 0.0575 a 1 a In any particular I will the most has a particular value. chapter, however, use 0.0128 2 b b of upsetting risk at the simple and notation pure-minded friendly possible, 3 c 0.0263 c readers. For example, if something is `true with probabilit y 1', I will usually 4 0.0285 d d simply say that it is `true'. 5 e 0.0913 e 0.0173 6 f f g 7 0.0133 g Probabilities ensembles 2.1 and 8 h 0.0313 h 0.0599 9 i i An ensem P ), where the outcome x is the ; value ble A x; ( is a triple X X X j 10 j 0.0006 of a random variable, values, of possible of a set one es on h tak whic 0.0084 k 11 k , having probabilities P A = f p ;a ;p ;:::;a ;:::;p ;:::;a g , = f a g 1 X i I X 1 2 I 2 12 0.0335 l l P ( x = P ) = p , p 0 and with a x ) = 1. P a = ( i i i i m 0.0235 m 13 2A a i X n 0.0596 n 14 example of an ensem ble for A `alphab et'. name The One is mnemonic is a o o 15 0.0689 p documen is ble ensem t. This that English an from selected is randomly letter p 16 0.0192 q q 0.0008 17 , and z { a sho wn in gure a space There are twenty-sev en possible letters: 2.1. r 0.0508 r 18 - '. ` character s 19 0.0567 s t 20 0.0706 t notation will Briefer sometimes . For be used. Abbreviations example, u 21 u 0.0334 = a ) may be written ( as P ( x P ). x ( P ) or a i i v 22 v 0.0069 w 23 w 0.0119 y of a subset Probabilit of A then: is a subset T If . X x 24 x 0.0073 X y 0.0164 25 y ) = ) = ( P T 2 x ( T P x : ) a = (2.1) ( P i z 26 0.0007 z a T 2 i { 27 { 0.1928 from to be vowels = gure 2.1, V For example, if we de ne V i g u ; o ; e , then ; a f ; Figure . Probabilit y 2.1 over the 27 distribution outcomes for a randomly selected letter in : 31 : 03 = 0 : 07 + 0 : 06 + 0 : 09 + 0 06 + 0 : ) = 0 V ( P : (2.2) an documen t English language (estimated from The Frequently is an h outcome h eac ordered t ensem ble XY is an ensem ble in whic A join for Manual d Questions Aske 2A x;y g ;:::;b b f pair = with x 2A y = f a and ;:::;a g . I X J Y 1 1 picture ws ). The the Linux sho areas probabilities by the of white t probabilit We call ( x;y ) the join P y of x and y . squares. so Commas when writing ordered pairs, optional xy , x;y . are N.B. In a join t ensem ble XY the two variables are not necessarily inde- penden t. 22

35 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Probabilities 23 and 2.1: ensem bles x probabilit . The 2.2 Figure y 27 over the distribution 27 a b xy possible bigrams in an English c d documen t, language The e Aske d Questions Frequently f g for Manual . Linux h i j k l m n o p q r s t u v w x y z { { a b c d e f g h i j k l m n o p q r s t u v w x y z y y . We can obtain the marginal probabilit y P ( x ) from Marginal probabilit t probabilit y P ( x;y ) by summation: join the X x = P ( ) a (2.3) : P ( x = a ) ;y i i 2A y Y briefer the marginal probabilit y of y Similarly notation, , using is: X ) y ( P (2.4) P ( x;y ) : 2A x X y probabilit Conditional P ) x = a b ;y = ( j i = 0. (2.5) 6 ) b = y ( P if = a P ( y = b x ) j j j i b = y ( P ) j [If P ( y = b ) is unde ned.] ) = 0 then P ( x = a b j y = i j j P ( = a We pronounce j y = b x ) `the probabilit y that x equals a , given i i j b y '. equals j is the example Example t ensem ble An ordered pair XY consisting 2.1. of a join e letters in an English documen t. The of two successiv outcomes possible are pairs suc h as aa , ab , ac , and zz ; of these, we migh t exp ect ordered and probable to be more ab than aa and zz . An estimate of the ac is sho join for two neigh bouring characters y distribution wn t probabilit graphically in gure 2.2. This join t ensem ble has the special prop erty that its two marginal dis- tributions, P x ) and P ( y ), are iden tical. They are both equal to the ( distribution sho wn in gure 2.1. monogram x;y this t ensem From P ( join ) we can obtain conditional distributions, ble P ( y j x ) and P ( x j y ), by normalizing the rows and columns, resp ectiv ely j ( gure probabilit y P ( y The x = q ) is the probabilit y distribution 2.3). of the second letter given that the rst letter is a q . As you can see in for gure the two most probable values 2.3a, the second letter y given

36 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 2 | Probabilit Inference y, Entrop 24 y, and x x a a Figure 2.3 . Conditional b b c c (a) y distributions. probabilit d d e e the ws sho P h ): Eac row ( j y x f f g g distribution conditional of the h h second , given the rst y letter, i i j j . (b) x letter, , in a bigram xy k k l l y P ( x j ): Eac h column the ws sho m m n n of the distribution conditional o o p p , given the second letter, x rst q q r r . y letter, s s t t u u v v w w x x y y z z { { abcdefghijklmnopqrstuvwxyz { { abcdefghijklmnopqrstuvwxyz y y j y ( P (a) y j x ( P (b) ) x ) the rst letter x is q are u and - . (The space is common after q that the of the documen t mak es hea vy use because word FAQ .) source u probabilit ( x j y = P ) is the probabilit y distribution of the rst The y x given that the second letter y is a u . As you can see in gure letter 2.3b the probable values for x given y = u are n and two most . o Rather writing down the join than y directly , we often de ne an t probabilit ensem ble in terms of a collection of conditional probabilities. The follo wing rules of probabilit be useful. ( H denotes assumptions on whic h y theory will are the based.) probabilities { obtained from the Pro of conditional probabilit y: rule duct de nition x;y jH ) = P ( x j y; P ) P ( y jH ) = P ( y j x; H ) P ( x jH ) : (2.6) ( H rule is also kno wn as the chain rule. This rule { a rewriting marginal probabilit y de nition: Sum of the X x ) = ( jH P ( x;y jH ) (2.7) P y X = P ( x j y; H ) P ( y jH ) : (2.8) y Bayes' { obtained from the pro duct rule: theorem P ( x j y; H ) P ( y jH ) (2.9) ) = P ( y j x; H ( x ) P jH jH P j y; H ) P ( y x ) ( P = (2.10) : 0 0 j y P ; H ) P ( y ( jH ) x 0 y Indep endence . Tw o random variables X and Y are indep enden t (sometimes written Y ) if and only if ? X ( x;y ) = P ( x ) P ( y ) : (2.11) P [ 1, p.40 ] Exercise 2.2. Are the random variables X and Y in the join t ensem ble of gure 2.2 indep enden t?

37 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. The meaning y 25 2.2: of probabilit we often I said ensem ble in terms of a collection of condi- de ne that an follo example illustrates this idea. The tional probabilities. wing a test for a nast y disease. We denote Jo's state of health Example 2.3. Jo has and the test result by b . by the variable a has the disease a = 1 Jo (2.12) Jo have the disease. = 0 a does not of the The is either `positiv e' ( b = 1) or `negativ e' ( b = 0); result test test reliable: in 95% of cases of people who really have the the is 95% e result and in 95% of cases of people who a positiv is returned, disease, have the disease, a negativ e result is obtained. The nal do of not piece kground is that 1% of people of Jo's age information bac kground bac and disease. have the { Jo has the test, OK the result is positiv e. What is the probabilit y and that has the disease? Jo We write down pro vided probabilities. The test reliabilit y Solution. the all probabilit y of b given a : conditional speci es the b = 1 j a = 1) = 0 : 95 P ( b = 1 j a = 0) = 0 : 05 P ( (2.13) b = 0 j a = 1) = 0 : 05 P ( b = 0 P a = 0) = 0 : 95; ( j and disease prev alence tells us about the marginal probabilit y of a : the ( : = 1) = 0 : 01 P ( a = 0) = 0 P 99 : (2.14) a y the From ( a ) and the conditional probabilit marginal P ( b j a ) we can deduce P the join t probabilit y P ( a;b ) = P ( a ) P ( b j a ) and any other probabilities we are interested in. by the sum rule, the marginal probabilit y of b = 1 For example, probabilit y of getting e result { is { the a positiv ( b P ( b = 1 j a = 1) P ( ( = 1) + P = 1) = b = 1 j a = 0) P ( a = 0) : (2.15) P a has receiv ed a positiv e result b = 1 and is interested in how plausible it is Jo she has the disease (i.e., that a = 1). The man in the street migh t be that ed by the e result t `the test is 95% reliable, so Jo's positiv dup implies statemen The there chance that Jo has that disease', but this is incorrect. is a 95% the correct solution to an inference problem is found using Bayes' theorem. = 1) a ( P = 1) a j = 1 P ( b = = 1) j = 1 a ( P b (2.16) ( = 1 j a = 1) P ( a = 1) + P ( b = 1 j a = 0) P P a = 0) ( b 0 95 0 : 01 : (2.17) = 05 95 : : 01 + 0 : 0 : 99 0 0 = 0 : 16 : (2.18) So in spite of the e result, the probabilit y that Jo has the disease is only positiv 16%. 2 2.2 The meaning of probabilit y Probabilities can be used in two ways. Probabilities can describ e frequencies of outcomes in random exp erimen ts , is a but noncircular de nitions of the terms `frequency' and `random' giving challenge { what does it mean to say that the frequency of a tossed coin's

38 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 2 | Probabilit y, Entrop 26 y, and Inference Cox axioms. . The Box 2.4 If a set satisfy these of beliefs ( B by ' be denoted x osition in prop of belief degree `the Let . Notation ). The x then axioms they can be mapp ed x of negation ( ) is written x - not x . The degree of belief in a condi- onto probabilities satisfying ` x to be true', osition, is represen y ted osition prop , assuming tional prop ( true P ) = 0, false ( P ) = 1, y j x ( B by ). rules P of ( 0 x ) 1, and the Axiom x than B ( y ), and B if ) is `greater' 1 . Degrees of belief can be ordered; ( probabilit y: ). ( B z ( B than ) is `greater' x ( B ), then z ( B than y ) is `greater' ), x P ( P ) = 1 x ( and [ Consequence: beliefs can be mapp ed onto real num bers.] y j x ( P ) = x;y ( P : y ( P ) ) 2 . degree of belief in a prop osition x and its negation x are related. Axiom The is a function f h that There suc ( x f ( B ) = x )] : B [ 3 . The degree of belief in a conjunction of prop ositions x;y ( x and y ) is Axiom to the j of belief in the conditional prop osition x related y and the degree is a function of belief osition y . There prop g suc h that degree in the ( x;y ) = g [ B ( x j y ) ;B ( y B : )] 1 / is coming heads up ? If we say that average frequency is the 2 fraction of this heads sequences, we have to de ne `average'; and it is hard to de ne in long `average' without using a word synon ymous to probabilit y! I will not attempt to cut this knot. philosophical can also more generally , to describ e degrees of be- Probabilities be used, { for ositions not involve random variables do example `the in prop lief that Mr. S. was the murderer of Mrs. S., given the evidence' (he probabilit y that was or wasn't, and the jury's job to assess how probable it is that he either it's a child `the Thomas Je erson had y that by one of his slaves'; was); probabilit probabilit y that Shak esp eare's plays were written by Francis Bacon'; `the or, to pick a mo y example, `the probabilit y that a particular signature on dern-da cheque is gen a particular uine'. in both in the is happ man probabilities street these ways, but The y to use books on probabilit y restrict probabilities to refer only to frequencies of some in rep eatable random exp erimen ts. outcomes ertheless, sat- of belief can be mapp ed onto probabilities if they Nev degrees Cox axioms simple rules kno wn as the isfy (Co x, 1946) ( gure 2.4). consistency Thus probabilities can be used to describ e assumptions, and to describ e in- ferences given assumptions. The rules of probabilit y ensure that if two those mak e the assumptions and receiv e the same data then they will people same use tical This more general conclusions. of probabilit y to quan tify dra w iden is kno wn as the Bayesian beliefs oint. It is also kno wn as the sub jectiv e viewp interpretation y, since the probabilities dep end on assumptions. of probabilit Adv of a Bayesian approac h to data mo delling and pattern recognition ocates do not view this sub jectivit y as a defect, since in their view, you cannot inference without making assumptions. do book it will ted time to time be tak en for gran In this that a Bayesian from is warned approac but the reader es sense, that this is not yet a globally h mak held view { the eld of statistics was dominated for most of the 20th cen tury by non-Ba metho ds in whic h probabilities are allo wed to describ e only yesian is that variables. The big di erence between the two approac hes random

39 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Forw ard and inverse probabilities 27 2.3: probabilities use probabilities e infer ences . Bayesians also to describ and 2.3 inverse Forward probabilities probabilities into one fall forw ard prob- often y calculations Probabilit of two categories: inverse abilit y . Here is an example of a forw ard probabilit y y and probabilit problem: 2, p.40 ] [ An urn con tains K balls, of whic h B are blac k and W Exercise 2.4. = B are white. Fred dra ws a ball at random from the urn and replaces K times. it, N probabilit y distribution of the num ber of times a blac What (a) is the k wn, n is dra ? ball B is the n ectation of What (b) ? What is the variance of n exp ? What B B deviation of n is the ? Giv standard answ ers for the e numerical B N = 5 and N = 400, when cases = 2 and K = 10. B Forw probabilit y problems involve a generativ e mo del that describ es a pro- ard that is assumed to some data; the task is to compute the cess to give rise ends or exp quan tity that dep of some on the y distribution probabilit ectation is another example of a forw ard probabilit y problem: data. Here [ p.40 ] 2, An urn con tains K balls, 2.5. h B are blac k and W = Exercise of whic B are white. We de ne the fraction f N K B=K . Fred dra ws B blac from exactly as in exercise 2.4, obtaining n the urn, ks, and times B computes the quan tity 2 ( ) n f N B B (2.19) : z = f ) Nf (1 B B is the exp ectation What z ? In the case N = 5 and f 5, what = 1 = of B is the probabilit y distribution of z ? What is the probabilit y that z < 1? [Hin t: compare with the quan tities computed in the previous exercise.] z e forw ard y problems, inverse probability problems involve a Lik probabilit the del but instead of computing cess, probabilit y distri- e mo of a pro generativ quan tity produc ed by the pro cess, we compute the conditional bution of some or more probabilit unobserve d variables in the pro cess, given y of one of the observ This invariably requires the use of Bayes' theorem. the ed variables. 2.6. There are elev en urns lab Example by u 2f 0 ; 1 ; 2 ;::: ; 10 g , eac h con- elled taining balls. Urn u con tains u blac k balls and 10 u white balls. ten selects dra urn u at random and Fred ws N times with replacemen t an N from obtaining n friend, blac ks and urn, n Fred's whites. that B B Bill, looks on. If after N = 10 dra ws n wn, = 3 blac ks have been dra B what probabilit y that the urn Fred is using is urn u , from Bill's is the (Bill kno w the value of u .) point of view? doesn't The join t probabilit y distribution of the random variables u and n Solution. B can be written j ( (2.20) j N ) = P ( P : u;n u;N ) P ( u ) n B B From the join t probabilit y of u and n conditional , we can obtain the B u given n : distribution of B ) N j u;n ( P B (2.21) P n ) = j u ( ;N B ( n j N ) P B u;N u n P j ( ) P ( ) B (2.22) : = ( n j N ) P B

40 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. y, Entrop y, and 2 | Probabilit Inference 28 u 2.5 t probabilit . Join u Figure y of urn Fred's and Bill for n and B 0 N = 10 after ws. problem, dra 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 910 n B 1 for the u . You wrote down all marginal probabilit The u is P ( u ) = y of 11 n probabilit given u and N , P ( n y of j u;N ), when you solv ed exercise 2.4 B B 0.3 0.25 doing the highly recommended exercises, aren't you?] If we (p.27) . [You are 0.2 f de ne u= 10 then u 0.15 0.1 0.05 N n N n B B f f ) (1 ( u;N n P ) = j : (2.23) 0 u B u n B 6 7 8 9 10 1 2 3 4 5 0 u marginal y of P about What probabilit ( is the )? This N j the n denominator, B sum the using obtain h we can , whic n rule: B n j u ( u P ) ;N = 3 B X X 0 0 ( (2.24) P : u;n j N ) = u;N P ( u ) P ( n j ) P ( n j N ) = B B B 0.063 1 u u 0.22 2 0.29 3 given the is probabilit So y of u n conditional B 0.24 4 5 0.13 P j ) ( n u;N u ) P ( B 6 0.047 (2.25) P n ) = u ( ;N j B P N ) j n ( B 0.0099 7 0.00086 8 1 N 1 n n N B B = f f ) (1 : (2.26) u u 0.0000096 9 n 11 ) n P j N ( B B 0 10 of 3 column by normalizing be found can distribution conditional This Figure 2.6 . Conditional and 2.6. t, the constan normalizing The in gure gure 2.5 wn is sho marginal = 3 and y of probabilit u given n B probabilit y y of n probabilit , is P ( n posterior = 3 j N = 10) = 0 : 083. The B B N = 10. (2.26) is correct for all u , including the end-p oints u = 0 and u = 10, where posterior f = 1 resp ectiv ely. The f probabilit y that u = 0 given = 0 and u u from = 3 is equal if Fred were dra wing because urn 0 it would be n to zero, B The for k balls to be dra wn. any blac posterior probabilit y that imp ossible = 10 is also zero, because there are no white balls in that urn. u other The hypotheses = 1 ; u = 2, ::: u = 9 all have non-zero posterior probabilit y. 2 u gy of inverse Terminolo probability probabilit y problems it is con venien t to give names to the proba- In inverse app earing in Bayes' theorem. In equation (2.25), we call the marginal bilities probabilit y P ( u ) the prior probabilit y of u , and P ( n like- j u;N ) is called the B liho of u . It is imp ortan t to note that the terms likeliho od and probabilit y od not synon yms. The quan tity P ( n are j u;N ) is a function of both n and B B u . For xed u , P ( n , j u;N ) de nes a probabilit y over n n . For xed B B B P ( n u j . ) de nes the likeliho od of u;N B

41 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Forw ard and inverse probabilities 29 2.3: probabilities likeliho od of the Alw ays say `the likeliho od Nev er say `the data'. likeliho is not a probabilit y The od function of the parameters'. distribution. tion the data that a likeliho (If is asso ciated with, you want to men od function likeliho od of the parameters given the data'.) may say `the you probabilit y P ( u j n The ;N conditional the posterior probabilit y ) is called B given n of . The normalizing constan t P ( n u j N ) has no u -dep endence so its B B imp ortan t if we simply wish to evaluate the relativ e probabilities is not value e hypotheses u . However, in most data-mo delling problems alternativ of the y, this quan tity becomes imp ortan t, and of any complexit various it is given names: ( n P j N ) is kno wn as the evidence or the marginal likeliho od . B If the unkno wn parameters, D denotes the data, and denotes denotes H overall hypothesis space, the general equation: the ) jH ( P ) P ( D j ; H ) = D; j ( P H (2.27) D jH ) P ( is written: od prior likeliho (2.28) : posterior = evidence Inverse probability and prediction 2.6 (continued). Assuming again that Bill has Example ed n ks = 3 blac observ B w another N dra ws, let Fred in = 10 ball from the same urn. What dra is the probabilit y that the next dra wn ball is a blac k? [You should mak e use of the probabilities in gure 2.6.] posterior By the rule, Solution. sum X j n j u ( P ) ;N : P (ball ;N is blac k ) u;n is blac ) = ;N P n j k (ball B N B +1 +1 B N u (2.29) the balls are dra wn with replacemen t from the chosen urn, the proba- Since y bilit (ball P are. is blac k j u;n N ;N ) is just f and = u= 10, whatev er n B B +1 N u So X P (ball is blac k j n ;N ) = u (2.30) : f ) P ( ;N j n +1 N B B u u Using the values of P ( u j n we obtain ;N ) given in gure 2.6 B P (ball (2.31) is blac k j n 2 = 3 ;N = 10) = 0 : 333 : B N +1 Notice the di erence between this prediction obtained using prob- Comment. y theory widespread the abilit practice in statistics of making predictions , and (whic selecting most plausible hypothesis by rst h here would be that the the urn is urn u = 3) and then making the predictions assuming that hypothesis the to be true give a probabilit y of 0.3 that h would next ball is blac k). (whic The correct prediction is the one that tak es into accoun t the uncertain ty by mar over the possible values of the hypothesis u . Marginalization ginalizing predictions. leads to sligh tly more mo derate, less extreme here

42 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 2 | Probabilit y, Entrop Inference 30 y, and e as inverse probability enc Infer wing follo h has the character of a simple sci- the exercise, Now consider whic investigation. enti c a sequence N times, obtaining tosses of heads Example 2.7. a bent coin Bill that the coin and a probabilit y f tails. of coming up We assume has H kno w f we do . If n heads; not have occurred in N tosses, what heads H H y distribution of f t be 10, ? (For example, N migh is the and probabilit H we migh after a lot more tossing, t be 3; or, t have N = 300 and n migh H +1th = 29.) probabilit y that the N is the outcome will be a n What H n tosses? heads in N head, given H e example (p.27) , this problem Unlik a sub jectiv e elemen t. Giv en a has 2.6 of probabilit says `probabilities de nition are the frequencies restricted y that this variables', is di eren t from the elev en-urns example. of random example the urn u was a random variable, the bias Whereas not of the coin would f H be called normally It is just a xed but unkno wn parameter a random variable. 2.7 interested the two examples 2.6 and Yet don't seem to have we are in. that tial similarit y? [Esp ecially an N = 10 and n essen = 3!] when H 2.7, we have to mak e an assumption e example what the bias To solv about coin f a probabilit migh t be. This prior probabilit y distribution over f of the , P ( f ) denotes ), Here P ( f y H H H y, rather than y densit a probabilit onds example, prior over u in the elev en-urns problem. In that corresp to the distribution. ). de nition speci ed P ( u the In real life, we have to mak e helpful problem in order to assign assumptions these assumptions will be sub jectiv e, priors; and answ ers will dep end on them. Exactly the same can our for the be said other in our generativ e mo probabilities too. We are assuming, for example, del that the balls are dra wn from an urn indep enden tly; but could there not be correlations in the because Fred's ball-dra wing action is not perfectly sequence Indeed random? be, so the likeliho od function that we use dep ends there could e data problems, priors are sub jectiv delling and mo too. In real on assumptions . ods so are likeliho P () to denote probabilit y densities over con tinuous We are now using vari- as well as probabilities variables and probabilities of logical ables over discrete v The ositions. a con tinuous variable probabilit lies between values prop y that R b b (where ) is de ned to be and a b > a d v P ( v ). P ( v )d v is dimensionless. a quan P ( v ) is a dimensional y tity, having dimensions inverse to the The densit of v { in con trast to discrete probabilities, whic h are dimensionless. dimensions be surprised probabilit Don't y densities greater than 1. This is normal, to see R b as long as nothing and ). d v P is wrong, v ) 1 for any interv al ( a;b ( a Conditional and join t probabilit y densities are de ned in just the same way as conditional and t probabilities. join 2 ] [ , Assuming a uniform prior on f . Exercise P ( f problem ) = 1, solv e the 2.8. H H in example 2.7 (p.30) . Sketch the posterior distribution of posed and f H outcome the the N +1th y that will be a head, for compute probabilit (a) = 3 and n = 0; N H (b) N = 3 and n = 2; H (c) N = 10 and n = 3; H (d) and n N = 29. = 300 H will You the beta integral useful: nd Z 1 ! F ! F + 1)( ( F + 1) F a a b b F F a b (2.32) : = d p p p (1 ) = a a a + ( F F ( + 1)! + 2) F F + a a b b 0

43 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Forw 2.3: inverse probabilities 31 probabilities ard and it instructiv may also k at example 2.6 (p.27) and e to look nd You bac (2.31). equation wn distribution to an unkno assigning pa- People sometimes a prior confuse f with rameter making an initial guess suc value of the parameter. h as of the H over f the , P ( f But ), is not prior statemen t like `initially , I would a simple H H 1 / guess f = '. The prior is a probabilit y densit y over f 2 whic h speci es the H H that f degree lies in any interv al ( f;f + f ). It may well be of belief prior H 1 / is symmetric f case our about that prior for the , so that the mean of f 2 H H 1 / under prior is the case, the predictiv e distribution for the rst toss 2 . In this would indeed be x 1 Z Z 1 / = ( d f x P ( f = head) ) P ( x = = head j f f ) = P d f ) P ( f 2 : H H 1 H H 1 H H (2.33) But for subsequen t tosses will dep end on the whole prior dis- the prediction just mean. not tribution, its ession and inverse probability Data compr follo task. the Consider wing binary Write program 2.9. of compressing a computer les Example capable one: like this 10000 0010 0000 01000 0000 00000 000000000000000000001001000 0000 00000 0000 00000 0010 1000 00000 0000 01100 00 0000 100000000001000010000000001 0000 0000 00000 0000 10000 0000 0000 00000 1000 00000 0110 0000 10000 0001 10001 00 00000 0000 10000 0000 00001 1000 00000 0000 0000 00000 0000 00010 0000 0000 00000 0010 00000 00 000000000100100000000001000 The string sho wn con tains n s. = 29 1 s and n 0 = 271 0 1 works Intuitiv antage of the predictabilit y of a by taking ely, compression adv the to emit of the le app ears more likely case, 0 s than le. source In this must, program compresses this le that implicitly or compression 1 s. A data the question explicitly is the probabilit y that the next , be addressing `What in this is a 1 ?' character le think (p.30) problem is similar in character to example 2.7 you ? Do this data One of this book is that themes compression and data I do. of the delling are one and the same, and that they should both be addressed, like mo urn the using inverse probabilit y. Example 2.9 is solv ed in of example 2.6, 6. Chapter od principle The likeliho e the solv wing two exercises. Please follo A B A con tains three balls: one blac k, and two white; urn B Example 2.10. Urn one con balls: two blac k, and three white. One of the urns is tains is at random and one ball is dra wn. The ball is blac k. What selected . Urns for example 2.10. Figure 2.7 the probabilit the selected urn is urn A? y that 2.11. and A con tains ve balls: one blac k, two white, one green Example Urn g ... p urn one pink; B con tains ve hundred balls: two hundred blac k, one ... c s 50 25 20 er, silv green, 25 sienna, 30 cyan, hundred white, 40 yello w, ... y [One blac gold, and 10 purple. k; two- fths fth of A's balls are of B's p g blac are k.] One of the urns is selected at random and one ball is dra wn. example . Urns 2.8 Figure 2.11. for probabilit The is blac k. What is the ball y that the urn is urn A?

44 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. y, Entrop y, and Inference 32 2 | Probabilit h answ you about do solutions? Does eac notice er dep end on the What your h ( p ) i a p i i i h urn? ts of eac ten con detailed a 1 4.1 .0575 possible ir- of the other and their probabilities are details outcomes The 6.3 .0128 b 2 that relev ant. All matters is the probabilit y of the actually that outcome 3 c .0263 5.2 4 d .0285 5.1 k) given was blac wn dra ened happ (here, that ball t hypothe- di eren the the 5 e .0913 3.5 , i.e., how the probabilit y of the data ses. We need only to kno w the likeliho od f 6 .0173 5.9 simple rule about inference is that happ ened varies with the hypothesis. This .0133 7 g 6.2 as the kno wn od principle likeliho . 8 h .0313 5.0 .0599 4.1 9 i e mo The likeliho od principle: given a generativ del for given d data 10.7 j 10 .0006 ), and parameters , P ( d having observ ed a particular outcome j .0084 k 11 6.9 l 12 .0335 4.9 function dep should predictions and , all inferences d end only on the 1 .0235 5.4 13 m d ( j P ). 1 14 n .0596 4.1 3.9 .0689 o 15 statistical metho ds man principle, y of this simplicit of the In spite y classical 16 .0192 5.7 p q .0008 17 10.3 it. violate 4.3 18 r .0508 4.1 s 19 .0567 functions of entrop De nition 2.4 related y and t .0706 3.8 20 .0334 4.9 21 u t of an ten con information Shannon The is de ned x to be outcome v .0069 22 7.2 .0119 6.4 w 23 1 h : x ( ) = log (2.34) 2 x .0073 7.1 24 x ) P ( y 25 .0164 5.9 10.4 z 26 .0007 to denote used is also `bit' a variable in bits. [The word It is measured .1928 2.4 - 27 alw whose value is 0 or 1; I hop e con text will ays mak e clear whic h of the is intended.] two meanings X 1 p log 4.1 i 2 p Shannon next In the few we will establish that the chapters, information i i measure of the information con con t ten t h ( a ten ) is indeed a natural i event point, we will . At that name a = x the of the shorten of this i Table 2.9 . Shannon information t'. a quan tity to `the information con con ten ts of the outcomes ten { z . The fourth in table 2.9 sho ws the Shannon information con ten t column 27 possible when a random character is picked from of the outcomes z documen outcome x = English has a Shannon information an t. The ten t of 10.4 bits, and x = e has an information con ten t of 3.5 bits. con The entrop ensem ble X is de ned to be the average Shannon in- y of an con t of an outcome: formation ten X 1 (2.35) ; H ( X ) P ( x ) log P ( x ) 2A x X with the con vention for P ( x ) = 0 that 0 log 1 = 0 0, since lim + log 1 = = 0. 0 ! e the in bits. con ten t, entrop y is measured Lik information ) as it is con t, we may also When H ( X venien H ( p ), where p is write the vector ( p is the ;p X ;:::;p y of ). Another name for the entrop I 2 1 ty of . uncertain X 2.12. The entrop Example selected letter in an English docu- y of a randomly men t is about 4.11 bits, assuming its probabilit y is as given in table 2.9. col- We obtain num ber by averaging log 1 =p this (sho wn in the fourth i column). under the probabilit y distribution umn) third (sho wn in the p i

45 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Decomp osabilit entrop y 33 2.5: y of the prop erties entrop y function. some We now note of the ' means X 0 with equalit y i p = 1 for ( i . [`i ) `if and only if'.] H one i Entrop y is maximized if p is uniform: X ) log( jA p j ) with equalit y i H (2.36) = 1 ( jA . j for all i = X X i is a set, vertical jj ' have two meanings. If A bars ` jA the j Notation: X X num ber of elemen ts in denotes the ; if x is a num ber, then j x j is the A X absolute value of x . redundancy measures the fractional di erence between H ( X ) and The its max- imum value, log( jA possible j ). X The X is: redundancy of H ) ( X 1 : (2.37) jA log j X e use of `redundancy' in this book, so I have not assigned We won't mak a sym bol to it. t entrop y of X;Y is: The join X 1 ( ) = H X;Y (2.38) : ) log P ( x;y x;y P ( ) 2A xy A Y X y is additiv indep enden t random variables: Entrop e for ( X;Y ) = H ( X ) + H ( Y ) i P ( x;y ) = H ( x ) P ( y ) : (2.39) P Our for information con ten de nitions apply only to discrete probabilit y t so far distributions over nite sets A to in nite . The de nitions can be extended X sets, the entrop y may then be in nite. The case of a probabilit y though over a con tinuous is addressed in section 11.3. Further imp ortan t density set exercises 8.1. with entrop y will come along in section and de nitions to do Decomp y of the 2.5 y entrop osabilit y function entrop a recursiv e prop erty that can be very useful The satis es computing entropies. For con venience, we'll stretc h our notation so that when write we can ( X ) as H ( p ), where p is the probabilit y vector asso ciated with H ensem X . the ble illustrate the prop erty by an example rst. Imagine that a random Let's is created variable 0 ; 1 ; 2 g 2f by rst ipping a fair coin to determine whether x x = 0; then, if x is not 0, ipping a fair coin a second time to determine whether x y distribution of x is is 1 or 2. The probabilit 1 1 1 (2.40) = 2) = ; P ( x = 1) = : x ; P ( = 0) = x ( P 4 4 2 is the entrop y of X ? We can either compute it by brute force: What 1 1 1 / / / X ( H ) = 2 + log 2 4 4 + log 4 4 = 1 : 5; (2.41) log or we can use the follo wing decomp osition, in whic h the value of x is revealed 0, learning gradually rst learning whether x = 0, and then, if x is not . Imagine The whic value is the case. h non-zero revelation of whether x = 0 or not entails

46 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. y, and y, Entrop Inference 2 | Probabilit 34 1 1 / / probabilit is revealing a binary y distribution f variable whose ; 2 . This g 2 1 1 1 1 / / 0, 2 + is not log x log 2 = 1 bit. If ( entrop an H revelation has y 2 ; ) = 2 2 2 of the second coin ip. This too is a binary variable whose we learn the value 1 1 / / is f probabilit y distribution ; 2 2 , and whose entrop y is 1 bit. We only get g the second revelation half the time, however, so the entrop y can to exp erience be written: 1 1 1 1 1 / / / / / ( H H ( ) = X 2 ; ) + 2 ( H 2 ; 2 ) : (2.42) 2 observ ation we are making about the entrop y of any Generalizing, the probabilit p f p = ;p y distribution ;:::;p is that g I 1 2 p p p 3 I 2 ;:::; ; : (2.43) ) + (1 p p ) H ) = H ( p ; 1 H p ( 1 1 1 p p 1 p 1 1 1 1 1 ugly; as a form prop erty looks regrettably this nev- it's ula, written When ertheless one that you should mak e use of. it is a simple prop erty and the entrop y has the prop Generalizing any m that further, erty for ( ) = H [( p + p + + p p ; ( p H + p + + p )] ) +2 m 1 +1 2 I m m p p m 1 ;:::; H + + +( p p ) m 1 + p ( ) p + + ( p ) + p m m 1 1 p p I m +1 +( H + ;:::; p + : p ) m +1 I p p ) ) + ( + ( p p + + +1 m I +1 m I (2.44) alphab pro duces a character x from the source et A = A Example 2.13. 1 / ;:::; 9 ; a ; b ;:::; z g ; with probabilit y f 0 ; 1 ); x ( 0 ;:::; 9 is a numeral , 3 1 1 / / 3 u , a ; e ; i ; o ; is a vowel ( ); and with probabilit y 3 x y probabilit with one of the 21 consonan ts. All numerals are equiprobable, and the it's y of goes for consonan ts. Estimate the entrop and X . same vowels 1 1 Solution. log 3 + 10 + log . 21) = log 3 + (log 5 + log log 1050 ' log 30 bits 2 3 3 2.6 Gibbs' inequalit y `ei' in L ei bler is pronounced The same as in h ei st. the relativ e entrop The or Kullbac k{Leibler divergence between two y probabilit y distributions P ( x ) and Q ( x ) that are de ned over the same alphab et A is X X P ) x ( D P jj Q ) = ( (2.45) : ) log ( P x KL ) ( Q x x relativ e entrop y satis es The inequalit y Gibbs' D ( P jj Q ) 0 (2.46) KL . Note with if P = Q equalit that in general the relativ e entrop y y only is not symmetric under interc hange of the distributions P and Q : in D general ( P jj Q ) 6 = D it is sometimes ( Q jj P ), so , although D KL KL KL a distance. the distance', called strictly `KL The relativ e entrop y is not is imp ortan t in pattern recognition and neural net works, as well as in information . theory Gibbs' inequalit y is probably the most imp ortan t inequalit y in this book. It, ved and y other inequalities, can be pro man using the concept of con vexit y.

47 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Jensen's inequalit con vex functions 35 2.7: y for Jensen's 2.7 inequalit y for convex functions vex-smile' ' and ve The ' may be pronounced `con `conca and ^ vex `con words _ terminology has useful redundancy: while one may forget `conca ve-fro wn'. This ) f x x ( f ) ) + (1 ( 1 2 and `conca ve' are, it is harder to confuse a smile with a whic h way up `con vex' frown. f ( x ) functions . A function f ( x ) is con vex ^ over ( a;b ) if every chord Con vex ^ above the in gure is, that 2.10; wn as sho function, lies function of the ( ) and 0 2 a;b 1, for all x ;x x x 1 2 1 2 x = x x + (1 ) 2 1 x ( f ) (2.47) f ( x + (1 : ) x ) f ( x ) + (1 ) 1 1 2 2 of 2.10 Figure . De nition A function f is strictly con vex ^ if, for all x y ;x equalit 2 ( a;b ), the 2 1 con y. vexit only for = 0 and = 1. holds functions. apply ve _ and strictly de nitions ve _ to conca Similar conca strictly con vex ^ functions are Some x 2 x e x e , x for all ; and log(1 ) and x log x for x > 0. =x 1 2 x x x functions. ^ log e log x Figure 2.11 . Con vex x 1 1 2 2 -1 0 1 0 0 2 3 2 3 3 -1 1 0 3 Jensen's y . If f is a con vex inequalit function and x is a random variable ^ then: E [ f ( x )] f ( E [ x ]) ; (2.48) where denotes ectation. If f is strictly con vex ^ and E [ f ( x )] = E exp E t. x ]), then the random variable x is a constan ( f [ Centre of gravity y can be rewritten for a conca ve inequalit function, with Jensen's _ also of the inequalit the direction y reversed. inequalit of Jensen's as follo ws. version A physical y runs of masses p e are placed on a con vex ^ curv If a collection f ( x ) i at locations ( ;f ( x masses, )), x the cen tre of gra vity of those then i i h is at ( E [ x ] ; E [ f ( x )]), lies above the curv e. whic follo fails you, then feel free to do the vince wing exercise. to con If this [ 2, p.41 ] 2.14. Pro ve Jensen's inequalit y. Exercise 2 Three squares have average area = 100 A Example m 2.15. . The average of size is the l = 10 m. What can be said about the sides of their lengths largest of the three squares? [Use Jensen's inequalit y.] of the Let x be the length of the side of a square, and let the probabilit y Solution. 1 1 1 / / / . Then ;l that information three 3 over the lengths l ;l the ; 3 ; 3 x be of 3 2 1 2 E [ x ] = 10 and E [ f ( x )] = 100, where f ( x ) = x we have is that is the function mapping to areas. This is a strictly con vex ^ function. We notice lengths that the equalit y E [ f ( x )] = f ( E [ x ]) holds, therefore x is a constan t, and the 2 of the must all be equal. The area lengths largest square is 100 m three . 2

48 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. y, Entrop 36 y, and 2 | Probabilit Inference avity and to maximization also Convexity conc relate a point at whic ) is conca x and there exists ve h ( f If _ @f all (2.49) = 0 for k; @x k value x ( maxim um ) has at that point. f then its verse does not hold: if a conca ve _ f ( x ) is maximized at some x The con true it is not the gradien t r f ( x ) is equal to zero there. For necessarily that ( its ) = j x j is maximized at x = 0 where f deriv ativ e is unde ned; example, x 2 ( f p ) ; for a probabilit y p p (0 ; 1), is maximized on the boundary and ) = log( range, at p = 1, where the gradien t d f ( p ) = d p = 1. of the 2.8 Exercises variables of random Sums 3, ] p.41 [ Tw o ordinary dice 2.16. with lab elled 1 ;:::; 6 are (a) Exercise faces thro is the probabilit y distribution of the sum of the val- wn. What What probabilit y distribution of the absolute is the ues? di erence the values? between One hundred ordinary dice are thro wn. What, roughly , is the prob- This exercise is intended to help (b) think about cen tral-limit you the of the probabilit y abilit Sketch the values? y distribution sum of the theorem, if whic h says that mean its estimate and distribution deviation. standard and indep enden t random variables num cubical two can How (c) bers the elled using be lab dice x ;x ;:::;x have means and N 1 2 n 2 are two dice the when that so g 6 ; 5 ; 4 ; 3 ; 2 ; 1 ; 0 f the wn sum thro , then, nite variances in the n P N x sum , the of large limit n 1{12? integers probabilit y distribution over the a uniform has n tends that has a distribution to a could with dice elled inte- be lab (d) Is there any way that one hundred (Gaussian) normal distribution P the h that suc is uniform? gers sum of the y distribution probabilit mean with and variance n n P 2 . n n Infer e problems enc [ 2, p.41 ] = ln If q = 1 p and a Exercise p=q , sho w that 2.17. 1 : (2.50) = p a ) 1 + exp( and nd its relationship to the hyperb olic tangen t Sketch this function u u e e . ) = function u tanh( u u + e e be useful to be uen t in base-2 logarithms also. If b = log It will p=q , 2 what p as a function of b ? is ] [ 2, p.42 Exercise . Let x and 2.18. be dep enden t random variables with x a y binary variable taking values in A to = f 0 ; 1 g . Use Bayes' theorem X sho the log posterior probabilit y ratio for x given y is w that P x = 1 j y ) ( = 1) x ( P = 1) x j y P ( log = log + log : (2.51) ( y j x = 0) y = 0 x ( P P ( x = 0) j P ) [ 2, p.42 ] and . Let x , d and 2.19. d d be random variables suc h that Exercise 2 1 1 d Bayes' are conditionally indep enden t given a binary variable x . Use 2 theorem w that the posterior probabilit y ratio for x given f d is to sho g i x ( P = 1) = 1) j d ( P x = 1) x j d ( P = 1 ) ( g x P jf d 2 1 i = : (2.52) j g = 0) P = 0) d ) ( P ( P ( d x j x = 0) x = 0 jf d P ( x 1 i 2

49 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Exercises 37 2.8: spaces Life in high-dimensional have some y distributions prop erties in unexp ected Probabilit volumes and high-dimensional spaces. p.42 [ 2, ] a sphere r in an N of radius real Consider 2.20. Exercise -dimensional the fraction of the volume of the sphere that is in the space. Sho w that and at values between r radius r , where of the lying surface shell 0 < < r , is: N f = 1 1 : (2.53) r f cases N = 2, N = 10 and N = 1000, with (a) the = 0 : 01; for Evaluate =r = 0 : 5. =r (b) are Implication: distributed in a sphere in N di- points that uniformly where is large, are very likely to be in a thin shell near N mensions, the surface. Exp ectations and entr opies are probably familiar with the idea You the exp ectation of a of computing function x , of X [ x )] = h f ( x ) i = f E ( ) ( f ( x ) : (2.54) P x x are not so comfortable with computing this exp ectation in cases Ma ybe you y function ( x ) dep ends on the probabilit f P ( x ). The next few ex- the where address this concern. amples 1, p.43 ] [ 2.21. Let p Exercise = 0 : 1, p ) = 10, = 0 : 2, and p a = 0 : 7. Let f ( c a b ( b ) = 5, and f ( c ) = 10 = 7. What is E [ f f x )]? What is E [1 =P ( x )]? ( [ 2, p.43 ] 2.22. arbitrary For an Exercise ensem ble, what is E [1 =P ( x )]? [ 1, p.43 ] . Exercise Let p 2.23. = 0 : 1, p ) = 1, = 0 : 2, and p b = 0 : 7. Let g ( a ) = 0, g ( c a b )]? ( is E [ g ( x ) = 0. What c and g p.43 ] 1, [ 2.24. Let . proba- = 0 : 1, p Exercise = 0 : 2, and p is the = 0 : 7. What p c a b y that P ( x ) 2 [0 : 15 ; 0 : 5]? What is bilit P x ) ( P 0 ? : > 05 log 2 : 0 [ 3, p.43 ] 2.25. ) Pro ve the assertion that H ( X Exercise log( jA equal- j ) with X ity i p = 1 = jA ts in j for all i . ( jA ber of elemen j denotes the num X X i set A Jensen's .) [Hin t: use the inequalit y (2.48); if your rst attempt X to use does not succeed, remem ber that Jensen involves both a Jensen random and a function, and you have quite a lot of freedom in variable choosing these; think about whether your chosen function f should be con vex or conca ve.] [ 3, p.44 ] ve that Exercise . 2.26. the relativ e entrop y (equation (2.45)) satis es Pro D . ( P jj Q ) 0 (Gibbs' inequalit y) with equalit y only if P = Q KL [ 2 ] the Exercise 2.27. Pro ve that . entrop y is indeed decomp osable as describ ed in equations (2.43{2.44).

50 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 2 | Probabilit y, Entrop 38 y, and Inference 2, ] p.45 [ x 2f 0 ; 1 ; 2 Exercise 3 g is selected by ipping . A random variable 2.28. ; g 0 the outcome is in f 0 ; 1 g or with bias a bent coin to determine whether f ipping a second bent coin with bias g or a third bent f 2 ; 3 g ; then either @ ely. Write probabilit the down y distribution ectiv resp h bias coin with R @ g 1 f 1 y of x entrop the decomp osabilit y of the entrop y (2.44) to nd the . Use if you expression how compact e use an mak is obtained of X . [Notice @ f 1 @ 2 x ( H y function entrop binary of the out writing with ), compared the h 2 R @ ) with ( H ect X resp four-term entrop y explicitly .] Find the deriv ativ e of @ =x ( to f . [Hin t: d H x ).] x ) = d x = log((1 ) 2 @ R h 1 3 2, p.45 ] [ is ipp An unbiased coin Exercise ed until one head is thro wn. . 2.29. is the entrop y of the random variable x 2 f 1 ; 2 ; 3 ;::: g , the num- What case Rep calculation for the the of a biased coin with ber of ips? eat y f of coming up heads. [Hin t: solv e the problem both directly probabilit by using the osabilit y of the entrop y (2.43).] decomp and Further exercises 2.9 d probability Forwar [ 1 ] Tw Exercise An urn con tains . white balls and b blac k balls. 2.30. o balls w are wn, one after the other, dra replacemen t. Pro ve that the without probabilit y that the rst ball is white is equal to the probabilit y that the second is white. ] [ 2 2.31. coin of diameter a is thro wn onto a square grid . Exercise A circular probabilit are b . ( a < b ) What is the y that the coin squares whose b 2 one square? [Ans: (1 will ) lie entirely ] within a=b [ 3 ] Exercise Bu on's needle . A needle of length a is thro wn onto a plane . 2.32. . What equally parallel lines with separation b spaced is covered with 2 a / needle will cross a line? [Ans, if a < b : the probabilit y that the ] b noodle Bu on's average, a random curv e of length { : on [Generalization 2 A / b times.] ected to intersect A lines is exp the 2 ] [ 2.33. Tw o points are selected at random on a straigh t line segmen t Exercise 1. What probabilit y that a triangle can be constructed is the of length of the three resulting segmen ts? out [ 2, p.45 ] Exercise 2.34. An unbiased coin is ipp ed until one head is thro wn. What is the ected num ber of tails and the exp ected num ber of heads? exp estimates who w that the coin is unbiased, kno the bias Fred, doesn't ^ f using ( h + t ), where h= and t are the num bers of heads and tails h ^ and sketch the probabilit Compute of tossed. f . y distribution N.B., this is a forw ard probabilit y problem, a sampling theory problem, not an problem. Don't use Bayes' theorem. inference p.45 2, ] [ 2.35. Fred rolls Exercise six-sided die once per second, not- an unbiased ing the occasions when the outcome is a six. one (a) mean num ber of rolls from is the six to the next six? What (b) Bet ween two rolls, the clock strik es one. What is the mean num ber of rolls until the next six?

51 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Further exercises 2.9: 39 bac (c) clock struc k. What is the mean num ber k before Now think the k in time, most recen t six? bac until the of rolls, going mean num ber of rolls from the six before the clock (d) What is the six? struc k to the next er to (d) di eren t from (e) answ er to (a)? Explain. answ Is your your of this exercise refers to Fred waiting for a bus at Another version a in Poisson where buses arriv e indep enden tly at random (a bus-stop ville min cess), average, one bus every six on utes. What is pro with, Poisson wait for a bus, after Fred the es at the stop? [6 min utes.] So average arriv is the between the two buses, the time that Fred just missed, what one the one that he catc hes? [12 min utes.] and the apparen t para- Explain dox. Note con trast with the situation in Clo ckville, where the buses the spaced exactly utes apart. There, as you can con rm, the mean are 6 min is 3 min between and the time at a bus-stop the missed bus wait utes, the next is 6 min utes. and one Conditional probability [ 2 ] . Exercise You meet Fred. Fred tells you he has two brothers, 2.36. and Alf Bob. What probabilit y that Fred is older than Bob? is the Now, tells he is older than Alf. that what is the probabilit y Fred you Fred is older than Bob? (That is, what is the conditional probabilit y that F > B that that F > A ?) given [ 2 ] island Exercise . inhabitan ts of an 2.37. tell the truth one third of the The time. They lie with probabilit y 2/3. On an after one of them made a statemen t, you ask another occasion, statemen t true?' he says `yes'. `was that and probabilit What statemen t was indeed true? y that is the the [ p.46 2, ] Compare two ways of computing 2.38. probabilit y of error Exercise . the etition of the R rep , assuming a binary symmetric channel (you code 3 once for exercise 1.2 (p.7)) and this that they give the same did con rm er. answ distribution . d Binomial Add the probabilit y that all three metho two bits are probabilit y that exactly ed to the are ipp ed. bits ipp rule metho d . Using the sum rule, compute the marginal Sum prob- abilit r tak es on eac h of the eigh t possible values, y that ( r ). P P [ P ( r ) = probabil- P ( s ) P ( r j s ).] Then compute the posterior s ity of for eac h of the eigh t values of r . [In fact, by symmetry , s two example = ( r = ( 000 ) and r only 001 ) need be consid- cases determined Notice ered.] of the inferred bits are better that Equation (1.18) gives the some posterior probabilit y of the input read can ) you r j s than others. From the posterior probabilit y P ( . r receiv , given the s ed vector y, the the out error probabilit case-b probabilit y that the more y-case probable hypothesis is not correct, P (error j r ). Find the average error y using the sum rule, probabilit X (error) = (2.55) : P ( r ) P (error j r ) P r

52 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 2 | Probabilit y, Entrop Inference 40 y, and p.46 ] [ 3C, 2.39. frequency p . of the n th most frequen t word in Exercise The n appro by is roughly English ximated 0 : 1 12 367 for n 2 1 ;:::; n ' p (2.56) n 0 n > 12 367 : able 1 =n law is kno wn as Zipf 's law, and applies [This word remark to the of man (Zipf, 1949).] If we assume that English frequencies y languages words is generated to this distribution, at random by picking according y of English entrop [This calculation can be found is the what (per word)? entrop y of prin ted in `Prediction C.E. Shannon, Bel l Syst. and English', J. , pp. 50{64 (1950), but, 30 , the great man made Tech. inexplicably errors numerical in it.] Solutions 2.10 2.2 (p.24) . No, they are not indep enden t. If they were Solution to exercise the tical distributions P ( y j x ) would be iden all functions of then conditional of (cf. , regardless gure 2.3). y x f 2.4 . to exercise the fraction (p.27) Solution B=K . We de ne B The (a) ber of blac k balls has a binomial distribution. num N n n N B B f ) = n f ;N j P ( (1 f ) : (2.57) B B B B n B The mean and variance of this distribution (b) are: E [ n ] = Nf (2.58) B B var[ n (2.59) ] = Nf : (1 f ) B B B were deriv 1.1 (p.1). The standard deviation These ed in example results p p Nf n ] = var[ ). (1 f of n is B B B B n = 5 and N = 5, the exp ectation and variance of = 1 B=K are 1 When B and 4/5. The standard deviation is 0.89. B=K = 1 = 5 and N = 400, the exp ectation and variance of n When are B 80 The standard deviation is 8. and 64. 2.5 of the . The numerator to exercise quan tity Solution (p.27) 2 n f N ) ( B B z = (1 f ) Nf B B 2 be recognized as ( n variance E [ n to the ]) can ; the denominator is equal B B ectation n whic h is by de nition the exp (2.59), of the numerator. So the of B ectation of z is 1. [A random variable exp z , whic h measures the deviation like 2 the exp ected value, from called of data (chi-squared).] is sometimes In the case N = 5 and f The = 1 = 5, Nf ] is 4/5. is 1, and var[ n B B B has ve possible values, only one of whic h is smaller than 1: ( n numerator B 2 1 N ) f = 0 has probabilit y P ( n z < = 1) = 0 : 4096; so the probabilit y that B B is 0.4096.

53 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 41 2.10: . We wish to pro ve, given 2.14 prop erty Solution to exercise (p.35) the + (1 ) x f ) f ( x ( ) + (1 x ) f ( x (2.60) ) ; 2 2 1 1 P = 1 and p if 0, that, p i i ! I I X X (2.61) : ) f f ( x p x p i i i i =1 i i =1 working of does the righ t-hand side. (This pro by recursion, We pro ceed from left where p handle = 0; suc h details are cases to the pedan tic not some i with rst we use the de nition of con vexit y (2.60) line = reader.) At the p p 1 2 = p . ; at the second line, = 1 I I p p i i i =1 i =2 ! ! I I X X f p = f x + p x x p 1 1 i i i i i =1 i =2 !# , #" " I I I X X X (2.62) f p x ( ) + p f p x p i 1 1 i i i i =2 i =2 i =2 " #" , !# P I I I I X X X p p i 2 =3 i ) + p x f ( p x ) + ( f f x p p ; P P 1 1 i 2 i i i I I p p i i =2 =2 i i =2 i =3 i =3 i and so forth. 2 Solution to exercise 2.16 (p.36) . (a) For the f 2 ; 3 ; 4 ; 5 ; 6 ; 7 ; 8 ; 9 ; 10 ; 11 ; 12 g , the probabilities are P = outcomes 5 2 1 3 2 4 3 5 4 6 1 ; ; ; ; ; ; ; ; ; ; g . f 36 36 36 36 36 36 36 36 36 36 36 The = of one die has mean 3 : 5 and variance 35 (b) 12. So the sum of value hundred and mean 350 and variance 3500 = 12 ' 292, one by the has tral-limit Gaussian the probabilit y distribution is roughly cen theorem con ned mean integers), with this (but and variance. to the (c) to obtain a sum that has a uniform distribution we have to start In order from random variables some of whic h have a spiky distribution with the probabilit y mass trated at the extremes. The unique solution is concen ordinary die one with faces 6, 6, 6, 0, 0, 0. to have one and example a uniform can Yes, in several ways, for distribution To think about: does this uniform (d) be created r tradict the distribution con by lab 1 the r th die with the num bers f 0 ; elling ; 2 ; 3 ; 4 ; 5 g 6 . cen tral-limit theorem? Solution to exercise 2.17 (p.36) . p p a ) = e (2.63) a = ln q q and p gives q = 1 p a = e (2.64) p 1 a e 1 p = ) (2.65) : = a 1 + exp( ) + 1 e a hyperb olic tangen t is The a a e e (2.66) a tanh( ) = a a e + e

54 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 2 | Probabilit 42 y, Entrop y, and Inference so a 1 1 1 e a ( = f + 1 ) a 1 + 2 ) e 1 + exp( a ! a= 2 2 a= 1 e e 1 = + 1 2) + 1) a= (2.67) (tanh( : = a= 2 a= 2 2 2 + e e b = log In the p=q , we can rep eat steps (2.63{2.65), replacing e by 2, case 2 to obtain 1 : (2.68) p = b 1 + 2 to exercise 2.18 (p.36) . Solution ) x ( P P ( y j x ) ( ) = j x P y (2.69) y ( P ) = 1) x ( P = 1) x j y ( P j y P ( x = 1 ) = (2.70) ) ( ( y j x = 0) x ( P j P P x = 0) y = 0 ) P ( x = 1 j y ) x = 1) ( P = 1) x j y ( P log ) log + log : (2.71) = ( y j x = 0) P x ( P ) y ( x = 0) j = 0 P endence to exercise . The conditional indep (p.36) of d Solution and d 2.19 2 1 x given means ( x;d (2.72) ;d d ) = P ( x ) P ( P : j x ) P ( d ) j x 2 1 1 2 This gives a separation of the posterior probabilit y ratio into a series of factors, point, times one prior probabilit y ratio. eac h data the for = 1) x ( P ) = 1 x P ( f d jf gj x = 1) ( P g d i i (2.73) = g ) P ( f P ( gj x = 0) x = 0 jf d P ( x = 0) d i i ( ( P = 1) x = 1) x P j d ( = 1) x j d P 2 1 : (2.74) = P ( d d j x = 0) P j x = 0) P ( x = 0) ( 2 1 spaces in high-dimensional Life 2.20 (p.37) . The Solution of a hypersphere of radius r in to exercise volume dimensions is in fact N 2 N= N ; (2.75) r V r;N ) = ( N= 2)! ( but you need to kno w this. For this question all that we need is the don't N -dep V ( r;N ) / r r : So the fractional volume in ( r ;r ) is endence, N N N ) ( r r 1 = 1 : (2.76) N r r The fractional volumes in the shells for the required cases are: N 2 1000 10 = 0 01 0.02 0.096 0.99996 =r : 1000 =r 0.75 0.999 1 2 = 0 : 5 Notice that no matter how small is, for large enough N essen tially all the probabilit y mass is in the surface shell of thic kness .

55 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 43 2.10: f (p.37) p 2.21 = 0 : 1, p Solution = 0 : 2, p ) = 10, = 0 to exercise 7. . ( a : c a b ) = 10 f ( f ) = 5, and = 7. b ( c f ( x )] = 0 : 1 10 + 0 : 2 5 + 0 : 7 10 = 7 = 3 : (2.77) E [ ( x ( x ) = 1 =P f x ), so , For eac h =P ( x )] = E [ f ( x )] = 3 : (2.78) E [1 2.22 (p.37) . For general X , to exercise Solution X X jA 1 = (2.79) : j P x )1 =P ( x ) = ( E =P ( )] = x [1 X 2A x 2A x X X Solution (p.37) . p to exercise = 0 : 1, p ) = 1, = 0 : 2, p b = 0 2.23 7. g ( a ) = 0, g ( : c a b and g ( c ) = 0. [ g ( x )] = p : = 0 : 2 E (2.80) b Solution 2.24 (p.37) . to exercise ( P ( x ) 2 [0 : 15 ; 0 : 5]) = p P = 0 : 2 : (2.81) b P x ) ( 05 > 0 : : : + 8 (2.82) = p = 0 p log P c a 2 : 0 Solution to exercise 2.25 (p.37) . This type of question can be approac hed in two ways: either the function to be maximized, nding the by di eren tiating this and it is a global maxim um; ving strategy is somewhat um, maxim pro it is possible for the maxim um of a function to be at the boundary risky since at where space, the deriv ativ e is not zero. Alternativ ely, a of the a place The chosen establish the answ er. y can second metho d is carefully inequalit much neater. of by di erentiation (not the recommended metho d). Since it is sligh tly Pro , we temp to di eren =p than log easier 1 =p ln 1 orarily de ne H ( X ) to be tiate 2 using natural logarithms, thus scaling it down by a factor of log . e measured 2 X 1 H ( p ) = X ln (2.83) i p i i ( X ) @H 1 ln = 1 (2.84) p @p i i P we maximize to the constrain t sub ject p h can be enforced with = 1 whic i i a Lagrange multiplier: ! X H ( X ) + G ( p (2.85) p 1 ) i i 1 ( p @G ) ln = (2.86) 1 + : @p p i i At a maxim um, 1 ln (2.87) 0 1 + = p i 1 = 1 ; (2.88) ln ) p i p so all the are equal. That this extrem um is indeed a maxim um is established i by nding the curv ature: 2 1 @ G ( p ) ; = (2.89) ij @p p @p i j i whic h is negativ e de nite. 2

56 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. y, Entrop 44 y, and 2 | Probabilit Inference inequalit of using d). First a reminder of metho Pro y (recommended Jensen's y. inequalit the vex ^ function and x is a random If then: f is a con variable f x )] f ( ( [ x ]) : [ E E is strictly con vex ^ and E [ f ( x )] = f ( E [ x If the random f ]), then x t (with probabilit y 1). variable is a constan of a pro The inequalit y is to choose the righ t func- of using secret Jensen's righ t random variable. We could de ne tion the and 1 ) = log u f ( log u (2.90) = u P 1 of mean as the ( X ) = h is a con vex function) (whic log and think of H p i p i ( u ) where u = P ( x ), but this would not get us there { it would give us an f y in the inequalit If instead we de ne wrong direction. =P x ) (2.91) = 1 ( u then we nd: X ) = E [ f (1 =P ( x ))] f ( E H =P ( x )]) ; (2.92) ( [1 w from 2.22 (p.37) that E [1 =P exercise x )] = jA now we kno j , so ( X ( X ) f ( jA H j ) = log jA (2.93) j : X X ( only random variable u = 1 y holds if the x ) is a constan t, whic h Equalit =P P ( x ) is a constan t for all x . 2 means to exercise (p.37) . Solution 2.26 X ) P ( x ( P x ) log P Q ) = D ( jj (2.94) : KL ( x ) Q x ve Gibbs' Jensen's inequalit y. Let f ( u ) = log 1 =u and We pro inequalit y using ( ) x Q . Then = u ) x ( P P jj Q ) = E [ f D Q ( x ) =P ( x ))] (2.95) ( ( KL ! X 1 ) x Q ( P ) f x ( = log P = 0 ; (2.96) ( P x ) Q ( x ) x x Q ( x ) P Q 2 ). x ) = x ( ( is a constan t, that is, if = u if y only equalit with ) ( P x solution. In the above pro of the exp ectations were with resp ect to Second metho probabilit P ( x ). A second solution y distribution d uses Jensen's the P ( x ) x ) instead. We de ne f ( u ) = u log u and let u = inequalit y with Q ( . ( ) Q x Then X X P ( x ) P ( x ) x ) P ( ) ( Q x f ) x ( Q ( ) = P D Q jj = log (2.97) KL ( x ) x ) Q Q ( x ) Q ( x x ! X x ( ) P ) ( Q x f f (2.98) = (1) = 0 ; Q ) x ( x ) x ( P t, that 2 ). x ( P ) = x ( Q is a constan is, if = with equalit y only if u ) x ( Q

57 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 45 2.10: Solutions (p.38) Solution 2.28 to exercise . ) = H ( ( f ) + fH H ( g ) + (1 f ) H ( X h ) : (2.99) 2 2 2 (p.38) . The probabilit y that to exercise are x 1 tails and Solution 2.29 there head (so we get the rst head on the x th toss) is then one x 1 x ) P f f: (2.100) ( ) = (1 future toss probabilit y distribution for the the looks just rst If the is a tail, we made the rst toss. Thus we have a recursiv e expression like it did before entrop y: for the ) X H H ( f ) + (1 ( f ) = H ( X ) : (2.101) 2 Rearranging, ( X ) = H H ( f ) =f: (2.102) 2 Solution . The probabilit y of the num ber of tails t is 2.34 (p.38) to exercise t 1 1 ) = t P ( for t 0 : (2.103) 2 2 num ber of heads is 1, by de nition of the ected The exp ected exp The problem. is num ber of tails 0.5 1 t X ^ 1 1 P ( f ) 0.4 t ; t [ ] = E (2.104) 2 2 =0 t 0.3 y of ways. For example, since wn h may be sho the whic to be 1 in a variet 0.2 situation, we can situation after one tail is thro wn is equiv alen t to the opening 0.1 write down the recurrence relation 0 1 0 0.8 0.6 0.4 0.2 1 1 [ t ] = E t (1 + E [ (2.105) ]) + 0 ) E [ t ] = 1 : ^ 2 2 f ^ t (1 + = The probabilit y distribution of the = 1 ), given that f `estimator' y Figure 2.12 . The probabilit ^ of the estimator distribution f in gure = 2, is plotted = 1 The 2.12. probabilit y of f is simply y probabilit the ^ t f = 1 = (1 + ), given that f = 1 = 2. value of t onding of the . corresp to exercise 2.35 (p.38) . Solution The mean num ber of rolls from one six to the next six is six (assuming (a) of the coun rolls after the rst ting two sixes). The probabilit y we start that the next six occurs on the r th roll is the probabilit y of not getting a six for 1 rolls multiplied by the probabilit y of then getting a six: r r 1 1 5 (2.106) 3 . g ;::: 1 ; 2 ; for r 2f ; r ( P = ) = r 1 6 6 probabilit y distribution of the This ber of rolls, r , may be called an num exp onen tial distribution, since r P = r ) = e ( r =Z; (2.107) 1 where = ln(6 = 5), Z is a normalizing constan t. and (b) The mean num ber of rolls from the clock until the next six is six. bac (c) mean num ber of rolls, going The k in time, until the most recen t six is six.

58 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. y, and Inference 46 2 | Probabilit y, Entrop num The the six before the clock struc k to the six (d) ber of rolls mean from 0.15 sum k is the ers to (b) and (c), less one, of the answ after clock struc the en. is, elev that 0.1 di erence between (a) and (d), let (e) give Rather than explaining the me 0.05 buses in Poisson ville arriv e indep en- another that hin t. Imagine the 0 with, cess), one bus every pro den average, on tly at random (a Poisson 0 5 15 20 10 min utes. Imagine that passengers turn up at bus-stops at a uniform six . The 2.13 Figure probabilit y and are sco oped up by the bus without interv rate, al be- dela y, so the of the ber of rolls num distribution constan t. Buses that two buses than remains w gaps tween bigger follo from (falling r next 6 to the one 1 wded. utes become represen overcro six The min e com- tativ passengers' line), solid es on overcro plains that two-thirds wded of all passengers themselv found 1 r 1 5 buses buses. The bus operator claims, `no, no { only one third of our ; P = r ) = ( r 1 6 6 both Can wded'. overcro be true? claims are these probabilit y distribution and the of the ber of line) num (dashed . (p.39) 2.38 to exercise Solution the 6 before 1pm rolls from to the metho d . From Binomial the solution to exercise 1.2, distribution p = , next 6, r tot B 3 2 ) + f . (1 f 3 f r 2 1 5 1 ) = ( P = r r r : tot r are of t values eigh of the probabilities marginal The . d metho rule Sum 6 6 illustrated by: > ( r The 6) is probabilit y P 3 3 1 1 1 / / ( = r 000 ) = P + 2 ) f (1 f (2.108) ; 2 y probabilit about the 1/3; 2 2 1 1 1 / / / P 2/3. The r ( > 6) is about 001 ) = = ( P r f (1 f ) 2 + f (1 f ) = 2 (2.109) : 2 f (1 f ) tot of mean r is 6, and the of mean 1 The posterior probabilities are represen ted by is 11. r tot 3 f (2.110) P r = 000 ) = ( = 1 s j 3 3 ) + f f (1 and 2 f ) f (1 (2.111) f: = ( 1 = s ) = = r 001 j P 2 2 + f f (1 f ) f ) (1 e cases probabilities in these represen tativ of error are thus The 3 f j r = 000 ) = P (error (2.112) 3 3 (1 f f + ) and (error r = 001 ) = f: P (2.113) j 2 is about that probabilit y of error of R Notice the 3 f while , the average 3 3 f en particular bit is wrong is either about any y (giv probabilit ) that r . f or error probabilit y, using the sum rule, is The average X j P ( r ) P (error r ) (error) = P rst two terms The are for the r 3 and cases = 000 r 111 ; the f 3 3 1 1 1 / / / f ] 2 ) f + (1 2 = 2[ f f: (1 f 2 )] + 6[ remaining 6 are for the other 3 3 + ) f (1 f same outcomes, whic h share the and y of occurring probabilit So tical error probabilit y, f . iden 3 2 = f f + 3 P (error) (1 f ) : Solution to exercise 2.39 (p.40) . The entrop y is 9.7 bits per word.

59 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. out Chapter Ab 3 eager If you to information theory , data compression, and noisy to get are on skip and 4. Data compression can data mo delling channels, to Chapter you however, so you'll probably want to come bac k to connected, are intimately by the time you get to Chapter 6. Before this Chapter 3, it chapter reading t be good to look follo wing exercises. migh at the 2, p.59 ] [ 3.1. A die is selected at random from two twenty-faced Exercise . dice non h the bols 1{10 are written with whic uniform frequency as on sym ws. follo bol 1 2 Sym 4 5 6 7 8 9 10 3 Num of die A 6 4 3 2 1 1 1 1 1 0 ber of faces ber of faces of die 3 3 2 2 2 2 2 2 1 1 B Num the chosen is rolled 7 times, with die follo wing outcomes: randomly The 5, 3, 9, 3, 8, 4, 7. is the is die y that the die What A? probabilit [ 2, p.59 ] on Exercise Assume that there is a third . die, die C, 3.2. twenty-faced whic sym bols 1{20 are written h the eac h. As above, one of the once three dice is selected at random and rolled 7 times, giving the outcomes: 3, 5, 4, 8, 3, 9, 7. What probabilit y that the die is (a) die A, (b) die B, (c) die C? is the [ p.48 ] 3, Inferring a deca y constant Exercise 3.3. particles emitted from a source and deca y at a distance Unstable are , a real num ber that has an exp onen tial probabilit y distribution with x length . Deca characteristic y events can be observ ed only if they occur w extending deca x = 1 cm to x = 20 cm. N in a windo ys are from observ ed f x ? ;:::;x at locations g . What is 1 N * * * * * * * * * x [ 3, p.55 ] . 3.4. Forensic evidence Exercise o people scene traces of their own blo od at the Tw of a crime. A have left to have type `O' ect, er, is tested and found susp blo od. The blo od Oliv groups of the two traces are found to be of type `O' (a common type in the population, having frequency 60%) and of type `AB' (a rare local type, with frequency 1%). Do these data (type `O' and `AB' blo od were Oliv found give evidence in favour of the prop osition that at scene) er was one of the two people presen t at the crime? 47

60 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 3 More about Inference troversial statemen Bayes' theorem pro vides the correct It is not a con t that the comm of a message describing unicated over a noisy language for inference it in Chapter But strangely , when it comes to as we used channel, 1 (p.6). problems, the use of Bayes' theorem is not so widespread. other inference A rst problem 3.1 inference undergraduate bridge, I was privileged to receiv e su- I was an in Cam When Stev e Gull. Sitting at his desk in a dishev elled oce pervisions from in St. College, ed him how one ough t to answ I ask old Tripos question John's er an 3.3): (exercise particles Unstable emitted from a source and deca y at a are distance , a real num ber that has an exp onen tial probabilit y dis- x with characteristic . Deca y events can be ob- tribution length only ed occur in a windo w extending from x = 1 cm serv if they x = 20 cm. N deca ys are observ ed at locations f x . to g ;:::;x N 1 is What ? * * * * * * * * * x time. I had my head over this for some hed My education had pro vided scratc me with a couple of approac hes to solving suc h inference problems: construct- ing `estimators' unkno wn parameters; or ` tting' the mo del to the data, of the cessed data. of the version or to a pro mean of an unconstrained exp onen tial distribution is , it seemed Since the P see and =N x if an estimator = x reasonable mean sample the to examine n n ^ ^ be obtained it. It was eviden t that the estimator could = x 1 would from for for 20 cm, but not cases where the truncation of the be appropriate distribution at the righ t-hand side is signi can t; with a little ingen uity and the introduction hoc bins, promising estimators for 20 cm could be of ad But under was no obvious estimator that would work constructed. all there conditions. densit I nd Nor approac h based on tting the could y P ( x j ) a satisfactory to a histogram deriv ed from the data. I was stuc k. problem What solution to this general and others like it? Is it is the alw ays necessary , when confron ted by a new inference problem, to grop e in the nding dark appropriate `estimators' and worry about for the `best' estimator (whatev er that means)? 48

61 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. inference A rst problem 3.1: 49 0.25 y Figure 3.1 . The probabilit P(x|lambda=2) P(x|lambda=5) ) as a function P x y densit ( . x of j 0.2 P(x|lambda=10) 0.15 0.1 0.05 0 x 8 20 16 14 12 10 18 6 4 2 0.2 P(x=3|lambda) probabilit 3.2 Figure . The y P(x=5|lambda) ) as a function , densit y P ( x j of P(x=12|lambda) 0.15 for three di eren t values of x . plotted When way round, this the 0.1 is kno wn as the likeliho od function of . The marks indicate the 0.05 that 10, ; three values of , = 2 ; 5 were used in the preceding gure. 0 10 1 100 e wrote down the probabilit y of one data point, given : Stev 1 x= e 20 < x < =Z ( ) 1 ( P ) = x j (3.1) 0 otherwise where Z 20 1 1 = 20 x= = e e e = : (3.2) d x ) = ( Z 1 seemed obvious enough. Then he wrote Bayes' theorem : This 3 gj P ( f x ) P ( ) (3.3) ) = x ( ;:::;x jf P g N 1 g ) ( P f x 2 P 1 N / P exp (3.4) : ) x = ( n 1 N )) ( Z ( 1 tforw ard distribution P ( f x gj ;:::;x Suddenly , the ), de ning the straigh 100 1 N y of the , was being turned on its head data given hypothesis probabilit the 10 1 the data. so as to de ne gure A simple y of a hypothesis probabilit the given 1.5 x 1 2 of x , x sho wed the j ) as a familiar ( function probabilit y of a single data point P 2.5 onen tial, for di eren t values of ( gure 3.1). Eac h curv e was an inno cen t exp probabilit 3.3 Figure y . The a for the as a function function same 1. Plotting to have area normalized of P y densit x of ) as a function j x ( able , something xed value of x remark happ ens: a peak 3.2). ( gure emerges and . Figures 3.1 and 3.2 are function, gure To help understand these 3.3 two points of view of the one vertical sections through this ) as a function of x and . of plot a surface ws sho P j x ( surface. N f points, e.g., the six For a dataset of several x g consisting points = n =1 f 1 : 5 ; 2 ; 3 ; 4 ; 5 ; 12 g , the likeliho od function P ( f x gj ) is the pro duct of the N ) ( gure functions , P ( x of j 3.4). n 1.4e-06 od function likeliho . The 3.4 Figure 1.2e-06 in the oint dataset, of a six-p case 1e-06 1 ), as gj P ( f x g = f 5 : 12 ; 2 ; 3 ; 4 ; 5 ; of a function . 8e-07 6e-07 4e-07 2e-07 0 10 1 100

62 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 3 | More Inference 50 about Bayes' theorem the fact that Stev e summarized as embodying you w about the data arriv e is what after knew kno what you ( )], and what the data told you before P ( f x gj )]. [ P [ used to quan tify degrees here To nip possible are Probabilities of belief. bud, it must be emphasized that the hypothesis that cor- confusion in the describ es the is not a stochastic variable, and the fact that the rectly situation he a probabilit does not mean that P thinks of uses Bayesian y distribution as stochastically changing its nature between the states describ ed the world di eren t hypotheses. uses the notation of probabilities to represen t by the He (here, about mutually exclusiv e micro-h ypotheses beliefs values of ), his the h only one is actually true. That probabilities can denote degrees of of whic given belief, seemed reasonable to me. assumptions, posterior probabilit (3.4) represen ts the unique and com- The y distribution to the plete is no need to invent `estimators'; nor do problem. solution There comparing alternativ e estimators with eac h other. to invent criteria we need for dox statisticians o er twenty ways of solving a problem, and Whereas ortho an- twenty di eren for deciding whic h of these solutions is the best, other t criteria problem. statistics one answ er to a well-p osed o ers If you have any dicult y Bayesian only this chapter I understanding ensuring you are recommend e enc in infer Assumptions y with and 3.2 3.1 happ exercises y their noting (p.47) then similarit )]. Our inference is conditional on our assumptions [for example, the prior P ( 3.3. to exercise view suc h priors as a dicult y because they are `sub jectiv e', but I don't Critics how it could How can see one perform inference without making be otherwise. assumptions? e that it is of great value that Bayesian metho ds force I believ one to mak e these tacit assumptions explicit. First, once are made, the inferences are objectiv e and unique, assumptions ducible repro agreemen t by anyone who has the same informa- with complete mak same assumptions. For example, given the assumptions and es the tion H , and the data D listed will agree about the posterior prob- above, , everyone y of the deca y length : abilit ) jH ( P ) P ( D j ; H ) = D; j ( P H : (3.5) jH ) P ( D are when are explicit, they assumptions easier to criticize, and Second, the to mo dify { indeed, we can quan tify the sensitivit y of our inferences to easier details note assumptions. For example, we can the from the likeliho od of the es in gure data that in the case of a single curv point at x = 5, the 3.2 in the likeliho strongly peak ed than is less case x = 3; the details od function of the prior P ( ) become increasingly imp ortan t as the sample mean x gets closer to the of the windo w, 10.5. In the case x = 12, the likeliho od middle rule doesn't { suc h data merely at all out small values function have a peak , and don't give any information about the of e probabilities of large relativ values . So in this case, the details of prior at the small{ end of things of the are not imp ortan t, but at the large{ end, the prior is imp ortan t. Third, when not sure whic h of various alternativ e assumptions is we are treat the for a problem, we can appropriate this question as another most inference task. Thus, given data D , we can compare alternativ e assumptions H Bayes' theorem: using ) I Hj ( P P ( D jH ;I ) (3.6) ; Hj ( P ) = D;I D j I ) P (

63 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. The bent coin 3.2: 51 denotes where assumptions, whic h we are not questioning. the I highest e into accoun uncertain ty regarding suc h assump- tak Fourth, t our we can e subsequen t predictions. Rather than choosing one partic- tions when we mak assumption working out our predictions about some , and tity t , H ular quan j D; H P ( ), we obtain predictions that tak e into accoun t our uncertain ty t ;I H the sum rule: about by using X t D;I ) = P ( j ( t j D; H ;I ) P ( P D;I ) : (3.7) Hj H con trast with ortho is another in whic h it is con ventional This dox statistics, a default mo del, to `test' then, if the test `accepts the mo del' at some and `signi cance to use exclusiv ely that mo del to mak e predictions. level', e thus persuaded me Stev that reac parts that ad hoc metho ds cannot reac h. y theory hes probabilit at a few more examples of simple inference problems. Let's look The coin 3.2 bent F times; is tossed e a sequence s of heads and tails A bent coin we observ h we'll denote by the sym bols a (whic b ). We wish to kno w the bias of and the and predict the probabilit y that the next toss will result in a head. coin, it encoun this task We rst 2.7 (p.30) , and we will encoun ter tered in example again in Chapter 6, when we discuss adaptiv e data compression. It is also the original inference problem studied by Thomas Bayes in his essa y published in 1763. As (p.30) , we will assume a uniform prior distribution and in exercise 2.8 distribution by the likeliho od. A critic migh t a posterior obtain by multiplying did this prior come from?' I will not claim that the uniform object, `where tal; is in any way fundamen we'll give examples of non uniform prior indeed of the The later. jectiv e assumption. One prior themes of this priors is a sub is: book can't do inference { or data compression { without making you assumptions. We give the name to our assumptions. [We'll be introducing an al- H 1 y, given in a momen t.] The probabilit ternativ p e set , that F of assumptions a result in a sequence s that con tains f F tosses ;F two outcomes g coun ts of the b a is F F b a : (3.8) (1 p ) j s H ) = ;F; P p ( p a 1 a a [For P ( s = aaba j p del ;F = 4 ; H mo ) = p rst p ] Our example, p : ) p (1 a a a 1 a a assumes prior distribution for p , a uniform a P ( p (3.9) jH 1] ) = 1 ; p ; 2 [0 a 1 a . p 1 and p a b Inferring unknown parameters interested of length F of whic h F Giv are a s and F are b s, we are en a string a b t be; (b) in (a) what p character migh inferring predicting whether the next is a

64 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 3 | about 52 More Inference b a alw ays expressed as probabilities. So `predicting . [Predictions an or a are probabilit character ' is the same as computing the a y next the whether is an character is an a .] that the next to be true, H posterior probabilit y of p Assuming , given a string s the a 1 ;F coun ts f F F that of length g , is, by Bayes' theorem, has b a ) jH p ( P ) H ;F; P ( s j p a 1 a 1 ;F; s j ) = ( H p P : (3.10) 1 a H ( s j F; P ) 1 P ( s j p , is kno ;F; H The ), whic h, as a function of p likeli- factor wn as the a 1 a jH (3.8); prior P ( was given in equation the hood function, ) was given in p 1 a Our inference of p equation is thus: (3.9). a F F a b p (1 p ) a a (3.11) j s P H : ) = ( p ;F; 1 a j F; H ( ) P s 1 The by the beta integral normalizing constan t is given Z 1 + 1) F + 1)( F ( ! ! F F a b b a F F b a = ) p (1 p d p F; H : ) = P ( s j = a a 1 a ( + F + 2) ( F + F + 1)! F a b b a 0 (3.12) [ 2, p.59 ] Sketch the posterior probabilit y P 3.5. p Exercise j s = aba ;F = 3). ( a is the most probable value of p maximizes (i.e., What value that the a posterior under y densit y)? What is the mean value of p probabilit the a this distribution? er the same Answ for the posterior probabilit y questions P ( p = 3). j s = bbb ;F a infer From ences to predictions toss the the probabilit y that the next toss, is an a , prediction Our next about p has . This is obtained the by integrating of taking into accoun t over e ect a p uncertain when making predictions. ty about the sum rule, our By a Z j s ;F ) = P d p ( P ( a j p a ) P ( p (3.13) j s ;F ) : a a a p a p The probabilit given y of an , so is simply a a Z F F b a (1 ) p p a a (3.14) s ;F ) = P d p ( p a j a a j F ) P ( s Z F +1 F a b p ) (1 p a a p = (3.15) d a s j ( ) P F ( + 1)! F F ! F ! F ! F + 1 b a b a a = (3.16) ; = + F + 1)! + 2)! + F ( ( F + F + 2 F F b a b a b a whic h is kno as Laplace's rule . wn 3.3 coin and model compa rison The bent our a scien introduces another theory for that data. He asserts Imagine tist the source is not really that but is really a perfectly formed die with a bent coin one pain ted heads (` a ') and the face ve pain ted tails (` b '). Thus the other parameter p between , whic h in the original mo del, H e any value , could tak 1 a , not 1, is according hypothesis, H 0 and new a free parameter at all; to the 0 it is equal rather, = 6. [This hypothesis is termed H of so that the sux to 1 0 eac h mo del indicates its num ber of free parameters.] to How can these two mo dels in the ligh t of data? We wish we compare infer how probable H is relativ e to H . 1 0

65 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. The bent coin mo del comparison 53 3.3: and arison Model as infer comp ence del comparison, down Bayes' theorem again, to perform mo In order we write We wish a di eren t on the left-hand side. with to time this but t argumen theorem, is given the data. By Bayes' kno w how probable H 1 ) ( s j F; H ) P P ( H 1 1 (3.17) : H j s P ;F ) = ( 1 j F ) s ( P posterior probabilit Similarly H , the is y of 0 P ( s j F; ) H ) P ( H 0 0 P j s ;F ) = H ( : (3.18) 0 ( j F ) P s constan t in both cases is P ( s j F ), whic h is the total The normalizing proba- y of getting observ ed data. If H bilit and the under are the only mo dels H 0 1 this probabilit y is given by the consideration, rule: sum P s j F ) = P ( s j F; H ( ) P ( H (3.19) ) + P ( s j F; H : ) P ( H ) 0 0 1 1 posterior of the hypotheses we need to assign To evaluate the probabilities the prior probabilities P ( H ( ) and P values H to ); in this case, we migh t 0 1 to 1/2 eac h. And we need these the data-dep enden t terms set to evaluate ( s j F; H The ) and P ( s j F; H tities. ). We can P to these quan give names 0 1 of how much the tity s j F; H quan ) is a measure ( data favour H we , and P 1 1 it the evidence for mo del H tity in . We already encoun call this quan tered 1 (3.10) rst it app eared as the normalizing constan t of the equation where we made { the inference of p given the data. inference a How mo del comparison works: The evidence for a mo del is usually the t of an earlier Bayesian inference. normalizing constan in (3.12). normalizing mo del H the t for The evi- We evaluated constan 1 mo del H parameters is very simple because this mo del has no dence to for 0 De ning p to be 1 = 6, we have infer. 0 F F a b ( ) = P j F; H s p : (3.20) p ) (1 0 0 0 posterior probabilit y ratio of mo del H Thus the to mo del H is 0 1 ) H ( P ) H F; j s ( P P ( H j s ;F ) 1 1 1 = (3.21) s ) P ;F P ( ( j F; H s ) P ( H ) H j 0 0 0 F ! F ! a b F F a b ) p (1 : (3.22) p = 0 0 F + F ( + 1)! a b Some posterior probabilit y ratio are illustrated in table 3.5. The of this values mo ve lines some outcomes favour one that del, and some favour rst illustrate other. the outcome is completely incompatible with either mo del. With No small ts of data (six tosses, say) it is typically not the case that one of amoun than two mo is overwhelmingly more probable dels the other. But with the more data, the evidence against H F given by any data set with the ratio F : a 0 b from ts up. You can't predict in adv ance how much data di ering 1: 5 moun needed are y sure whic h theory is true. It dep ends what p is. to be prett a The simpler mo del, H to , since it has no adjustable parameters, is able 0 lose by the biggest margin. The out may be hundreds to one against it. odds The more complex mo del can nev er lose out by a large margin; there's no data set that is actually unlikely given mo del H . 1

66 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. about 3 | Inference More 54 del of mo . Outcome 3.5 Table ( H ) j s ;F P 1 ( ) Data ;F F F comparison between H dels mo a b 1 H s j P ( ) ;F 0 and H for the `bent coin'. Mo del 0 states p = 1 = 6, p = 5 H that 6. = 222.2 (5 ; 1) 6 a b 0 2.67 3) ; (3 6 = 1/1.4 0.71 ; (2 6 4) (1 = 1/2.8 0.356 5) ; 6 6 (0 ; 6) 0.427 = 1/2.3 96.5 20 (10 ; 10) ; 20 (3 17) = 1/5 0.2 20 ; (0 20) 1.83 H is true H is true Figure 3.6 of viour beha . Typical 1 0 as H of in favour evidence the 1 = 0 6 = 0 = 1 = p : p : 25 5 p a a a 8 8 8 ulate bent coin tosses under accum 1000/1 1000/1 1000/1 6 6 6 three di eren t conditions 100/1 100/1 100/1 4 4 4 tal axis Horizon 1, 2, 3). (columns 10/1 10/1 10/1 2 2 2 0 0 0 1/1 1/1 1/1 . The num ber of tosses, F is the -2 -2 -2 1/10 1/10 1/10 on the left axis vertical is -4 -4 -4 ) j s ( P H F; 1 1/100 1/100 1/100 ln ; the righ t-hand H ) ( s j P F; 100 150 200 0 50 150 0 50 100 150 200 200 0 50 100 0 8 8 8 values of the ws vertical axis sho 1000/1 1000/1 1000/1 P ( s j F; H ) 6 6 6 1 . 100/1 100/1 100/1 4 4 4 ( H P s j F; ) 0 10/1 10/1 10/1 2 2 2 t w indep enden The three sho rows 0 0 0 1/1 1/1 1/1 sim erimen ts. exp ulated -2 -2 -2 1/10 1/10 1/10 (See p.60.) 3.8, gure also -4 -4 -4 1/100 1/100 1/100 150 200 0 50 200 150 100 50 0 0 50 100 150 200 100 8 8 8 1000/1 1000/1 1000/1 6 6 6 100/1 100/1 100/1 4 4 4 10/1 10/1 10/1 2 2 2 0 0 0 1/1 1/1 1/1 -2 -2 -2 1/10 1/10 1/10 -4 -4 -4 1/100 1/100 1/100 200 100 50 0 200 50 0 100 0 50 100 150 200 150 150 [ 2 ] 3.6. . Sho w that after F tosses have tak en place, the biggest value Exercise the evidence that ratio log P s j F; H ( ) 1 log (3.23) P s j F; H ) ( 0 can have scales line arly with F if H log is more probable, but the 1 evidence of H in favour can gro w at most as log F . 0 [ 3, p.60 ] . 3.7. Putting your sampling theory hat on, assuming F Exercise has a not measured, compute a plausible range that the log evidence yet been true migh ratio of F and the t lie in, value of p sketch , and as a function a it as a function of F for p t: = p 2. [Hin = 1 = 6, p = = 0 : 25, and p = 1 a a a 0 F log of the random variable as a function sketch the and work evidence a the mean out standard deviation of F .] and a Typic al behaviour of the evidenc e Figure 3.6 sho ws the log evidence ratio as a function of the num ber of tosses, left-hand F ulated exp erimen ts. In the ber of sim exp erimen ts, H , in a num 0 was true. In the righ t-hand ones, H was either was true, and the value of p a 1 0.25 or 0.5. We will discuss chapter. del comparison more in a later mo

67 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. An example evidence 55 3.4: of legal An evidence example 3.4 of legal than example there is more to Bayesian inference that illustrates The wing follo priors. the traces of their own blo od at the scene of a Tw o people have left Oliv A susp is tested and found to have type `O' crime. ect, er, od groups of the two traces are found to be of type od. The blo blo type in the local population, having frequency 60%) `O' (a common (a rare and frequency 1%). Do these data of type `AB' type, with `AB' blo found at scene) give evidence in (type `O' and od were prop that Oliv er was one of the two people of the favour osition crime? presen t at the t claim that the fact that the susp ect's blo od type was A careless lawyer migh theory scene e evidence for the is positiv that he was presen t. But found at the is not this so. Denote prop osition `the susp ect and one unkno wn person were presen t' the S e, . The S , states `two unkno wn people from the population were by alternativ The prior in this problem is the t'. probabilit y ratio between the presen prior ortan S S . This quan tity is imp ositions t to the nal verdict and prop and be based on all other available information in the case. would task here is Our just the con tribution made by the data D , that is, the likeliho od to evaluate task D j S; H ) =P ( ratio, j P S; H ). In my view, a jury's ( should generally be to D multiply carefully evaluated likeliho od ratios together eac h indep enden t from piece of admissible evidence with an equally carefully reasoned prior proba- bilit y. [This is shared by man y statisticians but learned British app eal view recen tly and actually overturned the verdict of a trial because judges disagreed complicated had t to use Bayes' theorem to handle taugh DNA been jurors the evidence.] y of the data given S The probabilit y that one unkno wn probabilit is the dra wn from the population has blo od type AB: person ( D j S; H P p (3.24) ) = AB will given kno w that one trace , we already be of type O). The prob- (since S data given wn S is the probabilit y that two unkno abilit y of the dra wn people from population have types O and AB: the p j ( P H ) = 2 D (3.25) p : S; AB O In these equations H denotes the assumptions that two people were presen t y distribution and and that the probabilit od there, of the blo od groups blo left wn people in an explanation is the same as the population frequencies. of unkno we obtain likeliho od ratio: the Dividing, ( D j S; H ) P 1 1 83 : = 0 (3.26) = : = p 2 2 0 : 6 ) H S; j D ( P O Thus data in fact pro vide weak evidence the the supp osition that against Oliv er was presen t. This result may be found surprising, so let us examine it from various ect, points of view. the case of another susp consider Alb erto, who has First 0 evidence Intuitiv data do pro vide ely, the in favour of the theory S type AB.

68 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 3 | More Inference 56 about susp t, relativ e to the null hypothesis that S . And indeed ect this was presen in this is: the likeliho case od ratio 0 1 S H ) P ( ; D j = = 50 (3.27) : p 2 P S; ) j D H ( AB the situation sligh tly; imagine that Now let of people are of us change 99% and rest are of type AB. Only the two blo od types exist od type O, blo these The in the at the scene are the same as before. Consider population. data how these in uence our beliefs about Oliv er, a susp ect of type again data believ Alb ect of type AB. Intuitiv ely, we still a susp e that the and erto, O, rare AB blo od pro vides positiv e evidence that Alb erto was presence of the does the there. that type O blo od was detected at the scene favour But fact case, that er was presen t? If this were the hypothesis that would mean the Oliv regardless that the susp ect is, the data mak e it more probable they were of who presen in the population would be under greater suspicion, whic h t; everyone be absurd. The may be comp atible with any susp ect of either would data some presen pro vide evidence for if they theories, they od type being blo t, but pro vide evidence against other theories. must also is another way of thinking this: imagine that instead of two Here about in the od stains are ten, and that blo entire local population people's there hundred, there are ninet y type O susp ects and ten type AB of one ects. susp Consider type O susp ect, Oliv er: without any other information, a particular there before od test results come in, blo is a one in 10 chance that he and the scene, since we kno w that 10 out of the 100 susp ects were presen was at the t. We now get results of blo od tests, and nd that nine of the ten stains the are of type AB, one of the stains and Does this mak e it more likely is of type O. that Oliv er was there? No, there is now only a one in ninet y chance that he was there, since w that only one person presen t was of type O. we kno ybe the intuition nally by writing down the form ulae for the Ma is aided of type O are where blo od stains of individuals case found, and general n O of type AB, a total of N individuals in all, n unkno wn people come and AB from with fractions p a large ;p population . (There may be other blo od AB O task is to evaluate the likeliho The for the two hypotheses: types too.) od ratio , `the type O susp ect (Oliv er) S N 1 unkno wn others left N stains'; and and probabilit , ` wns left N stains'. The unkno y of the data under hypothesis N S is just the probabilit y of getting n two types when ;n S individuals of the AB O individuals are dra wn at random from the population: N N ! n n AB O (3.28) : ) = j P S n ( ;n p p AB O AB O ! n ! n O AB , we need the distribution of the N 1 other indi- In the case of hypothesis S viduals: 1)! ( N 1 n n AB O ) = S j ;n n ( P (3.29) : p p AB O AB O ! ( 1)! n n AB O likeliho od ratio is: The ( n j ;n ) P S n =N AB O O (3.30) : = p S ) P ;n ( j n O AB O is an of e result. The likeliho od ratio, i.e. the con tribution This instructiv er was presen data question of whether Oliv to the t, dep ends simply on these a comparison of the frequency of his blo od type in the observ ed data with the ts bac frequency in the population. There is no dep endence on the coun kground of the other types found at the scene, or their frequencies in the population.

69 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Exercises 57 3.5: more are the average num ber exp ected under type O stains If there than , then give evidence in favour of the presence data er. S the hypothesis of Oliv are few er type O stains than the exp Con num ber under versely , if there ected data reduce the probabilit y of the the that he was there. S , then hypothesis case n =N In the = p special , the data con tribute no evidence either way, O O fact that the data are compatible with the hypothesis S . regardless of the Exercises 3.5 [ 2, p.60 ] The doors, normal rules. Exercise three 3.8. sho w, On testan t is told the rules as follo ws: a game a con are three doors, lab elled There A single prize has 1, 2, 3. been behind one of them. You get to select one door. hidden your chosen not be opened. Instead, the Initially door will will one of the other two doors, and he w host gamesho open as not to reveal the prize. For example, will do so in such a way choose door 1, he will then open one of doors 2 and if you rst teed it is guaran he will choose whic h one to open 3, and that the prize will not be revealed. so that At this point, you be given a fresh choice of door: you will or you either your rst choice, k with can switc h to the can stic closed door. All the doors will then be opened and other you will e whatev er is behind your nal choice of door. receiv door 1 rst; that con testan t chooses Imagine then the gamesho w host the opens door 3, revealing nothing behind the door, as promised. Should the con t (a) stic k with door 1, or (b) switc h to door 2, or (c) does testan e no di erence? it mak [ ] p.61 2, three doors, 3.9. e scena rio. Exercise The earthquak the game happ ens again and just as the gamesho w host Imagine that is to open one doors a violen t earthquak e rattles the building about of the It happ one doors ies open. three ens to be door 3, and it and of the ens not to have the prize happ it. The con testan t had initially behind chosen door 1. `OK, ositioning ee, the host suggests, toup since you chose door Rep his , door 3 is a valid door for me to open, according to the rules 1 initially of the game; let door 3 stay open. Let's carry on as if nothing I'll ened.' happ Should the con testan t stic k with door 1, or switc h to door 2, or does it mak e no di erence? Assume that the prize was placed randomly , that the gamesho does not kno w where it is, and that the door ew w host its h was brok en by the earthquak e. open because latc similar alternativ e scenario is a gamesho w whose confuse d host for- [A hosen the and where gets prize is, and opens one of the unc rules, the doors He opens door 3, and at random. prize is not revealed. Should the the con testan t choose what's behind door 1 or door 2? Does the opti- testan mal the con testan t dep end on the con for t's beliefs about decision whether the gamesho w host is confused or not?] ] 2 [ emphasis 3.10. Another example in which the . is not on priors. You Exercise don't a family whose three children are all at the local school. You visit

70 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 3 | More Inference 58 about about kno of the children. While walking clum- the w anything sexes three home, ble through one of the stum unlab elled the you sily round that you kno w belong, one eac bedro three children, om doors h, to the that bedro om con tains girlie the in sucien t quan tities to nd and stu you that the child who lives in that con om is a girl. Later, vince bedro sneak addressed to the paren ts, whic h reads `From you a look at a letter we are this letter to all paren ts who have male Headmaster: the sending school to inform them about the follo wing boyish mat- children at the . . '. ters. of evidence establish that at least one of the three These two sources and that at least one of the children is a boy. What children is a girl, the probabilities that there are (a) two girls and one boy; (b) two are one boys and girl? 2, p.61 ] [ Exercise Mrs S is found stabb ed in her family garden. Mr S . 3.11. after death and is considered as a susp ect. On ves strangely beha her and social records it is found that Mr S had beaten investigation of police his wife nine previous occasions. The prosecution adv ances up on at least as evidence in favour of the hypothesis that Mr S is guilt y of the this data `Ah no,' says Mr S's highly murder. lawyer, ` statistic ally , only one paid 1 eaters actually goes on to murder his wife. in a thousand So the wife-b eating strong evidence at all. In fact, given the wife-b eating wife-b is not alone, it's extremely unlikely that he would be the murderer of evidence cen wife a 1 = 1000 chance. You should therefore nd him inno his t.' { only Is the lawyer righ t to imply that the history of wife-b eating does not point to Mr S's being the murderer? Or is the lawyer a slim y tric kster? If the latter, is wrong with his argumen t? what ving receiv an indignan t letter from a lawyer about the preceding [Ha ed like to add extra inference exercise at this point: Does I'd an paragraph, that Mr. S.'s lawyer may have my trickster imply suggestion been a slimy I believe lawyers are slimy tricksters? (Answ all No.)] that er: 2 ] [ . Exercise A bag con tains one coun ter, kno wn to 3.12. white or be either blac k. coun ter is put in, the bag is shak en, and a coun ter A white What wn h pro ves to be white. whic is now the chance of is dra out, wing a white coun ter? [Notice that the state of the bag, after dra the operations, iden tical to its state before.] is exactly 2, [ p.62 ] Exercise . You 3.13. house; the phone is connected, and move into a new you're prett y sure that the phone num ber is 740511 , but not as sure as you would As an exp erimen t, you pick up the phone and like to be. you 740511 a `busy' signal. Are obtain now more sure of your dial ; you num ber? If so, how much? phone ] [ 1 Exercise . In a game, two coins are tossed. If either of the coins 3.14. comes up you have won a prize. To claim heads, prize, you must point to the one of your coins that is a head and say `look, that coin's a head, I've won'. You play the game. He tosses the two coins, and he watch Fred 1 are it is estimated that 2 million women In the abused eac h year by their partners. U.S.A., In 1994, 4739 women were victims of homicide; of those, 1326 women (28%) were slain by husbands boyfriends. and (Sources: http://www.umn.edu/mincava/pap ers/ facto id.h tm, http://www.gunfree.inter.net /vpc/ wome nfs.h tm )

71 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 59 3.6: Solutions that says `look, I've won'. What is points to a coin coin's and a head, other coin is a head? the probabilit the y that ] 2, [ p.63 A statistical statemen t app . in The Guar dian on Exercise 3.15. eared uary Frida y Jan 4, 2002: edge 250 on a Belgian one-euro coin came spun When times, 140 times up tails 110. `It looks very suspicious heads and said Bligh t, a statistics lecturer at the London to me', Barry coin unbiased the chance of `If the School of Economics. were as extreme as that would be less than 7%'. getting a result fair? these give evidence that the coin is biased rather than do But data t: see equation (3.22).] [Hin 3.6 Solutions to exercise 3.1 (p.47) . Let the data be D . Assuming equal prior Solution probabilities, 9 1 1 2 1 3 1 3 A ) ( P j D = = (3.31) 32 2 ) D j 2 B 2 1 ( 2 2 P 2 P A j D ( = 41 : and ) = 9 to exercise 3.2 (p.47) . The probabilit y of the data given eac h hy- Solution is: pothesis 3 1 3 1 2 1 1 18 P A j D ) = ( = ; (3.32) 7 20 20 20 20 20 20 20 20 1 2 2 2 2 2 64 2 (3.33) ; = ) = P ( D j B 7 20 20 20 20 20 20 20 20 1 1 1 1 1 1 1 1 D C ) = = ( P j : (3.34) 7 20 20 20 20 20 20 20 20 So 1 18 64 18 j : C = ( ( ; P ) = B j D ) = D P ; ) = D j A ( P 83 83 18 + 64 + 1 83 (3.35) probabilit . Posterior 3.7 Figure y for p of a bent coin bias the a sets. t data given two di eren 0.6 0 1 0.4 0 0.2 0.8 0.2 1 0.8 0.6 0.4 (a) (b) 2 3 p p ) P ( p j j s = bbb ; F = 3) / (1 p s ) = aba ; F = 3) / P ( p (1 a a a a a to exercise 3.5 (p.52) . Solution 2 p of value probable most ). The p (1 (i.e., j / (a) ;F aba = P s = 3) ( p p a a a a the value that maximizes the posterior probabilit y densit y) is 2 = 3. The is 3 mean of p 5. value = a See gure 3.7a.

72 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 3 | about More 60 Inference 3 ) s = bbb ;F = 3) / (1 p j p . The most probable value of p ( (i.e., (b) P a a a maximizes value posterior probabilit y densit y) is 0. The the that the of p is 1 = 5. mean value a 3.7b. gure See is true H H is true of plausible Figure 3.8 . Range 0 1 log values of the in evidence : 5 = 1 p 6 = 0 : 25 = 0 p p = a a a 8 8 8 of of favour as a function H F . 1 1000/1 1000/1 1000/1 6 6 6 The axis on is the left vertical 100/1 100/1 100/1 4 4 4 s ( P ) H F; j 1 righ log ; the t-hand 10/1 10/1 10/1 2 2 2 P ) F; j s ( H 0 0 0 0 1/1 1/1 1/1 sho axis vertical values ws of the -2 -2 -2 H F; s ( P j ) 1/10 1/10 1/10 1 . -4 -4 -4 ) P ( s j F; H 0 1/100 1/100 1/100 50 150 100 50 0 0 50 0 200 150 100 200 150 100 200 solid line the log sho ws The variable random evidence if the tak es on its mean F value, a curv found were 3.8 es in gure by nding The . (p.54) 3.7 to exercise Solution F F . The dotted lines sho w = p a a standard to the F setting , then deviation the mean and two mean of F a a (appro ximately) the log evidence F the computing , and for range plausible a 95% to get deviations standard if is at its 2.5th or 97.5th F a a percen tile. values three corresp onding of the log evidence ratio. 3.6, also (See p.54.) gure to exercise 3.8 (p.57) . Let H is denote Solution hypothesis that the prize the i the door . We mak e the follo wing behind i three hypotheses H , assumptions: 1 H , i.e., and H a priori are equiprobable 3 2 1 P H ) = P ( H ( ) = ( P H ) = (3.36) : 1 2 3 3 The choosing door 1, is one of D = 3 and D = 2 (mean- datum we receiv e, after resp ectiv ely). We assume that these two possible door 3 or 2 is opened, ing follo wing probabilities. If the prize is behind door 1 then outcomes have the the has in this case we assume that choice; host selects at a free the host D = 2 and D = 3. Otherwise the choice of the host is forced random between probabilities are 0 and 1. the and 1 / = 2 P ) = D ( jH P ( D = 2 jH ) = 1 ) = 0 P ( D = 2 jH 2 3 1 2 (3.37) 1 / 2 ) = ( D = 3 P ( jH = 3 jH P ) = 1 P ( D = 3 jH ) = 0 D 3 2 1 using Bayes' theorem, we evaluate the posterior probabilities of the Now, hypotheses: jH P ( D = 3 ) H ) P ( i i P j D = 3) = H ( (3.38) i ( = 3) P D 3) (1 = 3) = 3) = (0)(1 (1)(1 = 2)(1 D = 3) = ( = 3) = D P ( H j j D = 3) = H P j H ( P 3 2 1 =3) P =3) D ( P D P ( D =3) ( (3.39) constan ( D = 3) is (1 = 2) because it is the normalizing P t for denominator The distribution. So posterior this 2 1 / / 3 3 H j : D P ( H 0 j D = 3) = = 3) P = P ( H = j D = 3) ( 3 2 1 (3.40) So the con testan t should switc h to door 2 in order to have the biggest chance of getting the prize. y people this outcome surprising. There are two ways to mak e it Man nd more e. One is to play the game thirt y times with a friend and keep intuitiv trac k of the frequency with whic h switc hing gets the prize. Alternativ ely, you can a though t exp erimen t in whic h the game is played with a million perform doors. The game are now that the con testan t chooses one door, then the rules

73 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 61 3.6: Solutions 999,998 opens h a way as not to reveal the prize, leaving sho doors w host in suc one testan door closed. The con door and t may the contestant's selected other by a million the testan t confron ted con doors, h. k or switc now stic Imagine 234,598 have not been opened, door 1 having of whic the h doors 1 and been t's guess. Where do you think the prize is? con testan initial (p.57) by an If door 3 is opened 3.9 earthquak e, the Solution to exercise . scene di eren { even though visually the out looks the comes inference tly nature same. data, and the probabilit y of the data, are both The of the t. possible data outcomes are, rstly , that any num ber of now di eren The migh We could lab el the eigh t possible outcomes doors t have opened. the ; 0 ; 0) ; (0 ; 0 ; 1) ; (0 ; 1 ; 0) ; (1 ; 0 d 0) ; (0 ; 1 ; 1) ;::: ; (1 ; 1 ; 1). Secondly , it migh t = (0 ; the is visible after the earthquak prize opened one or more doors. be that e has the data D consists of the value of d , and a statemen t of whether the prize So It is hard was revealed. probabilities of these outcomes are, to say what the dep end our beliefs about the reliabilit y of the door latc hes and since on they the of earthquak but it is possible to extract es, desired posterior erties the prop naming the values of P ( d jH probabilit y without eac h d . All that matters ) for i relativ e values of the quan tities P ( D jH are ), the ( D jH ), for ), P ( D jH P 3 2 1 value of D that actually occurred. [This is the likeliho od principle , whic h the in section we met value of D that actually occurred is ` d = (0 ; 0 ; 1), 2.3.] The prize P First, it is clear that no ( D jH and ) = 0, since the datum visible'. 3 no is visible is incompatible with H prize . Now, assuming that the that 3 the prize is Where y con ( D jH testan ) compare with t selected door 1, how does the probabilit P 1 door door door 3 2 1 ( D that P sensitiv )? Assuming es are earthquak not jH of game e to decisions 2 p p p none none none tities have to be equal, by symmetry . We sho w con testan ts, these two quan none 3 3 3 hinges, its o door 3 falls however likely it is that w how likely but kno don't prize it is, it's is behind the so whether to do as likely just door 1 or door 2. e 1 D we obtain: if D ( P ) are P ( equal, jH So, ) and jH 1 2 1 1 1 2 / 3 ( jH ( )( D / P P D jH ( )( D / 3 ) jH 3 P )( ) ) 3 2 1 P ( H P j D ) = ( H ) = j D ) = P ( H j D 2 3 1 D ) D P ( P ( ( ) P D ) 1 1 / / p p p 3 3 3 = = 0 : = 2 2 by earthquak 3 3 3 3 (3.41) likely. now equally are hypotheses The two possible 1,2 If we assume that the host kno ws where the prize is and migh t be acting t be further deceptiv we have to the answ er migh ely, then mo di ed, because doors opened 1,3 data. view the host's words as part of the you understand It's w two gamesho these Confused? sure making well worth 2,3 Which worry , I slipp rst Don't problems. ed time problem, second the I the on up it. met p p p 3 ; 1 ; 2 1 3 ; 1 ; 2 ; 3 2 ; 1,2,3 h helps immensely when you have a confusing There is a general rule whic 3 3 3 probabilit y problem: 3.9 y of Figure . The probabilit second everything, for the the y of everything. ays write Alw down probabilit or problem, an assuming three-do (Steve Gul l) just occurred. e has earthquak Here, y that is the probabilit p 3 ob- this join t probabilit y, any desired inference can From be mec hanically door 3 alone is opened by an earthquak e. tained ( gure 3.9). Solution 3.11 (p.58) . The statistic quoted by the lawyer indicates to exercise the probabilit y that a randomly selected wife-b eater will also murder his wife. the The y that the husband was the murderer, given that probabilit wife has been mur dered , is a completely di eren t quan tity.

74 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 3 | More Inference 62 about latter, To deduce mak e further assumptions about the we need the to wife y that else. If she lives in a neigh- probabilit is murdered the by someone t random then this probabilit y is large and frequen od with bourho murders, y that the husband did it (in the absence of other ev- the posterior probabilit large. idence) in more peaceful regions, it may well be may not be very But likely you, to have murdered most if you are found murdered, person that the relativ es. closest of your is one out some Let's e num bers with the help of the statistics work illustrativ page Let m = 1 denote the prop osition that a woman has been mur- on 58. and = 1, the that the husband did it; osition b = 1, the prop o- h prop dered; he beat her in the year preceding the sition The statemen t that murder. else it' is denoted by h = 0. We need to did P ( h j m = 1), `someone de ne ( b j h = 1 ;m = 1), P P ( b = 1 j h = 0 ;m = 1) in order to compute the pos- and terior y P ( h = 1 j b = 1 ;m = 1). From the statistics, we can read probabilit P ( = 1 j m = 1) = 0 : 28. And if two million women out of 100 million out h = 1) = 0 beaten, ( b = 1 j h = 0 ;m P : 02. Finally , we need a value for are then ( b j h = 1 ;m = 1): if a man murders his wife, how likely is it that this is the P time ect laid a nger on her? I exp rst it's prett y unlik ely; so maybe he or larger. ( j h = 1 ;m = 1) is 0.9 = 1 b P Bayes' theorem, then, By : 9 : 28 = 1 b = 1) = j = 1 h ( P ;m : ' 95% (3.42) : 02 : 72 9 : 28 + : One way to mak e obvious the sliminess of the lawyer on p.58 is to construct argumen ts, with the same logical structure as his, that are clearly wrong. For example, the say `Not only was Mrs. S murdered, she was lawyer could between 4.02pm 4.03pm. Statistic ally , only one in a million murdered and actually wife-b his wife between 4.02pm and 4.03pm. goes on eaters to murder given the eating evidence at all. In fact, strong wife-b eating wife-b So the is not it's extremely unlik evidence that he would murder his wife in this alone, ely a 1/1,000,000 way { only chance.' H 3.13 . There to exercise two hypotheses. (p.58) Solution : your num ber are 0 740511 ; H , : it is another is ber. The data, D , are `when I dialed 740511 num 1 a busy signal'. is the probabilit y of D , given eac h hypothesis? If I got What signal num , then we exp ect a busy 740511 with certain ty: your ber is ( D jH ) = 1 P : 0 On other hand, if H ber dialled is true, then the probabilit y that the the num 1 various a busy is smaller than returns signal other outcomes were also 1, since possible (a ringing tone, or a num ber-unobtainable signal, for example). The value P ( D jH probabilit ) will dep end on the probabilit y y that a of this 1 num ber similar to your own phone num ber would be a valid random phone num ber, and on the probabilit y that you get a busy signal when you phone a valid dial ber. phone num Cam the of my phone from that size bridge has about I estimate book valid phone num bers, all of length six digits. The probabilit y that a 75 000 6 six-digit num ber is valid is therefore about 75 000 = 10 random = 0 : 075. If we exclude num beginning with 0, 1, and 9 from the random choice, the bers ' y probabilit 75 000 = 700 000 0 : 1. If we assume that telephone is about num bers are clustered then a misremem bered num ber migh t be more likely our to be valid a randomly chosen num ber; so the probabilit y, , that than H guessed ber would be valid, assuming num t be bigger is true, migh than 1

75 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 3.6: 63 must Anyway, 0.1 and 1. We can carry forw ard 0.1. be somewhere between y and how much it matters at the end. probabilit see ty in the uncertain this that you The a busy signal when you dial a valid phone probabilit y get fraction of phones you think num in use or o -the-ho ok ber is equal to the are mak ten tativ e call. This fraction varies from town to town when you e your time during bridge, the the day, I would guess that and with of day. In Cam are in use. At 4am, maybe 0.1%, or few er. 1% about of phones y P ( D jH of ) is the pro duct The and , that is, about 0 : 1 probabilit 1 3 : . According to our estimates, there's about a one-in-a-thousand 0 01 = 10 signal ber; you dial a random num a busy or one-in-a- chance of getting when 4 dial strongly clustered; or one-in-10 bers , if you num in hundred, if valid are wee hours. the data a ect your beliefs about your phone num ber? The pos- How do the probabilit y ratio likeliho od ratio times the prior probabilit y ratio: terior is the H j D ) ( P ) H ( P D jH P ) ( 0 0 0 (3.43) = : ( D jH H ) P j D ) ( P ( H P ) 1 1 1 od ratio is about so the posterior probabilit y The likeliho 100-to-1 or 1000-to-1, prior or 1000 of H ratio . If the of 100 probabilit y by a factor is swung in favour 0 was 0.5 then the posterior probabilit y is H of 0 1 (3.44) : 999 : 99 or 0 : 0 ' ) = H P j D ( 0 D P ( H j ) 1 1 + ( ) D j H P 0 to exercise 3.15 (p.59) . We compare the mo dels H Solution { the coin is fair 0 its coin is biased, with the prior on { the bias set to the uniform { and H 1 ( p jH distribution ) = 1. [The use of a uniform prior seems reasonable P 1 0.05 H0 some to me, coins, since suc h as American pennies, have severe I kno w that H1 0.04 : = 0 p or 01 : = 0 p situations so the edge; 1 or spun when biases p = 0 : 95 on 0.03 not surprise me.] would 140 0.02 { the say, `how absurd t would { a pedan is fair coin to even When H tion I men 0 0.01 biased And t'. exten to some is surely { any coin is fair coin the that consider 0 of course agree. So will pedan I would `the as meaning ts kindly H understand 0 250 0 50 100 150 200 : part one to within 5 : 0 2 p 001'. i.e., is fair 0 in a thousand, coin Figure y probabilit . The 3.10 ber of num of the distribution is: od ratio likeliho The given the heads two hypotheses, 140!110! is fair, that the it is that and coin jH P ( ) D 1 251! distribution prior the with biased, = 48 : (3.45) = 0 : 250 2 = 1 ) P ( D jH 0 uniform. bias The of the being ( gives heads) = 140 D outcome give scarcely give weak they way; in fact either data any evidence Thus the weak evidence in favour of H , the 0 in favour (two to one) ! H of evidence is fair. coin the hypothesis that 0 silly the `No, er in bias, `your objects uniform prior doesn't no', believ t my prior beliefs about the bias of biased coins { I was expecting only represen , let's bias'. to the H To be as generous as possible see how well it a small 1 fare if the prior were prescien tly set. Let us allo w a prior of the form could 1 1 1 2 ; ) p (1 p ) (3.46) where Z ( (2 ) ) = ( = ( P ) = jH ; p 1 ( ) Z distribution, with (a Beta original uniform prior repro duced by setting the = 1). By tweaking , the likeliho od ratio for H , over H 0 1 250 )2 ) (2 + ) (110 + (140 D P ( jH ) ; 1 (3.47) ; = 2 ( D (250 +2 ) ) ( ) P jH 0

76 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 3 | about Inference 64 More be increased for of in gure 3.11. values several a little. can wn It is sho ; ) P jH ( D 1 favourable most the ( of choice of od ratio a likeliho yield can 50) ' Even ) jH ( P D 0 only in favour two to one . H of 1 .37 .25 They not `very suspicious'. the can In conclusion, data are be construed 1.0 .48 two of the or other of one in favour evidence two-to-one at most as giving 2.7 .82 hypotheses. 1.3 7.4 1.8 20 of over-restrictiv e priors? Are od ratios these wimp y likeliho the fault Is there 55 1.9 a `very any way of pro is best- that prior The conclusion? suspicious' ducing 1.7 148 1.3 403 p sets that prior od, is the of likeliho in terms data, to f to matc hed the 1.1 1096 . The call this mo del H y one. is likeliho od ratio 140 = 250 with probabilit Let's 140 250 110 ) = 2 f ( D jH evidence ) =P ( D jH strongest P (1 f ) = 6 : 1. So the that 0 that is hypothesis the bias is no there these data can possibly muster against for Figure . Lik eliho 3.11 od ratio six-to-one. prior of the choices various hyperparameter . distribution's ers we are absurdly misleading answ the that `sampling the- While noticing statistics pro duces, suc h as the p ory' of 7% in the exercise we just solv ed, -value let's k the boot in. If we mak stic to the data set, increasing e a tiny change the num ber of heads in 250 tosses from 140 to 141, we nd that the p -value goes below the mystical (the p -value is 0.0497). The sampling value of 0.05 0 theory squeak `the statistician y of getting a result as would happily probabilit ( D P jH ; ) 1 0 0.05 null the { we thus reject hypothesis than is smaller heads extreme as 141 P jH ) D ( 0 The several at a signi cance er is sho level of 5%'. for correct answ wn values .32 .37 table are, rst, worth highligh values ting from this of in gure 3.12. The .61 1.0 1.0 2.7 when h is 1:0.61 whic likeliho od ratio the H prior, uses the standard uniform 1 7.4 1.6 favourable choice of , . Second, the most H othesis null hyp of the in favour 0 20 2.2 , can 2.3:1 od ratio a likeliho yield from of about the H of point of view only 1 2.3 55 in favour H . of 148 1.9 1 403 1.4 of 0.05 -value p the that as implying interpreted is often Be warned! A odds 1.2 1096 stac truth the But hypothesis. null are against twenty-to-one about ked the sligh in this is that the evidence either case tly the null hypothesis, or favours 3.12 . Lik eliho od ratio for Figure dep ending of prior. choice the on to one, 2.3 it by at most disfa vours various of the prior choices be treated statistics of classical levels' The should -values and `signi cance p hyperparameter distribution's , sermon. the ends Here them! with . Shun caution eme extr 0 when the data are D = 141 heads in 250 trials.

77 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Part I Data Compression

78 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Ab out Chapter 4 how to chapter information con ten t of the this In the measure we discuss t. erimen of a random outcome exp tough bits. If you This the mathematical details chapter has some nd them keep going { you'll be able and y Chapters 5 through skim hard, to enjo chapter's tools. and 6 without this Notation should reading Before 4, you Chapter have read Chapter 2 and work ed on of the A is a 2 mem ber x x below. 4.1 exercise , and exercises 36{37) (pp. 2.16 and 2.21{2.25 A set think is intended to help you The about how to measure follo exercise wing is a subset of the S A S ten con information t. set A S or A S is a subset of, 2, ] [ p.69 on this problem Chapter e reading befor 4. work Please { Exercise 4.1. to, the set A equal V is the union of the V = B [ A 12 in weigh that one for You are given t except balls, all equal is either A sets B and to use. balance are a two-pan given also h In eac hea vier or ligh ter. You is the V = B \ A V intersection balls 12 ber of the any num may put you balance of the use left the on A and B sets of the ts ber num jAj of elemen pan, and the same num ber on the righ t pan, and push a button to initiate A in set either weighing; three possible outcomes: are the weigh ts are the there or the balls on the left are hea vier, or the balls on the left are equal, ter. task is to design a strategy to determine whic h is the odd ligh Your and whether it is hea vier or ligh ter than the others in as few uses ball of the balanc . e as possible thinking to consider this problem, you may nd it helpful While about the wing questions: follo (a) How can one measure information ? (b) When you have iden ti ed the odd ball and whether it is hea vy or ligh t, how much information gained? have you wing, Once a strategy , dra w a tree sho have designed for eac h (c) you possible outcomes of a weighing, what weighing you of the perform next. h node in the tree, how much information have the At eac outcomes given you, and how much information remains to so far be gained? (d) How much information is gained when you learn (i) the state of a (iii) ipp (ii) the states of two ipp ed coins; coin; the outcome ed when a four-sided die is rolled? (e) How much information is gained on the rst step of the weighing problem are weighed against the other 6? How much is if 6 balls gained if 4 are weighed against 4 on the rst step, leaving out 4 balls? 66

79 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 4 The Source Coding Theorem the content of a random variable? 4.1 info How to measure rmation chapters, be talking about probabilit y distributions and few next In the we'll of the time random get by with slopp y notation, variables. Most we can , we will precise notation. Here is the notation need we but occasionally that 2. in Chapter established ble X is a triple ( x; A An ; P ensem ), where the outcome x is the value X X of a set es on one whic of possible values, of a random h tak variable, = f a A ;a , ;:::;a g ;:::;a ;:::;p g , having probabilities P ;p = f p 1 X 2 I I i 2 1 X P ( x with a ) = 1. ) = p a , p = 0 and P x ( P = i i i i a 2A i X a we measure ten t of an outcome x = con How can from suc h information the i ble? In this chapter we examine the assertions an ensem Shannon information con ten t , the 1. that 1 x = a ) h log ( ; (4.1) i 2 p i x information con ten t of the outcome of the = a is a sensible , measure i and the entrop y 2. that ensem ble, of the X 1 (4.2) ; X p ) = ( H log i 2 p i i is a sensible measure of the ensem ble's average information con ten t. 1 1 4.1 Figure . The Shannon 10 ) H p ( p ) ( p h ( h ( p ) = log p H ) 2 2 2 1 p information h con t ten ) = log p ( 0.8 2 p 8 10.0 0.001 0.011 the entrop and binary y function 0.6 6 0.081 0.01 6.6 ) = p 1 p; ( H ) = p ( H 2 0.4 4 1 1 0.47 3.3 0.1 p log p as a ) log + (1 2 2 ) p p (1 2.3 0.72 0.2 0.2 2 function p . of 1.0 0.5 1.0 0 0 0.6 1 0.2 0 1 0.8 0.4 0.4 0.2 0 0.8 0.6 p p ten sho the Shannon information con ws t of an outcome with prob- Figure 4.1 y p , as a function of p . The less probable an outcome is, the greater abilit entrop Shannon con ten t. Figure 4.1 also sho ws the binary information y its function, 1 1 (4.3) ; + (1 p ) log 1 p ) = p log ( H p ) = H ( p; 2 2 2 p ) p (1 whose h is the whic ensem ble X entrop alphab et and probabilit y dis- y of the tribution are A . = f a;b g ; P g = f p; (1 p ) X X 67

80 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 4 | The Source Theorem 68 Coding of indep Information variables endent content random con 1 have anything to do with the information =p ten t? Wh y Wh log y should i function of p explore ? We'll not this question some shortly , other in detail i a nice prop erty of this particular function h ( notice ) = log 1 =p ( x ). but rst, x the value Imagine indep endent random variables, x and y . learning of two de nition endence is that the probabilit y distribution is separable The of indep : product into a x;y ) = P ( x ) P ( y ) : (4.4) ( P ely, we migh t want any measure `amoun t of information gained' Intuitiv of the erty of prop { that is, for indep enden t random variables to have the additivity and y , the information gained when we learn x and x should equal the sum y of the gained if x alone were learned and the information gained information y alone learned. if were information The t of the outcome x;y is con Shannon ten 1 1 1 1 ( h x;y ) = log + log = log (4.5) = log ) P ( y ) ) P P ( x ) P ( P ( y ) ( x;y x satisfy so it does indeed ( ( ) = h h x ) + h ( y ) ; if x and y are indep enden t. (4.6) x;y 1, p.86 ] [ 4.2. Sho Exercise if x and y are indep enden t, the entrop y of the w that, x;y outcome satis es H ( ) = H ( H ) + X;Y ( Y ) : (4.7) X In words, entrop y is additiv e for indep enden t variables. We now explore these with some examples; then, in section 4.4 and ideas 5 and 6, we pro the Shannon information con ten t and the in Chapters ve that e the related ber of bits needed to describ num outcome of y are entrop to the erimen t. an exp problem: designing informative experiments The weighing yet? ed the problem solv 4.1, p.66) weighing Are you sure? Have you (exercise that in three uses of the balance { whic h reads either `left hea vier', Notice 3 t hea or `balanced' { the num ber of conceiv able outcomes is 3 vier', = 27, `righ the the states of the world is 24: ber of possible odd ball could whereas num and it could be hea vy or ligh t. So in principle, the be any of twelve balls, 2 migh able in three weighings { but not in two, since 3 t be solv < 24. problem weigh kno can determine the odd w how you t and whether it is If you hea vy or ligh t in three weighings, then you may read on. If you haven't found a strategy that ays gets there in three weighings, I encourage you to think alw exercise 4.1 more. about some strategy optimal? y is your is it about your series of weighings Wh What allo ws useful information to be gained that kly as possible? The answ er as quic is that h step of an optimal pro cedure, at eac three outcomes (`left hea vier', the `righ t hea vier', and `balance') are as close as possible to equipr obable . An optimal solution wn in gure 4.2. is sho against optimal suc Sub balls 1{6 strategies, 7{12 on the rst h as weighing step, do not achiev e all outcomes with equal probabilit y: these two sets of balls t can er balance, so the only possible outcomes are `left hea vy' and `righ nev out hea Suc h a binary outcome rules vy'. only half of the possible hypotheses,

81 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. How to measure information ten t of a random variable? 69 the con 4.1: + 1 1 + + + - 2 1 2 5 + 1 @ 2 @ R + 5 2 weigh + + 3 3 + 4 1 2 6 3 + + + - - 4 3 4 6 + 1 5 A 3 4 5 4 @ @ R + 6 2 A 6 + A 3 7 7 A + 4 1 8 U A - 8 8 7 + 5 @ 7 @ R + ? 6 + 7 4 + 8 3 + - 3 4 3 6 + 1 9 @ 4 + R @ + 6 2 10 weigh weigh + 3 11 2 + 4 12 1 1 2 3 4 1 2 6 + - - - 2 1 1 5 + 5 1 2 5 6 7 8 3 4 5 A @ B + R @ + 5 A B 6 2 + A B + 7 3 7 A B + 7 8 4 + + + A U - B 8 8 7 5 1 @ B R @ ? 6 B + B 7 9 B 9 8 + + + + - 10 9 10 11 B + 9 9 10 @ B + @ R + 11 10 10 B weigh + 11 11 10 B + 12 9 9 10 11 B 12 - - N B 11 9 10 9 9 10 1 2 3 @ A R @ 11 A 10 A + 11 12 A 12 12 + U A - 12 12 12 @ 1 R @ ? to the . An left the 4.2 optimal solution Figure weighing problem. At eac h step there are two boxes: are h hypotheses whic ws box sho t box sho the possible; still righ ws the balls involved in the + + denoting ;:::; 1 written are hypotheses 24 The weighing. next that , with, e.g., 1 12 the names odd ball and it is hea 1 is the vy. Weighings are written balls of the by listing rst weighing, balls 1, 2, 3, and by a line; on the two pans, separated for example, in the righ on put ws of arro h triplet t. In eac 4 are the 8 on 5, 6, 7, and and side left-hand the upp the arro w to vier, middle is hea side left the when situation w leads er arro the to the situation when the righ t side is hea vier, and the lower arro w to the situation when the the outcome The three points lab elled is balanced. corresp ond to imp ossible outcomes. ?

82 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 4 | The Source Theorem 70 Coding uses so a strategy must sometimes tak e longer to nd the suc that h outcomes t answ er. righ outcomes should be as near t that to equiprobable The insigh the as possible h for an optimal strategy mak rst weighing must es it easier to searc . The 24 possible into three groups of eigh hypotheses the second the divide t. Then be chosen so that there weighing split of the hypotheses. must is a 3:3:2 t conclude: Thus we migh of a random outcome t is guaran teed to be most in- exp erimen the y distribution over outcomes is uniform. probabilit formativ e if the agrees with the prop erty of the This y that you pro ved conclusion entrop you ed exercise 2.25 (p.37) : the entrop solv ensem ble X is biggest when y of an the outcomes have equal probabilit if all p . = 1 = jA j y X i games Guessing game of twenty questions, one player thinks of an object, and the In the to guess the object is by asking questions that player attempts what other ers, for example, `is it aliv e?', or `is have yes/no The aim answ it human?' tify object with as few questions the What is the best is to iden as possible. for playing strategy game? For simplicit y, imagine that we are playing this the dull version of twenty questions called `sixt y-three'. rather the 4.3. `sixt y-three' . What's game smallest num ber of yes/no Example The needed to iden tify an questions x between 0 and 63? integer Intuitiv best questions successiv ely, the divide the 64 possibilities into equal ely sized sets. Six questions suce. One reasonable strategy asks the follo wing questions: 1: x 32? is is x d 32 16? 2: mo x d 16 8? is 3: mo x mo d 8 4? 4: is x is d 4 2? 5: mo x mo d 2 = 1? 6: is notation x mo d 32, pronounced ` x mo dulo 32', denotes [The remainder the when is divided by 32; for example, 35 mo d 32 = 3 and 32 mo d 32 = 0.] x 0 answ to these questions, if translated from f yes ; no The to f 1 ; ers g , give g the expansion of x , for binary 35 ) 100011 . 2 example What are the Shannon information con ten ts of the outcomes in this ex- ample? If we assume all values of x are equally likely, then the answ ers that Shannon questions enden t and eac h has indep information con ten t to the are total (1 = 0 : 5) = 1 bit; the log Shannon information gained is alw ays six bits. 2 Furthermore, num ber x that the from these questions is a six-bit bi- we learn nary num ber. Our questioning strategy de nes a way of enco ding the random variable x le. as a binary ten far, Shannon So con the t mak es sense: it measures the information length of a binary le that enco des x . However, we have not yet studied Shannon ensem where the outcomes have unequal probabilities. Does the bles information con ten t mak e sense there too?

83 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 4.1: How to measure con ten t of a random variable? 71 information the A j B C D j E j F j G j H S 6 2 7 4 8 5 1 3 1 32 move # 48 49 2 G3 B1 F3 H3 question E5 = = = n x = n x x n x = y outcome n x 1 16 32 62 63 ) x ( P 64 33 17 16 63 x ) 0.0227 0.0230 0.0443 0.0874 4.0 h ( info. 0.0227 1.0 2.0 6.0 Total 0.0458 . A game submarine 4.3 . Figure of submarine on the 49th The is hit The game of submarine: how many bits convey? one bit can attempt. game of battleships, h player hides a eet of ships in a sea represen ted In the eac the On h turn, one player attempts to hit grid. other's ships by by a square eac at one square in the opp onen t's sea. The resp onse ring square to a selected suc is either `miss', `hit', or `hit and destro yed'. h as `G3' version h player hides called submarine , eac In a boring just of battleships t grid. submarine one of an eigh t-by-eigh in one Figure 4.3 sho ws a few square pictures of this game in progress: the circle represen ts the square that is being red at, the s sho w squares in whic h the outcome was a miss, x = n ; the and is hit (outcome = y sho wn by the sym bol s ) on the 49th attempt. submarine x made The an ensem ble. h shot two possible out- Eac by a player de nes a hit f ; are g , corresp onding to y and a miss, and their probabili- comes n dep end on the state of the board. At the beginning, P ( y ) = 1 = 64 and ties ( second ) = 63 = 64. At the P shot, if the rst shot missed, P ( y ) = 1 = 63 and n ( 62 ) = 62 = 63. At the third shot, if the rst two shots missed, P ( y ) = 1 = P n P n ) = 61 = 62. and ( Shannon information gained from an outcome x is h ( The ) = log(1 =P ( x )). x If we are hit the submarine lucky, and the rst shot, then on h ( x ) = h (4.8) ( y ) = log : 64 = 6 bits (1) 2 it migh t seem strange that one binary outcome can con vey six Now, a little we have learn place, hiding But whic h could have been any of 64 bits. t the lucky binary question, indeed learn t six bits. squares; so we have, by one What rst shot misses? The Shannon information that we gain from if the this is outcome 64 (4.9) : bits = 0 : 0227 h ) = x h ( ( n ) = log (1) 2 63 Does this mak e sense? It is not so obvious. Let's keep going. If our second shot also the Shannon information con ten t of the second outcome is misses, 63 = 0 n ) = log (4.10) ( : h : 0230 bits (2) 2 62 If we miss thirt y-two times ( ring at a new square eac h time), the total Shan- non information is gained 63 33 64 + + log + log log 2 2 2 63 32 62 + 0 = 0 0227 + 0 : 0230 + : : 0430 = 1 : 0 bits : (4.11)

84 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Source 72 Coding 4 | The Theorem num y this what have we learn t? We now kno w that the Wh ber? round Well, we red at; learning that fact is just in any of the is not submarine 32 squares sixty-three (p.70), asking as our rst question `is x like playing a game of y-two num one corresp onding to these squares I red at?', and of the thirt bers hypotheses, er `no'. answ er rules out half of the answ so it receiving the This bit. gives us one shots, the information gained is 2 bits: the unkno After 48 unsuccessful wn has narro wed down to one quarter of the original hypothesis location been space. if we hit on the 49th shot, when there were 16 squares What submarine the information con ten t of this outcome is Shannon left? The ( h ) = log (4.12) 16 = 4 : 0 bits y : (49) 2 Shannon con ten t of all the outcomes is total information The 64 63 17 16 log + log + + log + log 2 2 2 2 16 62 1 63 0 : + 0 : 0230 + + 0 : 0874 + 4 : 0 = 6 : 0 bits : (4.13) = 0227 con- we kno submarine is, the total Shannon information the w where once So is 6 bits. t gained ten holds regardless of when we hit the submarine. If we hit it This result (4.13) are squares left to choose from { n was 16 in equation n { there when the total information gained is: then 1 aaail aaaiu 2 64 63 n + 1 n aaald 3 + log + log + log + log 2 2 2 2 1 63 62 n . . . 63 n + 1 n 64 64 = 6 bits : (4.14) = log = log 129 abati 2 2 62 n 1 63 1 . . . the so far? from have we learned examples I think submarine the What azpan 2047 a con vincing case for the claim that the Shannon infor- example mak es quite 2048 aztdn . of ten con of information measure game ten con mation t is a sensible the t. And . . the con ten t can be intimately ws sho sixty-three that Shannon information . . . size of a le that enco des the outcomes of a random exp connected eri- to the odrcr 16 384 a possible connection to data compression. t, thus suggesting men . . not In case let's look con at one more example. vinced, you're . . . . language The Wenglish zatnt 32 737 . . sen to English. similar is a language Wenglish of words consist Wenglish tences . 15 zxast 32 768 = 32,768 wn at random from the Wenglish dictionary , whic h con tains 2 dra 5 characters. all words, Eac h word in the Wenglish dictionary was of length y distribution constructed at random by picking ve letters from the probabilit Figure 4.4 . The Wenglish dictionary . ::: a over depicted 2.1. z in gure Some from the dictionary are sho wn in alphab etical order in g- entries words ure that the num ber of Notice in the dictionary (32,768) is 4.4. much smaller than the total num ber of possible words of length 5 letters, 5 26 12,000,000. ' the of the y of the letter z is about 1 = 1000, only 32 Because probabilit z words begin with the letter dictionary . In con trast, the probabilit y in the of the letter a is about 0 : 0625, and 2048 of the words begin with the letter a . , and Of words, two start az 2048 128 start aa . those Let's imagine that we are reading a Wenglish documen t, and let's discuss the Shannon information con ten t of the characters as we acquire them. If we

85 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Data compression 4.2: 73 the are word at a time, the Shannon information con ten t of text given one all is log bits, since Wenglish uses = 15 its word 32,768 eac h ve-character probabilit y. The average information con words t per character with equal ten 3 bits. is therefore information con ten look the documen t one Now let's at the t if we read a time. If, say, the rst letter of a word character a , the Shannon at is con t is log 1 = 0 : 0625 ' 4 bits. If the rst letter is z , the Shannon information ten t is thus ten 1 = 0 con 001 ' 10 bits. The information con ten t is log information : variable rst character. The total at the con ten t of the 5 highly information in a word, however, is exactly 15 bits; so the letters that follo characters w an initial have lower average information con ten t per character than the letters z follo w an a . A rare initial letter suc h as z indeed con veys more that initial what word is than a common initial letter. about the information if rare characters occur at the start of the Similarly (e.g. , in English, word ), then we can iden tify the often word immediately; whereas xyl... whole that start with common characters (e.g. pro... ) require more words charac- ters we can iden tify them. before 4.2 ression Data comp Shannon examples the idea preceding the justify information con ten t The that is a natural measure of its information con ten of an outcome out- t. Improbable comes con vey more information than probable outcomes. We now discuss do information con ten t of a source by considering how man y bits are needed the to describ e the outcome of an exp erimen t. If we can sho we can compress data from a particular source into w that of L per source sym bol and reco ver the data reliably , then we will a le bits per average con the t of that source is at most L bits information say that ten bol. sym Example: compr ession of text les is comp osed of a sequence of bytes. A byte is comp osed of 8 bits and Here we use the word `bit' with its A le `a sym meaning, bol with two osed is comp le 255. A typical can have a decimal value between 0 and text values', not to be confused with values (decimal ASCI I character set of the 0 to 127). This character set uses con of information unit the t. ten seven of the eigh t bits in a byte. only p.86 [ 1, ] . Exercise By 4.4. the size of a le be reduced given how much could that it is an ASCI I le? How would you achiev e this reduction? Intuitiv ely, it seems to assert that an ASCI I le con tains 7 = 8 as reasonable since as an of the same size, le we already kno w much information arbitrary out of every eigh t bits before we even look at the le. This is a simple ex- one of redundancy ample sources of data have further redundancy: English . Most non-equal les text ASCI I characters with use frequency; certain pairs of the letters are more probable than others; and entire words can be predicted given the con and a seman tic understanding of the text. text de ne simple compr ession metho ds that Some measur es of informa- data tion content One way of measuring the information con ten t of a random variable is simply ber of elemen to coun num ber of possible outcomes, jA ts in j . (The num t the X the A is denoted by a set .) If we gave a binary name to eac h outcome, jAj

86 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 4 | The 74 Theorem Source Coding h name length would jA of eac j bits, if jA be log j happ ened to be a power X X 2 of 2. We thus mak follo wing de nition. e the ten t of X is The raw bit con j X ( jA (4.15) ) = log : H X 0 2 ( X H for the num ber of binary questions that are alw ays ) is a lower bound 0 teed tify an outcome from the ensem ble X . It is an additiv e guaran to iden raw bit con ten t of an ordered pair x;y , having jA tity: the jjA quan j possible Y X outcomes, satis es (4.16) ( X;Y ) = H H ( X ) + H : ( Y ) 0 0 0 measure of information con ten t does not include any probabilistic This t, and the ding rule it corresp onds to does not `compress' the source elemen enco maps string. h outcome to a constan t-length binary it simply data, eac [ ] p.86 2, an Could be a compressor 4.5. maps there outcome x to Exercise that code c ( x ), and a decompressor that maps c bac k to x , suc a binary h that possible outc ome is compressed into a binary code of length every than ( X ) bits? shorter H 0 a simple coun ting argumen t sho ws that it is imp ossible to mak e Even though a reversible compression that reduces the size of all les, ama- program compression enthusiasts frequen tly announce that they have invented teur a program that can do this { indeed that they can further compress com- pressed les them through their compressor several times. Stranger by putting paten ts have been ted to these mo dern-da y alchemists. See the yet, gran 1 tly ed questions for further reading. frequen ask comp.compression only two ways in whic h a `compressor' can There compress are actually les: compressor compresses some les, lossy maps some les to the 1. A but enco ding. We'll assume same the user requires perfect reco very of that the le, so the occurrence of one of these confusable les leads source a failure in applications suc h as image compression, lossy to (though is view ed as satisfactory). We'll denote by compression the probabilit y that source string is one the confusable les, so a lossy compressor of the has a probabilit y of failure. If can be made very small then a lossy compressor may be practically useful. lossless compressor all les to di eren t enco dings; if it shortens 2. A maps les, makes others longer . We try to design the some it necessarily so that compressor probabilit y that a le is lengthened is very small, the and probabilit y that it is shortened is large. the In this we discuss a simple lossy compressor. In subsequen t chapters chapter we discuss lossless compression metho ds. 4.3 rmation content de ned in terms of lossy comp ression Info hev er type of compressor we construct, we need someho w to tak e into Whic accoun t the probabilities of the di eren t outcomes. Imagine comparing the I characters information ten ts of two text les { one in whic h all 128 ASCI con 1 http://sunsite.org.uk/publi c/use net/ news- faqs /com p.com pres sion/

87 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. con 4.3: in terms of lossy compression 75 ten Information t de ned equal used one in whic h the characters are used with are probabilit with y, and Can a measure of information text. we de ne their frequencies in English between these two les? Intuitiv ely, the con le ten t that distinguishes latter information per character because it is more predictable. less tains con our kno wledge that One sym bols have a smaller simple way to use some imagine ding the observ ations into a smaller alphab et probabilit y is to reco abilit improbable de some of the more the sym bols { { thus losing y to enco alphab the con ten t of the new raw bit et. For example, then and measuring e a risk when compressing English text, guessing that the most we migh t tak t characters occur, and mak e a reduced ASCI I code that omits infrequen won't f ] , @ , # , % , ^ , * , ~ , < , > , / , \ , _ , { , } , [ , characters , | g , thereb y reducing ! the of the the et by seventeen. The larger the risk we are willing to size alphab tak smaller alphab et becomes. e, the our nal 16 = = 1 = 0 taking when a parameter that describ es the We introduce we are risk probabilit y that there will be no using this compression metho d: is the x c ( x ) ) x c ( x an for name . outcome x a 000 a 00 Example 4.6. Let b 001 01 b f ; c ; g ; ; h ; d b ; ; g a ; f = A e X (4.17) 10 c 010 c 1 1 3 1 1 1 1 1 f = and P : ; ; ; ; ; ; g ; X 4 64 64 64 16 64 4 4 d 011 d 11 ble is 3 bits, corresp onding to 8 binary The raw bit con ten t of this ensem e e 100 But willing if we are So 16. = ) = 15 g d ; c x notice names. b ; a 2f ( P that ; f 101 f = 1 = 16 of not having a name for x , then we can get to run a risk of g g 110 2A if every four names { half as man y names by with x needed as are X h 111 h a name. has di eren binary ws sho 4.5 Table to the be given could that names t out- 4.5 the for names Table . Binary two failure for outcomes, = 1 = 16. When = 0 we need 3 bits to comes in the cases = 0 and probabilities . 2 bits. = 1 when outcome; the de enco only 16 = we need us now formalize this To mak e a compression strategy with risk Let idea. the e the subset S suc h that possible probabilit y that x is , we mak smallest S not is less than in to , i.e., P ( x 62 S ) . For eac h value of or equal then de ne a new measure of information con ten t { the log of the size we can bles smallest of this . [In ensem S in whic h several elemen ts have the subset con probabilit smallest subsets that may be several tain di eren t same y, there ts, but all that matters is their sizes (whic h are equal), so we will not elemen this y.] dwell on ambiguit smallest -sucien t subset S satisfying is the smallest subset of A The X ( 2 S (4.18) P x 1 : ) The subset S of can be constructed by ranking the elemen ts of A in order X probabilit y and successiv e elemen ts starting from the most decreasing adding (1 elemen probabilit y is total ). probable ts until the mak e a data compression code by assigning a binary name to eac h We can t of the elemen sucien t subset. This compression scheme motiv ates smallest follo con measure of information the ten t: wing essen The bit con ten t of X is: tial H (4.19) ( X ) = log : j S j 2 X that all ( X ) is the special case of H Note ( H ) with = 0 (if P ( x ) > 0 for 0 x 2A ) ). [ Caution: do not confuse H p ( X ) and H ( ( X ) with the function H 2 X 0 displa yed in gure 4.1.] 4.6 sho ws of Figure ( X ) for the ensem ble of example 4.6 as a function H .

88 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 4 | The Coding 76 Source Theorem P x ) ( log 2 2 : 4 4 6 2 - of outcomes The . (a) X 4.6 Figure example 4.6 (p.75)), rank ed (from by their probabilit y. (b) The S 1 S 0 ( tial bit ). The X ten con H t essen 16 lab els on the graph sho w the smallest as a t set sucien 6 6 6 ( X ) = 3 H . Note of function 0 ( and ) = 2 bits. bits X H 16 = 1 h a , b , , d e , f , g c (a) {a,b,c,d,e,f,g,h} 3 {a,b,c,d,e,f,g} {a,b,c,d,e,f} 2.5 {a,b,c,d,e} ) X ( H {a,b,c,d} 2 {a,b,c} 1.5 1 {a,b} 0.5 {a} 0 0.4 0.6 0.8 0.9 0.5 0.7 0.3 0.1 0 0.2 (b) d ensembles Extende metho d any more useful if we compress blocks of sym bols Is this compression a source? from x We now turn the outcome x = ( to examples ;x ) is a ;:::;x where N 2 1 of N indep enden t iden tically distributed random variables from a single string N ( denote by X ble the ensem ble . We will X ensem ;X ). Remem- ;:::;X X N 2 1 entrop y is additiv e for indep enden t variables (exercise 4.2 (p.68) ), so ber that N ( ) = NH ( X ). H X 4.7. Consider a string of N ips of a bent coin, x = ( x Example ;x ;:::;x ), 2 1 N = 0 x 2f 0 ; 1 g , with probabilities where prob- most : 9 ; p 1. The = 0 : p 1 n 0 able strings x are those with most 0 s. If r ( x ) is the num ber of 1 s in x then ) x ( r N x ( r ) : (4.20) p P x ( ) = p 1 0 N ( X To evaluate ) we must nd the smallest sucien t subset S H . This ; tain all x subset r ( x ) = 0 ; 1 will 2 ;::: ; up to some r con ( ) 1, with max some of the x with r ( x ) = r w graphs ( and ). Figures 4.7 and 4.8 sho max N N X H ) against for the cases ( = 4 and N = 10. The steps are the of of at whic h values S e of j changes by 1, and the cusps where the slop j staircase changes are the the r changes by 1. points where max ] p.86 [ 2, 4.8. Exercise are the mathematical shap es of the curv es between What cusps? the N wn in gures 4.6{4.8, H For the examples X sho ) dep ends strongly on ( value of , so it migh t not the a fundamen tal or useful de nition of seem information ten t. But we will consider what con ens as N , the num ber happ N enden t variables in X nd , increases. We will of indep the remark able result N t of ( X that ) becomes almost indep enden close { and for all it is very H to NH ( X ), where H ( X ) is the entrop y of one of the random variables. binary Figure this asymptotic tendency for the illustrates ensem ble of 4.9 1 N function, ) becomes at an increasingly X H ( increases, N 4.7. example As N

89 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. con ten t de ned in terms of lossy compression Information 4.3: 77 P ( x ) log 2 . (a) The sixteen Figure 4.7 8 10 12 4 14 2 6 0 4 - with outcomes of the ensem ble X = 0 y. p ed by probabilit 1, rank : 1 S S 0 : 01 0 1 : bit tial essen The (b) t ten con 4 X er schematic upp ). The ( H diagram indicates the strings' vertical probabilities lines' by the 6 6 6 6 6 to scale). (not lengths ; 0000 0010 ; 0001 ; 1111 ; : : : 1011 ; : : : 1101 ; : : : 1010 0110 (a) 4 N=4 3.5 4 ( ) X H 3 2.5 2 1.5 1 0.5 0 0.2 0.35 0.3 0.25 0.4 0.15 0.1 0.05 0 (b) 10 N Figure N ) for . H = 10 4.8 X ( N=10 1. binary : = 0 p with variables 1 8 10 X ( ) H 6 4 2 0 0.4 0.6 0.8 1 0 0.2 1 1 N X ( ) for H . 4.9 Figure N=10 N N=210 binary ;:::; 210 ; 1010 = 10 N N=410 0.8 1 N N=610 ) X H ( = 0 : p with variables 1. 1 N=810 N N=1010 0.6 0.4 0.2 0 0.2 0 0.4 0.6 0.8 1

90 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 4 | The Theorem 78 Source Coding top 15 strings Figure 4.10 . The )) x log x ( P ( 2 100 , where from samples are X ...1... ...1...1...1.1...1...1...1...1...11... 50.1 9. The 1 and : = 0 p : = 0 p 0 1 bottom and most the two are ... 37.3 ...1...1...1...1...1...1...1... probable in this least strings ...1...1. .1...1...11..1.1...11...1...1.1..1...1...1. 65.9 The ble. ensem sho ws column nal ...1...11.1..1...1...1..1.11... 56.4 1.1...1... the log-probabilities of the ...11... .1...1...1.1...1...1...1...1...1...1... 53.2 whic random h may be strings, ...1...1.1...1...1...1...1...1... ...1 43.7 with compared the entrop y 100 ...1...1 ...1...1...1...1...1...1..11... 46.8 H ( X ) = 46 : 9 bits. ...1..1..1... ...111...1...1...1.1...1...1...1 56.4 37.3 ...1...1...1...1...1...1... ...1... ...1... ...1...1...1..1.1.1..1...1. 43.7 ...1...1...1...1...1...1...1..11..1.1...1... 56.4 1... 37.3 ...11.1 ...1...1...1...1... .1...1.. 56.4 .1.1...1...11...1.1...1...1...11... ...1..11.1.1.1...1...1...1...1..1... 59.5 ...1...1..1. ...11. 1...1...1..1...1...1...1...1... 46.8 ... ... 15.2 332.1 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111 111111111111111 for = 0 and close to except 1. As long as we are allo wed a tiny tails bits y of error down to NH , compression is possible. Even if we probabilit allo wed a large probabilit y of error, we still can compress only down are to NH This is the source coding theorem. bits. Let 4.1 Theorem coding theo rem. Shannon's X be an ensemble with source entr opy H ( X ) = H bits. Given > 0 and 0 < < 1 , ther e exists a positive inte ger N such that for N > N , 0 0 1 N X ) H ( H (4.21) < : N y Typicalit 4.4 N help? Let's examine long Wh from X y does increasing . Table 4.10 N strings N ws fteen from X sho for N = 100 and p samples = 0 : 1. The probabilit y 1 of a string that con tains r 1 s and N r 0 s is x N r r p (4.22) (1 : ) p ( P ) = x 1 1 The num ber of strings that con tain r 1 s is N n ) = r ( : (4.23) r the ber of 1 s, r , has a binomial distribution: So num N r N r (1 p : ) (4.24) p ) = P ( r 1 1 r These functions are sho wn in gure 4.11. The mean of r is Np its , and 1 p is standard deviation p (1 Np then ) (p.1). If N is 100 1 1 p 3 (4.25) 10 : Np ' (1 p ) r Np 1 1 1

91 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Typicalit y 79 4.4: N = 1000 N = 100 1.2e+29 3e+299 1e+29 2.5e+299 N 8e+28 2e+299 n ( ) = r r 6e+28 1.5e+299 4e+28 1e+299 2e+28 5e+298 0 0 30 90 100 70 60 50 40 80 20 10 0 0 100 200 300 400 500 600 700 800 900 1000 2e-05 r N r 2e-05 (1 p ) ( x p P ) = 1 1 1e-05 0 1e-05 5 1 0 2 3 4 0 70 100 60 50 40 90 0 10 20 80 30 0 0 -500 -50 -1000 -100 T T ( log ) x P 2 -150 -1500 -200 -2000 -250 -2500 -300 -3000 -3500 -350 30 20 10 0 100 0 70 1000 900 800 700 600 500 400 300 100 90 80 200 60 50 40 0.045 0.14 0.04 0.12 0.035 N 0.1 r N r 0.03 p ( (1 p ( n x P ) = r ) ) 1 1 r 0.08 0.025 0.02 0.06 0.015 0.04 0.01 0.02 0.005 0 0 100 1000 10 0 30 40 50 60 70 80 90 0 100 200 300 400 500 600 700 800 900 20 r r graphs these typical set T . For p . Anatom 4.11 Figure y of the = 0 : 1 and N = 100 and N = 1000, 1 sho ( r ), the num ber of strings con taining r n s; the probabilit y P ( x ) of a single string w 1 con tains r 1 s; the same probabilit y on a log scale; and the total probabilit y n ( r ) P ( x ) of that strings tal that con tain r 1 s. The num ber r is on axis. The plot of log P ( x ) all horizon the 2 sho ( P of log NH mean the line by a dotted ws x also ( p ), whic h equals ) = 46 : 9 value 1 2 2 The = 100 and 469 when N = 1000. N typical set includes only the strings that when P de ned mark ( x ) close to this value. The range have log ed T sho ws the set T (as in N 2 09 (righ section 4.4) for N = 100 and = 0 : 29 (left) and N = 1000, = 0 : t).

92 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Source Coding Theorem 80 4 | The = 1000 If then N 100 (4.26) 10 r : r gets the N y distribution of bigger, becomes more as that Notice probabilit sense that while the range of possible concen of r gro ws trated, in the values p deviation , the r gro ws only as standard of as N is most likely to r . That N range of values implies that the outcome x fall most likely to in a small is also in a corresp small subset of outcomes that we will call the typical fall onding set . De nition of the al set typic alphab y for arbitrary ensem typicalit X with an et A us de ne . Our Let ble X string involve the of a typical string's probabilit y. A long string de nition will sym con will usually N tain about p bols N occurrences of the rst sym bol, of 1 string N occurrences of the second, etc. Hence the probabilit y of this p is 2 roughly ) N N p ( p ) N ) ( ( p 2 1 I ( x P ) :::P ( x = ) ' p P ( x x ) P ( ( x ) ) P :::p p (4.27) N 3 2 1 typ 2 1 I the con ten so that string is information t of a typical X 1 1 log (4.28) NH: = ' N log p i 2 2 ) x ( P p i i 1 / log variable random the So ( x ) , whic P information con ten t of x , is h is the 2 likely to be close in value to NH . We build our de nition of typicalit y very this on observ ation. N elemen ts of A typical the to be those elemen ts that have prob- We de ne X N H abilit y close . (Note that the typical set, unlik e the smallest sucien t to 2 N does not include the most probable elemen ts of A subset, w , but we will sho X that probable elemen ts con tribute negligible probabilit y.) these most that how close the probabilit y has to We introduce a parameter de nes N H ts t to be `typical'. for the set of typical elemen an be to 2 elemen We call set, T typical : the N 1 1 N H log < : (4.29) : x 2A T N 2 X ) x ( N P sho w that whatev er value of we choose, the typical set con tains We will all almost y as N increases. the probabilit ortan is sometimes called the `asymptotic equipartition' imp This t result principle . indep equipartition' For an ensem ble of N . enden t `Asymptotic principle N distributed (i.i.d.) random variables X iden ( X tically ;X ), ;:::;X N 2 1 N sucien tly large, the ) is almost x = ( x with ;x ;:::;x outcome 1 N 2 N ( ) X N H A certain of to belong to a subset 2 only having mem bers, eac h X ) X ( N H 2 y `close probabilit . having to' N H ( X ) X ) < H X ( Notice ) then 2 that if H ( is a tiny fraction of the num ber 0 N X ( ) N H N 0 jA outcomes of possible = : jA j j = 2 X X term equipartition is chosen to describ e the idea that the mem bers of The y. [This typical have roughly equal probabilit set should not be tak en too the literally , hence my use of quotes around `asymptotic equipartition'; see page 83.] idea for equipartition, in thermal physics, is the meaning that eac h A second 1 . This kT system has equal average energy , degree of freedom of a classical 2 second meaning is not intended here.

93 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Pro ofs 4.5: 81 P x ) log ( 2 ( X NH ) - T N 6 6 6 6 6 . . . 1111111111110 11111110111 0000100000010 . . . 00001000010 0100000001000 . . . 00010000000 00000000000 . . . 0001000000000 00000000000 . . . 0000000000000 Figure 4.12 . Schematic diagram all strings in the ensem ble sho wing t to: alen is equiv principle equipartition' The `asymptotic N y, X ed by their rank probabilit . T set typical the and source coding theorem (verbal statemen t) . N i.i.d. Shannon's ran- N eac h with entrop y H variables X ) can be compressed into more dom ( NH ( X ) bits with than risk of information loss, as N ! 1 ; negligible con if they are compressed into few er than NH ( X ) bits it is vir- versely certain that will be lost. tually information are alen t because we can de ne a compression algo- two theorems equiv These gives a distinct name of length rithm ( X ) bits to eac h x in the typical that NH set. Pro ofs 4.5 section tough ed if found This going. may be skipp law ers The of large numb of of the source coding theorem uses the law of large num bers. Our pro P u P ( u ) E [ u ] = u = and variance of a real Mean variable are random u P 2 2 2 ) = u = E [( u u ) var( ] = and P ( )( u u ) u : u u u I am assuming here that note: is a function u ( x ) Technical strictly x from a nite discrete ensem ble X . Then the summations of a sample P P x u ) f ( u ) should be written ( ) P ( x ) f ( u ( P )). This means that P ( u x u sum of delta functions. This is a nite guaran tees that the restriction mean and variance of u do exist, whic h is not necessarily the case for general P ). ( u be a non-negativ inequalit . yshev's t y 1 e real random variable, Cheb Let let be a positiv e real num ber. Then and t : (4.30) ) t ( P P t= by h term eac ). We multiply t 1 and ( P ( ) = t P of: Pro t P (non-negativ the We add t= : ) t ( e) missing P t ) ( P obtain: t P 2 = t= . ) t ( P t= ( t ) P obtain: and terms t

94 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Source 82 4 | The Coding Theorem . inequalit be a random variable, and let be a y 2 yshev's Let x Cheb Then num e real positiv ber. 2 2 P x ( = : (4.31) x ) x 2 = ( x x ) Pro and apply the previous prop osition. 2 of: Take t indep bers x to be the average of N Take enden t Weak . law of large num random variables , having common mean h h and common vari- ;:::;h N 1 P N 1 2 h . Then ance = : x n n =1 h N 2 2 (( ) h ) P x (4.32) = N: h 2 2 that x = . h and that Pro of: = obtained by sho =N wing 2 x h in x being very close to the mean ( very small). No matter We are interested 2 is, and no matter how small the required matter is, and no how large h 2 probabilit y that ( x how small h ) the , we can alw ays achiev e it desired N large by taking enough. (p.78) Proof of theorem 4.1 1 1 to the random variable num We apply law of large the bers de ned log 2 N P ( x ) N wn from the ensem ble X for x random variable can be written as dra . This average N information con ten ts h the = log h is a of =P ( x h of whic )), eac (1 n n 2 2 with mean H = H ( X ) and variance ( var[log random (1 =P variable x ))]. n 2 (Eac is the Shannon information con h t of the n th outcome.) h term ten n the typical set with parameters N and thus: We again de ne ) ( 2 1 1 2 N (4.33) : < log H = x T 2A : N 2 X ) x ( P N x 2 T For all , the probabilit y of x satis es N ( H + ) N N ( H ) < P 2 ( x ) < 2 : (4.34) And law of large bers, by the num 2 x ) 1 P 2 ( T : (4.35) N 2 N equipartition' `asymptotic As N increases, ved the We have thus pro principle. y that x falls in T the approac hes 1, for any . How does this probabilit N relate coding? result to source N ). We will to H We must ( relate T there sho w that for any given X N 1 N N ) X H ( big ) ' NH . h that H is a sucien suc ( X tly N N 1 N H ( X ) 0 Part 1: < H + . ( X H ) N size gives of So the compression. The set T T is not the best subset for N N H + N H ( . We sho w how smal l H an upp X er bound ) must be by calculating on H T could possibly be. We are free to set venien t value. how big to any con N H + ( ) N H , smallest The y that ber of T possible can have is 2 a mem probabilit N and can't be any bigger than 1. So the total probabilit y con tained by T N 1 0 ) + H ( N < 1 ; 2 j j T (4.36) N Figure illustration . Schematic 4.13 size typical is, the that of the set by is bounded of the of the theorem. two parts N ( H + ) w that , we sho and en any Giv < : 2 (4.37) j j T N 1 N X ( H ) large for enough , N 2 N , then P ( T set the , and 1 ) If we set = and h that suc N (1) H line and lies below the + 2 0 N N 0 N above the (2) . H line ) ). + H < N j T T j becomes log X ( a witness H that fact to the ( N N 2

95 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 83 Commen ts 4.6: 1 N H ) > H ( . 2: X Part N this second part is not so { that, for any N , that someone Imagine claims -sucien t subset S above inequalit is smaller than the the smallest y would mak e use of our typical set to sho w that they must be mistak en. w. allo We can ber that free we are to any value we choose. We will set Remem to set 0 N ( H 2 ) 0 having S = j 2 = 2, so that our task is to pro ve that a subset S j 0 an 2 S P ) 1 cannot exist (for N greater than x N achieving that and ( 0 we will specify). $ ' '$ 0 0 T N S . So, let us consider the probabilit y of falling in this rival smaller subset S 0 probabilit S subset is y of the The @ I O C @ C 0 &% &% S \ T N 0 0 0 C 2 P ( x 2 S \ ) = P ( x 2 S S \ T x ) + P ( T ) ; (4.38) 0 N N T S \ N value denotes complemen t f x 62 T of g . The maxim um the T where N N ( H 0 2 ) N con tains 2 rst term is found if S the \ outcomes all with T the N N ( H ) um . The maxim um value y, 2 second term can maxim probabilit the ( x 62 T P ). So: have is N 2 2 H ) N ( ( ) N H 0 2 N P ) + 2 2 ( x 2 S = 2 + (4.39) : 2 2 N N 0 = = 2 and N suc h that P ( x 2 S We can ) < 1 now set , whic h sho ws 0 0 . Thus cannot satisfy the de nition of a sucien t subset S that S any subset 0 0 N ( H ) 2 with size j S S j has probabilit y less than 1 , so by the de nition N of , H ). ( X H ) > N ( H 1 N a constan ( tially H ) is essen t X enough function , the large N Thus for N , for 0 < < 1, as illustrated in gures 4.9 and 4.13. 2 function of Comments 4.6 1 N (p.78) has two parts, ( source H coding The X theorem ) < H + , and N 1 N interesting. ( X H ) > H . Both results are N rst part tells us that even if the probabilit y of error is extremely The 1 N ( H a long X ) needed to specify the small, per sym bol ber of bits num N -sym bol string x with vanishingly small error probabilit y does not have to N H to have only bits. We need exceed a tiny tolerance for error, and the + ber of bits drops signi can tly from H num ( X ) to ( H + required ). 0 What ens if we are yet more happ t to compression errors? Part 2 toleran tells us that even if is very close to 1, so that errors are made most of the time, the num ber of bits per sym bol needed to specify x must still be average H bits. These two extremes tell us that regardless of our speci c at least bol needed wance the num ber of bits per sym error, to specify x is H allo for no more and no less. bits; at regarding Cave equip artition ' `asymptotic the t `asymptotic equipartition' in quotes because it is imp ortan I put words set to think the elemen ts of the typical not T have roughly really do that N the same probabilit y as eac h other. They are similar in probabilit y only in 1 Now, as h other. of eac are within 2 N the their sense values of log that 2 x ( P ) is decreased, how does N have to increase, if we are to keep our bound on 2 w gro must N , constan t? 2 T the ) 1 mass of the typical set, P ( x 2 N N p 2 , then , so, if we write in terms as 1 N as = = N , for some constan t of

96 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 4 | The Source Theorem 84 Coding p N times greater probable 2 the typical set will be of order string most in the probable decreases, in the typical set. As least N increases, than the string p N 2 and this ratio onen exp . Thus we have `equipartition' only in ws gro tially a weak sense! we intr oduc e the typic al set? Why did choice The for blo ck compression is (by de nition) S best , not a of subset set. we bother introducing the typical why did The answ er is, typical So set? the typic al set . We kno w that all we elemen ts have `almost iden- can count its N H probabilit ), and we kno w the y (2 set has probabilit y almost tical' whole N H set must have roughly 2 Without elemen ts. 1, so the typical help of the the set (whic h is very similar to S typical ) it would have been hard to coun t how man are in S y elemen . ts there 4.7 Exercises Weighing problems [ 1 ] Exercise While some 4.9. when they rst people, ter the weighing . encoun with balls and the 12 balance (exercise 4.1 problem three-outcome ), think that weighing six balls against six balls is a good rst (p.66) others weighing, weighing six against six con veys no informa- say `no, are at all'. second group why they to the both righ t and tion Explain Compute the information gained about which is the odd wrong. , ball and information gained about which is the odd ball and the it is whether heavy or light . [ 2 ] are . Exercise Solv e the weighing problem for the case where there 4.10. 39 balls of whic is kno wn to be odd. h one 2 ] [ Exercise You are given 16 balls, all of whic h are equal in weigh t except . 4.11. are that vier or ligh ter. You hea also given a bizarre two- one is either for that can rep ort only two outcomes: `the two sides pan balance balance' two sides not balance'. Design a strategy to determine whic h do or `the ball in as few uses of the balance as possible. odd is the ] 2 [ 4.12. . You have a two-pan balance; your job is to weigh out bags of Exercise with weigh ts 1 to 40 pounds inclusiv e. How man y weigh ts our integer need? [You are allo wed to put weigh ts on either pan. You're only do you wed to put one our bag on the balance at a time.] allo 4, p.86 ] [ Exercise (a) Is it possible 4.13. e exercise 4.1 (p.66) (the weigh- to solv ing problem with 12 balls and the three-outcome balance) using a sequence of three suc h that the balls chosen for the xed weighings, of the weighing dep end on the outcome not rst, and the second do weighing does not dep end on the rst or second? third weighing Find general N -ball to the problem in whic h a solution (b) one of N balls is odd. Sho w that in W weighings, an odd exactly W can be iden ti ed from among N = (3 ball 3) = 2 balls. 3 ] [ and You are given 12 balls Exercise the three-outcome balance of exer- 4.14. cise 4.1; this time, two of the balls are odd; eac h odd ball may be hea vy the or ligh we don't kno w whic h. We want to iden tify t, and odd balls and in whic h direction they are odd.

97 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Exercises 85 4.7: y weighings (a) by the optimal strategy . how man are required Estimate three odd balls? are what And if there answ ers change if it is kno (b) that all the regular balls How do your wn g, that t balls weigh 99 g, and hea ligh weigh 110 g? 100 weigh vy ones with a lossy compr essor, with loss Sour ce coding p.87 ] 2, [ 1 N : = f 0 : 2 ; 0 4.15. 8 g . Sketch Let P . Exercise ( X H ) as a function of X N for N ; 2 and 1000. = 1 [ 2 ] 1 N ) as a function for H ( Y of 5 ; 0 : 5 g . Sketch 4.16. Let P = f 0 Exercise . : Y N ; = 1 3 and 100. 2 N ; [ ] p.87 2, (For physics studen ts.) 4.17. the relationship between Exercise . Discuss the `asymptotic equipartition' principle and the equiv alence pro of of the large of the Boltzmann entrop y and systems) Gibbs entrop y. (for the that don 't obey the law of large numb ers Distributions law of large chapter, bers, whic h we used in this The sho ws that the mean num of N random variables has a probabilit y distribution that becomes i.i.d. of a set p with width / 1 = wer, narro , as N increases. However, we have pro ved N prop this for discrete random variables, that is, for real num bers erty only While on nite set of possible values. a man y random variables with taking con tinuous probabilit y distributions also satisfy the law of large num bers, there are imp t distributions that do not. Some con tinuous distributions do not ortan or variance. have a mean [ p.88 ] 3, Sketch the Cauc hy distribution . Exercise 4.18. 1 1 ; 1 (4.40) ; x 2 ( : ) 1 x ) = ( P 2 + 1 Z x normalizing constan t Z ? Can you evaluate its mean or What is its variance? , where z = x Consider + x sum x the and x are indep enden t random 1 2 2 1 variables a Cauc hy distribution. What is P ( z )? What is the prob- from ) y distribution mean of x is and x What , x abilit x 2? + x = of the = ( 2 2 1 1 samples probabilit of the the of N y distribution from this Cauc hy mean distribution? Other asymptotic properties [ ] 3 bound. We deriv ed the 4.19. law of large num bers Cherno Exercise weak yshev's inequalit y (4.30) by letting the random variable t in from Cheb 2 P ( t ) inequalit t= be a function, t = ( x x ) the , of the random y variable x we were interested in. Other useful inequalities can be obtained by using other functions. The Cherno bound, h is useful for bounding the tails of a distribution, whic is obtained t = exp( sx ). by letting Sho w that sa x a ) e s P g ( ( ) ; for any s > 0 (4.41) and sa ( x a ) e ( P g s ) ; for any s < 0 (4.42)

98 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 4 | The Source Theorem 86 Coding ( where momen t-generating function of x , s g ) is the X sx ) = g s ( ) e P : (4.43) ( x x relate p log 1 =p functions d to Curious ] p.89 [ 4, This exercise has no Exercise ose 4.20. it's included for the purp at all; ymen who like mathematical curiosities. enjo t of those function Sketch the x x x x (4.44) x ) = x f ( function x Hint: Work out the inverse 0. to f { that is, the function for ( y ) suc h that if x = g ( y ) then y = f ( x ) { it's closely related g p log 1 =p . to Solutions 4.8 (p.68) . Let P ( x;y ) = P ( x ) 4.2 ( y ). Then P to exercise Solution X 1 ) = X;Y ( H (4.45) ) P ( y ) log P ( x x P ( y ) ( ) P xy X X 1 1 = ) P P y ) log ( x ( (4.46) + ( x P P ( y ) log ) ( P x ) P ( y ) xy xy X X 1 1 + = (4.47) y P ( ) log ) log x ( P ( ) x P ( y ) P y x = H ( X ) + H ( Y ) : (4.48) Solution to exercise (p.73) . An ASCI I le can be reduced in size by a 4.4 of 7/8. This could be achiev ed by a blo ck code that maps factor reduction 56 blo blo cks by cop ying the yte information-carrying bits yte cks into 7-b 8-b ignoring the last into 7 bytes, of every character. and bit to exercise (p.74) . The pigeon-hole 4.5 states: you can't Solution principle 16 pigeons into 15 holes without using one of the holes put twice. Similarly give A , you can't outcomes unique binary names of some length X l l log shorter jA names, j bits, because there are only 2 than suc h binary X 2 l < log jA to the j implies 2 l < t inputs jA two di eren j , so at least and X X 2 compressor would compress to the same output le. Solution to exercise (p.76) . Bet ween the cusps, all the changes in proba- 4.8 y are at eac and the num ber of elemen ts in T changes by one bilit h step. equal, H with varies logarithmically So ( ). solution 4.13 (p.84) . This Solution was found by Dyson and Lyness to exercise in 1946 and presen ted in the follo wing elegan t form by John Con way in 1999. used Be sym bols A, B, and C are the to name the balls, to name the warned: pans of the balance, to name the outcomes, and to name the possible states of the ball! odd (a) Lab el the 12 balls by the sequences AAB ABA ABB ABC BBC BCA BCB BCC CAA CAB CAC CCA and in the

99 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 87 4.8: Solutions BBC BCA BCB BCC 1st AAB ABA ABB ABC put AAB CAA CAB CAC in pan A, ABA ABB ABC BBC in pan B. 2nd weighings ABA BCA CAA CCA 3rd AAB ABB BCB CAB end up will weighing, a pan Now in a given either in the ( ) that it assumes when the pans are balanced, C anonical position C or position A ), or A bove that ( B elow it ( ), B weighings determine for eac so a sequence of three of the three h pan letters. these are CCC , then there's no odd ball. If both for just sequences Otherwise, of the the sequence is among the 12 above, and names one two pans, ball, the t is A bove or B elow the prop er one according weigh whose odd A or B . is pan as the W weighings the odd ball (b) be iden ti ed from among In can W 3) = 2 (4.49) N = (3 all same elling them with in the the non-constan t se- balls way, by lab of W letters from A , B , C whose rst change is A-to-B or B-to-C quences and at the th weighing putting those whose w th letter is A or C-to-A, w and those w th letter is B in pan B . in pan A whose 1 N (p.85) . The curv es to exercise Solution 4.15 ( X H ) as a function of for N ; 2 and 1000 are sho wn in gure 4.14. Note = 1 H N (0 : 2) = 0 : 72 bits. that 2 1 N=1 N = 1 = 2 N N=2 N=1000 1 1 X ( X ) ) H H ( 0.8 X ( 2 H ) 2 ) H X ( N N 1 4 0{0.04 1 2 0{0.2 0.6 1 0 1 0.2{ 0.79 3 0.04{ 0.2 0.4 0.2{ 0.36 0.5 2 1 0.36{ 1 0 0.2 1 0 4.14 . Figure ) (vertical H ( X 0 0.2 0.4 0.6 0.8 1 N axis) (horizon tal), for against = 1 ; 100 binary variables N ; 2 P = 0 : 4. p with 1 1 ln p , where i y is k The . (p.85) entrop Gibbs to exercise Solution 4.17 i B i p i over all states of the system. This entrop y is equiv alen t (apart from runs the factor k ble. ) to the of entrop y of the ensem Shannon B Whereas the Gibbs entrop y can be de ned for any ensem ble, the Boltz- mann entrop de ned for micro canonical ensem bles, whic h have a y is only y distribution states. is uniform over a set of accessible probabilit The that k entrop Boltzmann S ber of ac- = y is de ned num ln where is the to be B B cessible states of the micro canonical ensem ble. This is equiv alen t (apart from con the k constrained ) to the perfect information of ten t H of that factor 0 B ensem ble. The Gibbs entrop y of a micro canonical ensem ble is trivially equal to the Boltzmann entrop y.

100 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 4 | The Source Theorem 88 Coding distribution We now consider ensem ble), where (the a thermal canonical x the probabilit is y of a state 1 x ) ( E : (4.50) ( exp x ) = P T Z k B With ble we can asso ciate a corresp onding micro canonical this canonical ensem an ensem ble with total energy xed to the mean energy of the ensem ble, ble canonical to within some precision ). Now, xing the total ensem ( xed 1 / the value of ln energy t to xing alen is equiv to a precision ) to within x ( P T . Our de nition of the typical k T set was precisely that it consisted B N ts that have a value of log P ( x elemen close to the mean value of of all ) very P ( x ) under the canonical ensem ble, NH log X ). Thus the micro canonical ( ensem is equiv alen t to a uniform distribution over the typical set of the ble ensem ble. canonical of of the pro principle thus pro ves { for the `asymptotic Our equipartition' whose is separable into a sum of indep enden t terms of a system case energy Boltzmann entrop y of the { that canonical ensem ble is very close the micro large ) to the Gibbs entrop y of the canonical N ble, if the energy of (for ensem micro canonical ensem ble is constrained to equal the mean energy of the the canonical ensem ble. hy dis- to exercise The normalizing constan t of the Cauc . Solution (p.85) 4.18 tribution 1 1 P ( x ) = 2 + 1 x Z is Z 1 1 1 1 Z = tan : = x = (4.51) = x d 1 2 2 2 x + 1 1 The mean and variance of this distribution are both unde ned. (The distribu- tion is symmetrical zero, but this does not imply that its mean is zero. about mean value t integral.) The sum z = x of a divergen + x is the , where The 2 1 and x x both have Cauc hy distributions, has probabilit y densit y given by 1 2 the volution con Z 1 1 1 1 ; (4.52) ) = z ( P d x 1 2 2 2 ( z x ) + 1 x + 1 1 1 1 lab our using h after metho ds gives whic a considerable standard 1 1 2 (4.53) ) = ; z P 2 ( = 2 2 2 2 + 2 + 4 z z whic as a Cauc hy distribution with width parameter 2 (where h we recognize This original has width parameter 1). the implies that the mean distribution of the two points, x = ( x with + x hy distribution ) = 2 = z= 2, has a Cauc 2 1 parameter 1. the mean of N samples from a Cauc hy Generalizing, width hy-distributed with the same parameters as the individual is Cauc distribution The probabilit samples. of the mean does not become narro wer y distribution p as 1 N . = centr al-limit The does not apply to the Cauchy distribution, be- theorem cause it does not have a nite varianc e. An alternativ metho d for getting to equation (4.53) mak es use of the e neat j ! j tial whic h is a biexp onen Fourier e transform Cauc hy distribution, . of the Con volution in real space corresp onds to multiplication in Fourier space, so ! j 2 j z is simply e the transform Fourier of . Rev ersing the transform, we obtain equation (4.53).

101 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 89 Solutions 4.8: 4.20 (p.86) . The function f ( Solution ) has inverse function to exercise x 50 1 =y 40 y y ( g (4.54) ) = : 30 20 Note 10 y: log g ) = 1 y ( (4.55) log =y 0 0.6 0.4 0.2 0 1 1.4 0.8 1.2 y ) with y ( g ) by plotting x ( f of e graph tativ a ten vertical the along I obtained tal horizon the ) along y ( g and axis that suggests graph resulting The axis. 5 eha surprisingly looks and 1), ; (0 2 well-b for valued ) is single x ( f ved and x 4 p 1 =e f x 2 (1 ;e for ordinary; ), ( ( x f ) is two-valued. 2) is equal both to 2 and 3 =e 1 x ( f 1.44), h is about (whic t be x > e 4. For it migh However, ) is in nite. 2 ( that this approac h to sketching f x ) is only partly valid, if we de ne f argued 1 x x x x , x of functions , sequence x limit as the of the ;::: ; this sequence does not 0 1.4 0 0.2 0.4 0.6 0.8 1.2 1 e hfork ) =e (1 x 0 for have a limit accoun on 07 : bifurcation 0 ' t of a pitc 0.5 e 1 =e for limit alued =e ) x ; and at x 2 (1 ;e { the = (1 ), the sequence's is single-v 0.4 gure. in the sketched two values lower of the 0.3 0.2 0.1 0 0 0.2 x x x x wn sho x x ( f . 4.15 Figure ) = ; di eren at three t scales.

102 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. out Chapter Ab 5 chapter, In the fundamen tal status of the entrop y we saw a pro of of the last con ten t. We de ned a data compression information as a measure of average xed length block codes , and pro ved that scheme N increases, it is using as to enco N i.i.d. variables x = ( x possible ;:::;x de ) into a blo ck of N ( H ( X )+ 1 N with vanishing probabilit y of error, whereas if we attempt to enco de ) bits N into N ( H ( X ) ) bits, the probabilit y of error is virtually 1. X the the of data compression, but We thus veri ed blo ck coding possibility in the algorithm. of did not give a practical de ned In this chapter and pro the we study practical data compression algorithms. Whereas the last next, chapter's compression scheme used large blo cks of xed size and was lossy , in the next we discuss variable-length compression schemes that are chapter for small ck sizes and that are not lossy . blo practical er glove lled with water. If we compress two ngers of the a rubb Imagine other part of the glove has to expand, because glove, some total volume the of water t. (Water is essen tially incompressible.) Similarly , when is constan there the ords for some outcomes, we shorten must be other codew ords codew that get longer, if the scheme is not lossy . In this chapter we will disco ver the information-theoretic equiv t of water volume. alen reading 2.26 5, you should have work ed on exercise Before (p.37) . Chapter use the follo wing notation for interv als: We will x 2 [1 ; 2) means that x 1 and x < 2; that x (1 ; 2] means 2 x > 1 and x 2. 90

103 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 5 Sym bol Codes chapter, this sym bol codes , whic h enco de one we discuss In variable-length instead source ding huge strings of N bol at a time, sym- source sym of enco blo codes unlik e the last chapter's lossless: ck codes, they are These bols. are guaran compress and decompress without any errors; but there is a teed to that codes may sometimes pro duce enco ded strings longer than chance the original source the string. is that we can idea e compression, on average, by assigning The achiev enco dings to the more probable outcomes and longer shorter dings to the enco less probable. key issues are: The ? the bol code is lossless if a sym If some code- What implications are shortened, by how much do other codew ords words are have to be length- ened? practical . How can we ensure that compression bol code is Making a sym to deco de? easy sym bol codes . How should we assign Optimal to achiev e the codelengths best and what is the best achiev able compression? compression, verify con fundamen tal status of the Shannon information We again ten t the the entrop y, pro ving: and coding theorem Source bol codes) . There exists a variable-length (sym enco ding C of an ensem ble X suc h that the average length of an en- X coded ( C;X ), satis es L ( C;X ) 2 [ H ( X ) ;H ( L ) + 1). bol, sym average is equal to the entrop y H length X ) only if the codelength The ( eac h outcome is equal to its Shannon information con ten t. for also We will e pro cedure, the Hu man coding algorithm, a constructiv de ne duces optimal sym bol codes. pro that N alphab ets . A Notation denotes the set of ordered N -tuples of ele- for + ts from the set A , i.e., all strings of length N . The sym bol A men will of elemen denote of all strings of nite length comp osed set ts from the the set A . 3 5.1. f 0 ; 1 g 101 = f 000 ; 001 ; 010 ; 011 ; 100 ; Example ; 110 ; 111 g . + 00 f 0 ; 1 g Example = f 0 ; 1 ; 5.2. ; 01 ; 10 ; 11 ; 000 ; 001 ;::: g . 91

104 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 5 | Sym 92 bol Codes Symb ol codes 5.1 the C ble X is a mapping from an ensem range sym bol code A (binary) for + ;:::; a = f a of x cor- , , to f 0 ; 1 g A . c ( x ) will denote the codew ord g I 1 X to x resp l ( x ) will denote its length, with l onding = l ( a ). , and i i + + + is a mapping from A code extended C The 0 1 f g to obtained by ; X without concatenation, of the corresp onding codew ords: punctuation, + ( x (5.1) x : :::x ) ) = c ( x x ) c ( x ( c ) :::c 2 N N 2 1 1 term [The ym for `function'.] here `mapping' is a synon X bol code the ensem ble A sym de ned by 5.3. Example for = f a ; b ; c ; d g A ; X a c ( a ) l (5.2) i i i 1 1 1 1 / / / / P f = ; 2 ; 4 8 ; g ; 8 X 4 a 1000 : in the wn , sho C is C margin. 0 0 4 b 0100 c 0010 4 as Using the extended code, we may enco de acdbac 4 d 0001 + ( acdbac ) = 100000100001010010000010 : (5.3) c sym There requiremen ts for a useful basic bol code. First, any enco ded are string must have a unique deco ding. Second, the sym bol code must be easy to deco de. third, the code should achiev e as much compression as possible. And oded string a unique decoding have Any enc must + X ) is uniquely deco deable if, under the extended code C A code , no C ( have the same enco ding, i.e., strings two distinct + + + 2A x y ; x 6 = y ) c ; ( x ) 6 = c 8 ( y ) : (5.4) X The C code de ned above is an example of a uniquely deco deable code. 0 The ol code must be easy to decode symb bol code is easiest to deco de if it is possible to iden tify the end A sym of a codew as soon as it arriv es, whic h means ord no codew ord can be a pre x that of another codew ord. [A word c is a pre x of another word d if there exists a tail string suc h that the concatenation ct is iden tical to d . For example, 1 is t of , and so is 10 .] a pre x 101 sho w later that we don't lose any performance if we constrain our We will sym bol code code. to be a pre x ord bol code a pre x code A sym codew is called is a pre x of any if no other codew ord. A pre x code is also kno wn as an instan taneous or self-punctuating code, because an ded string can be deco ded from left to righ t without enco ahead to subsequen t codew ords. The end of a codew ord is im- looking mediately recognizable. A pre x code is uniquely deco deable. as `pre x-free Pre x also kno wn are codes' or `pre x condition codes'. codes page. codes corresp ond to trees, as illustrated in the margin of the next Pre x

105 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Sym 5.1: bol codes 93 0 0 a pre x is not 0 Example 5.4. The code C code because = f 0 ; 101 g is a pre x 1 0 . 101 of 101 , nor is 0 a pre x of 1 1 C 101 1 Let C Example 5.5. f 1 ; 101 g . This code is not a pre x code because 1 is a = 2 101 . pre x of 110 The C Example = f 0 5.6. 10 ; code ; 111 g is a pre x code. ; 3 0 0 g code. 11 ; 10 ; 01 ; 00 f is a pre x = C code The 5.7. Example 4 0 10 C 3 1 ] p.104 1, [ 0 110 Exercise C Is uniquely deco deable? 5.8. 2 1 1 111 4.2 gure 5.9. and (p.66) Example 4.1 (p.69). Any weighing exercise Consider 0 00 t can strategy that iden ti es or ligh odd ball and whether it is hea vy the 0 h of the to eac code ternary a as assigning be view ed states. possible 24 01 1 code. This code is a pre x 0 10 C 4 1 1 11 code should ession compr as much achieve The as possible can Pre x be represen ted codes pre x on binary trees. Complete ble ) of a sym ensem for is C bol code The exp ected length L ( C;X X trees codes corresp ond to binary X unused with hes. C branc is an no 1 L C;X ) = ( (5.5) : ) ( x ) l ( x P code. incomplete x 2A X We may also write this quan tity as I X ) = L ( C;X p l (5.6) i i =1 i I where jA j . = X C : 3 Let 5.10. Example l ( c h ( p a ) p ) a i i i i i ; b ; a f = A ; g d ; c X (5.7) 1 1 1 1 1 / / / / / g 8 ; ; 8 ; 4 ; 2 f = and P 1.0 2 1 a 0 X 1 / b 10 2.0 4 2 X y of entrop . The C ected exp the and bits, is 1.75 and consider the code 3 1 / 3 3.0 8 c 110 bols of sym sequence The length L ( C bits. ;X ) of this code is also 1.75 3 1 / 3 8 3.0 d 111 + ) is enco ded as c x ( x acdbac 0110111100110 . C = ( is a pre x code ) = 3 is therefore uniquely deco deable. Notice that the codew ord lengths and l i . = 2 l ), or equiv alen tly, p satisfy = log (1 =p i i i 2 C C 5 4 length code for the same ensem ble X , C Example . 5.11. Consider the xed 4 0 a 00 C The ;X ) is 2 bits. exp ected length L ( 4 b 01 1 00 c 10 L 5.12. Consider C ected . The exp Example length h ( C whic ;X ) is 1.25 bits, 5 5 11 d 11 than H ( X ). But the code is not uniquely deco deable. The se- is less also x acdbac ) enco des as 000111000 , whic h can = ( be deco ded quence : C 6 cabdca ). as ( ( h p ) a ( c a l ) p i i i i i code The . length ;X C ected L the Consider C 5.13. Example exp ( ) of this 6 6 1 / a 0 1 1.0 2 bols as ded ) is enco acdbac = ( The of sym sequence code is 1.75 bits. x 1 / b 01 4 2.0 2 + c 0011111010011 ) = x ( . 1 / 3.0 8 3 c 011 b ( c Is C of both a pre x code? It is not, because c ( a ) = 0 is a pre x ) 1 / 6 3.0 3 8 d 111 and c ( c ).

106 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 5 | Sym 94 bol Codes think Is This is not so obvious. If you deable? that it uniquely C deco 6 a pair be uniquely try to pro ve it so by nding deable, of not deco migh t y that have the same enco ding. [The de nition of unique strings x and in equation deco deabilit y is given (5.4).] e ` isn't to deco certainly When we receiv easy 00 ', it is possible C de. 6 could start ` that ', ` ab ' or ` ac '. Once we have receiv ed ` 001111 ', x aa second bol is still ambiguous, as x could be ` abd . . . ' or ` acd . . . '. the sym a unique app ding crystallizes, once the next 0 eventually ears But deco ded enco stream. in the is in fact uniquely deco deable. C with the pre x code C Comparing , 3 6 that the codew ords of C is are the reverse of C 's. we see That C 3 3 6 deco deable ves that C uniquely is too, since any string from C pro is 6 6 bac C iden read to a string kwards. from tical 3 limit is imp osed by unique deco deabilit y? 5.2 What a list of positiv e integers f l We now ask, g , does there exist a uniquely given i lengths? those integers as its codew ord code with At this stage, we deco deable the probabilities of the di eren t sym bols; once ignore unique we understand deco y better, we'll rein troduce the probabilities and discuss how to deabilit e an optimal deco deable sym bol code. mak uniquely examples ed that if we tak e a code suc h as In the above, we have observ 00 ; 01 ; 10 ; 11 f , and shorten one of its codew ords, for example 00 ! 0 , then g we can unique deco deabilit y only retain other codew ords. Thus if we lengthen there seems to be a constrained budget that we can spend on codew ords, with shorter codew being more exp ensiv e. ords us explore the of this budget. If we build a code purely from Let nature ords of length to three, how man y codew equal can we have and codew l ords l = 8. Once deabilit answ er is 2 deco The we have chosen all unique y? retain codew ords, is there any way we could add eigh code another t of these to the ord other length and retain of some deco deabilit y? It would codew unique not. seem if we mak e a code that includes a length-one codew ord, ` 0 ', with the What y length-three codew of length three? How man being codew ords can other ords If we restrict atten tion to pre x codes, then we can have only four we have? . What ords three, codew f 100 ; 101 ; 110 ; 111 g of length about other namely codes? any other way of choosing codew Is there of length 3 that can give ords more codew ords? Intuitiv ely, we think this unlik ely. A codew ord of length 3 2 app that is 2 ears times smaller than a codew ord of length 1. to have a cost codew de ne of size 1, whic h we can spend on budget ords. If Let's a total l of a codew ord whose length is we set to 2 the cost , then we have a pricing l system ts the examples discussed above. that ords of length 3 cost Codew 1 / 8 eac h; codew ords of length 1 cost 1 = 2 eac h. We can spend our budget on any codew ords. over our budget then the code will certainly not be If we go hand, deco If, on the other deable. uniquely X l i (5.8) 2 1 ; i then code may be uniquely deco deable. the inequalit y is the Kraft in- This equalit y. deable Kraft y . For any uniquely deco inequalit code C ( X ) over the binary

107 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. limit 5.2: by unique deco deabilit y? 95 is imp What osed 0 ; g , the codew ord lengths must satisfy: 1 f alphab et I X l i 1 2 (5.9) ; i =1 I jA j . where = X If a uniquely deable code satis es . Kraft inequalit y deco the Completeness it is called a complete code. y then equalit with are uniquely We want codes deable; pre x codes are uniquely de- that deco and easy to deco de. So life would be simpler for us if we could codeable, are is tion Fortunately , for any source there codes. an op- atten restrict to pre x bol code that is also a pre x code. timal sym y and codes inequalit . Giv en a set of codew ord lengths that pre x Kraft Kraft inequalit y, there exists a uniquely deco deable pre x the satisfy with these codew ord lengths. code Kraft The accurately referred to as the Kraft{ inequalit y migh t be more Kraft then ved that if the inequalit y is satis ed, y: a McMillan inequalit pro pro exists given lengths. McMillan (1956) the ved the con- code pre x with unique deco deabilit y implies that the inequalit y holds. verse, that P l i quan 2 . Consider the tity De ne S = Kraft Pro inequalit y. of of the i " # N I I I X X X X N l + ) l + l l ( i i i i 2 1 N = S 2 = (5.10) : 2 i i =1 =1 i i =1 1 2 N quan tity in the exp onen t, ( l The l + + l + length of the ), is the i i i 1 2 N string = a of the enco x ding a :::a of length . For every string x N , i i i 1 2 N term in the above sum. Introduce an arra y A there that coun ts is one l l have enco ded length l . Then, de ning l y strings = min how man x i min i l = max l and : i i max N l max X N l = S (5.11) : A 2 l = N l l min + is uniquely deco deable, so that for all x 6 = y , c 6 ( x ) Now assume = C + a y ). Concen trate on the x that have enco ded length l . There are ( c l l distinct bit strings of length of 2 , so it must be the case that total l . So 2 A l N l N l max max X X l N Nl : (5.12) 1 2 A = S max l = l N l N l = l min min N N S l N for all N . Now if S Thus greater than 1, then as were max N gro would be an exp onen increases, S wing function, and for large tially enough N , an exp onen tial alw ays exceeds a polynomial suc h as l . N max N result ( S But l 2 N our for any N . Therefore S 1. ) is true max ] p.104 3, [ stated 5.14. Pro ve the result . above, that for any set of code- Exercise word lengths f l code g satisfying the Kraft inequalit y, there is a pre x i having those lengths.

108 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. bol Codes 96 5 | Sym bol coding sym . The 5.1 Figure 0000 000 l of eac 2 `cost' h The budget. 0001 00 ) is l length codew ord (with 0010 001 size indicated by the of the box it 0011 budget is written in. The total 0 0100 available a uniquely making when 010 0101 is 1. deco deable code 01 0110 as of this think can diagram You 011 , ord codew a sho et ermark sup wing 0111 with the codew ords arranged in 1000 100 by their cost and length, the aisles 1001 10 indicated by the ord of eac h codew 1010 size of its box on the shelf. If the 101 1011 that ords cost codew you of the 1 1100 your then tak e exceeds the budget 110 code will not be uniquely The total symbol code budget 1101 11 deco deable. 1110 111 1111 C C C C 6 4 0 3 0000 0000 0000 0000 000 000 000 000 0001 0001 0001 0001 00 00 00 00 0010 0010 0010 0010 001 001 001 001 0011 0011 0011 0011 0 0 0 0 0100 0100 0100 0100 010 010 010 010 0101 0101 0101 0101 01 01 01 01 0110 0110 0110 0110 011 011 011 011 0111 0111 0111 0111 1000 1000 1000 1000 100 100 100 100 1001 1001 1001 1001 10 10 10 10 1010 1010 1010 1010 101 101 101 101 1011 1011 1011 1011 1 1 1 1 1100 1100 1100 1100 110 110 110 110 1101 1101 1101 1101 11 11 11 11 1110 1110 1110 1110 111 111 111 111 1111 1111 1111 1111 Figure . Selections of 5.2 codew ords made by codes C section ;C from ;C C and 4 6 3 0 5.1.

109 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. What's the compression that we can hop e for? 97 5.3: most of the Kraft y may help you solv e this exercise. A pictorial view inequalit bol code. choosing ords to mak e a sym codew We can we are the Imagine that of all candidate codew ords in a sup ermark dra displa ys the w the set et that codew by the area of a box ( gure 5.1). ord total budget of the `cost' The `1' on the righ t-hand side of the Kraft inequalit y { is sho wn at available { the Some codes side. discussed in section 5.1 are illustrated in gure one of the C that that are pre x codes, C the , codes Notice , and C , have the 5.2. 4 3 0 to the righ t of any selected codew ord, there are no other prop erty that selected ords pre x codes corresp ond to trees. Notice that a complete codew { because no corresp to a complete tree having onds unused branc hes. pre x code now ready We are bac k the sym bols' probabilities f p a g . Giv en to put i of sym bol probabilities English language probabilities of gure 2.1, set (the with how do best sym bol code { one e the the smallest example), for we mak ected length L ( C;X )? And what is that smallest possible exp ected possible exp not It's how to assign the codew ord lengths. If we give short length? obvious exp to the ords sym bols then the more ected length migh t be codew probable on reduced; other hand, shortening some codew ords necessarily causes the others by the Kraft inequalit y. to lengthen, 5.3 most comp ression that we can hop e for? the What's the exp ected length of a code, to minimize We wish X C;X ) = L ( p (5.13) l : i i i lower bound you migh the entrop y app ears As t have guessed, on the as the exp ected length of a code. Lower bound on exp ected length . The exp ected length L ( C;X ) of a uniquely deco code is bounded below by H ( X ). deable P l l 0 i i 2 , so z = =z , where of. We de ne the implicit probabilities q Pro 2 0 i i P =q p 1 log log z . We then use Gibbs' that y, = log 1 =q l inequalit i i i i i P p log 1 =p z , with 1: y if q y = p inequalit , and the Kraft equalit i i i i i X X (5.14) z =q 1 log p log = l p C;X L ) = ( i i i i i i X p log log (5.15) =p z 1 i i i H ( X ) : (5.16) The equalit y L ( C;X ) = H ( X ) is achiev ed only if the Kraft equalit y z = 1 is satis ed, and codelengths satisfy l if the = log(1 =p 2 ). i i is an imp ortan t result so let's say it again: This source codelengths . The exp ected length is minimized and is Optimal to to ( X ) only if the codelengths are equal H the Shannon in- equal formation con ten ts : l (5.17) = log : (1 =p ) i i 2 probabilities by codelengths . Con versely , any choice Implicit de ned f l , of codelengths implicitly de nes a probabilit y distribution f q g g i i l i q 2 =z; (5.18) i whic h those codelengths for be the optimal codelengths. If the would code is complete then z = 1 and the implicit probabilities are given by l i q = 2 . i

110 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Sym 98 bol Codes 5 | can 5.4 we comp How much ress? entrop ect can we exp below the to get to y. How close So, we can't compress y? entrop the coding theo rem for symb ol codes. For an ensemble Theorem 5.1 Source X a pre x C with expected length satisfying ther e exists code ( ) L X C;X ) < H X ) + 1 : (5.19) ( H ( We set the codelengths to integers Pro tly larger than the optim um of. sligh lengths: = d log l (1 =p (5.20) ) e i i 2 e denotes the smallest integer greater than or equal to l d l . [We where asserting the not optimal code necessarily uses these lengths, are that them choosing lengths because we can use simply to pro ve we are these theorem.] the k that there is a pre x code with these lengths by con rming We chec the Kraft y is satis ed. that inequalit X X X X ) =p (1 log d ) log e (1 =p l i i i 2 2 (5.21) = 1 p : = 2 2 = 2 i i i i i x x ( ) P a 0.0575 Then we con rm 0.0128 b X X c 0.0263 p : ) + 1 (1 =p ) + 1) = (5.22) H ( X (log < e ) =p log(1 d p C;X ) = L ( i i i i 0.0285 d i i e 0.0913 0.0173 f 2 g 0.0133 0.0313 h 0.0599 codelengths The the cost of using wrong i j 0.0006 0.0084 k to the a code whose lengths are not equal optimal codelengths, the If we use 0.0335 l entrop be larger than the message y. will average length m 0.0235 g and we use a complete code with lengths true probabilities are f p If the i n 0.0596 l i o 0.0689 , we can implicit as de ning lengths those l . Con- view = 2 q probabilities i i p 0.0192 average the (5.14), equation from length tinuing is q 0.0008 X r 0.0508 p p ; (5.23) log =q X ) + L ( C;X ) = H ( i i i s 0.0567 0.0706 t i u 0.0334 q i.e., the entrop y by the relativ it exceeds e entrop on de ned ) (as y D jj ( p v 0.0069 KL w 0.0119 p.34). x 0.0073 y 0.0164 z 0.0007 symb ol codes: 5.5 coding Optimal source coding with Hu man 0.1928 P of probabilities a set en Giv pre x code? optimal an we design , how can bol code for what language English the ensem sym best is the ble For example, Figure 5.3 . An ensem ble in need of a sym bol code. assume our let's we say `optimal', When 5.3? in gure wn sho is to aim minimize exp ected length L the C;X ). ( How not to do it One migh t try to roughly split the set A the in two, and con tinue bisecting X subsets a binary tree from the so as to de ne This construction has the root. righ t spirit, as in the weighing problem, but it is not necessarily optimal; it ) achiev L ( C;X es H ( X ) + 2.

111 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. source 5.5: sym bol codes: Hu man coding 99 coding Optimal with algorithm coding The Hu man algorithm for nding an optimal pre x t a beautifully simple We now presen from tric code backwar ds starting the the tails of The code. k is to construct the we build the binary tree from its leaves . codew ords; rithm . Hu man coding Algo 5.4 algorithm. two least sym bols in the alphab et. These 1. Take the two probable ords, bols the longest codew be given whic h will have equal sym will and di er only in the last digit. length, bine 2. Com bols into a single sym bol, and rep eat. these two sym h step et by one, the size of the alphab eac this algorithm will Since reduces jA to all sym bols after the have assigned j 1 steps. strings X 5.15. Let A Example = f a , b , c , d , e g X . and = f 0.25, 0.25, 0.2, 0.15, 0.15 g P X 4 step 3 step 2 step 1 step x p a ) a ( c l ) p ( h i i i i i 0 0 0.25 0.55 1.0 a 0.25 0.25 2 a 0.25 2.0 00 0 0.45 b 0.45 1 0.25 0.25 2 2.0 b 10 0.25 c 1 0.2 0.2 c 0.2 2 11 2.3 0 1 d 010 3 2.7 0.15 d 0.3 0.3 0.15 011 e 2.7 3 0.15 e 1 0.15 by the Table 5.5 . Code created obtained then are digits codew The ords binary the by concatenating in algorithm. Hu man order: 10 ; 11 selected codelengths The . 010 g 011 ; ; C = f 00 ; reverse Hu man by the (column 4 of table 5.5) are in some cases algorithm and Shannon cases shorter than the ideal codelengths, the longer in some 1 / (column of the p length 3). The exp ected con ten information ts log i 2 code is L = 2 : 30 bits, whereas the entrop y is H = 2 : 2855 bits. 2 If at any point there is more one way of selecting the two least probable than bols then choice may be made in any manner { the exp ected length of sym the code will not end on the choice. dep the p.105 ] 3, [ Pro Exercise ve that there is no 5.16. sym bol code for a source better than Hu man code. the 5.17. probabilit mak e a Hu man code for the Example y distribution We can over the et introduced in gure 2.1. The result is sho wn in g- alphab ure 5.6. This code has an exp ected length of 4.15 bits; the entrop y of the ensem is 4.11 bits. Observ e the disparities between the assigned ble 1 / p . log the codelengths and ideal codelengths i 2 a binary tree top-down is suboptimal Constructing chapters In previous weighing problems in whic h we built ternary we studied or binary trees. We noticed that balanced trees { ones in whic h, at every step, the outcomes were as close as possible to equiprobable { app eared two possible to describ e the most ecien t exp erimen ts. This gave an intuitiv e motiv ation for entrop y as a measure of information con ten t.

112 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. bol Codes 100 5 | Sym 1 code the Figure 5.6 . Hu man for log p a ( ) c l a i i i i 2 p i a language English ble ensem n 4 0000 a 4.1 0.0575 (monogram statistics). b 6 b 6.3 001000 0.0128 g 00101 0.0263 c 5.2 5 c 10000 5 5.1 0.0285 d s 1100 0.0913 e 3.5 4 âˆ’ 6 f 0.0173 5.9 111000 d 6 001001 g 0.0133 6.2 h h 5.0 5 10001 0.0313 i 4 0.0599 4.1 1001 i k 0.0006 j 1101000000 10 10.7 x k 0.0084 6.9 7 1010000 y 4.9 5 11101 l 0.0335 u 110101 0.0235 5.4 6 m o n 0.0596 4.1 4 0001 e o 0.0689 3.9 4 1011 j p 111001 6 5.7 0.0192 z 0.0008 10.3 9 110100001 q q 5 0.0508 11011 4.3 r v 0011 4 4.1 0.0567 s w 4 3.8 1111 t 0.0706 m u 0.0334 4.9 5 10101 r 11010001 v 0.0069 7.2 8 f w 0.0119 6.4 7 1101001 p x 0.0073 7.1 7 1010001 l y 5.9 6 101001 0.0164 t 10.4 1101000001 0.0007 z 10 2.4 2 01 { 0.1928 the however, that optimal codes can always be constructed It is not case, top-do wn metho d in whic h the alphab et is successiv ely divided by a greedy a p Greedy Hu man i i as near as possible to equiprobable. that are into subsets a .01 000000 000 Example binary 5.18. Find the optimal sym bol code for the ensem ble: b 001 01 .24 c .05 010 0001 e = f a ; b ; c ; d ; A ; f ; g g X 001 011 .20 d : (5.24) ; : 01 ; 0 : 02 g : ; 0 : 47 ; 0 P = f 0 : 01 20 0 : 24 ; 0 : 05 ; 0 X 1 10 .47 e 110 000001 .01 f into two sub- set this d can that a greedy top-do wn metho split Notice 00001 g .02 111 h both have probabilit y 1 = 2, and that and e ; f ; f g g whic sets f a ; b ; c ; d g , whic a be divided into subsets f a ; b ; c ; d g h have prob- f g d ; c f and g b ; can . A greedily-constructed Table 5.7 top-do 4; so a greedy = y 1 abilit sho in the code d gives the metho wn wn the Hu man with code compared code. exp whic 5.7, of table column third ected length The 2.53. Hu man h has in the algorithm the code sho wn yields fourth column, whic h has coding exp ected length 1.97. 2 5.6 of the Hu man code Disadvantages sym Hu man pro The an optimal algorithm bol code for an ensem ble, duces but this is not the end of the story . Both the word `ensem ble' and the phrase `sym need careful atten tion. bol code' Changing ensemble If we wish to comm unicate a sequence of outcomes from one unc hanging en- may be con sem then a Hu man code ble, venien t. But often the appropriate

113 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Disadv antages Hu man code 101 5.6: of the changes. ensem we are compressing text, then the sym bol If for ble example is much more with in English the letter u text: prob- vary con frequencies will q than after an e ( gure 2.3). able furthermore, our kno wledge of after a And text-dep t sym bol frequencies will also enden as we learn the con these change erties of the text statistical prop source. codes not handle changing ensem ble probabilities with any Hu man do brute-force h would be to recompute the Hu man code One elegance. approac the probabilit y over sym bols changes. Another attitude is to den y every time option of adaptation, instead run through the entire le in adv ance and the a good probabilit y distribution, whic h will then remain xed compute and transmission. The code throughout must also be comm unicated in this itself scenario. h a technique is not only cum bersome and restrictiv e, it is also Suc optimal, since initial message specifying the code and the documen t sub the partially wastes t. This technique therefore are bits. itself redundan a bit extr The serious problem with Hu man An is the inno cuous-lo oking `ex- equally codes bit' relativ e to the ideal average length of H ( X ) { a Hu man code achiev es tra a length that H ( X ) L ( C;X ) < H ( X )+1 ; as pro ved in theorem 5.1. satis es code thus incurs per sym overhead of between 0 and 1 bits A Hu man bol. an ortan H X ) were large, then this If would be an unimp ( t fractional overhead increase. But for man y applications, the entrop y may be as low as one bit per sym bol, or even smaller, so the overhead L ( C;X ) H ( X ) may domi- nate the ded le length. Consider English text: in some con texts, long enco of characters may be highly For example, in the con text strings predictable. bols ', one the next nine sym t predict to be ` aracters_ ' strings_of_ch ` migh y of 0.99 eac h. A traditional Hu man code would be obliged with a probabilit at least one per character, making a total cost of nine bits where to use bit information is being con veyed (0.13 bits in total, to be precise). virtually no entrop y of English, given a good mo del, is about one The per character bit (Shannon, so a Hu man code is likely to be highly inecien t. 1948), to compress patc codes uses them of Hu man blo cks of h-up A traditional N bols, for `extended sources' X the we discussed in Chapter 4. sym example overhead per blo ck is at most 1 bit so the overhead The bol is at most per sym 1 bits. For sucien tly large blo =N the problem of the extra bit may be cks, remo ved { but only at the exp enses of (a) losing the elegan t instan taneous deco deabilit Hu man coding; and (b) having to compute the prob- y of simple Hu man of all relev build the asso ciated and tree. One will abilities ant strings up explicitly computing the probabilities and codes end a huge num ber of for strings, of whic h will nev er actually occur. (See most 5.29 (p.103) .) exercise Beyond symb ol codes Hu man codes, therefore, although widely trump eted as `optimal', have man y for defects purp oses. They are optimal symb ol codes, but practical practi- for cal purp oses we don 't want a symb ol code . The defects of Hu man codes are recti ed by arithmetic coding , whic h integer disp with the restriction that eac h sym bol must translate into an enses num ber of bits. Arithmetic coding is the main topic of the next chapter.

114 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Sym 102 5 | bol Codes 5.7 Summa ry Kraft is uniquely deco deable its lengths must satisfy inequalit y . If a code X l i 1 : (5.25) 2 i Kraft inequalit y, there exists a pre x code For any lengths satisfying the with those lengths. Shannon an ensem ble are equal to the source Optimal codelengths for ts ten con information 1 (5.26) ; = log l i 2 p i versely , any choice de nes implicit probabilities con and of codelengths l i 2 = q : (5.27) i z e entrop y D p ( The jj relativ ) measures how man y bits per sym bol are q KL a code whose implicit probabilities are q , when the by using wasted true probabilit y distribution is p . ble's ensem coding theorem for sym bol codes . For an ensem ble X , there ex- Source a pre x ists exp ected length satis es code whose ( X L ( C;X ) < H X ) + 1 : (5.28) ( ) H coding algorithm generates an optimal The bol code itera- Hu man sym h iteration, two least probable sym bols are com the tively. At eac bined. 5.8 Exercises [ 2 ] . Exercise Is the code f 5.19. ; 11 ; 0101 ; 111 ; 1010 ; 100100 ; 0110 g uniquely 00 deable? deco [ 2 ] Exercise 5.20. Is the ternary code f 00 ; 012 ; 0110 ; 0112 . 100 ; 201 ; 212 ; 22 g ; deco deable? uniquely [ 3, p.106 ] 2 3 4 e Hu man codes for X 5.21. , X Mak and X Exercise where A = X f ; 1 g and P 0 = f 0 : 9 ; 0 : 1 g . Compute their exp ected lengths and com- X 2 3 4 H ( X pare ), H ( with the ) and H ( X entropies ). them X 2 4 : 0 ; where and X g 4 P 6 = f 0 : . for exercise this eat Rep X X [ p.106 ] 2, Find a probabilit y distribution f 5.22. ;p ;p ;p Exercise g suc h that p 1 3 4 2 are there optimal codes that assign di eren t lengths f l two g to the four i sym bols. [ 3 ] Exercise 5.23. (Con tinuation of exercise 5.22.) Assume that the four proba- bilities f ;p p ;p 0. Let ;p g are ordered suc h that p p p p 3 2 1 4 4 3 2 1 be the set of all probabilit y vectors p suc h that there are two optimal Q of with t lengths. Giv e a complete description di eren Q . Find codes (1) (2) (3) q three , q probabilit , q y vectors , whic h are the con vex hull of Q , i.e., any p 2Q can be written as suc h that (2) (1) (3) + q q = + q p ; (5.29) 2 3 1 where f g are positiv e. i

115 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Exercises 103 5.8: [ ] 1 Write a short essa y discussing how to play the game of twenty Exercise . 5.24. twenty questions, player thinks of an object, . [In one questions optimally player has to guess the object using and binary questions the other as few few er than twenty.] preferably as possible, [ 2 ] Exercise Sho w that, if eac h probabilit y 5.25. p . to an integer power is equal i exists a source code whose exp ected length equals the of 2 then there y. entrop [ 2, p.106 ] . Mak e ensem bles for whic h the di erence between the Exercise 5.26. as possible. the ected length of the Hu man code is as big exp y and entrop 2, p.106 ] [ Exercise 5.27. A source X has an alphab et of elev en . characters f ; b ; c ; d ; e ; f ; g ; h ; i ; j ; k g ; a of whic h have equal y, 1 = 11. all probabilit optimal source. deco deable sym bol code for this an How Find uniquely code than exp length of this optimal ected the entrop y much greater is the X ? of [ 2 ] Exercise bol code Consider the optimal sym . for an ensem ble X with 5.28. et size tical from whic h all sym bols have iden alphab probabilit y p = I =I is not I 1 a power of 2. . + w that fraction f Sho of the I the bols that are assigned codelengths sym equal to + l log d I e (5.30) 2 satis es + l 2 + f = 2 (5.31) I and exp ected length of the optimal sym bol code is the that + + l 1 + f : (5.32) L = di eren tiating the excess length L L H ( X ) with resp ect to By , I sho the excess length is bounded by w that ln(ln 2) 1 L 1 = 0 : 086 : (5.33) ln 2 ln 2 ] [ 2 0 5.29. a sparse binary source with P . Dis- = f Consider : 99 ; 0 : 01 g Exercise X cuss how Hu man codes could be used to compress this source eciently . Estimate how man ords your prop osed solutions require. y codew [ 2 ] Exercise in 1975. Scienti c Americ an carried the follo wing puzzle . 5.30. poisoned . The `Mathematicians are curious birds', the police glass commissioner to his wife. `You see, we had all those partly said lled glasses lined up in rows on a table in the hotel kitc hen. Only one con poison, and we wanted to kno w whic h one before tained Our hing glass for ngerprin ts. searc lab could test the liquid that in eac h glass, but the tests tak e time and money , so we wanted to mak of them as possible by sim ultaneously testing mixtures e as few of small t over a from groups of glasses. The univ ersit y sen samples

116 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Sym 104 bol Codes 5 | us. to help ted the glasses, smiled professor He mathematics coun said: and you want, Commissioner. We'll test it rst." ` \Pic k any glass waste a test?" I ask ` \But won't that ed. said, part of the best pro cedure. We can test one \it's ` \No," he matter whic h one." ' It doesn't rst. glass y glasses were there to start `Ho the commissioner's w man with?' ask wife ed. ber. Somewhere 100 and 200.' `I don't between remem num ber of glasses? exact What was the puzzle and then explain why the professor was in fact wrong Solv e this commissioner the t. What is in fact the optimal pro cedure and was righ is the tifying one poisoned glass? What iden exp ected waste for the e to this optim um if one follo wed the professor's strategy? Explain relativ relationship to sym bol coding. the 2, ] p.106 [ Exercise 5.31. of sym bols from the ensem ble Assume that a sequence beginning of this chapter is compressed using the at the introduced X the . Imagine picking one bit at random from code binary enco ded C 3 C : 3 this y that c probabilit is the . What ::: ) x x ( ) ) ( x ( bit sequence c = c c 2 3 1 is a 1? a ) p ) h ( p c ( l a i i i i i 1 [ 2, p.107 ] / a 0 1 2 1.0 Hu man binary . Exercise 5.32. be scheme ding enco How should the 1 / b 10 2.0 4 2 in an bol codes sym e optimal to mak di ed mo q et with alphab ding enco 1 / c 110 3 3.0 8 as `radix '.) q wn sym bols? (Also kno 1 / d 111 3 3.0 8 e codes Mixtur to construct a `metaco de' from several sym bol codes that It is a tempting idea t-length e sym ords to the alternativ di eren bols, then switc h from assign codew shortest choosing hev er assigns the whic codew ord to the one code to another, t sym bol. Clearly we cannot do this for free. If one wishes to choose curren two codes, in a way that it is necessary to lengthen the message between then whic h of the is being used. If we indicate this choice by indicates two codes resulting leading be found that the it will code is sub optimal a single bit, it is incomplete (that is, it fails the Kraft because y). equalit [ 3, p.108 ] Exercise Pro ve that this metaco de is incomplete, 5.33. explain and why this com bined code is sub optimal. 5.9 Solutions Solution to exercise 5.8 (p.93) . Yes, C deable, = f 1 ; 101 g is uniquely deco 2 even it is not a pre x code, because no two di eren t strings can map though same bol only the codew ord c ( a string; ) = 101 con tains the sym onto the 0 . 2 Solution 5.14 (p.95) . We wish to pro ve that for any set of codew ord to exercise Kraft lengths those g satisfying the l inequalit y, there is a pre x code having f i lengths. This is readily pro ved by thinking of the codew ords illustrated in gure 5.8 being in a `codew ord sup ermark et', with size indicating cost. as at a time, purc codew ords one hasing starting from the shortest We imagine codew ords (i.e., the biggest purc hases), using the budget sho wn at the righ t of the of gure We start at one side 5.8. codew ord sup ermark et, say the

117 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 5.9: 105 codew 5.8 . The Figure ord 0000 000 bol sym the et and sup ermark 0001 l 00 of 2 `cost' The budget. coding 0010 001 length (with ord eac h codew ) is l 0011 indicated by the size of the box it 0 0100 is written in. budget The total 010 0101 available a uniquely making when 01 0110 deco deable code is 1. 011 0111 1000 100 1001 10 1010 101 1011 1 1100 110 The total symbol code budget 1101 11 1110 111 1111 Figure 5.9 . Pro of that Hu man di ed rival bol sym probabilit y Hu man Riv al code's Mo coding mak es an optimal bol sym ords code codew ords codew rival the that We assume code. to be optimal, h is said whic code, ( ( c ) c a p c a ) c ( a ) H R R a qual ords une length assigns codew smallest with two sym bols to the ) ) ) b ( ( c b p b b c ( c H R R b a probabilit y, and b . By a c ords and interc hanging codew c ) c c ( c ) c c ( a ) ( p R H c R rival code, of the is a where c rival codelength bol with as sym b 's, we can mak long as e a code than the rival code. This better We adv length. required codew ord top, and purc hase the rst ance of the the that ws sho rival code was not l sup ermark et a distance 2 the down , and purc hase the next codew ord of the optimal. next length, and so forth. Because required codew ord lengths are getting the longer, and the corresp onding interv als are getting shorter, we can alw ays buy an t codew ord to the latest purc hase, so there is no wasting of adjacen P I l i budget. a distance Thus at the th codew the I 2 ord we have adv anced i =1 P l i 1, we will have purc hased all the codew ords 2 ermark down if sup the et; running out of budget. without to exercise 5.16 (p.99) . The pro of that Hu man coding is optimal Solution ends on pro ving that the key step in the algorithm { the decision to give dep enco two sym smallest probabilit y equal with ded lengths { cannot the bols to a larger exp ected length than any other code. We can lead ve this by pro con tradiction. probabilit that two sym bols with Assume the y, called a and b , smallest to whic h the Hu man algorithm would assign equal length codew ords, do not have equal lengths any optimal sym bol code. The optimal sym bol code in other in whic h these two codew ords have unequal lengths is some rival code this and l that with l of generalit < l assume . Without loss l y we can a a b b code any codelengths pre x code, because other of a uniquely is a complete deable code can be realized by a pre x code. deco In this rival code, there must be some other sym bol c whose probabilit y in the p p and whose length than rival code is greater than or is greater a c b equal of equal , because the code for l must have an adjacen t codew ord to b or greater length { a complete pre x code nev er has a solo codew ord of the maxim length. um Consider exc hanging the codew ords is a and c ( gure 5.9), so that a of

118 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Sym bol Codes 106 5 | with enco codew ord that was c 's, and c , whic h is more probable the ded longer the the codew ord. Clearly this reduces , gets exp ected length than a shorter p change ected length is ( p code. The of the )( l l ). Thus we have in exp a a c c tradicted assumption that the rival code is optimal. Therefore it is con the two sym probabilit with smallest to give the y equal enco ded lengths. bols valid pro duces optimal sym bol codes. 2 coding Hu man l c ( a ) a p i i i i 2 0000 0.1296 3 000 where X Solution code for to exercise = A (p.102) . f A Hu man 0 ; 1 g 5.21 X 0001 0.0864 4 0100 ; 11 g ! f 1 ; 01 ; 000 ; and g . This code has P 001 = f 0 : 9 ; 0 : 1 g is f 00 ; 01 ; 10 X 0110 4 0.0864 0010 2 2 ( ) = 1 ) is 0.938. entrop X ( H y the whereas 29, C;X L : 0111 0100 0.0864 4 3 X code for A Hu man is 1000 3 100 0.0864 1100 0.0576 4 1010 000 001 110 ; g! ; 100 ; 011 ; f 111 ; 101 ; ; 010 1010 0.0576 4 1100 ; 010 ; 001 ; 00000 ; 00001 ; 00010 ; 00011 g : f 1 ; 011 1001 4 1101 0.0576 0110 0.0576 4 1110 3 3 X ( y entrop ) is whereas 598 : ) = 1 H C;X ( L length ected exp has This the 0101 1111 4 0.0576 1.4069. 0010 0011 0.0576 4 4 source the maps sixteen X for code A Hu man wing follo to the strings 0.0384 5 00110 1110 01010 1101 0.0384 5 codelengths: 1011 0.0384 5 01011 0111 0.0384 4 1011 1101 ; 1110 ; ; f 0000 ; 1000 0100 ; 0010 ; 0001 ; 1100 ; 0110 ; 0011 ; 0101 ; 1010 ; 1001 ; 00111 1111 0.0256 5 7 ; 7 ; 7 ; ; 7 ; 9 ; 9 ; 9 ; 10 ; 10 g : ; 3 ; 4 ; 6 ; 7 ; 1011 ; 0111 ; 1111 g! f 1 ; 3 3 4 4 C;X X ( H y This has exp ected length L ( entrop ) is ) = 1 : 9702 whereas the 4 Table 5.10 . Hu man code for X 1.876. ws 3 sho 6. Column : = 0 when p 0 2 2 has ; X for code Hu man , the g 4 : 0 ; 6 : 0 f = lengths P When 2 ; 2 ; f g ; 2 the assigned and codelengths X Some ords. codew 4 the column code for ected exp length is 2 bits, and the entrop y is 1.94 bits. A Hu man the 4 whose strings are probabilities entrop is 3.92 and y the bits, X length ected exp The 5.10. in table wn is sho e.g., the fourth and tical, iden bits. is 3.88 e di eren t codelengths. receiv fth, to exercise 5.22 (p.102) . The set of probabilities f p = ;p Solution ;p g ;p 4 2 3 1 1 1 1 1 / / / / 3 t optimal because sets to two di eren g gives rise of codelengths, 3 ; ; 6 ; 6 f second step of the Hu man coding algorithm we can choose at the any of the three possible pairings. We may either put them in a constan t length code f 00 ; 01 ; 10 ; 11 g or the code f 000 ; 001 ; 01 ; 1 g . Both codes have exp ected length 2. 2 1 1 1 / / / / . g 5 5 ; ; 5 ; 5 Another ;p is ;p f g = f p solution ;p 4 3 1 2 1 1 1 / / / ; 3 . 0 g ; 3 ; 3 f is f = g And ;p a third ;p p ;p 1 2 4 3 to exercise 5.26 (p.103) . Let p Solution be the largest probabilit y in max p ;p y ;:::;p entrop . The di erence between the exp ected length L and the I 2 1 can be no bigger than max( p 1978). ; 0 : 086) (Gallager, H max See 5.27{5.28 to understand where the curious 0.086 comes from. exercises to exercise Length (p.103) . Solution entrop y = 0.086. 5.27 Solution 5.31 (p.104) . There are two ways to answ er this problem to exercise correctly , and one popular way to answ er it incorrectly . Let's give the incorrect answ er rst: er . \W e can answ bit by rst picking a random Erroneous pick a random ) a ( c a l p i i i i bit from y p bol source , then x probabilit picking with a random sym i i 1 ( x x ) that are 1 s, ( ). If we de ne f c to be the fraction of the bits of c / i i i a 0 2 1 1 / C : we nd 3 b 10 4 2 X 1 / 8 3 c 110 ) = P (bit is 1 (5.34) f p i i 1 / 3 8 d 111 i 1 1 1 1 2 1 1 / / / / / / / = 2 0 + 4 2 + 8 + 3 8 1 = 3 ." (5.35)

119 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 107 5.9: Solutions er is wrong This for the bus-stop fallacy , whic h was intro- because answ it falls (p.38) arriv e at random, and we are interested 2.35 duced in exercise : if buses from one bus until the next', we must distinguish two in `the average time the time (a) from a randomly chosen bus until the possible averages: average missed average the the bus you just time and the next bus. (b) next; between `average' is twice as big as the rst The by waiting for a bus second because, time, bias your selection of a bus in favour of buses that at a random you comes gap. ely to catc h a bus that unlik 10 seconds after w a large follo You're Similarly , the sym bols c and d get enco ded into longer-length a preceding bus! than strings , so when we pick a bit from the compressed string at binary a c more to land in a bit belonging to a we are or a d than would random, likely probabilities p (5.34). in the exp ectation be given by the All the probabilities i need by l to be scaled , and renormalized. up i x style . Every time sym bol same er in the is enco ded, l answ Correct i i added to the binary string, of whic h f s. The l bits are 1 are exp ected i i num s added per sym bol is ber of 1 X p f l ; (5.36) i i i i exp ected total and ber of bits added per sym bol is the num X p l : (5.37) i i i So the s in the transmitted string is 1 fraction of P l f p i i i i P is P (bit ) = 1 (5.38) p l i i i 7 1 1 1 1 / / / / / 8 8 3 8 2 + 4 1 + 0 + 2 = = = 1 : 2 = 7 7 / / 4 4 sym For a general ensem ble, the exp ectation (5.38) is bol code and a general a more answ in this case, we can use But powerful argumen t. correct the er. er . The enco ded string c is the output of an Information-theoretic answ to that from X down samples an ex- compresses compressor optimal of H ( X ) bits. We can't exp ect to compress this data any pected length 1 / probabilit y P (bit is 1 ) were not equal to But further. if the 2 it then (using be possible binary string further the a blo ck to compress would 1 / ; in- 2 (bit is 1 ) must be equal to say). code, Therefore compression P the l y of any sequence of deed bits in the compressed stream probabilit l value must be 2 any particular taking . The output of a perfect on compressor is alw ays perfectly random bits. To put it another probabilit y P (bit is 1 ) were not equal to way, if the 1 / the information con ten , then of the compressed string would 2 t per bit H whic ( P ( 1 )), be at most h would be less than 1; but this con tradicts 2 the that we can reco ver the fact data from c , so the information original con ten t per bit of the compressed string must be H ( X ) =L ( C;X ) = 1. Solution to exercise (p.104) . The general Hu man coding algorithm for 5.32 has enco an et with q sym bols ding one di erence from the binary case. alphab The pro cess of com bining q sym bols into 1 sym bol reduces the num ber of A sym by q 1. So if we start with bols sym bols, we'll only end up with a

120 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Sym 108 bol Codes 5 | tree -ary mo d ( q 1) is equal to 1. Otherwise, we kno w that q if A complete a num e, it must incomplete tree with be an ber we mak er pre x whatev code dulo ( q of missing to A mo d ( q 1) 1. For example, if leaves equal, mo 1), is built eigh t sym bols, then there will una for be one a ternary tree voidably in the tree. leaf missing q -ary code is made by putting these extra leaves in the longest The optimal h of the This can be achiev ed by adding the appropriate num ber branc tree. extra to the sym bol set, all of these source sym bols having bols original of sym The total num ber of leaves is then equal to r probabilit q 1) + 1, for y zero. ( integer . The sym bols are then rep r com bined by taking the q some eatedly bols with smallest probabilit y and replacing them by a single sym bol, as sym binary in the algorithm. Hu man coding 5.33 Solution We wish to sho w that a greedy metaco de, (p.104) to exercise . h gives the enco code whic ding, is actually sub optimal, whic shortest h picks the Kraft inequalit y. the because it violates that We'll h sym bol x is assigned lengths l assume ( x ) by eac h of the eac k C candidate . Let us assume there are codes alternativ e codes and that we K k enco de whic h code is being used with a header of length log K bits. Then can 0 metaco lengths l the ( x ) that are given by de assigns 0 ( ) = log l K + min (5.39) l x ( x ) : k 2 k the Kraft sum: We compute X X 1 0 ( x ) l ) x ( min l k k 2 = = S 2 : (5.40) K x x K divide the set A Let's into non-o verlapping subsets fA g suc h that subset X k =1 k tains all the sym bols x that the metaco con sends via code k . Then A de k X X 1 ( l x ) k = S 2 : (5.41) K x 2A k k P ) x l ( k de Now if one satis es the k Kraft sub-co y equalit = 1, then it 2 2A x X be the case that must X x l ) ( k 2 1 ; (5.42) x 2A k A equalit the sym bols x are in y only with , whic h would mean that we if all k only using one of the K codes. So are K X 1 (5.43) 1 = 1 ; S K k =1 equalit y only if equation (5.42) is an equalit y for all codes k . But it's with K ossible for all the sym bols to be in all the non-o verlapping subsets fA , g imp k k =1 have equalit y (5.42) for all k . So S < 1. so we can't holding that a mixture code is sub optimal is to consider Another way of seeing binary tree that it de nes. Think of the special the of two codes. The case rst we send iden ti es whic h code we are using. Now, in a complete code, bit t binary string is a valid string. But once we kno w that we any subsequen are using, say, code A, we kno w that what follo ws can only be a codew ord corresp onding bol x whose enco ding is shorter under code A than to a sym tinuations, B. some strings are invalid con code and the mixture code is So incomplete and sub optimal. For discussion of this issue and its relationship to probabilistic further mo (1998). read about `bits bac k coding' in section 28.3 and in Frey delling

121 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. out 6 Ab Chapter reading Chapter 6, you Before have read the previous chapter and should work ed on most of the exercises in it. that We'll mak e use of some Bayesian mo delling ideas also arriv ed in the vicinit y of exercise 2.8 (p.30) . 109

122 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 6 Stream Codes we discuss chapter schemes. two data In this compression metho goes hand in hand with the is a beautiful d that coding Arithmetic philosoph from a source entails probabilistic mo d- y that compression of data source. As of 1999, the best compression metho ds for text elling of that les arithmetic and several state-of-the-art image compression systems use coding, it too. use Lemp is a `univ ersal' metho d, designed under the philosoph y el{Ziv coding like a single compression algorithm that will do a reasonable job we would that any source. In fact, for man y real life sources, this algorithm's univ ersal for erties prop only in the limit of unfeasibly large amoun ts of data, but, all hold same, Lemp compression is widely used and often e ectiv e. the el{Ziv The 6.1 game guessing ds, ation these a motiv metho for consider the redundancy As two compression English text le. Suc h les have redundancy at several levels: in a typical for example, con tain the ASCI I characters with non-equal frequency; certain they e pairs than are more probable consecutiv others; and entire words of letters a seman can the con text and given tic understanding of the text. be predicted To illustrate the redundancy of English, and a curious way in whic h it could be compressed, imagine a guessing game in whic h an English we can er rep eatedly to predict the next character in a text le. speak attempts alphab y, let the allo wed that et consists of the 26 For simplicit us assume er case letters A,B,C,..., Z and a space ` - '. The game involves asking upp the sub to guess the next character rep eatedly , the only feedbac k being ject the until is correct or not, whether the character is correctly guessed. guess ber of guesses After we note the num guess, that were made when a correct the character was iden ti ed, and ask the sub ject to guess the next character in the same way. a human sen follo wing result when gave the was ask ed to guess One tence tence. The num bers of guesses a sen listed below eac h character. are T H E R E - I S - N O - R E V E R S E - O N - A - M O T O R C Y C L E - 1 1 1 5 1 1 2 1 1 2 1 1 15 1 17 1 1 1 2 1 3 2 1 2 2 7 1 1 1 1 4 1 1 1 1 1 next in man y cases, Notice that letter is guessed immediately , in one the guess. In other cases, particularly at the start of syllables, more guesses are needed. What this game and these results do us? First, they demonstrate the o er redundancy of English from the point of view of an English speak er. Second, this game migh t be used in a data compression scheme, as follo ws. 110

123 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Arithmetic codes 6.2: 111 of num The 1, 1, 5, 1, . . . ', listed above, was obtained by bers string `1, ber of guesses text ject. The maxim um num sub that the the to the presen ting mak e for a given letter is twenty-sev sub so what the sub ject is ject will en, us is performing arying mapping of the a time-v en letters for doing twenty-sev ; B ; C ;:::; Z ; g onto the twenty-sev en num bers f 1 ; 2 ; 3 ;::: ; f g , whic h we A 27 view bols in a new alphab et. The total num ber of sym bols has not can as sym bols but uses some of these sym he much more frequen tly reduced, been since { for example, 1 and 2 { it should be easy to compress this new than others of sym bols. string unc ession of the sequence of num bers `1, 1, 1, 5, 1, . . . ' How would ompr the work? we do not have the original string ` THERE . . . ', At uncompression time, the enco ded we have only Imagine that our sub ject has an absolutely sequence. iden twin who also plays the guessing game with us, as if we knew the tical text. If we stop whenev er he has made a num ber of guesses equal to source him letter, ber, have just guessed the correct he will and we can given num then the that's righ t', and move to the then character. Alternativ ely, if say `yes, next iden twin is not available, we could tical a compression system with the design help of just one human as follo ws. We choose a windo w length L , that is, the ber of characters text a num to sho w the human. For every one of the of con L possible of length L , we ask them, strings would you predict is the 27 `What character?', and `If that prediction were wrong, what would your next next L be?'. tabulating their answ ers to these 26 27 After questions, we guesses enco use of these enormous tables at the two copies der and the deco der could in place of the two human twins. Suc h a language mo del is called an L th order Mark ov mo del. systems are unrealistic for practical compression, but they These clearly principles that mak e use of now. illustrate several we will codes Arithmetic 6.2 sym bol codes, When the optimal Hu man we discussed variable-length and constructing them, we concluded by pointing out two practical algorithm for theoretical problems Hu man codes (section 5.6). and with are h were by arithmetic codes , whic defects invented by These recti ed and by Pasco, and subsequen tly made practical by Witten Elias, by Rissanen (1987). In et al. arithmetic code, the probabilistic mo delling is clearly an separated the enco ding operation. The system is rather similar to the from probabilistic game. predictor is replaced by a human mo del of guessing The source. As eac h sym bol is pro duced by the source, the probabilistic mo del the a predictiv e distribution over all possible values of the next sym bol, supplies g is, a list that bers f p del of positiv that sum to one. If we choose to mo e num i the source as pro ducing i.i.d. sym bols with some kno wn distribution, then the predictiv e distribution same every time; but arithmetic coding can with is the duce ease adaptiv e mo dels that pro complex con text-dep enden t equal handle e distributions. The predictiv e mo del is usually implemen ted in a predictiv program. computer enco der mak es use of the mo del's predictions to create a binary string. The of the The mak es use of an iden tical twin der mo del (just as in the guessing deco game) to interpret the binary string. Let the alphab et be A a = f a bol ;:::;a source g , and let the I th sym 1 X I I have the special meaning `end of transmission'. The source spits out a sequence sym x ;x We ;:::;x bols. ;:::: The source does not necessarily pro duce i.i.d. 1 2 n will assume that a computer program a vided to the enco der that assigns is pro

124 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 6 | Stream Codes 112 occurred a sequence that has given the over predictiv e probabilit y distribution i receiv = a thus far, j x P ;:::;x that program tical ). The ( er has x iden an 1 n n i 1 j predictiv y distribution P same x = a e probabilit x the ;:::;x ). duces pro ( 1 n i 1 n 0.00 de ne strings . Binary Figure 6.1 6 interv real real als the within line 0 0.25 We rst a tered [0,1). encoun 6 picture when we like this 01 01101 ? ? de bol-co discussed the sym 0.50 et in Chapter 5. sup ermark 6 1 0.75 ? 1.00 Conc epts for understanding arithmetic coding The interv al [0 : 01 ; 0 : 10) is all num bers between 0 : 01 and Notation for intervals. _ _ including 0 : 01 : 0 0 : 01000 ::: but not 0 : 10 10, 0 0 : 10000 :::: 0 A binary de nes an interv al within the real line from 0 to 1. transmission the string is interpreted as a binary real num ber 0.01. . . , whic h For example, 01 interv to the al [0 : 01 ; onds : 10) in binary , i.e., the interv al [0 : 25 ; 0 : 50) corresp 0 ten. in base longer string 01101 corresp onds to a smaller interv al [0 : 01101 ; The 0 : Because 01101 has the rst string, 01 , as a pre x, the new in- 01110). al is a sub-in 10). al of the interv al [0 : 01 ; 0 : terv A one-megab yte binary le terv 23 bits) is thus view ed as specifying a num ber between 0 and 1 to a precision (2 of about decimal places { two million decimal digits, because eac h two million byte translates into a little more than two decimal digits. Now, we can also the real line [0,1) into I interv als of lengths equal divide wn probabilities to the = a P ), as sho x in gure 6.2. ( i 1 0.00 Figure del mo . A probabilistic 6.2 a 1 6 ? x ) P ( = a the als de nes real interv within 1 1 a a 1 2 6 real [0,1). line a 2 a a 2 5 ? ) + x a ( P x = a ) P = ( 1 2 1 1 . . . . . . ( x ( x a = = a ) P ) + ::: + P 1 I 1 1 1 a I 6 ? 1.0 sub tak h interv al a de- and We may then divide it into interv als e eac i noted a to a ortional ;a is prop a a ;:::;a a a of , suc h that the length I i 2 j i 1 i i length ( be precisely = a will j x a = a P ). Indeed the x of the interv al a i 1 j j 2 i the join t probabilit y P ( x (6.1) = a : ;x ) = a a ) = P ( x = = a x ) P ( x j = a j 2 1 1 j i 2 i 1 i al [0 Iterating cedure, the interv pro ; 1) can be divided into a sequence this of interv als corresp onding to all possible nite length strings x h x , suc :::x N 1 2 that the length of an interv al is equal to the probabilit y of the string given our mo del.

125 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 113 6.2: Arithmetic codes . Arithmetic rithm Algo coding. 6.3 e pro the Iterativ to nd cedure := 0.0 u string al [ the ) for interv u;v v := 1.0 . x x :::x 1 2 N u v := p n = 1 to N { for ulativ Compute Q the and R cum (6.2, 6.3) e probabilities n n + pR v ( x u j x ) ;:::;x := n 1 n n 1 u pQ := ( x + x ;:::;x ) u j n 1 n 1 n v := p u } arithmetic describing coding Formulae depicted in gure 6.2 can The explicitly as follo ws. The pro cess be written are de ned in terms of the lower and upp er cum ulativ e probabilities interv als 1 i X 0 ) ; ;:::;x x j (6.2) Q ) j a x ;:::;x ( = a x ( P 1 1 n 1 n 1 i n i n 0 i = 1 i X 0 (6.3) : ) ;:::;x x j ( ) ( ;:::;x R j x a P a x = n 1 1 i 1 n 1 n i n 0 1 i = the n th sym bol arriv es, we sub divide the n 1th interv al at the points de ned As bol, Q R by . For example, starting with the rst sym and the interv als ` a ', 1 n n a ` ', and ` a ' are I 2 $ [ Q ( a (6.4) ) ;R a a )) = [0 ;P ( x = a )) ; ( 1 1 1 1 1 1 1 ) a Q ( a ) ;R ( a )) = [ P ( x = a [ ;P ( x = a ) + P ( x = a )) ; (6.5) $ 2 1 1 2 1 1 2 2 and a $ [ Q (6.6) ( a : ) ;R 0) ( a : )) = [ P ( x 1 = a ; ) + ::: + P ( x ) = a I I 1 1 1 1 I 1 I 1 6.3 cedure. es the general pro Algorithm describ de x x to To enco onding :::x a string , we locate the interv al corresp 2 1 N x al. x interv :::x that , and send a binary string whose interv al lies within N 2 1 enco ding be performed on the y, as we now illustrate. This can essing coin tosses of a bent compr Example: the we watch as that is tossed some num ber of times (cf. Imagine a bent coin example (p.30) and section 3.2 (p.51)). The two outcomes when the coin 2.7 is tossed denoted a and b . A third are y is that the exp erimen t is possibilit halted, an event denoted by the `end of le' sym bol, ` 2 '. Because the coin is bent, we exp ect the probabilities of the outcomes a and b are not equal, that more beforehand kno w whic h is the we don't probable outcome. though Enc oding Let the source string be ` bbba 2 '. We pass along the string one sym bol at a to compute time use our mo del and the probabilit y distribution of the next

126 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Codes 114 6 | Stream string sym these probabilities be: the thus far. bol given Let text Con Probabilit y of next sym bol (sequence thus far) P : 425 P ( b ) = 0 : 425 P ( 2 ) = 0 : 15 ( a ) = 0 ( a j b ) = 0 : 28 P ( b j b ) = 0 : 57 P ( 2 b b ) = 0 : 15 P j P a j bb ) = 0 : 21 P ( b j bb ) = 0 : 64 P ( 2 j bb ) = 0 : 15 bb ( ) = 0 ( j bbb ) = 0 : 17 P ( b j bbb a : 68 P ( 2 j bbb ) = 0 : 15 bbb P P ( a j bbba ) = 0 : 28 P ( b j bbba ) = 0 : 57 P ( 2 j bbba ) = 0 : 15 bbba 6.4 Figure corresp onding interv als. The interv al b is the middle ws sho the of ; interv al bb is the middle 0.567 The b , and so forth. 0.425 of [0 1). 00000 Figure . Illustration 6.4 of the 0000 00001 cess as the pro coding arithmetic 000 00010 is transmitted. 2 sequence bbba 0001 00011 00 00100 0010 00101 001 00110 a 0011 00111 0 01000 0100 01001 010 01010 0101 01011 01 01100 0110 01101 011 01110 0111 01111 ba 10010111 10011000 10000 1000 10011001 bbbaa 10001 100 10011010 bba 10010 1001 10011011 10011 10011 bbba bbba 10011100 bbbab 10 b 10100 B 10011101 1010 B bb 10101 bbb bbbb 10011110 bbba 101 2 B O C 10110 10011111 C 1011 B 10100000 10111 2 bbb C 1 11000 2 bb C 1100 100111101 11001 110 b 2 11010 1101 11011 11 11100 1110 11101 2 111 11110 1111 11111 der the bol ` b ' is observ ed, the enco sym kno ws that the enco ded When rst will start ` 01 ', ` 10 ', or ` 11 ', but does not kno w whic h. The enco der string nothing writes the time being, and examines the next sym bol, whic h is ` b '. for 1 The bb ' lies wholly within interv al ` al ` ', so the enco der can write the interv rst bit: ` 1 '. The third sym bol ` b ' narro ws down the interv al a little, but not quite enough it to lie wholly within interv al ` 10 '. Only when the next ` a ' for from more source can we transmit some is read bits. Interv al ` bbba ' lies the adds within interv al ` 1001 ', so the enco der wholly ` 001 ' to the ` 1 ' it has the written. Finally when the ` 2 ' arriv es, we need a pro cedure for terminating the ' ( gure enco the interv al ` bbba 2 Magnifying 6.4, righ t) we note that the ding. mark ed interv al ` 100111101 ' is wholly con tained by bbba 2 , so the enco ding can be completed by app ending ` 11101 '.

127 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Arithmetic codes 6.2: 115 [ ] 2, p.127 w that the overhead required to terminate a message Exercise 6.1. Sho given 2 bits, ideal message length e to the the than relativ is nev er more H , h ( x jH ) = log[1 probabilistic ( x jH )]. mo del =P imp t result. Arithmetic coding is very ortan optimal. The is an This nearly is alw ays within two bits of the Shannon message con ten t length information entire string, so the exp ected message length is within two bits of the source y of the message. entrop of the entire Decoding der receiv es the string ` 100111101 ' and passes along it one sym bol The deco ) the P ( a ) ;P ( b probabilities ;P ( 2 ) are computed using the at a time. First, tical program that the enco der used and the interv als ` a ', ` b ' and ` 2 ' are iden Once deduced. rst two bits ` 10 ' have been examined, it is certain that the original string have been started with a ` b ', since the interv al ` 10 ' lies the must then within b '. The deco der can al ` use the mo del to compute wholly interv ( a j b ) ;P ( b j b ) ;P ( 2 j b ) and deduce the boundaries of the interv als ` ba ', ` bb ' P ` 2 and '. Con tinuing, we deco de the second b once we reac h ` 1001 ', the third b the once 100111 ', and so forth, with h ` unam biguous iden ti cation b we reac bbba 2 ' once the whole binary string has been read. of ` the con vention With that 2 ' denotes the end of the message, the deco ` kno ws to stop deco ding. der Transmission of multiple les How migh t one use arithmetic coding to comm unicate several distinct les over the binary Once the 2 character has been transmitted, we imagine channel? the deco is reset into its initial state. There is no transfer of the learn t that der rst e that to the second le. If, however, we did believ of the statistics le among that les is a relationship we are going to compress, we could there the alphab et di eren tly, introducing a second end-of- le character that de ne our the end of the le but instructs the enco der and deco der to con tinue marks using the probabilistic mo del. same big pictur The e the to comm a string of N that both unicate enco der and the Notice letters der needed to compute only N jAj conditional probabilities deco proba- { the bilities h possible letter in eac h con text actually encoun tered { just as in of eac trasted guessing This cost can be con game. with the alternativ e of using the a Hu man code with a large blo ck size (in order to reduce the possible one- bit-p er-sym discussed in section 5.6), where all blo ck sequences bol overhead could must be considered and their probabilities evaluated. that occur how exible arithmetic coding is: it can be used with any source Notice et and alphab ded alphab et. The size of the source alphab et and the any enco Arithmetic ded et can change with time. alphab coding can be used with enco any probabilit y distribution, whic h can change utterly from con text to con text. Furthermore, if we would sym bols of the enco ding alphab et (say, like the frequency and ) to be used with une qual 0 , that can easily be arranged by 1 sub dividing the righ t-hand interv al in prop ortion to the required frequencies. How probabilistic model might make its predictions the The technique of arithmetic coding does not force one to pro duce the predic- tive probabilit y in any particular way, but the predictiv e distributions migh t

128 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 6 | Stream Codes 116 of the . Illustration 6.5 Figure 00000 0000 interv als de ned by a simple 00001 aaaa 000 del. Bayesian probabilistic mo The aaa 00010 0001 of an size ortional interv is prop als 00011 aaab aa 00 y of the probabilit to the string. 00100 that anticipates This the mo del 0010 aaba 00101 aab is likely source to be biased 001 aabb 00110 a of a and b towards , so one 0011 aa 2 00111 a sequences s or lots lots of having 0 abaa 01000 aba than als interv s have larger b of abab 0100 same length sequences that of the 01001 abba ab 010 abb abbb are s and b a s. 50:50 01010 0101 2 ab 01011 01 01100 2 a 0110 01101 011 baaa baa 01110 baab 0111 01111 baba ba bab babb 10000 1000 ba 2 10001 bbaa 100 bba bbab 10010 1001 10011 bbba 10 b 10100 1010 bb 10101 bbb bbbb 101 10110 1011 10111 1 11000 2 bb 1100 11001 110 b 2 11010 1101 11011 11 11100 1110 11101 2 111 11110 1111 11111 naturally duced by a Bayesian mo del. be pro Figure 6.4 was generated using a simple mo del that alw ays assigns a prob- abilit y of 0.15 2 , and assigns the remaining 0.85 to a and b , divided in to ortion by Laplace's rule, prop to probabilities given + 1 F a ; (6.7) j ;:::;x P ( ) = x a n 1 L 1 F F + 2 + b a F and ( x so far, ;:::;x where has a ) is the num ber of times that occurred 1 n 1 a corresp b coun t of F s. These predictions is the ond to a simple Bayesian mo del b that exp ects and adapts to a non-equal frequency of use of the source sym bols a and within a le. b 6.5 of ys the interv als corresp onding to a num ber of strings Figure displa has up that if the string so far length con tained a large num ber of to ve. Note b s then the probabilit y of b relativ e to a is increased, and con versely if man y als, a a s are made more probable. Larger interv then remem ber, require s occur few er bits to enco de. Details of the Bayesian model Having that any mo del could be used { arithmetic coding is not emphasized e to any particular set of probabilities { let me explain the simple adaptiv wedded

129 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 117 6.2: Arithmetic codes used in the example; we rst encoun tered this del preceding mo probabilistic (p.30). 2.8 in exercise mo del Assumptions will be describ ed using parameters p p , p The and mo del , de ned below, b 2 a be confused the predictiv e probabilities in a particular whic with not h should P ( a j context = baa ). A bent coin lab elled a and b is tossed some , for example, s l , whic h we don't kno w beforehand. The coin's probabilit y of num ber of times a up is p coming , and p when = 1 p tossed ; the parameters p not ;p are b a b a a beforehand. The source string s = baaba 2 indicates that l was 5 and kno wn of outcomes baaba . was the sequence the length of the string l has an exp onen tial probabilit y 1. It is assumed that distribution l l ) = (1 p P ) ( p (6.8) : 2 2 corresp This a constan t probabilit y p onds for distribution to assuming 2 sym 2 ' at eac termination h character. the bol ` non-terminal the in the string are selected in- that 2. It is assumed characters tly at random dep an ensem ble with probabilities P = f p enden ;p ; g from b a string y probabilit is xed throughout the p to some unkno wn value the a could be anywhere between 0 and 1. The probabilit y of an a occur- that is (1 as the bol, given p ring (if only we knew it), sym p . The ) p next a 2 a y, given p string , that an unterminated probabilit of length F is a given a coun that con tains f F string ;F Bernoulli g s ts of the two outcomes is the b a distribution F F a b p P ;F ) = p ( s j (1 p ) (6.9) : a a a prior distribution p 3. We assume , a uniform for a ; ) = 1 ( p 2 [0 P 1] ; (6.10) ; p a a p and de ne p , . It would be easy to assume other priors on p 1 a b a distributions being beta most con venien t to handle. with the mo del was studied in section 3.2. The key result we require is the predictiv This e distribution the next sym bol, given the string so far, s . This probabilit y for it is not the is a or b (assuming that character ` 2 ') was deriv ed in that next (3.16) and is precisely Laplace's rule (6.7). equation [ 3 ] an . Compare the exp ected message length when 6.2. ASCI I le is Exercise compressed by the follo wing three metho ds. Hu man-with-header . Read the whole le, nd the empirical fre- quency of eac a Hu man code for those frequen- h sym bol, construct the Hu man the lengths of the transmit cies, code by transmitting the then the le using ords, Hu man code. (The codew transmit codew ords don't need to be transmitted, since we can use a actual metho building the tree given the codelengths.) d for deterministic code using the Laplace mo del . Arithmetic F + 1 a P ) = ;:::;x a ( j x P (6.11) : n 1 1 L 0 + 1) ( F 0 a a code using a Diric hlet mo del . This mo del's predic- Arithmetic are: tions + F a P (6.12) ; ;:::;x a P ( ) = j x 1 1 n D 0 ( F + ) 0 a a

130 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 6 | Stream Codes 118 is xed to ber suc h as 0.01. A small value of where a num mo a more e version of the Laplace onsiv del; to corresp resp onds is exp ected to be more non uniform; the probabilit y over characters the = 1 repro mo del. duces Laplace the of your Hu man message is self-delimiting. that Take care header worth considering are (a) short les with just a few hundred Special cases large (b) in whic h some characters are nev er used. characters; les Further coding applications 6.3 of arithmetic of random samples gener Ecient ation not only o ers a way to compress strings believ ed to come Arithmetic coding mo it also a given o ers a way to generate random strings from a from del; interv Imagine king a pin into the unit del. al at random, that line mo stic been having into subin terv als in prop ortion to probabilities p ; the divided i y that your will lie in interv al i is p probabilit . pin i to generate a mo del, all we need from is feed ordinary So a sample to do into an arithmetic decoder for that random del. An in nite random bits mo sequence onds to the selection corresp from the line bit of a point at random ; 1), so the deco der will then [0 a string at random from the assumed select distribution. arithmetic metho d is guaran teed to use very nearly the This { an num possible to mak e the selection bits imp ortan t smallest ber of random unities where random num bers point in comm exp ensiv e! [This is not a joke. are Large ts of money are spent on generating amoun bits in soft ware and random hardw are. Random num bers are valuable.] A simple example use of this technique is in the generation of random of the with a non distribution f p bits ;p uniform g . 0 1 2, p.128 ] [ the Exercise wing two techniques for generating Compare 6.3. follo bols from a non uniform random f p sym ;p : g = f 0 : 99 ; 0 : 01 g distribution 1 0 The standard metho d: use (a) random num ber generator a standard 32 integer between 1 and 2 to generate . Rescale the integer to an distributed ; whether this uniformly Test random variable is (0 1). than 0 : 99, and emit a 0 less 1 accordingly . or (b) coding using the correct mo Arithmetic fed with standard ran- del, dom bits. Roughly how man y random bits will eac h metho d use to generate a thousand samples sparse distribution? from this devic data-entry es Ecient Compression: bits ! text When sort { maybe we enter text into a computer, we mak e gestures of some with a pointer, or clic k with a mouse; an we tap a keyb oard, or scribble Writing: ecient ber of gestures required to where num text the entry system is one text gestures . l is string text a given enter smal to data compression. In data Writing can be view ed as an inverse pro cess compression, the is to map a given text string into a smal l num ber of bits. aim entry, we want a small sequence of gestures to pro duce our intended In text text. By inverting an arithmetic coder, we can obtain an information-ecien t tinuous text device that is driv en by con entry pointing gestures (Ward et al. ,

131 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Lemp el{Ziv 119 6.4: coding system, 2000). the user zooms in on the unit interv al to called In this Dasher, style al corresp intended string, in the same to their as interv onding locate the mo del (exactly as used in text compression) gure trols 6.4. A language con of the als suc h that probable strings interv quic k and easy to sizes the are . After an hour's practice, a novice user iden write with one nger tify can Dasher 25 words per min ute { that's about half their normal driving at about speed keyb oard. It's even possible to write at 25 typing on a regular ten- nger ute, hands-fr ee , using gaze direction to driv e Dasher (Ward words per min and 1 is available as free soft ware for various Dasher y, 2002). MacKa platforms. 6.4 Lemp el{Ziv coding Lemp el{Ziv algorithms, whic h are widely used for data compression (e.g., The compress the gzip commands), are di eren t in philosoph y to arithmetic and There is no between mo delling and coding, and no opp or- coding. separation explicit delling. y for tunit mo Basic Lemp el{Ziv algorithm d of compression is to replace a substring with a pointer to The metho occurrence earlier same substring. For example if the string is of the an it into an ordered dictionary of substrings that . . . , we parse 1011010100010 app eared before as follo ws: , 1 , have not , 11 , 01 , 010 , 00 , 10 , . . . . We in- 0 clude empt y substring as the rst substring in the dictionary and order the h they substrings by the order in whic dictionary emerged from the the in the After every comma, we look along the next source. of the input sequence part until a substring that has not we have read mark ed o before. A mo- been men t's re ection will con rm that this substring is longer by one bit than a substring that earlier in the dictionary . This means that we can has occurred h substring enco pointer to the earlier occurrence of that pre- by giving de eac a new sending extra bit by whic h the then substring in the dictionary and x the the earlier substring. If, at the n th bit, we have enumerated s ( di ers ) from n then give the value of the pointer in d log substrings, s ( n ) e we can The bits. 2 above sequence is then as sho wn in the the line of the follo wing code for fourth (with punctuation included for clarit y), the upp er lines indicating the table string and value of s ( n ): source the 0 11 01 010 00 10 1 substrings source n ) 0 s 2 3 4 5 6 7 ( 1 ( n ) s 000 001 010 011 100 101 110 111 binary ) ; 1 ) ( 0 ; 0 ) ( 01 ; 1 ) ( 10 ( 1 ) ( 100 ; 0 ) ( 010 ; 0 ) ( 001 ; 0 ; ; bit) (pointer Notice that rst pointer we send is empt y, because, given that there is the { no one dictionary { the string in the bits are needed to only substring vey the `choice' of that substring con the pre x. The enco ded string is as 100011101100001000010 . The enco ding, in this simple case, is actually a longer string than the source string, because there was no obvious redundancy in the string. source ] 2 [ + code f any uniquely deco deable ve that from Pro 0 ; 1 g 6.4. . to Exercise + 0 ; 1 g f necessarily mak es some strings longer if it mak es some strings shorter. 1 http://www.inference.phy.ca m.ac. uk/d asher /

132 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 6 | Stream Codes 120 why the reason ed above lengthens a lot of strings is algorithm One describ unnecessary to put it another way, t { it transmits bits; because it is inecien complete. Once a substring in the dictionary has been its code is not joined of its then we can be sure that children, not be needed by both there it will as part of our proto col (except terminating a message); so at that possibly for drop our dictionary of substrings and shue them point we could it from one, y reducing the length of subsequen t pointer messages. along thereb all tly, we could write the second pre x Equiv dictionary at the point alen into the occupied by the paren t. A second unnecessary overhead is the previously of the new transmission in these cases { the second time a pre x is used, bit we can of the iden tity of the next bit. be sure Decoding The again involves an iden tical twin at the deco ding end who con- deco der ded. dictionary as the data are deco of substrings structs the 2, p.128 ] [ Exercise 6.5. Enco de the string 00000000000010000000000 0 using . basic Lemp el{Ziv algorithm describ ed above. the 2, p.128 ] [ de Deco Exercise the string . 6.6. 10101000011 001010111011001001000110 was enco ded using that basic Lemp el{Ziv algorithm. the Practic alities description I have not discussed In this metho d for terminating a string. the There are man y variations on the Lemp el{Ziv algorithm, all exploiting the same idea using di eren t pro cedures for dictionary managemen t, etc. The but programs are but their performance on compression of English resulting fast, arithmetic useful, although h the standards set in the does not text, matc literature. coding The oretic al properties trast to the blo ck code, Hu man code, and arithmetic coding metho In con ds we discussed last three chapters, the Lemp el{Ziv algorithm is de ned in the making for tion of a probabilistic mo del without the source. Yet, given any men is memoryless dic (i.e., one that source on sucien tly long timescales), any ergo the Lemp el{Ziv algorithm can be pro ven asymptotic ally to compress down to the entrop source. This is why it is called a `univ ersal' compression y of the For a pro prop erty, see Cover and Thomas (1991). algorithm. of of this es its compression, however, only by memorizing substrings It achiev that have happ so that it has a short name for ened the next time they occur. them The asymptotic timescale on whic h this univ ersal performance is achiev ed may, for man be unfeasibly long, because the num ber of typical substrings y sources, need memorizing may be enormous. The useful performance of the al- that gorithm in practice is a re ection of the fact that man y les con tain multiple a form rep of particular short sequences of characters, etitions of redundancy to whic h the algorithm is well suited.

133 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Demonstration 121 6.5: ground Common in philosoph y behind arithmetic coding and the I have emphasized di erence There ground between them, though: in prin- coding. Lemp el{Ziv is common design adaptiv e probabilistic mo dels, and thence arithmetic ciple, one can `univ ersal', that is, mo dels that will asymptotically compress codes, that are 1) of its class some factor (preferably to within entrop y. sour any ce in some practical purp oses, However, suc h univ ersal mo dels can only be for I think if the of sources is severely restricted. A general purp ose constructed class any can probabilit y distribution of ver the source would that disco compressor ose arti cial intelligence! A general purp ose arti cial be a general purp intelli- does not gence yet exist. Demonstration 6.5 2 for exploring An coding, dasher.tcl , is available. interactiv e aid arithmetic arithmetic-co ding soft ware pac kage A demonstration by Radford written 3 consists of enco ding and deco ding mo dules to whic h the user adds a Neal dule de ning probabilistic mo del. It should be emphasized that there mo the del ose compressor; a new mo ding has to general-purp arithmetic-co is no single eac h type of source. Radford Neal's pac be written includes a simple for kage e mo similar to the Bayesian mo del demonstrated in section 6.2. adaptiv del results using this Laplace mo del should be view ed as a basic The hmark benc since simplest possible probabilistic mo del { it simply assumes the it is the ensem in the indep enden tly from a xed come ble. The coun ts characters le F h that g of the sym bols f a suc g are rescaled and rounded as the le is read f i i all the coun ts lie between 1 and 256. A state-of-the-art compressor for documen ts con taining text and images, 4 DjVu coding. , uses It uses a carefully designed appro ximate arith- arithmetic coder for alphab ets called the Z-co der (Bottou et al. , 1998), whic h metic binary the ed coding soft ware describ than above. One of is much faster arithmetic adaptiv tric Z-co der uses is this: the neat e mo del adapts only occa- the ks the (to save on computer time), with the decision about when to adapt sionally pseudo-randomly trolled being by whether the arithmetic enco der emitted con a bit. binary compression standard for image images uses arithmetic The JBIG with a con text-dep enden t mo del, whic h adapts using a rule similar to coding Laplace's rule. (Teahan, 1995) is a leading metho d for text compression, PPM it uses coding. and arithmetic are man y Lemp There programs. gzip is based on a version el{Ziv-based of Lemp el{Ziv called ` LZ77 ' (Ziv and Lemp el, 1977). compress is based on ` LZW In my exp erience the best is gzip , with compress being ' (Welch, 1984). on les. inferior most is a blo ck-sorting le compressor , whic h mak es use of a neat hac k bzip the (Burro ws{Wheeler transform called ws and Wheeler, 1994). This Burro mo d is not on an explicit probabilistic metho del, and it only works well based for les larger than several thousand characters; but in practice it is a very e ectiv e compressor les in whic h the con text of a character is a good for 5 for character. predictor that 2 rnn/ uk/m http://www.inference.phy.ca /itp m.ac. softw areI .html ackay 3 ftp://ftp.cs.toronto.edu/pu b/rad ford /www/ ac.s oftw are.h tml 4 http://www.djvuzone.org/ 5 There is a lot of information about the Burro ws{Wheeler transform on the net. http://dogma.net/DataCompres sion/ BWT. shtml

134 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 6 | Stream 122 Codes le Compr of a text ession the computer in seconds tak 6.6 and time compression the Table gives en A programs are applied to achiev L ed T when X le con taining the these the E of size 20,942 bytes. text chapter, of this 6.6 Table . Comparison of Metho d Compression Compressed size Uncompression to applied algorithms compression = time = sec (%age of 20,942) time sec le. a text Laplace mo del 0.28 12 974 (61%) 0.32 gzip 0.10 0.01 (39%) 8 177 0.05 10 816 0.05 compress (51%) (36%) 7 495 bzip (36%) bzip2 7 640 (32%) ppmz 6 800 ession of a sparse Compr le gives the gzip ays do so well. Table 6.7 alw compres- , does not Interestingly 6 sion these programs are applied to a text le con taining 10 achiev ed when eac h of whic h is either characters, and 1 with probabilities 0.99 and 0.01. 0 The mo del is quite well matc hed to this source, and the Laplace hmark benc arithmetic gives good performance, follo wed coder by compress ; gzip closely is worst. An ideal mo del for this source would compress the le into about 6 10 (0 : 01) = 8 ' 10 100 bytes. The Laplace-mo del compressor falls short of H 2 performance it is implemen ted using only eigh t-bit precision. The this because compresses es much more best of all, but tak compressor computer ppmz the time. Table 6.7 . Comparison of d Metho Compression size Uncompression Compressed compression algorithms to applied = sec = bytes time = sec time 6 of 10 le a random characters, 99% s. 1% s and 1 0 (1.4%) 0.57 14 143 del mo Laplace 0.45 gzip (2.1%) 20 646 0.04 0.22 gzip --best+ 1.63 15 553 0.05 (1.6%) 14 785 0.13 compress 0.03 (1.5%) (1.09%) bzip 0.30 10 903 0.17 0.05 bzip2 0.19 11 260 (1.12%) 535 (1.04%) 533 ppmz 10 447 6.6 ry Summa last three chapters we have studied three classes of data compression In the codes. Fixed-length ck codes (Chapter 4). blo are mappings from a xed These num ber of source sym bols to a xed-length binary message. Only a tiny enco fraction strings are given an source ding. These codes were of the fun for iden tifying the entrop y as the measure of compressibilit y but they are of little practical use.

135 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Exercises 6.7: codes 123 on stream (Chapter Sym bol codes emplo y a variable-length code for 5). bol codes Sym integer source the codelengths being et, lengths bol in the eac h sym alphab of the sym bols. Hu man's algorithm determined by the probabilities sym bol code for of sym bol probabilities. constructs an optimal a given set and has string deable enco ding, a uniquely if the source source Every deco come from sym assumed distribution then the sym bol code will bols the to exp ected length per character L lying in the interv al compress an + 1). actual uctuations in the source may mak e the H;H [ Statistical or shorter length. this mean longer length than assumed is not source hed to the well distribution then the If the matc length is increased by the relativ e entrop y D the between mean source KL and the implicit distribution. For sources with small distribution code's bit sym emit at least one to per source sym bol; y, the entrop bol has bit per source sym bol can be achiev ed only by compression below one bersome cum cedure of putting the source data into blo cks. the pro . The distinctiv e prop erty of stream codes, compared with Stream codes bol codes, is that they sym not constrained to emit at least one bit for are every bol read from the source stream. So large num bers of source sym bols into a smaller num ber of bits. This prop erty sym may be coded be obtained using a sym bol code only if the source stream could were someho ed into blo cks. w chopp del Arithmetic com a probabilistic mo codes with an enco ding bine algorithm that iden ti es eac h string with a sub-in terv al of [0 ; 1) of size equal probabilit y of that string under the mo del. This to the optimal in the that the compressed length of a code is almost sense ten closely the Shannon information con hes t of x given x matc string mo del. Arithmetic codes t with the philosoph y probabilistic the good compression data mo delling , in the form requires that of an e Bayesian mo del. adaptiv Lemp el{Ziv codes are adaptiv e in the sense they memorize that strings have already occurred. They are built on the philoso- that we don't w anything at all about what the probabilit y phy that kno of the source will be, and we want a compression algo- distribution that perform rithm reasonably well whatev er that distribution will is. codes Both and Lemp el{Ziv arithmetic will fail to deco de correctly codes if any of the bits of the compressed le are altered. So if compressed les are to be stored or transmitted media, error-correcting codes will be over noisy is the tial. unication over unreliable channels comm topic of Part essen Reliable II. Exercises stream 6.7 codes on ] 2 [ Exercise Describ e an arithmetic 6.7. algorithm to enco de random bit coding strings of length N and weigh t K (i.e., K ones and N K zero es) where N and are given. K corresp case = 5, K = 2, sho w in detail For the interv als N onding to the all source substrings of lengths 1{5. ] p.128 2, [ needed 6.8. How man y bits are . to specify a selection of K Exercise objects the N objects? ( N and K are assumed to be kno wn and from

136 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 6 | Stream Codes 124 K selection How migh t suc h a selection be objects of is unordered.) being of random bits? without wasteful at random made [ 2 ] A binary . X emits indep enden t iden tically distributed Exercise 6.9. source with probabilit y distribution f f Find ;f sym g , where f bols = 0 : 01. 1 0 1 uniquely-deco deable sym bol code for a string x = x an x optimal x of 3 2 1 successiv e samples from this source. three one decimal place) the factor by whic h the Estimate ected length (to exp optimal than the entrop y of the code is greater string x . of this three-bit H )).] (0 : 01) ' 0 : 08, where H x ( x ) = x [ (1 =x ) + (1 x ) log (1 (1 = log 2 2 2 2 arithmetic code to compress a string of 1000 samples from An is used deviation X mean and standard the of the length source the . Estimate le. of the compressed 2 ] [ arithmetic Describ e an 6.10. coding algorithm to generate random . Exercise strings of length N with densit y f (i.e., eac h bit has probabilit y bit of f being where N is given. a one) [ ] 2 Use a mo di ed 6.11. el{Ziv algorithm in whic h, as discussed Exercise Lemp p.120 , the dictionary of pre xes is pruned by writing new pre xes on the into occupied by pre xes that will not be needed again. space Suc can be iden ti ed when both their children have been h pre xes added to the dictionary of pre xes. (You may neglect the issue of termination of ding.) Use this algorithm to enco de the string enco . Highligh t the that follo w a pre x 0100001000100010101000001 bits (As second that pre x is used. that discussed earlier, the on occasion could be omitted.) these bits [ p.128 ] 2, Sho w that this mo 6.12. Lemp el{Ziv code is still not Exercise di ed that is, there are binary strings that are not enco `complete', of dings any string. 3, p.128 ] [ y Giv e examples of simple sources that have low entrop Exercise . 6.13. would not be compressed well by the Lemp el{Ziv algorithm. but Further 6.8 on data comp ression exercises who follo exercises The ed by the reader wing is eager to learn may be skipp about noisy channels. ] p.130 [ 3, 6.14. a Gaussian distribution in N dimensions, Exercise Consider P 2 x 1 n n : (6.13) ( P x ) = exp 2 2 N= 2 2 ) (2 P 1 = 2 2 mean . Estimate the x x to be r = De ne the radius of a point n n P 2 2 radius, r and = variance of the square of the x . n n You may nd helpful the integral Z 2 1 x 4 4 d x ; (6.14) x exp = 3 2 2 = 1 2 2 ) (2 though it. should be able to estimate the required quan tities without you

137 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Further exercises data compression 125 6.8: on y probabilit y densit is maximized here probabilit Assuming the all nearly w that sho is large, N that y of a p of radius N . Find the thic kness in a thin shell tained is con Gaussian of the shell. p N y (6.13) shell thin y densit probabilit the Evaluate at a point in that and = 1000 example. as an N case the Use at the compare. x = 0 and origin almost all probabilit y mass is here all nearly that Notice is located t part in a di eren y mass probabilit the Figure 6.8 . Schematic space of the y. y densit probabilit of highest region the from of the represen of set typical tation N an Gaussian -dimensional ] [ 2 binary 6.15. is mean t by an optimal what symb ol code . Explain Exercise distribution. optimal binary sym bol code for the ensem ble: an Find = f a ; b ; c A d ; e ; f ; g ; h ; i ; j g ; ; 2 25 4 10 9 30 6 8 5 1 ; ; ; ; ; P ; ; = ; ; ; 100 100 100 100 100 100 100 100 100 100 code. compute ected length of the exp and the 2 ] [ 6.16. A string y = x two x from consists of Exercise indep enden t samples 2 1 ensem ble an 1 6 3 a ; b ; c g ; P : = A = X f ; ; : X X 10 10 10 is the entrop y of y ? Construct an optimal What sym bol code for binary the string y , and nd its exp ected length. ] 2 [ 6.17. Exercise of N indep enden t samples from an ensem ble with Strings = P : 1 ; 0 : 9 g are compressed using an arithmetic code that is matc hed f 0 ensem and Estimate the mean that standard deviation of the to ble. (0 lengths the strings' N = 1000. [ H compressed for : 1) ' 0 : 47] case 2 ] [ 3 6.18. ols. coding with variable-length symb Exercise Source the chapters source coding, we assumed that we were In on g into a binary ding f 0 ; 1 alphab in whic h both sym bols enco et be used with equal frequency . In this should we ex- question plore enco ding alphab et should be used if the sym bols how the e di eren to transmit. tak t times free ken t comm A poverty-stric for studen with a friend using a unicates telephone by selecting an integer n 2 f 1 ; 2 ; 3 ::: g , making the friend's phone ring times, then hanging up in the middle of the n th ring. This n n cess so that a string of sym bols n eated n pro ::: is receiv ed. is rep 1 3 2 is the way to comm What unicate? If large integers n are selected optimal unicate. then tak es longer to comm message If only small integers n the are used then the information con ten t per sym bol is small. We aim to maximize the of information transfer, per unit time. rate transmit that time Assume en to the a num ber of rings n and to tak redial is l . seconds. Consider a probabilit y distribution over n , f p g n n De ning the average duration per symb ol to be X ( p ) = L p l (6.15) n n n

138 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 6 | Stream Codes 126 entrop the to be y and per symb ol X 1 (6.16) ; H ) = ( p p log n 2 p n n the average information rate per second to be maximized, sho w that for bols must with probabilities of the form the be used sym 1 l n = p (6.17) 2 n Z P l n 2 and satis es the implicit equation Z = where n ( p ) H = ; (6.18) L ( p ) of comm rate is, unication. Sho w that these two equations that is the imply that must be set suc h that (6.17, 6.18) Z = 0 : log (6.19) Assuming the channel has the prop erty that = seconds ; (6.20) l n n optimal information p and sho w that the maximal the nd distribution is 1 bit per second. rate with the information How does this per second achiev ed if compare rate is set to (1 = 2 ; 1 = 2 ; 0 ; 0 ; 0 ; 0 p ) | that is, only the sym bols n = 1 and ;::: n selected, and they have equal probabilit y? = 2 are 6.19) the the results (6.17, between deriv ed above, Discuss relationship the Kraft inequalit y from and coding theory . source How migh binary source be ecien t a random enco ded into a se- tly quence of sym bols n de ned n channel n over the ::: for transmission 3 2 1 (6.20)? in equation [ 1 ] . How man y bits does it tak e to shue a pac k of cards? 6.19. Exercise [ 2 ] 6.20. In the card game Exercise the four players receiv e 13 cards . Bridge, h from the dec k of 52 and start eac h game by looking at their own eac hand and The legal bids are, in ascending order 1 | ; 1 } ; 1 ~ ; 1 ; bidding. 1 | ; 2 } ; ::: 7 ~ ; 7 2 ; 7 NT , and successiv e bids must follo w this NT; order; a bid of, say, 2 ~ may only be follo wed by higher bids suc h as 2 or 3 | NT . (Let us neglect the `double' bid.) or 7 of the players when bidding. One aims aims is for two The have several to comm unicate to eac partners as much as possible about what h other cards in their hands. are Let trate on this task. us concen (a) After the cards have been dealt, how man y bits are needed for North to con vey to South her hand is? what bid Assuming E and (b) do not that at all, what is the maxim um W total information that N and S can con vey to eac h other while that bidding? that N starts the bidding, and Assume once either N or S stops bidding, the bidding stops.

139 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 127 6.9: [ ] 2 Arabic Roman 6.21. wave oven had 11 buttons for entering . My Exercise old `arabic' micro my new wave has ve. The but- micro cooking times, just and `roman' M X 3 2 1 `1 min ute', `10 tons of the roman micro wave are lab elled `10 min utes', I C 2 4 5 6 these abbreviate I'll ve strings seconds', `Start'; to the and `1 second', 9 7 8 seconds bols M , C , X sym I , 2 . To enter one min ute and twenty-three , 2 0 is (1:23), the arabic sequence . Alternativ 6.9 e keypads Figure 123 2 ; (6.21) wave ovens. for micro roman sequence is and the CXXIII : (6.22) 2 Eac keypads de nes a code mapping the 3599 cooking times h of these 0:01 to 59:59 of sym bols. from into a string sym h times duced with two or three be pro bols? (For Whic can (a) can be pro duced by three sym bols example, code: 0:20 in either 2 20 2 .) XX and the Are complete? Giv e a detailed answ er. (b) two codes For eac h code, name (c) time that it can pro duce in four a cooking sym that the other code cannot. bols Discuss implicit probabilit y distributions over times to whic h (d) the h of these codes is best matc hed. eac Conco probabilit (e) y distribution over times that a real ct a plausible the migh and evaluate roughly user exp ected num ber of sym- t use, bols, and maxim um num ber of sym bols, that eac h code requires. Discuss the h eac h code is inecien t or ecien t. ways in whic ecien (f) Invent a more for a mi- t cooking-time-enco ding system crowave oven. [ p.132 ] 2, standard binary Is the tation for positiv e inte- Exercise 6.22. represen c gers (5) = 101 ) a uniquely deco deable code? (e.g. b i.e., for the positiv e integers, code a mapping from Design a binary + 1 ; 2 ; n ;::: g to c ( n ) 2 f 0 ; 1 g 2 f , that is uniquely deco deable. Try 3 to design that are pre x codes and that satisfy the Kraft equalit y codes P l n 2 = 1. n ations: Motiv le terminated by a special end of le character any data can be mapp ed onto an integer, so a pre x code for integers can be used as a self-delimiting enco of les too. Large les corresp ond to large ding ersal' Also, building blo cks of a `univ of the coding scheme { integers. one is, a coding scheme that will work OK that a large variet y of sources for { is the y to enco de integers. Finally , in micro abilit cooking wave ovens, times are positiv e integers! Discuss criteria by whic h one migh t compare alternativ e codes for inte- for gers alen tly, alternativ e self-delimiting codes equiv les). (or, 6.9 Solutions Solution to exercise 6.1 (p.115) . The worst-case situation is when the interv al interv to be represen just inside a binary lies al. In this case, we may choose ted als of two binary interv als as sho wn in gure 6.10. These binary interv either

140 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 6 | Stream Codes 128 of . Termination 6.10 Figure interv als Binary string's al Source interv worst in the coding arithmetic case, where there is a two bit 6 Either of the overhead. two binary interv als mark ed on the 6 may be chosen. side righ t-hand binary are als These interv no ? P ) jH x ( smaller than P ( x jH ) = 4. 6 ? ? no than P ( x jH ) = are binary enco ding has a length no greater 4, so the smaller h is two bits log =P ( x jH ) + log than 4, whic 1 more than the ideal message 2 2 length. to exercise 6.3 (p.118) . The Solution metho d uses 32 random bits standard per generated bol and so requires 32 000 bits to generate one thousand sym samples. Arithmetic on average about H coding (0 : 01) = 0 : 081 bits per gener- uses 2 sym ated so requires about 83 bits to generate one thousand samples bol, and overhead of roughly two bits asso ciated with termination). an (assuming in the num ber of 1 s would pro Fluctuations variations around this duce mean standard deviation 21. with ding to exercise . The enco (p.120) is 010100110010110001100 , Solution 6.5 h comes from the parsing whic 00000 ; ; 000 ; 0000 ; 001 ; 0 ; 000000 (6.23) 00 whic h is enco ded thus: ( ; 0 ) ; ( 1 ; 0 ) ; ( 10 ; 0 ) ; ( 11 ; 0 ) ; ( 010 ; 1 ) ; ( 100 ; 0 ) ; ( 110 ; 0 ) : (6.24) Solution to exercise (p.120) . The deco ding is 6.6 . 0100001000100010101000001 6.8 (p.123) . This problem is equiv alen t to exercise 6.7 Solution to exercise . (p.123) N log N objects requires d selection of The objects from K e bits ' 2 K The ( K=N ) bits. This selection could be made using NH coding. arithmetic 2 h the corresp string of length N in whic to a binary 1 bits rep- selection onds t whic h objects are selected. Initially the probabilit y of a resen is K=N and 1 the y of a 0 is ( N K probabilit =N . Thereafter, given that the emitted string ) thus far, of length n , con tains k 1 s, the probabilit y of a 1 is ( K k ) = ( N n ) and the y of a 0 is 1 ( K k ) = ( N n ). probabilit not to exercise Solution . This mo di ed Lemp el{Ziv code is still 6.12 (p.124) `complete', because, for example, after ve pre xes have been collected, the pointer could be any of the strings 000 , 001 , 010 , 011 , 100 , but it cannot be strings 101 or 111 . Thus there are some binary 110 that cannot be pro duced , as enco dings. Solution to exercise 6.13 (p.124) . Sources with low entrop y that are not well compressed by Lemp el{Ziv include:

141 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 129 6.9: with Sources bols that have long range correlations and inter- some (a) sym what's An del should capture mo correlated junk. ideal vening random Lemp el{Ziv can compress the and features only compress it. correlated cases interv ening junk. of the a simple example, all by memorizing As book in whic h every line con tains consider (old num ber, a telephone an num pair: new ber) 285-3820:572-5892 2 2 258-8302:593-2010 the per line dra The from is 18, 13-c haracter ber of characters num wn f 0 ; 1 ;:::; alphab ; ; : ; 2 g . The characters ` - ', ` : ' and ` 2 ' occur in a et 9 sequence, so the true information con ten predictable assuming t per line, all phone num bers are seven digits long, and assuming that they are the sequences, is about (A ban is the information con ten t of random 14 bans. between state 9.) A nite integer language mo del could a random 0 and the regularities in these data. A Lemp el{Ziv algorithm easily capture tak e a long time before it compresses suc h a le down to 14 bans will however, that in order for it to `learn' per line, the string : ddd because ddd ays follo - , for any three digits by , it will have to see all is alw wed strings. So near-optimal compression will only be achiev ed those after thousands of the le have been read. of lines well compressed 6.11 with Figure y that is not . A source by Lemp el{Ziv. The bit sequence low entrop is read from left to righ t. Eac h line di ers from the line above in f = 5% of its bits. The image width pixels. is 400 Sources with range correlations, for example two-dimensional im- (b) long are ted by a sequence of pixels, row by row, so that that represen ages t pixels are a distance w apart in the vertically stream, adjacen source w image width. Consider, for is the a fax transmission in where example, h eac h line is very similar to the previous line ( gure 6.11). whic true The entrop H y is only ( f ) per pixel, where f is the probabilit y that a pixel 2 only paren t. Lemp el{Ziv algorithms will its compress down di ers from w 400 all strings of length to the entrop = 2 y once have occurred and 2 300 have been memorized. There are successors about 2 their par- only ticles in the univ erse, so we can con den tly say that Lemp el{Ziv codes will never the redundancy of suc h an image. capture 6.12. highly is sho wn in gure t texture The image was Another redundan by dropping horizon tal and vertical made randomly on the plane. It pins con both long-range vertical tains and long-range horizon tal correlations correlations. There is no practical way that Lemp el{Ziv, fed with a pixel-b y-pixel of this image, could capture both these correlations. scan Biological systems can readily iden tify the redundancy in computational these images and in images that are much more complex; thus we migh t result anticipate the best data compression algorithms will that from the dev elopmen t of arti cial intelligence metho ds.

142 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 6 | Stream Codes 130 . A texture consisting tal and vertical pins dropp ed at random on the plane. of horizon Figure 6.12 intricate redundancy , suc h as les generated by computers. with Sources (c) A into T For X le follo wed by its enco ding example, a L a PostScript E information con ten t of this pair of les is roughly equal to the le. The A ten L T con information X le alone. t of the E has (d) set. The picture of the an information con ten t A picture Mandelbrot to the num ber of bits equal to specify the range of the complex required plane the pixel sizes, and the colouring rule used. studied, A picture of a ground of a frustrated antiferromagnetic Ising mo del (e) state Lik whic discuss in Chapter 31. 6.13), e gure 6.12, this ( gure h we will image has interesting correlations in two directions. binary 6.13 triangular Figure . Frustrated mo del in one of its ground Ising states. (f) Cellular { gure 6.14 sho ws the state history of 100 steps of automata update automaton 400 cells. The with rule, in whic h eac h a cellular cell's new state dep ends on the state of ve preceding cells, was selected at random. The con ten t is equal to the information in the information (400 bits), the propagation rule, whic h here can be de- and boundary An optimal compressor will thus give a compressed le ed in 32 bits. scrib whic h is essen tially constan t, indep enden t of the vertical heigh length t of the Lemp el{Ziv would only give this zero-cost compression once image. whic cellular has the a perio dic limit cycle, automaton h could entered 100 easily tak e about 2 iterations. In con trast, the JBIG compression metho d, whic h mo dels the probabilit y coding, of a pixel text and uses arithmetic con would do a local given its on these images. good job the to exercise (p.124) . For a one-dimensional Gaussian, 6.14 vari- Solution 2 2 2 , E [ x ], is ance of . So the mean value of r x in N dimensions, since the is comp ts of x are indep enden t random variables, onen 2 2 E [ r ] = N : (6.25)

143 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 131 6.9: Solutions 100-step automaton with 400 cells. Figure . The time-history 6.14 of a cellular 2 2 , where , is the variance of x , similarly times x is a N r The of variance Gaussian variable. one-dimensional Z 2 x 1 4 4 2 exp x (6.26) : ) = d x x var( 2 2 1 = 2 2 (2 ) 2 4 4 (6.14)), so var( x is found ) = 2 (equation to be 3 . Thus the The integral 4 2 . is 2 N variance of r 2 N tral-limit theorem indicates that r For large has a Gaussian , the cen p 2 2 mean and standard deviation with distribution N N 2 y probabilit , so the p N . be concen trated densit r ' y of r must similarly about thic kness of this shell is given by turning the standard deviation The 2 a standard deviation on r : for small r=r , into log r = r=r = of r p 2 2 2 2 2 1 1 / / has r , standard N 2 ( de- log r 2 = ( ) r ( , so setting =r ) r ( ) 2 ) = p 2 2 1 / = ( r 2. viation ) 2 =r ) = = r ( r p where y of the Gaussian at a point x The y densit r = probabilit N shell is 2 1 N N 1 = exp exp : (6.27) ) = P ( x shell 2 2 2 2 N= N= 2 2 2 ) ) (2 (2 the probabilit Whereas y at the origin is y densit 1 (6.28) : P ( x = 0) = 2 2 N= ) (2 P ( x ) =P ( x = 0) = exp ( Thus N= 2) : The probabilit y densit y at the typical shell N= 2 radius is e times smaller than the densit y at the origin. If N = 1000, then 500 the probabilit y densit y at the origin is e times greater.

144 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 7 Codes for Integers is an aside, h may safely be skipp ed. This chapter whic (p.127) 6.22 Solution to exercise of integers we need some de nitions. the coding To discuss represen tation of a positiv e integer n will be standard The binary c denoted by n ), e.g., c . (5) = 101 , c 101101 (45) = ( b b b binary , of a positiv e integer n standard l length ( n ), is the The b of the string c = 6. ( n ). For example, l (45) (5) = 3, l length b b b standard binary tation c The ( n ) is not a uniquely deco deable code represen b an there is no way of kno wing when since integer has ended. For for integers c if (5) c deable (5) is iden tical to example, deco (45). It would be uniquely c b b b the binary length of eac h integer before it was receiv ed. we knew standard that all positiv e integers Noticing binary represen tation have a standard that with a 1 starts t de ne another represen tation: , we migh The headless binary represen tation of a positiv e integer n will be de- noted by c ( n ), e.g., c (where (5) = 01 , c (45) = 01101 and c = (1) B B B B denotes the null string). represen tation would be uniquely deco deable if we knew the length l This ( n ) b of the integer. how can e a uniquely deco deable code for integers? Tw o strate- So, we mak can gies be distinguished. 1. codes . We rst comm Self-delimiting someho w the length of unicate the integer, l the ( n ), whic h is also a positiv e integer; then comm unicate b original n itself using c integer ( n ). B 2. with `end of le' characters . We code the integer into blo cks Codes b 2 and reserv e one of the of length b sym bols to have the special bits, meaning `end of le'. The coding of integers into blo cks is arranged so that reserv ed sym bol is not needed for any other purp ose. this The simplest uniquely deco deable code for integers is the unary code, whic h can be view ed as a code with an end of le character. 132

145 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. for 7 | Codes Integers 133 An . is enco ded by sending a string of n 1 0 s follo wed integer code Unary n by a 1 . ( n ) n c U 1 1 01 2 3 001 4 0001 00001 5 . . . 00000000000000000000000000 45 0000000000000000001 length l The ( n ) = n . code has unary U code is the code for integers unary probabilit y distri- if the The optimal n bution ( n ) = 2 n is . over p U l ) n ( c n c ) ( n ) n ( b b 1 1 1 1 Self-delimiting codes 010 2 10 2 3 11 2 011 of the unary the code to enco de We can length of the binary use n enco ding 3 4 100 00100 e a self-delimiting mak and code: 00101 101 5 3 3 110 6 00110 Code C . binary headless by the wed ), follo n We send the l code for unary ( b . . . of tation represen . n 101101 45 6 01101 000001 (7.1) c c [ l n ( ( n c ) = ( n ) : )] U B b Table 7.1 sho indicates the codes for some integers. The overlining ws . C 7.1 Table . the of eac h string into the parts c division [ l ). ( n )] and c t ( We migh n B U b alen tly view c es ( n ) as consisting of a string of ( l 1) zero ( n ) equiv b n wed standard binary represen tation follo by the , c ). ( n of b The codew ord c 1. ( n ) has length l ( n ) = 2 l ) ( n b code probabilit over n The the y distribution C implicit is separable for pro duct of a probabilit y distribution over the length l , into the l n n c ( n ) ) ( c P l (7.2) ; ) = 2 ( 1 1 1 having a uniform and over integers that length, distribution 0100 0 2 0100 +1 l 3 0101 0100 1 ( l l 2 ) = n b j n P (7.3) ) = l ( 01100 0101 00 4 : otherwise 0 5 01 0101 01101 6 10 0101 01110 above code, that alw Now, for the length the header comm unicates the ays . . as the of tation represen binary standard occupies the same num ber of bits . 01101 00110 01101 01110 45 integer exp ter large integers the to encoun (giv If we are e one). e or tak ecting to all les les) then this represen (large seems sub optimal, since it leads tation a size occup of using Instead size. ded unco original their is double that ying . Table 7.2 . C and C use n ), we could the C unary code to enco de the length l . ( b Code C binary . We send the length l headless ( n ) using C wed by the , follo b tation n . represen of (7.4) ( n ) = c : [ l ) ( n )] c c ( n B b of codes. this cedure, we can Iterating a sequence pro de ne Code C . c (7.5) ( n ) = c : [ l ) ( n )] c n ( B b Code C . c : ( n ) = c (7.6) [ l ) ( n )] c n ( B b

146 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 7 | for 134 Codes Integers symb with Codes ols end-of- le byte represen (Let's We can the term tations. exibly e byte-based mak also use string of bits, not just a string of length 8 here, to denote any xed-length ( ) n c ) n ( n c 7 3 ber in some num base, for example decimal, then we the de If we enco bits.) 01 11 1 001 111 t a digit to represen represen t eac h digit in a byte. In order can from 0 to 9 in a 2 10 11 010 111 4 = 16, Because 2 this byte we need leaves 6 extra four-bit sym bols, four bits. 011 111 01 00 11 3 , 1011 , 1100 , 1101 , 1110 , 1111 g , that corresp ond to no decimal digit. f 1010 . . . of our these sym bols to indicate the We can as end-of- le positiv e use end 01 10 00 00 11 110 011 111 45 integer. it is redundan t to have more end-of- le sym than bol, so a more Clearly one just t code would enco de the integer into base th sixteen the 15, use and ecien Table with o codes . Tw 7.3 , as the punctuation character. Generalizing this idea, we can bol, sym 1111 bols, . C and C sym end-of- le 3 7 7, and mak for integers in any base in bases codes 3 and byte-based e similar Spaces have been included to n sho w the byte boundaries. of the 1. 2 form codes if it almost complete. (Recall that a code is `complete' These are the inequalit y with equalit y.) The codes' remaining ineciency satis es Kraft they pro vide the abilit y to enco de the integer zero and is that empt y string, the neither h was required. of whic p.136 [ 2, ] . Exercise 7.1. the implicit probabilit y distribution over inte- Consider gers corresp onding to the code with an end-of- le character. (a) If the has eigh t-bit blo cks (i.e., the integer is coded in base code what is the length in bits of the integer, under the 255), mean distribution? implicit (b) to enco de binary les of exp ected size about one hun- wishes If one what kilob a code with an end-of- le character, using is dred ytes optimal blo ck size? the oding a tiny le Enc the we now use we have discussed, To illustrate eac h code to enco de a codes le of just 14 characters, small consisting Claude Shannon : If we map the ASCI I characters onto seven-bit sym bols (e.g., in decimal, = 67, = 108, C etc.), this 14 character le corresp onds to the integer l = 167 987 786 364 950 891 085 602 469 870 (decimal) : n The unary code for n consists of this man y (less one) zero es, follo wed by a one. If all oceans were turned into ink, and if we wrote a hundred the write with millimeter, there migh t be enough ink to cubic bits every ( n ). c U The standard binary represen tation n is this length-98 sequence of of bits: 10101110010011001010100000 n ) = 10000111101100110000111 c ( b 01110110111011011111101110 : 10100111101000110000111 [ 2 ] e the . Exercise Write down or describ 7.2. follo wing self-delimiting represen- c tations above num ber n : c ), ( n ), c n ( n ), c ( ( n ), c c ( n ), of the ), ( n 3 7 and c c ( n .] h of these enco dings is the shortest? [ Answ er: ). Whic 15 15

147 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. for Integers 7 | Codes 135 the Comp codes aring by a sen er the h of two codes is sup erior?' `whic tence One question could answ n > k , code 1 is sup erior, for n < k , code of the erior' but I form `For 2 is sup that h an answ er misses the suc code corresp onds tend con point: any complete whic h it is optimal; you should not say that any other to a prior is for code erior Other codes are optimal for other priors. These implicit priors sup to it. t about e the best code for one's application. be though so as to achiev should one cannot, for free, switc h from Notice code to another, choosing that one hev If one were to do this, er is shorter. it would be necessary to whic then the message in some way that indicates whic h of the two codes lengthen is being If this is done by a single leading bit, it will be found that the used. code is sub optimal it fails the Kraft equalit y, as was discussed resulting because (p.104) in exercise . 5.33 for integers Another a sequence of codes way to compare is to consider suc h as monotonic probabilit y distributions probabilit n y distributions, over rank codes as to how well they enco the any of these distributions. 1, and de is called a `univ ersal' code if for any distribution A code class, it in a given enco into an average length that is within some factor of the ideal average des length. Let say this again. We are meeting an alternativ e world view { rather me guring over integers, a good prior than as adv ocated above, man y the- out orists the problem of creating codes that are reasonably good have studied codes for any priors in a broad class. Here the class of priors con vention- ally is the of priors that (a) assign a monotonically decreasing considered set and probabilit entrop y. (b) y over integers have nite codes of the above are univ ersal. Another code eral Sev we have discussed tly transcends the whic of self-delimiting codes is Elias's `uni- h elegan sequence code integers' (Elias, 1975), whic h e ectiv for chooses from all the versal ely C by sending ;C h ;:::: It works codes a sequence of messages eac h of whic by a single the next message, and indicates of the bit whether des length enco or not is the nal integer (in its standard binary represen tation). that message a length is a positiv e integer and all positiv e integers begin with ` 1 ', Because the 1 s can be omitted. all leading Algo der enco . Elias's 7.4 rithm for ' Write ` 0 . n integer an Loop f If = 0 b log n c halt end c ( n ) Prep string written to the b n c n := log b g wn der of The is generated is sho enco in algorithm 7.4. The enco ding C ! from righ t to left. Table 7.5 sho ws the resulting codew ords. [ 2 ] code . Sho w that the Elias 7.3. is not actually the best code for a Exercise prior distribution that exp ects very large integers. (Do this by construct- be for ing code and specifying how large n must another your code to give a shorter length than Elias's.)

148 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 7 | Codes Integers 136 for n ( n c n ( n ) n c ) ( ) ) n c n ( n c ! ! ! ! 9 1 31 10100111110 256 1110001000000000 0 1110010 100 1110100 32 101011000000 365 1110001011011010 2 10 11 1110110 45 101011011010 511 1110001111111110 3 110 101000 4 1111000 63 101011111110 512 11100110000000000 12 5 13 1111010 64 1011010000000 719 11100110110011110 101010 101100 1111100 127 1011011111110 1023 11100111111111110 6 14 101110 15 1111110 128 10111100000000 1024 111010100000000000 7 1110000 10100100000 8 255 10111111111110 1025 111010100000000010 16 7.5 `univ Table ersal' code . Elias's for Examples from 1 to integers. Solutions 1025. Solution to exercise 7.1 (p.134) . The use of the end-of- le sym bol in a code that represen integer in some base q corresp onds to a belief that there is ts the y of (1 last ( q + 1)) that the curren t character is the a probabilit character of = an exp num the prior to whic h this code is matc hed puts ber. onen tial Thus the prior distribution over the length of the integer. (a) The exp ected num ber of characters is q +1 = 256, so the exp ected length of the is 256 8 ' 2000 bits. integer (b) We wish to nd q suc h that q log q ' 800 000 bits. A value of q between 16 15 cks 2 2 satis es this constrain t, so 16-bit blo and are roughly the optimal size, assuming there is one end-of- le character.

149 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Part II Noisy-Channel Coding

150 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 8 Dep Variables enden t Random last In on data compression we concen trated on random three the chapters from namely extremely simple probabilit y distribution, coming vectors x an is indep in whic distribution onen t x the h eac enden t of the separable h comp n others. we consider join t ensem In this in whic h the random variables chapter, bles dep t. This material has two motiv ations. First, data from the real are enden correlations, data have interesting compression well, we need world so to do dep with dels that include w how to work endences. Second, a noisy to kno mo with input x and channel y de nes a join t ensem ble in whic h x and y are output dep t { if they were indep enden t, it would be imp ossible to comm unicate enden channel { so comm over noisy channels (the topic of chapters over the unication ed in terms entrop is describ y of join t ensem bles. 9{11) of the Mo re about entrop y 8.1 section gives de nitions and exercises to do with entrop y, carrying This on from 2.4. section t entrop join X;Y is: y of The X 1 X;Y ( (8.1) H : ) = ( x;y ) log P P ( x;y ) xy 2A A Y X Entrop y is additiv e for indep enden t random variables: H ( ) = H ( X ) + H ( Y ) i P ( x;y ) = P ( x ) P ( y ) : (8.2) X;Y conditional entrop X given y = b y of is the entrop y of the proba- The k y distribution P ( x j y = b bilit ). k X 1 H y = b (8.3) ) X : j ( P y x j ( = b ) log k k P ( x j y ) b = k 2A x X The conditional entrop y of X given Y is the average, over y , of the con- ditional X given y . entrop y of 2 3 X X 1 5 4 ) X P ( y ) Y j H P ( x j y ) log ( x y ) j P ( 2A 2A y x Y X X 1 = : P ( x;y ) log (8.4) y j x P ( ) 2A xy A Y X This measures the average uncertain ty that remains about x when y is kno wn. 138

151 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. More about y 139 8.1: entrop entrop marginal is another name for the entrop y of X , H ( X ), y of The X it with conditional entropies listed above. trast the used to con information con ten t . From Chain pro duct rule for probabil- rule for the (2.6), we obtain: equation ities, 1 1 1 log = log (8.5) + log P ( ) ) x;y P ( y j x ) ( P x so ( h ( x ) + h ( ) = j x ) : (8.6) h x;y y says that the information con ten In words, x and y is the infor- this t of con t of x plus the information con ten t of y given x . mation ten Chain for entrop y . The join t entrop y, conditional entrop y and rule marginal y are related by: entrop ( X;Y H ( X ) + H ( Y j X ) = H ( Y ) + H ( X j Y ) : (8.7) ) = H says that In words, uncertain ty of X and Y is the uncertain ty this the plus the uncertain ty of Y given X . of X mutual information The X and Y is between I X ; Y ) H ( X ) H ( X j Y ) ; (8.8) ( ), and satis es ( X ; Y ) = I ( Y ; X and I ( X ; Y ) 0. It measures the I average reduction in uncertain ty about x that results from learning the value of ; or vice versa , the average amoun t of information that x y veys con . about y mutual c between X and Y given z = conditional The information k X between the random variables information and Y in is the mutual join t ensem ble P ( x;y j z = c ), the k = X ; Y I z = c (8.9) ) = H ( X j z = c : ) H ( X j Y;z ( c ) j k k k is conditional information between X and Y The Z mutual given the average over z of the above conditional mutual information. I ( X ; Y j Z ) = H ( X j Z ) H ( X j Y;Z ) : (8.10) No other entropies' will be de ned. For example, expres- `three-term Z suc I ( X ; Y ; Z ) and I ( sions j Y ; h as ) are illegal. But you may put X conjunctions of arbitrary num bers of variables in eac h of the three spots in the expression I ( X ; Y j Z ) { for example, I ( A;B ; C;D j E;F ) is ne: it measures how much information average c and d con vey about a on b , assuming e and f are kno wn. and Figure 8.1 sho ws how the total entrop y H ( X;Y ) of a join t ensem ble can be brok en down. This gure is imp ortan t.

152 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 8 | Dep enden Variables 140 t Random relationship . The 8.1 Figure X;Y ) ( H between join t information, y, conditional marginal entrop ) H ( X entrop y and mutual entrop y. H ( ) Y ) ) I Y ( H ) Y j X ( H j Y ; ( X X Exercises 8.2 [ 1 ] Exercise Consider three 8.1. enden t random indep u;v;w with en- . variables H ;H tropies ;H )? . Let X ( U;V ) and Y ( V;W ). What is H ( X;Y w u v H ( X j Y )? What is I ( X ; Y What is )? [ 3, p.142 ] Exercise Referring 8.2. de nitions of conditional entrop y (8.3{ . to the con rm (with an example) that it is possible for 8.4), ( X j y = b ) to H k H ( ), but that the average, H ( X j Y ), is less than H ( X ). So exceed X helpful do are not increase uncertain ty, on average. data { they 2, p.143 ] [ Exercise Pro ve the 8.3. rule for entrop y, equation (8.7). . chain H ( X;Y ) = H ( X ) + H ( Y j X [ )]. [ 2, p.143 ] Exercise Pro ve that the mutual information I ( X ; Y ) H ( X ) 8.4. I ( j Y ) satis es H ( X ; Y ) = I ( Y ; X ) and X ( X ; Y ) 0. I [Hin t: see exercise 2.26 (p.37) and note that I ( X ; Y ) = D (8.11) ( P ( x;y ) jj P ( x ) P ( y )) : ] KL [ 4 ] Exercise The `entrop y distance' between two random variables can be 8.5. to be the di erence their join t entrop y and their mutual de ned between information: ) X;Y ) H ( X;Y ) D ( X ; Y ( : (8.12) I H ve that entrop y distance satis es the the for a distance { Pro axioms ( X;Y ) 0, D ) ( X;X D D X;Z ( X;Y ) = D ( ( Y;X ), and D ) = 0, H H H H H ( to see X;Y ) + D ) ( Y;Z ). [Inciden tally , we are unlik ely D D X;Y ( H H H y-pro but on whic h to practise again it is a good function ving.] inequalit [ 2, p.147 ] 8.6. t distribution. A join t ensem ble XY has the follo wing join Exercise 1 2 3 4 P ) x ( x;y 3 2 4 1 1 1 1 1 1 / / / / 32 32 16 8 1 2 1 1 1 1 / / / / 32 y 2 32 8 16 3 1 1 1 1 / / / / 16 16 16 16 3 4 1 / 0 0 0 4 4 What is the join t entrop y H ( X;Y )? What are the marginal entropies h value H ) and H ( Y )? For eac X of y , what is the conditional entrop y ( H ( X j y )? What is the conditional entrop y H ( X j Y )? What is the is the conditional y of Y given X ? What entrop mutual information between X and Y ?

153 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Further exercises 8.3: 141 [ 2, ] p.143 the h ble XY Z in whic Consider A ensem = A 8.7. = Exercise Y X = f 0 ; 1 g , x and y are indep enden t with P and = f p; 1 A p g X Z = 1 q g and f P q; Y (8.13) x ) mo d 2 : y = ( z + 1 / If (a) = q , what P 2 ? What is I is Z ; X )? ( Z For general (b) and q , what is P that ? What is I ( Z ; X )? Notice p Z ensem ble to the binary symmetric channel, with x = this is related = noise z = output. input, y , and 8.2 . A misleading Figure H(Y) represen tation of entropies (con trast with gure 8.1). H(Y|X) I(X;Y) H(X|Y) H(X,Y) H(X) ee term entr Thr opies [ 3, p.143 ] Exercise Man y texts dra w gure 8.1 in the form of a Venn diagram 8.8. 8.2). Discuss why this diagram is a misleading represen tation ( gure ensem Hin the three-v ariable of entropies. ble XY Z in whic h t: consider x 2f 0 ; 1 g and y 2f 0 ; 1 g are indep enden t binary variables and z 2f 0 ; 1 g is de ned to be = x + y mo d 2. z 8.3 Further exercises ocessing data-pr theorem The pro cessing theorem states that data pro The can only destro y data cessing information. [ 3, p.144 ] ve this Pro Exercise theorem by considering an 8.9. ble WDR ensem in whic w is the state of the world, d is data gathered, and r is the h Mark cessed so that these three variables form a pro ov chain data, w ! d ! r; (8.14) that is, the probabilit y P ( w;d;r ) can be written as w P ) = P ( w ) P ( d j w;d;r ) P ( r j d ) : (8.15) ( Sho w that the average information that R con veys about W , I ( W ; R ), is D less to the average information that or equal con veys about W , than I ( W ; D ). of `information' This is as much a caution about our de nition theorem as it is a caution about data pro cessing!

154 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 8 | Dep enden Variables 142 t Random e and Infer es information enc measur 2 ] [ cards. Exercise 8.10. three The and on both is white is blac k on both faces; faces; one card One (a) one side and is white k on the other. The three cards are on one blac their tations randomized. One card orien wn and and shued is dra the table. The upp er face is blac placed is the colour of on k. What lower face? e the inference problem.) its (Solv the the con vey information about face colour of Does seeing (b) top face? Discuss the information contents and entr opies the bottom situation. Let value of the upp er face's colour be u and in this the . Imagine of the colour be l lower face's that we dra w the value card and learn both a random and l . What is the entrop y of u u H ( U )? What is the entrop y of l , H ( L )? What is the mutual , between U L , I ( U ; L )? information and of Markov opies processes Entr [ 3 ] 8.11. In the . game, we imagined predicting the next letter Exercise guessing t starting from the beginning and working towards the end. in a documen the is, predicting of predicting the reverse d text, that Consider the task people that already kno wn. Most those nd this a harder letter precedes Assuming that we mo del the language using an N -gram task. del mo (whic probabilit y of the next character dep ends only on h says the the N 1 preceding characters), is there any di erence between the average information con ten ts of the reversed language and the forw ard language? Solutions 8.4 an example (p.140) exercise 8.6 (p.140) for See where to exercise . Solution 8.2 H y ) exceeds H ( X ) (set y = 3). ( X j ve the inequalit y H ( X j Y ) H ( X ) by turning the expression We can pro y (using theorem) and invoking Gibbs' inequalit y Bayes' e entrop into a relativ (p.37) ): (exercise 2.26 3 2 X X 1 5 4 ( X j Y ) H ) y ( P ( x j y ) log P ( x j y ) P x 2A 2A y X Y X 1 ( x;y ) log P = (8.16) ) ( x j y P 2A xy A Y X X ) P ( y (8.17) ) P ( y j x ) log P ( x = P j x ) y ( x ) P ( xy X X X 1 ) y ( P : = + (8.18) ( x ) P ( P ) log x x ) log ( y P j x ) P ( y j x ) P ( x x y The last expression is a sum of relativ e entropies between the distributions P ( j x ) and P ( y ). So y H ( j Y ) H ( X ) + 0 ; (8.19) X with equalit y only if P ( y j x ) = P ( y ) for all x and y (that is, only if X and Y are indep enden t).

155 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 143 8.4: 8.3 to exercise chain rule for entrop y follo ws from the Solution . (p.140) The y: of a join decomp t probabilit osition X 1 X;Y ) = (8.20) H ( ) log x;y P ( ) P ( x;y xy X 1 1 + log = (8.21) ( x j log y P P ) x ( ) P P ( y j ( x ) x ) xy X X X 1 1 (8.22) = + P ( ( x ) ) log x P y j x ) log P ( x ) ( P P ) y j x ( x x y (8.23) ( ) + H ( Y j X H : X = ) to exercise 8.4 (p.140) . Symmetry of mutual information: Solution ( X ; Y ) = H ( X ) H ( X I Y ) (8.24) j X X 1 1 (8.25) = ) log x P ( x;y ) log ( P P ( ) P ( x j y ) x x xy X y ) j P ( x = (8.26) ) log P x;y ( x ( P ) xy X ) x;y ( P : (8.27) ) log ( P x;y = ) P ( y ) P ( x xy expression is symmetric in This and y so x I X ; Y ) = H ( X ) H ( X j Y ) = H ( Y ) H ( Y j X ) : ( (8.28) We can ve that mutual information is positiv e in two ways. One is to pro tinue con from X P ( x;y ) I ( P ( x;y ) log X ; Y ) = (8.29) P ( x ) P ( y ) x;y whic use Gibbs' inequalit y (pro ved on p.44), whic h h is a relativ e entrop y and if relativ this 0, with equalit y only e entrop P ( x;y ) = that asserts y is x ) P ( y ), that is, if X and Y are indep enden t. P ( is to use inequalit y on other Jensen's The X X y P ) ) P ( ( x ) x;y ( P P ( x;y ) log : (8.30) log 1 = 0 ) = log y ( P ) x P ( ) ( P P ( x;y x;y ) x;y x;y to exercise 8.7 (p.141) . z = x + y mo d 2 : Solution 1 1 1 / / / j ( H ) ) = 1 X Z ( Z 1 = 0. 2 g and I ( Z ; X ) = H ; 2 = P , 2 f = q If (a) Z For general q and p , P mutual = f pq (b) p )(1 q ) ; p (1 q )+ q (1 p ) g . The +(1 Z information is I ( Z ; X ) = H ( Z ) H ( Z j X ) = H ). ( pq +(1 p )(1 q )) H q ( 2 2 ee term entr Thr opies 8.8 (p.141) . to exercise depiction of entropies in terms of Venn Solution The diagrams for at least two reasons. is misleading First, is used to thinking of Venn one as depicting sets; but what diagrams are the `sets' H ( X ) and H ( Y ) depicted in gure 8.2, and what are the objects the that mem bers of those sets? I think this diagram encourages are novice studen t to mak e inappropriate analogies. For example, some studen ts imagine

156 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 8 | Dep enden Variables 144 t Random 8.3 Figure . A misleading I(X;Y|Z) A I(X;Y) tation represen of entropies, H(Y) con tinued. H(X,Y|Z) H(Y|X,Z) H(X|Y,Z) H(X) H(Z|X) H(Z|Y) H(Z|X,Y) H(Z) ond ( x;y ) migh t corresp outcome to a point in the diagram, that the random entropies with probabilities. and thus confuse depiction diagrams of Venn , the encourages one to be- Secondly in terms tities. all e that corresp ond to positiv e quan the In the special case of liev areas variables it is indeed true that H ( X j Y ), I ( X ; Y ) and H ( Y j X ) two random positiv are tities. But as soon as we progress to three-v ariable ensem bles, e quan a diagram with e-lo oking areas that may actually corresp ond we obtain positiv relationships tities. correctly sho ws 8.3 suc h as e quan Figure to negativ X ) + H ( Z j X ) + H ( H j X;Z ) = H ( X;Y;Z ) : (8.31) ( Y it gives the misleading impression that the conditional mutual information But ( X ; Y j Z ) is less than the mutual information I ( I ; Y ). In fact the area X lab A can corresp ond to a negative quan tity. Consider the join t ensem ble elled X;Y;Z g h x 2f 0 ; 1 g and y 2f 0 ; 1 ( are indep enden t binary variables ) in whic x and 0 ; 1 g is de ned to be z = 2 f + y mo d 2. Then clearly H ( X ) = z H ( Y ) = 1 bit. Also H ( Z ) = 1 bit. And H ( Y j X ) = H ( Y ) = 1 since the two variables are enden t. So the mutual information between X and Y is indep I ( ; Y ) = 0. However, if z is observ ed, X and Y become dep enden t | zero. X j x z , tells you what y is: y = z wing x mo d 2. So I ( X ; Y , given Z ) = 1 kno Thus the lab elled A must corresp ond area 1 bits for the gure to give bit. to correct answ ers. the above example or exceptional at all a capricious The illustration. The is not , and symmetric input X , noise Y with output Z is a situation binary channel h I ( X ; Y ) = 0 (input and noise are indep enden t) but I ( X ; Y j in whic ) > 0 Z (once see the output, the unkno wn input and the unkno wn noise you are intimately related!). Venn diagram represen The is therefore valid only if one is aware tation that positiv e areas may represen t negativ e quan tities. With this pro viso kept in mind, the of entropies in terms of sets can be helpful (Yeung, interpretation 1991). ble (p.141) . For any join t ensem 8.9 XY Z , the follo wing Solution to exercise rule for mutual information holds. chain I ( ; Y;Z ) = I ( X ; Y ) + I ( X ; Z j Y ) : (8.32) X Now, in the case w ! d ! r , w and r are indep enden t given d , so I ( ; R j D ) = 0. Using the chain rule twice, we have: W W I ; D;R ) = I ( W ; D ) (8.33) ( and I ( W ; D;R ) = I ( W ; R ) + I ( W ; D j R ) ; (8.34) so (8.35) ( W ; R ) I I W ; D ) 0 : (

157 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Ab out Chapter 9 Before reading Chapter 9, you should have read Chapter 1 and work ed on exercise 2.26 (p.37) , and exercises 8.2{8.7 (pp. 140{141) . 145

158 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 9 Comm Channel unication over a Noisy The big 9.1 picture Source 6 ? Sour ce Compressor Decompressor coding 6 ? Channel der der Deco Enco coding 6 Noisy - channel we discussed source coding with blo ck codes, sym bol codes In Chapters 4{6, codes. and assumed that the channel from the compres- stream We implicitly decompressor noisy Real channels are to the . We will now sor was noise-free. hannel two chapters sub ject of noisy-c the coding { the fundamen- spend on possibilities and limitations tal comm unication through a noisy of error-free channel. aim of channel coding is to mak e the noisy channel beha ve like The channel. We will that the data to be transmitted has been a noiseless assume so the bit stream has no obvious redundancy . The through a good compressor, code, whic h mak es the transmission, will put bac k redundancy of a channel sort, special to mak e the noisy receiv ed signal deco deable. designed 1 / a over 2 p p ose = we transmit 1000 = Supp per second with bits 1 0 noisy channel that ips bits with probabilit y f = 0 : 1. What is the rate of transmission of information? t guess that the rate is 900 bits per We migh by subtracting the ected num ber of errors per second. But this is second exp w where correct, recipien t does not kno the the errors occurred. not because the case where the noise is so great that the receiv ed sym bols are Consider enden indep transmitted sym bols. This corresp onds to a noise level of t of the to chance = 0 5, since half of the receiv f sym bols are correct due : alone. ed But when f = 0 : 5, no information is transmitted at all. Giv en what we have learn t about entrop y, it seems reasonable that a mea- sure of the transmitted is given by the mutual information between information us the source the receiv ed signal, that is, the the y of the source min and entrop conditional entrop y of the source given the receiv ed signal. We will now review the de nition of conditional entrop y and mutual in- formation. we will examine whether it is possible to use suc h a noisy Then channel to comm unicate reliably . We will sho w that for any channel Q there Q is a non-zero the capacit y C ( rate, ), up to whic h information can be sen t 146

159 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Review y and 9.2: of probabilit information 147 y of error. with arbitrarily small probabilit info 9.2 y and of probabilit Review rmation As join t distribution XY from exercise 8.6 (p.140) . an example, we tak e the P ( x ) and P ( y ) are sho wn in the margins. The marginal distributions ) P ( x;y x ) ( y P 4 1 3 2 1 1 1 1 1 / / / / / 1 8 4 16 32 32 1 1 1 1 1 / / / / / 4 16 8 32 2 y 32 1 1 1 1 1 / / / / / 3 16 4 16 16 16 1 1 / / 0 4 4 4 0 0 1 1 1 1 / / / / ) x ( P 2 4 8 8 The y is H ( X;Y ) = 27 = 8 bits. The marginal entropies are H ( X ) = join t entrop 4 bits Y H ( = ) = 2 bits. 7 and y the distribution of x compute eac h value of conditional , and We can for entrop y of eac h of those conditional distributions: the ( x j y ) x H ( X j y ) = bits P 2 4 1 3 1 7 1 1 1 / / / / / 1 2 4 4 8 8 7 1 1 1 1 / / / / / 4 4 2 y 2 8 8 1 1 1 1 / / / / 2 3 4 4 4 4 4 0 0 0 0 1 11 / 8 H X j Y ) = ( Note that whereas H ( X j y = 4) = 0 is less than H ( X ), H ( X j y = 3) is greater than H X ). So in some cases, learning y can incr ease our uncertain ty about ( . Note also although P ( x j y = 2) is a di eren t distribution from P ( x ), x that H entrop H ( X j y = 2) is equal to conditional ( X ). So learning that y the y our kno wledge about x but does not reduce the uncertain ty of is 2 changes , as measured though, entrop y. On average x learning y does con vey by the ) about H ( X j Y , since < H ( X ). information x may also evaluate H ( Y j X ) = 13 = 8 bits. One mutual information is The I X ; Y ) = H ( X ) ( H ( X j Y ) = 3 = 8 bits. 9.3 Noisy channels memoryless Q is characterized by an input alphab et channel A discrete probabilit , an output alphab et A A , and a set of conditional y distri- Y X x P y j x ), one for eac butions ( 2A . h X These transition probabilities may be written in a matrix Q (9.1) : ) = P ( y = b a j x = j i j i j orien t this matrix with the output variable j indexing the rows I usually so that the variable i indexing the columns, input eac h column of Q is and a probabilit y vector. With this con vention, we can obtain the probabilit y , by of the p output, , from a probabilit y distribution over the input, p X Y righ t-m ultiplication: p (9.2) = Qp : X Y

160 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Channel 148 unication 9 | Comm over a Noisy channels Some del are: useful mo A 0 = f Binary ; 1 g symmetric A channel = f 0 ; 1 g . . . Y X 0 1 - 0 0 = 1 1 f ; = P ( y = 0 j x x 0 ) = f ; P ( y = 0 j ) = 0 y x @ 1 R @ - y = 1 j x = 0 ) = f P ( y ; 1 j x = 1 ) = 1 f: = P ( 1 1 . A erasure = f 0 ; 1 g . A Binary = channel 0 ; ? ; 1 g . f Y X 0 1 - 0 0 j 0 = y 1 ( ) = = P 0; x ; y j x = 0 ) = 1 f 0 ( = P 0 @ @ R y x ? ? ? P y f ; = ( x = 1 ) = j ) = f ; P ( y = ? j x = 0 1 - 1 1 ) = j x y 1 = 1 f: ( P 1 = = x j 1 = y ( P 0; ) = 0 . The . Noisy = A letters = the 27 letters f A , B , . . . , Z , - g typewriter A Y X to type in a circle, when the typist are and B , what arranged attempts 1 / A , B or C , with probabilit y comes out is either eac h; when is 3 the input the is , C or D output so forth, with B nal letter ` - ' adjacen t , the C ; and letter A . to the rst A B C D E F G H I J K L M N O P Q R S T U V W X Y Z - - 1 P A A P - P q C A 1 P B B B P C - q P C D P 1 C C P E . P q - C F . 1 P D D G . P H - q P C I 1 P E E P J F = P y 1 = ( ) = j x = 3; G - q P C K P F F 1 L P M P q - C 1 x j = G ) = G = 3; = y ( P N 1 G G P P O - q P C P P 1 H H Q = j H y ( P x G ) = 1 = 3; = P R P q C . S . T . . C U . P 1 V . P W q P - - C X P Y Y 1 P Y - C P q Z 1 Z P Z - P - C W P q - - Z channel A f 0 ; 1 g . . = = f 0 ; 1 g . A X Y 0 1 - 0 0 ; 1 = ) = x f j P ( y = 0 j x = 0 ) = 1; 0 = y ( P 0 y x 1 - j x = 0 ) = 0; 1 = P ( y = 1 j x = 1 ) = 1 f: y ( P 1 1 the input given the output Inferring 9.4 the ble x to a channel comes from an ensem that X , then If we assume input variables a join XY in whic h the random ble x and y have we obtain t ensem join t distribution: the ( P P ( y j x ) P ( x ) : (9.3) ) = x;y sym sym y , what was the input bol bol x ? We e a particular Now if we receiv won't kno w for certain. We can write down the posterior distribution typically input Bayes' theorem: of the using ) x ( P ) ( y j P x x ) ( x ) P y ( P j P = ( y j x ) = P (9.4) : 0 0 ) y P ( x P ) y ( x ( ) P j 0 x Consider a binary symmetric 9.1. with probabilit y of error Example channel = 0 : 15. Let the input ensem ble be P f : f p . Assume = 0 : 9 ;p g = 0 : 1 1 X 0 we observ e y = 1. P ( y = 1 j x = 1) P ( x = 1) P j y = 1) = P ( x = 1 0 0 y j x ) ( P ( x P ) 0 x 0 : 85 0 : 1 = 0 : 85 0 : 1 + 0 : 15 0 : 9 : 085 0 (9.5) : 39 = 0 : = : 0 22

161 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Information con by a channel 149 9.5: veyed = 1' is still Thus ` than ` x = 0', although it is not as im- less x probable probable as it was before. 1, p.157 ] [ 9.2. e y = 0. Compute the Exercise y probabilit Now assume we observ y = 0. of x = 1 given : a Z channel probabilit y of error Consider = 0 with 15. Let the 9.3. Example f ble be input ensem : f p = 1. = 0 : 9 ;p y = 0 : 1 g . Assume we observ e P 1 X 0 : 0 : 85 0 1 j y = 1) = ( x = 1 P 85 0 : 1 + 0 0 : 9 : 0 0 : 085 = = 1 : 0 : (9.6) 0 085 : given the y = 1 we become certain of the input. So output [ p.157 ] 1, Alternativ ely, assume we observ e 9.4. = 0. Compute Exercise y ( x j y = 0). P = 1 Info rmation convey ed by a channel 9.5 how much information can be comm unicated through a chan- We now consider In operational interested we are nel. in nding ways of using the chan- terms, unicated suc the bits that are comm all are reco vered with negligible nel h that y of error. In mathematical terms, assuming probabilit input en- a particular sem X , we can measure how much information the output ble veys about con the input by the mutual information: I ( X ; Y ) H ( X ) H ( X j Y ) = H ( Y ) H ( Y j X ) : (9.7) Our aim connection between these two ideas. Let us evaluate is to establish the X I ) for some of the channels above. Y ; ( mutual information computing for Hint to think of I We will X ; Y ) as H ( X ) H ( X j Y ), i.e., how much the tend ( ty of the input X is reduced when we look at the output Y . But for uncertain j purp H ( Y ) H ( Y to evaluate X ) instead. handy it is often oses computational Figure 9.1 . The relationship ) H ( X;Y between join t information, marginal entrop y, conditional H ( X ) entrop y and mutual entrop y. H ( Y ) This gure is imp ortan t, so I'm sho wing it twice. ) H ( X j Y ) H ( Y j X ) I ( X ; Y Example 9.5. the binary symmetric channel again, with f = 0 : 15 and Consider = 0 p f P probabil- = 0 : 9 ;p marginal : : 1 g . We already evaluated the 1 0 X ities P ( y ) implicitly above: P ( y = 0) = 0 : 78; P ( y = 1) = 0 : 22. The mutual is: information I ( X ; Y ) = H : Y ) H ( Y j X ) (

162 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 9 | Comm unication Channel 150 over a Noisy H ( j X )? It is de ned to be the weigh ted sum over x of H ( Y j x ); is What Y Y Y x ( same for eac h value of x : H ( j j x = 0 ) is H but (0 : 15), H ) is the 2 ( Y j x = 1 ) is H So (0 : 15). and H 2 ( ; Y ) = H ( Y ) H ( Y j X ) X I (0 = 22) H (0 : : 15) H 2 2 : 76 0 : 61 = 0 : 15 bits : = 0 (9.8) may be con with the entrop y of the source H ( trasted ) = This X (0 : 1) = 0 : 47 bits. H 2 here we have used the binary entrop Note: H ( p ) H ( p; 1 y function 2 1 1 + (1 p ) log this means log book, . Throughout ) = p p log p ) (1 p log . 2 And now the Z channel, with P Example as above. P ( y = 1) = 0 : 9.6. 085. X X ; Y ) = H ( Y ( H ( Y j X ) I ) H 15)] (0 : 085) = [0 : 9 H : (0) + 0 : 1 H (0 2 2 2 61) 0 (0 : 1 0 : 42 = 0 : 36 bits : (9.9) = : entrop y of the source, as above, is H The X ) = 0 : 47 bits. Notice that ( the information I ( X ; mutual ) for the Z channel is bigger than the Y mutual information for the binary symmetric channel with the same f . The Z channel reliable channel. is a more [ p.157 ] 1, Compute the mutual information between X and Y for Exercise 9.7. the symmetric f = 0 : 15 when with input distribution the binary channel f = is p . = 0 : 5 ;p g P : 5 = 0 1 X 0 [ p.157 ] 2, Compute 9.8. mutual the between X and Y for Exercise information Z channel with the = 0 : 15 when the input distribution is P : f X 5 p : 5 ;p f = 0 : = 0 g . 1 0 Maximizing the mutual information ed in the above examples that the mutual information between We have observ ble. input the output dep ends on the chosen input ensem and the Let that we wish to maximize us assume mutual information con veyed the by the channel by choosing the best possible input ensem ble. We de ne the capacit y channel to be its maxim um mutual information. of the capacit of a channel Q is: y The ( Q ) = max ) (9.10) I ( X ; Y C : P X distribution P The that achiev es the maxim um is called the optimal X distribution , denoted by P input . [There may be multiple optimal X input distributions achieving the same value of I ( X ; Y ).] In Chapter 10 sho w that the capacit y does indeed measure the maxi- we will over the mum information that can be transmitted t of error-free chan- amoun nel per unit time. Example 9.9. Consider the binary symmetric channel with f = 0 : 15. Ab ove, : we considered : = f p ) = 0 = 0 : 9 ;p Y = 0 P 1 g , and found I ( X ; 15 bits. 1 0 X

163 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. The hannel theorem 151 9.6: noisy-c coding By , the optimal input distribu- we do? symmetry How much better can 0 : 5 ; 0 : tion g and the capacit y is is f 5 X ; Y ) ( I 39 bits : ( (9.11) = 0 61 : 0 C : Q 0 ) = H : (0 : 5) H = 1 (0 : 15) 2 BSC 2 0.4 0.3 If there's t later. argumen about any doubt symmetry the We'll justify argumen the symmetry t, we can alw ays resort to explicit maximization 0.2 Y ; X ( I ), of the mutual information 0.1 ( H ) f ) p + (1 p ) f ((1 H ) = Y ; X I (9.12) 9.2). ( gure ) f ( 2 1 2 1 0 0.75 1 0.25 0 0.5 p 1 9.10. is a uni- distribution Example noisy The input typewriter. The optimal C gives 9 bits. = log , and x over distribution form 2 mutual Figure 9.2 . The X ) for I information Y a binary ; ( 15. : = 0 f with Z channel tifying optimal the Iden Example 9.11. Consider the symmetric channel f = 0 : 15 with as a function input of the distribution is not so straigh tforw ard. input I ( X ; Y ) explic- We evaluate distribution. for to compute P = f p y ;p probabilit g . First, we need itly P ( y ). The 1 X 0 of y = 1 is easiest to write down: ( y = 1) = p (9.13) (1 P ) : f 1 the information is: Then mutual I X ; Y ) ( ) X j Y ( H ) Y ( H ) = Y ; X ( I 0.7 0.6 H ( p f + (0) H H = p ( p )) (1 f )) ( 2 0 1 2 2 1 0.5 (1 ) f ( p H : ( p H (9.14) )) f = 2 1 1 2 0.4 0.3 This is a non-trivial It is maximized 9.3. in gure wn , sho p of function 1 0.2 We nd Notice 685. : ) = 0 : Q ( C optimal 445. : = 0 f the p 15 by = 0 for 0.1 Z 1 0 5 unicate . We can g 5 : 0 sligh ; comm : 0 f is not distribution more tly input 1 0 0.25 0.75 0.5 p 0 1 than information by using input sym bol . more frequen tly 1 ] p.158 1, [ . The 9.3 Figure mutual channel symmetric binary y of the capacit is the What Exercise 9.12. ( ; a Z ) for Y information I X ? f general for : 15 as a channel with f = 0 distribution. of the function input [ 2, p.158 ] 9.13. capacit Sho w that the Exercise y of the binary erasure channel with f 15 is C = 0 = 0 : 85. What is its capacit y for general f ? : BEC Commen t. The coding theo rem noisy-channel 9.6 It seems the `capacit y' we have de ned may be a measure of plausible that con veyed by a channel; what is not obvious, and what we will information measures ve in the is that the capacit y indeed chapter, the rate at next pro h blo cks of data can be comm unicated over the channel with arbitr arily whic l probability of error . smal We mak e the follo wing de nitions. K ( N;K ) blo ck code for a channel Q is a list of S = 2 An codew ords K (2) (2 ) ( s ) N (1) ; g ; x ;:::; x ; 2A x x f X enco length N . Using this code we can h of de a signal s 2 eac K ( s ) ; 3 ;::: ; 2 f g as x 1 ; 2 . [The num ber of codew ords S is an integer, but num ber of bits speci ed by choosing a codew ord, K the log , is S 2 not necessarily an integer.]

164 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. unication 152 9 | Comm over a Noisy Channel R K=N bits per channel use. The = rate of the code is of the rate for any channel, not [We will chan- use this de nition only inputs; note however that it is sometimes con ventional binary with nels of a code for a channel with q input sym bols to be to de ne the rate log ).] q K= N ( an ( N;K ) blo ck code is a mapping from the set of length- N A deco der for N K ord strings , to a codew of channel outputs, el ^ s 2f 0 ; 1 ; 2 ;::: ; 2 A g . lab Y sym bol ^ s = 0 can be used to indicate a `failure'. The extra y of blo of a code and deco der, for a given channel, probabilit The ck error a given probabilit y distribution over the enco ded signal P ( s and ), for in is: X ) s P ( s : (9.15) P ( ) s 6 = s j p = in out in in B s in maximal The ck error is probabilit y of blo = max p P ( s (9.16) 6 = s : j s ) in out BM in s in deco der for The code is the one that minimizes the prob- optimal a channel y of blo It deco des an output y ck error. input s that has abilit as the um posterior probabilit y P ( s j y ). maxim s ) ( y j P P ( s ) P ( y j ) = P s (9.17) 0 0 s y ) P ( ( P ) j s 0 s j ^ P ( s = argmax y ) : (9.18) s optimal 6 p BM on s is usually assumed, in whic h case the A uniform prior distribution is also the um likeliho od deco der , i.e., the deco der optimal deco der maxim y output s that has maxim um likeliho od to that the maps an input able achiev ). s j y ( P codew the probabilit ord error p The is de ned assuming that y of bit b - of length s bits; it is the num ber is represen ted by a binary vector s K R C a bit onding s of to the y that probabilit is not corresp average equal out R;p of the . Portion 9.4 Figure (averaging of bit s over all K bits). BM in by asserted to be achiev plane able noisy the rst part of Shannon's (part ciated Asso . one) Shannon's theorem coding with hannel noisy-c channel coding theorem. is a non-negativ ber e num eac there channel, h discrete C memoryless (called the channel capacit y) with the follo wing prop erty. For any > 0 and R < C large enough N , there exists a blo ck code of length N and , for probabilit a deco ding algorithm, suc h that the maximal and y rate R ck error is < . of blo of the theorem for the noisy Con rmation channel typewriter In the of the noisy typewriter, we can easily con rm the theorem, because case create a completely error-free comm unication strategy using a blo ck we can code of length N = 1: we use only the letters B , E , H , . . . , Z , i.e., every third letter. These form a non-confusable subset of the input alphab et (see letters 9.5). Any output can be uniquely deco ded. The num ber of inputs gure in the non-confusable subset is 9, so the error-free information rate of this y system in 9 bits, whic h is equal to the capacit is log C , whic h we evaluated 2 example 9.10 (p.151) .

165 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. e preview of pro of 153 9.7: Intuitiv A B C D E F G H I J K L M N O P Q R S T U V W X Y Z - 9.5 Figure . A non-confusable 1 A A subset noisy the for of inputs - B P B B C P D q P typewriter. E C F G 1 D H I - J P E E P K q P L F M N 1 G O P - Q P H H R P S P q I T . U . V . W X Y 1 Y Z - - Z P Z P P q - 1011 0011 1101 1111 0101 1001 0001 0000 1110 0110 1010 0010 1100 0100 1000 0111 Figure channels . Extended 9.6 obtained symmetric a binary from 0000 1000 transition channel with 0100 y 0.15. probabilit 1100 0010 1010 0110 1110 0001 1001 0101 01 11 00 10 1101 00 0011 0 1 10 1011 01 0 0111 11 1111 1 N N = 1 = 4 = 2 N translate into the terms of the theorem? The follo wing table How does this explains. The rem How it applies to the noisy typewriter theo ciate d with The discr ete Asso 9. capacit is log C y each 2 e is a memoryless channel, ther . C numb gative non-ne er > R < C , for large For any to 1. N No matter what and and R are, we set the blo cklength 0 , N enough e exists a block code of length N and by is given K The blo ck code is f ther ; E ;:::; Z g . The value of B K rate 2 R = 9, so K = log h is 9, and this code has log 9, whic rate 2 2 than the greater value of R . requested and a decoding algorithm, The deco ding algorithm maps the receiv ed letter to the nearest letter in the code; y of blo the probability of the such probabilit maximal ck error is zero, whic h is less that maximal < . than the given . block error is 9.7 preview of proof Intuitive Extende d channels ve the theorem for any given channel, we consider the extended channel To pro N onding to N uses of the channel. The extended channel has jA corresp j X N possible x jA obtained j inputs and outputs. Extended channels possible Y from a binary symmetric channel and from a Z channel are sho wn in gures 9.6 and 9.7, with N = 2 and N = 4.

166 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. over a Noisy Channel 154 9 | Comm unication 1111 0111 1011 0011 0000 0101 1001 0001 1110 0110 1010 0010 1100 0100 1000 1101 channels . Extended 9.7 Figure obtained from a Z channel with 0000 1000 h transition probabilit y 0.15. Eac 0100 column corresp to an input, onds 1100 0010 t output. h row is a di eren eac and 1010 0110 1110 0001 1001 0101 00 10 01 11 1101 00 0011 0 1 10 1011 01 0111 0 11 1 1111 = 2 N = 1 N = 4 N N N typical Some . (a) 9.8 Figure A A Y Y N to onding corresp outputs in A Y Typical Typical y y ' ' $ $ typical inputs x . (b) A subset of that sho in (a) sets wn typical the not This overlap do h other. eac picture be compared can with the typewriter in noisy to the solution 9.5. gure 6 % & & % Typical for a given typical y x (b) (a) ] [ 2, p.159 9.14. Exercise transition probabilit y matrices Q for the ex- Find the with binary = 2, deriv ed from the channel, erasure channel tended N erasure probabilit having y 0.15. two columns of this selecting probabilit y matrix, we can By transition 1 / this code for channel with 2 cklength N = 2. What is blo a rate- de ne best of two columns? What is the deco ding algorithm? the choice ve the noisy-c hannel coding theorem, we mak e use To pro blo ck- of large lengths . The intuitiv e idea is that, N N is large, an extende d channel looks if a lot like the noisy typewriter. Any particular input x is very likely to pro duce an output subspace of the output alphab et { the typical output set, in a small that input. we can nd a non-confusable subset of the inputs that given So essen sequences. disjoin t output duce For a given N , let us consider pro tially h a non-confusable subset of the inputs, and coun t up a way of generating suc y distinct inputs it con tains. how man Imagine making input sequence x for the extended channel by dra wing an N input ble X an , where it from is an arbitrary ensem ble over the ensem X alphab Recall the source coding et. of Chapter 4, and consider the theorem num ber of probable output sequences y . The total num ber of typical output N H ( Y ) sequences y , all having similar probabilit y. For any particular typical is 2 ( N H Y j X ) sequence input , there are about 2 probable x Some of these sequences. N A subsets 9.8a. of depicted by circles in gure are Y We now imagine restricting ourselv es to a subset of the typical inputs sets x the corresp onding typical output h that do not overlap, as sho wn suc in gure 9.8b. We can then bound the num ber of non-confusable inputs by ( N H Y ) typical y set, 2 dividing the size of the , by the size of eac h typical- y -

167 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Further 9.8: exercises 155 N H ( Y j X ) x given-t . So the num ber of non-confusable inputs, if they set, ypical- 2 ( N H ( Y ) N N H Y j X ) inputs x X the set , is from 2 of typical = are selected ( X ; Y ) N I . 2 um value of this bound is achiev ed if X is the ensem ble The maxim that I X ; Y ), in whic h case the num ber of non-confusable inputs is maximizes ( N C . Thus up to C bits per cycle, asymptotically no more, can be 2 and with vanishing error probabilit y. 2 comm unicated sketch has not pro ved that reliable comm unication really This rigorously { that's our for the next chapter. task is possible Further exercises 9.8 3, [ ] p.159 9.15. Refer bac k to the computation of the capacit y of the Z Exercise with f : 15. channel = 0 y is p Wh (a) 0.5? One than argue that it is good to favour less could 1 input, since it is transmitted without error { and also argue the 0 the that input, since it often gives rise to the it is good to favour 1 certain 1 whic h allo ws prized iden ti cation of the highly output, Try to mak e a con vincing argumen t. input! In the f of general (b) , sho w that the optimal input distribution case is 1 = (1 f ) = p (9.19) : 1 f ( f ) = (1 ( )) H 2 1 + 2 happ ens to p What (c) if the close level f is very noise to 1? 1 2, [ p.159 ] Exercise 9.16. of the capacit y of the Z channel, the Sketch graphs binary symmetric channel and the binary erasure channel as a function of f . ] 2 [ 9.17. is the capacit y of the ve-input, ten-output channel . What Exercise y matrix is probabilit transition whose 3 2 : 0 0 25 0 25 : 0 0 7 6 : 0 0 0 0 25 : 25 0 0 1 2 3 4 7 6 7 6 : 0 0 0 25 0 : 25 0 7 6 0 7 6 1 0 0 : 0 0 : 0 25 25 7 6 2 7 6 3 25 0 0 0 0 : 25 0 : 7 6 4 (9.20) ? 7 6 5 0 25 0 0 : 25 0 0 : 7 6 6 7 6 7 : : 25 0 0 0 25 0 0 7 6 8 7 6 9 : 0 0 0 : 25 25 0 0 7 6 5 4 0 0 0 25 0 0 25 : : 0 0 : 25 0 : 25 0 0 2, p.159 ] [ Exercise Consider a Gaussian channel with binary input x 2 9.18. A 1 f g and real output alphab et ; y den- , with transition probabilit +1 Y sity 2 ) x y ( 1 2 2 p (9.21) ; ) = y ( Q j x; ; e 2 2 where is the signal amplitude. (a) Compute the posterior probabilit y of x given y , assuming that the two inputs equiprobable. Put your answ er in the form are 1 (9.22) : = 1 j y; ; ) = ( P x a ( y ) e 1 +

168 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 9 | Comm unication Channel 156 over a Noisy of value = 1 j y; ; ) as a function of y . P Sketch the ( x bit optimal What is the a single (b) Assume that is to be transmitted. er is its what Express your answ probabilit and deco der, y of error? 2 2 in terms = of the and the error function signal-to-noise ratio (the e probabilit of the Gaussian distribution), ulativ cum y function Z 2 z z 1 2 p d (9.23) z: z ) ( e 2 1 corre- de nition error function ( that ) may not of the [Note this z people's.] to other spond as a noisy channel Pattern recognition y pattern We may think problems in terms of comm uni- of man recognition Consider channels. case of recognizing handwritten digits (suc h cation the des on envelop es). The author of the digit wishes to comm as postco unicate a message the set A from = f 0 ; 1 ; 2 ; 3 ;::: ; 9 g ; this selected message is the X input What comes out of the channel is a pattern of ink on to the channel. 256 er. pattern is represen ted using ink binary pixels, the channel pap If the 256 ; as its variable y 2A Q = f 0 a random 1 g has . An example of an output Y t from this alphab et is sho wn in the margin. elemen 2 ] [ 9.19. Estimate how man y patterns in A recognizable are Exercise as the Y try aim of this problem is to [The to demonstrate the character `2'. of as many patterns as possible that are existence as 2s.] recognizable Discuss migh t mo del the channel P how one y j x = 2). Estimate the ( y ( P = 2). x entrop y of the probabilit y distribution j One for pattern recognition is to create a mo del for strategy doing ; y x ) for eac h value of the input x = ( 0 ; 1 ; 2 j 3 ;::: ; 9 g , then use Bayes' P f to infer x given y . theorem ) x ( P ) P ( y j x P ) = j x ( y P (9.24) : 0 0 ( ( ) P j x x ) P y 0 x generativ is kno wn This full probabilistic mo delling or strategy e as mo delling . This is essen tially how curren t speech recognition systems work. In addition to the channel mo del, P ( y j x ), one uses a prior proba- ), whic bilit P ( x y distribution h in the case of both character recognition speech recognition and is a language mo del that speci es the probabilit y grammar of the ord given the con text and the kno wn character/w next and of the language. statistics Random coding [ 2, p.160 ] 9.20. prob- Giv en twenty-four people in a room, what is the Exercise abilit there are at least two people presen t who have the same y that birthda y (i.e., day and mon th of birth)? What is the exp ected num ber of pairs of people with the same birthda y? Whic h of these two questions . Some 9.9 more Figure 2s. is easiest to solv Whic h answ er gives most insigh t? You may nd it e? follo helpful problems and those that e these w using notation suc h to solv as A = num ber of days in year = 365 and S = num ber of people = 24. [ 2 ] scheme. 9.21. . The birthda y problem may be related to a coding Exercise to an Assume to con vey a message we wish outsider iden tifying one of

169 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 157 9.9: people. twenty-four comm unicate a num ber s from We could the simply a mapping f 2 ;::: ; 24 g , having agreed ; of people onto num bers; = A 1 S vey a num ber from A alternativ = ely, we could 1 ; 2 ;::: ; 365 g , con f X day of the year that is the the person's birthda y iden tifying selected to leap (with [The receiv er is assumed to kno w all apologies yearians). people's ys.] What, roughly , is the probabilit y of error of this the birthda scheme, it is used for a single transmission? unication comm assuming capacit y of the comm unication channel, and what is the What is the unication attempted by this scheme? rate of comm 2 ] [ are Exercise that there 9.22. K rooms in a building, eac h . Now imagine taining q people. (You con t think of K = 2 and q = 24 as an migh example.) aim is to comm unicate a selection of one person from eac h The by transmitting an list of K days (from A room ). Compare ordered X probabilit of the follo wing two schemes. the y of error before, where eac h room transmits the birthda y of the selected (a) As person. To eac K -tuple of people, one dra h from eac h room, an ordered (b) wn -tuple of randomly selected days from K - is assigned (this K A X This has with their birthda ys). to do enormous list tuple nothing K = q er. strings is kno wn to the receiv of S the building has When selected person from eac h room, the ordered string a particular of days corresp onding to that K -tuple of people is transmitted. What is the probabilit y of error when q = 364 and K = 1? What is the probabilit when = 364 and K is large, e.g. K = 6000? y of error q Solutions 9.9 (p.149) . If we assume to exercise e y = 0, 9.2 Solution we observ x ( P = 1) x = 1) P ( y = 0 j P = 0) j = 1 x ( = P y (9.25) 0 0 P x y ) ( ( x P ) j 0 x 0 : 15 0 : 1 (9.26) = 0 : 1 + 0 : 85 0 : 9 0 15 : : 0 015 = = 0 019 : (9.27) : 78 : 0 to exercise 9.4 (p.149) . If we observ e y = 0, Solution : 1 : 0 15 0 (9.28) x = = 0) y j = 1 P ( : 15 0 : 1 + 1 : 0 0 : 9 0 0 015 : = (9.29) : = 0 : 016 915 0 : to exercise 9.7 (p.150) . The probabilit y that y = 1 is 0 : 5, so the Solution information mutual is: ( X ; Y ) = H ( I ) H ( Y j X ) (9.30) Y = H (9.31) (0 : 5) H 15) (0 : 2 2 (9.32) 1 0 : 61 = 0 : 39 bits = : Solution 9.8 (p.150) . to exercise compute the mutual information We again and I ( X ; Y ) = H ( Y ) H ( Y j X ). The probabilit y that y = 0 is 0 : 575, using

170 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 9 | Comm unication Channel 158 over a Noisy P = 1) P ( x ) H ( Y j x ) = P ( x ( H ( Y j x = 1) + P ( x = 0) H Y j x = 0) Y H ) = X ( j x is: mutual so the information ( ; ) = H ( Y X H Y Y j X ) (9.33) ( I ) = H : 575) [0 : 5 H (9.34) (0 : 15) + 0 : 5 0] (0 2 2 : : 0 : 30 = 0 679 bits : (9.35) = 98 0 9.12 (p.151) . By symmetry Solution optimal input distribution to exercise , the f : 5 ; 0 : 0 g . Then the capacit y is is 5 = I ( X ; Y ) = H ( Y ) H ( C j X ) (9.36) Y = H (0 : 5) H (9.37) ( f ) 2 2 H = ( f ) : (9.38) 1 2 you optimal input distribution the invoking sym- Would like to nd without do this by computing the mutual information metry? general We can in the where input ensem ble is f the case ;p : g p 1 0 ( X ; Y ) = H ( Y ) H ( Y j X ) I (9.39) = H ( p (9.40) f + p : (1 f )) H ) ( f 2 0 2 1 only p -dep endence is in the rst term H h is ( p whic f + p )), (1 The f 1 2 0 is given by setting argumen t to 0.5. maximized value the by setting This p = 1 = 2. 0 Solution to exercise 9.13 (p.151) . Answ er 1 . By symmetry , the optimal input distribution is 0 : 5 ; 0 : 5 g . The capacit y is most easily evaluated by writing the f information as ( X ; Y ) = H ( X ) H ( X j Y ). The conditional entrop y mutual I P y P ( y ) H ( X j ); when y is kno wn, x is uncertain only if y = ? , X H j Y ( ) is y h occurs with probabilit y f= 2 + f= 2, so the conditional entrop y whic ( X j Y ) H is fH (0 : 5). 2 ) ( X ; Y ) = H ( X I H ( X j Y ) (9.41) C = H 5) (0 : = fH (9.42) (0 : 5) 2 2 1 f: (9.43) = binary erasure channel The a fraction f of the time. Its capacit y is fails precisely 1 f , whic h is the fraction of the time that the channel is reliable. This result very reasonable, but it is far from obvious how to enco de seems so as to comm unicate over this channel. information reliably . Alternativ er 2 invoking the symmetry assumed above, we Answ ely, without start from the input ensem ble f p probabilit ;p is g . The can y that y = ? 1 0 e we receiv + p is f = f , and when p f y = ? , the posterior probabilit y of x 1 0 the same as the prior probabilit y, so: I ; Y ) = H ( X ) X H ( X j Y ) (9.44) ( = H (9.45) ( p ) ) fH p ( 2 1 2 1 = (1 f ) H (9.46) ( p : ) 1 2 This mutual information achiev es its maxim um value 2. f ) when p = = 1 of (1 1

171 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 159 9.9: (2) (2) (1) (1) x x x x . (a) Figure The extended 9.10 channel ( = 2) obtained N from a 10 01 11 00 10 01 11 01 10 00 11 00 channel with erasure binary - = 1 m ^ 00 00 00 (b) A erasure probabilit y 0.15. - = 1 m ^ ?0 ?0 ?0 10 10 10 two of the ck code consisting blo - ^ = 1 m 0? 0? 0? codew The . (c) 11 ords 00 and - ^ m = 0 ?? ?? ?? 0 1 - = 2 ^ m code. this for der deco optimal 1? 1? 1? 0 01 01 01 - = 2 ^ m ? ?1 ?1 ?1 - m ^ = 2 1 11 11 11 (a) Q (b) (c) N N = 1 = 2 to exercise 9.14 (p.153) . The extended channel is sho Solution in g- wn ure The best code for this channel with N = 2 is obtained by choosing 9.10. that have minimal for example, columns 00 and 11 . The two columns overlap, output algorithm 00 ' if the extended channel ` is among the ding returns deco and ` 11 ' if it's among the bottom four, top gives up if the output is four and ?? ` '. (p.151) 9.15 . to exercise 9.11 (p.155) we sho wed that the Solution In example information between input and output of the Z channel mutual is I X ; Y ) = H ( Y ) H ( Y j X ) ( p H p (1 f )) ( = H ( f ) : (9.47) 2 1 1 2 care We di eren expression with resp ect to p to confuse , taking this not tiate 1 log with log : 2 e d ) f (1 p 1 1 Y ) log ; X ( ) = (1 I f H ( f ) : (9.48) 2 2 p f ) d (1 p 1 1 ed e to zero and rearranging using skills dev elop deriv in exer- this Setting ativ (p.36) , we obtain: 2.17 cise 1 p (1 ) = f ; (9.49) 1 f = (1 ( ) H f ) 2 1 + 2 input distribution is so the optimal 1 Z 0.9 BSC BEC (1 1 ) = f 0.8 (9.50) : = p 1 f )) H f ( ( ) = (1 0.7 2 1 + 2 0.6 0.5 tends you can pro ve As tends to 1 =e (as the noise level f to 1, this expression 0.4 L'H^ opital's rule). using 0.3 0.2 p values For all f , of intuition for why input is smaller than 1 = 2. A rough 1 0.1 1 is used less than input 0 is that when input 1 is used, the noisy channel 0 0 0.2 0.3 0.4 0.5 0.1 0.7 0.8 0.9 1 0.6 the injects entrop y into the receiv ed string; whereas when input 0 is used, has y. noise zero entrop 9.11 Figure . Capacities of the Z channel, binary symmetric three are channels capacities The . (p.155) 9.16 to exercise Solution of the channel, and binary erasure channel. 5, the : 0 f < For any 9.11. in gure wn channel sho BEC with highest is the capacit y and the BSC the lowest. Solution to exercise 9.18 (p.155) . The logarithm of the posterior probabilit y ratio, given , is y ; ; ) Q ( y j y x = 1 P ) j = 1 x ( y; ; (9.51) = 2 : = ln ) = ln y ( a 2 x ( P ) Q ( y y; ; x = 1 ; ; ) j 1 = j

172 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 9 | Comm unication 160 over a Noisy Channel in the picked 2.17 (p.36) , we rewrite this exercise from skills up our Using form 1 (9.52) : ) = ( x = 1 j P y; ; ( y ) a e 1 + deco der selects the most probable hypothesis; this can be done The optimal = 1. at the of a ( y ). If a ( y ) > 0 then deco de as ^ x sign simply by looking y of error The probabilit is Z Z 2 x 0 y 1 x 2 2 p y d x y Q d ; ; ) = j y ( = 1 = p = (9.53) : e b 2 2 1 1 coding Random 9.20 (p.156) . The probabilit to exercise S = 24 people whose Solution y that ys are dra wn at random from A = 365 days all have distinct birthda ys birthda is A A 1)( A 2) ::: ( A S + 1) ( (9.54) : S A probabilit y that two (or more) people share a birthda y is one min us this The = 365, tity, whic S = 24 and A h, for is about 0.5. This exact way of quan answ ering the question is not very informativ e since it is not clear for what value of the probabilit y changes from being close to 0 to being close to 1. S a particular num S ( S 1) = 2, and the probabilit y that is ber of pairs The shares y is 1 =A , so the a birthda er of collisions is pair expected numb 1 S 1) ( S : (9.55) 2 A answ er is more instructiv e. The exp ected num ber of collisions is tiny if This p p big A and if S A . S birthda also ximate the probabilit y that all We can ys are distinct, appro for small S , thus: A ( A 1)( A 2) ::: ( A S + 1) 1) ( S 2 1 / / / A ) ) ::: (1 A )(1 A (1)(1 = S A ' exp(0) exp ( 1 =A ) exp ( 2 =A ) ::: exp( ( S 1) =A ) (9.56) ! S 1 X 1 2 1) = S ( S exp ' i (9.57) : = exp A A =1 i

173 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. out Chapter Ab 10 4 and Chapter should have read Chapters you 9. Exer- reading 10, Before (p.153) cise ecially recommended. 9.14 is esp of char acters Cast Q the noisy channel the capacit C channel y of the N an ensem ble used to create a random code X a random C code the length of the codew ords N s ) ( x a codew ord, the s th in the code s the num ber of a chosen codew ord (mnemonic: the sour ce selects s ) K = 2 the total num ber of codew ords in the code S = log ord K the num ber of bits con veyed by the choice of one codew S 2 from S , assuming it is chosen with uniform probabilit y s represen tation of the num ber s a binary R = K=N the rate of the code, in bits per channel use (sometimes called 0 R instead) ^ s the deco der's guess of s 161

174 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10 The Coding Noisy-Channel Theorem theo The rem 10.1 three parts, two positiv e and one negativ The main positiv e theorem has e. The rst. is the result 6 p b the channel capacit channel, memoryless discrete 1. For every y R ( p ) b C ; = max X ) (10.1) Y I ( 2 P X 1 3 large , for N , > 0 and R < C erty. For any enough has the follo wing prop - a deco algorithm, and R rate and N a code of length exists there ding R C < ck error y of blo probabilit maximal the h that suc is . 2. If a probabilit p ) are p ( R achiev- y of bit error to is acceptable, rates up 10.1 Figure . Portion of the R;p b b b ved achiev to be pro able plane able, where (1, (3). achiev not 2) and able C ( p ) = R (10.2) : b H p ) 1 ( 2 b not , rates greater than R 3. For any p p ) are able. achiev ( b b 10.2 Jointly-t ypical sequences the intuitiv e preview of the We formalize chapter. last ( s ) N We will codew ords as coming from an ensem ble X x de ne con- , and sider random selection of one codew ord the a corresp onding channel out- and N y , thus de ning a join t ensem ble ( XY ) , . We will use a typical-set deco der put ( s ) whic y as s if x h deco a receiv ed signal and y are join tly typical , a term des shortly . to be de ned probabilities of will tre on determining the cen (a) that the The then pro codew true is not join tly typical with the output sequence; and (b) input ord a false input codew ord is join tly typical with the output. We will that w sho that, large N , both probabilities go to zero as long as there are few er than for N C X ords, and the ensem ble 2 is the optimal input distribution. codew y t typicalit . A pair of sequences x ; Join of length N are de ned to be y join tly typical (to tolerance ) with resp ect to the distribution P ( x;y ) if 1 1 < ; ) log H ( X x i.e., P of x ), is typical ( ) P N x ( 1 1 i.e., y ( P of ), is typical y H ( Y ) log < ; N P ( y ) 1 1 < : log ) X;Y ( H y is typical of P x ( x ; y ), i.e., and ; ) y ; x ( P N 162

175 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. tly-t 10.2: 163 ypical Join sequences ypical join The is the set of all join tly-t ypical sequence pairs set tly-t J N . N of length = 100 pair of length N Example. for the ensem ble is a join tly-t ypical Here ) in whic h P P x ) has ( p ( ;p x;y ) = (0 : 9 ; 0 : 1) and P ( y j x ) corresp onds to a ( 1 0 with noise symmetric : 2. binary channel level 0 00000 0000 0000 00000 0000 00000 0000 0000 00000 0000 00000 0000 0000 00000 0000 000 x 1111111111000000000000000000000 0011111111000000000000000000000 0000 0000 00000 0000 00000 0000 0000 00000 0000 00000 0011 1111 11111 1111 111 y 00000 x probabilit 10 1 s, and so is typical of the that y P ( x ) (at any Notice has y ); and has 26 1 s, so it is typical of P ( ) (because P ( y = 1) = 0 : 26); tolerance y x and y di er in 20 bits, whic h is the typical num ber of ips for this and channel. N ) x ; y be dra wn from the ensem ble ( XY . Join Let t typicalit y theorem by de ned N Y ( P x ; y ) = P ( x : ;y ) n n n =1 Then y that x ; y are join tly typical (to tolerance ) tends 1. the probabilit !1 ; N to 1 as N H ( X;Y ) sequences j J to 2 j is close 2. the num ber of join tly-t ypical . N To be precise, ) )+ N ( H ( X;Y 2 j ; (10.3) j J N 0 N 0 0 0 N y x Y and , i.e., x X y 3. if are indep endent samples and probabilit the marginal distribution as P ( x ; y ), then the same y with 0 0 N I ( X ; Y ) join tly-t ypical set is about ( x 2 ; y that ) lands in the . To be precise, ) 0 N ( I ( 0 ; Y ) 3 X y x ) 2 J (( ) 2 P ; : (10.4) N Pro pro of of parts 1 and 2 by the law of large num bers follo ws that of. The coding the in Chapter 4. For part 2, let source pair x;y of the theorem P of role source coding theorem, replacing x ( x ) there by play the in the probabilit y distribution P ( x;y ). the third For the part, X 0 0 ; P (( ) 2 J (10.5) ) = x y y ( P ) x ( P ) N 2 ) y ; x ( J N ) ( ) N ( H X ) ) Y N ( H ( 2 j 2 j (10.6) J N ) N ( H ( X;Y )+ ) N ( H ( X )+ H ( Y ) 2 2 (10.7) ( I ( X ; Y ) N 3 ) : 2 (10.8) = 2 on of the join tly-t ypical set is sho wn in gure 10.2. Tw o indep enden t A carto typical vectors join tly typical with probabilit y are I )) 0 N ( 0 ( X ; Y 2 x ; y (( ) P J (10.9) ) ' 2 N because the total num ber of indep enden t typical pairs is the area of the dashed ( X ) N H ( Y ) N H the 2 , and rectangle, num ber of join tly-t ypical pairs is roughly 2 N H ( X;Y ) 2 , so the probabilit y of hitting a join tly-t ypical pair is roughly N H ( X;Y ) N H ( X )+ N H ( Y ) N I ( X ; Y ) = 2 2 2 = : (10.10)

176 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. | The Noisy-Channel Coding Theorem 164 10 N A ypical . The 10.2 Figure join tly-t X - horizon The tal set. direction - N , the represen input of all set ts A 6 q q X ) N H X ( q q 2 6 strings of length N . The vertical q q q q q q N , the set of A direction represen ts q q q Y q q q all output of length N . strings q q q q q q The tains box con outer all q q q q q q able input{output pairs. conceiv q q q q q q q q q Eac ts a h dot represen N H ( X;Y ) dots 2 q q q q q q join of sequences ypical pair tly-t q q q q q q y ; x ( ber of num total ). The q q q q q q sequences ypical is about tly-t join q q q ) N H ( X;Y q q q . 2 6 6 q q q N ) ( j N H Y X q q q A 2 Y q q q ? ? q q q q q q q q q - q q q q q q q q q N H Y ( ) 2 q q q q q q q q q q q q q q q q q q q q q - q q q j Y ) N H ( X q q q 2 q q q q q q q q q q q ? q q ? Pro noisy-channel coding theo rem 10.3 of of the gy Analo that to pro ve that there is a bab y in a class of one hundred Imagine we wish who weighs less than 10 kg. Individual babies babies dicult to catc h and are weigh. metho d of solving the task Shannon's op up all the babies is to sco their that If we nd hine. mac weighing a big on at once all them weigh and 10 kg, at least age weigh t is smaller than aver there must exist one bab y who less must metho y! Shannon's be man weighs there than 10 kg { indeed d isn't d for metho Figure 10.3 . Shannon's since guaran teed to reveal the existence of an underw eigh t child, it relies on ving pro than y weighs less bab one But there ber of elephan ts in the class. a tiny num if we use his metho d being 10 kg. get task weigh t smaller than 1000 kg then our and is solv ed. a total skinny en to fantastic From codes childr a deco to sho there We wish a code and w that der having small prob- exists abilit y of error. Evaluating the probabilit y of error of any particular coding and deco system is not easy . Shannon's inno vation was this: instead of ding a good coding system deco ding constructing and evaluating its error prob- and y of blo abilit the average probabilit calculated ck error of all codes, y, Shannon and pro ved that this average is small. There must then exist individual codes that y of blo ck error. probabilit have small coding and typic al-set decoding Random 0 the follo wing enco ding{deco ding system, whose rate is R Consider . 0 N R 0 = 2 ) and generate the S P 1. We x ( codew ords of a ( N;NR x ) =

177 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. of of the noisy-c hannel coding theorem 165 Pro 10.3: (1) (1) (3) (2) (4) (4) (3) (2) x x x x x x x x . (a) 10.4 Figure code. A random q q q q q q q q by the dings deco Example (b) q q q q q q - q q q q q q y s ^ ) = 0 y ( a deco A sequence typical der. set a q q q q q q q q q q q q q q q q q q any typical tly with join is not that q q q q q q - q q q q q q y s ( y ) = 3 ^ b , is h as of the y suc codew ords, b q q q q q q a q q q q q q q q q q q q = 0. A sequence deco s ded as ^ that q q q q q q q q q q q q tly ord is join typical with codew q q q q q q q q q q q q (3) q q q q q q s as ^ ded , is deco y alone, x = 3. b q q q q q q q q q q q q Similarly is deco ded , s = 4. y as ^ q q q q q q c q q q q q q q q q q q q typical tly A sequence that is join q q q q q q q q q q q q than more with codew ord, one q q q q q q q q q q q q q q q q q q y h as suc = 0. , is deco s as ^ ded d q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q - q q q q q q y y ) = 0 ^ s ( d d q q q q q q q q q q q q q q q q q q q q q q q q - q q q q q q y s ( ^ ) = 4 y c c q q q q q q q q q q q q q q q q q q q q (b) (a) to at random according ) code ( C N;K N Y P (10.11) ) x : ( ( ) = x P n =1 n wn schematically in gure 10.4a. A random code is sho code is kno wn to both sender and receiv er. 2. The 0 N R ) ( s s ;::: ; 2 3. A message f from is chosen 1 ; 2 x is transmitted. The , and g ed signal is y , with receiv N Y ( s ) ) s ( ( j ) = x P y y j x ( P ) (10.12) : n n =1 n signal 4. The by typical-set deco ding . is deco ded s (^ ) de y as ^ s deco ( x ding . Deco ; y ) are join tly typical and Typical-set if 0 s ) 0 ( h that x suc s other is no there ; y ) are join tly typical; ( declare (^ s = 0). otherwise a failure is not the optimal deco ding algorithm, but it will This be good enough, and to analyze. The typical-set deco easier is illustrated in g- der ure 10.4b. 5. A deco ding error occurs if ^ s 6 = s . There are probabilities of error that we can distinguish. First, there three probabilit is, ck error for a particular code C , that is the y of blo 6 ) C p P (^ s ( = s jC ) : (10.13) B This is a dicult quan tity to evaluate for any given code. of this Second, average over all codes is the blo ck error probabilit y, there X p i h (10.14) : P (^ s 6 ) s jC ) P ( C = B C

178 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10 | The Coding Theorem 166 Noisy-Channel quan , this than the rst quan tity tity is much easier Fortunately to evaluate is just 6 jC ). h p P i s the probabilit y that s = (^ B error at step 5 there is a deco ding blo Third, the maximal ck error probabilit y of a code C , pro cess the ve-step on of the page. previous ( C ) max p P (^ s 6 = s j s; C ) ; (10.15) BM s tity we are quan in: we wish to sho w that there exists a most is the interested ck error with rate whose maximal blo required probabilit y is small. C code the to this result by rst nding the average blo We will probabilit y, get ck error p i . Once we have sho wn that this can be made smaller than a desired small h B exist deduce that there must we immediately at least one code C ber, num blo ck error probabilit y is also less whose this small num ber. Finally , than we sho this code, whose blo ck error probabilit y is satisfactorily small w that whose maximal ck error probabilit y is unkno wn (and could conceiv ably but blo smaller be mo e a code of sligh tly to mak rate whose can be enormous), di ed ck error probabilit y is also guaran teed to be small. We mo dify maximal blo code by thro wing worst 50% of its codew ords. the away the on nding the average probabilit y of blo ck error. now embark We therefore al-set decoder of error of typic Probability are two sources of error when we use typical-set deco ding. There (a) Either ( s ) join tly typical with the transmitted codew ord x the y is not , or (b) output is some . codew ord in C that is join tly typical with y there other the y of error of the code construction, the average probabilit By symmetry s over all does not dep end averaged the selected value of codes ; we can on assume without loss of generalit y that s = 1. (1) The probabilit input x (a) and the output y are not join tly y that the by the t typicalit y theorem's rst part (p.163 ). We give a vanishes, typical join , to the upp er bound on this probabilit y, satisfying ! 0 as N !1 ; name, (1) y nd any desired cklength N ( ) suc h that the P (( x ; , we can ) 62 for a blo . J ) N 0 s ) 0 ( The probabilit y that and y are join tly typical, for a given s x 6 = 1 (b) 0 Y N ( I ( X ; ) 3 ) N R 0 there are (2 , by part 2 1) rival values of s 3. And to is about. worry average probabilit y of error h p Thus the i satis es: B 0 N R 2 X ) Y ) 3 ; X ( I N ( 2 (10.16) + i p h B 0 =2 s 0 ) ( I ( X ; Y N R 3 ) : (10.17) + 2 inequalit y (10.16) that bounds a total probabilit y of error P The by the TOT 0 0 of the P probabilities sum sorts of events s t eac h of whic h is sucien of all s to cause error, P + P ; P + 2 1 TOT is called a union bound . It is only an equalit y if the di eren t events that cause error nev at the same time as eac h other. er occur average y of error (10.17) can be made < 2 The by increasing N if probabilit 0 R ( X ; Y < I 3 : (10.18) ) We are almost there. We mak e three mo di cations: optimal 1. We choose x ) in the pro of to be the ( input distribution of the P 0 0 the condition R . channel. ( X ; Y ) 3 becomes R Then < C 3 < I

179 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. unication errors) y 167 10.4: (with Comm above capacit 10.5 Figure . How expurgation works. (a) In a typical random a small fraction of the code, ords are involved in codew collisions { pairs of codew ords are close sucien tly h other to eac that y of error when probabilit the codew ord is transmitted is either tiny. We obtain code a new not from a random code by deleting all these confusable codew ords. tly (b) The resulting code has sligh ) so has er codew tly a sligh few ords, expurgated ::: code A random (a) (b) its lower rate, maximal and probabilit is greatly y of error reduced. probabilit y of error over all codes 2. Since < 2 , there must the average is code with mean probabilit y of blo ck error p ) ( C exist < 2 . a B w that only the average but also the maximal probabilit y of 3. To sho not this p be made small, we mo dify , can code by thro wing away error, BM worst half of the codew ords { the ones most likely to pro duce errors. the that Those all have conditional probabilit y of error less remain must 6 p This codew ords to de ne a new code. these . We use remaining than 4 b 0 1 N R 0 i.e., we have reduced the rate from R codew new code has 2 ords, 0 1 / to R is large), and achiev ed N (a negligible < 4 . reduction, if N p BM ( gure 10.5). The resulting code may This tric k is called expurgation able achiev rate it is still good enough to code of its and but best be the not length, coding h is what trying to theorem, we are hannel noisy-c ve the pro whic - R here. do C . Portion R;p of the 10.6 Figure 0 0 b 1 / R N < C 3 , , where In conclusion, `construct' we can a code of rate R able in the plane pro ved achiev . We obtain the theorem as stated by with maximal probabilit y of error < 4 [We've theorem. of the part rst 0 0 for = ) 3, and N sucien tly setting R large = ( R + C ) = 2, = = 4, R C ( < the maximal ved that pro conditions part theorem's 2 to hold. rst remaining is thus pro ved. the The p can probabilit y of blo ck error BM small, be made arbitrarily so the error the same goes for bit , whic be h must p y probabilit 10.4 y (with erro Communication capacit rs) above b smaller than .] p BM the any discrete We have pro channel, for achiev abilit y of a ved, memoryless of the R;p we can plane portion wn in gure 10.6. We have sho wn that sho b noiseless any noisy into an essen turn channel binary channel with rate tially up to C bits per cycle. We now extend the righ t-hand boundary of the region of achiev abilit error probabilities. [This is called rate-distortion y at non-zero .] theory a new tric k. Since we kno with mak e the noisy channel We do this w we can channel with a smaller rate, it is sucien t to consider comm u- into a perfect with over a nication noiseless channel. How fast can we comm unicate errors channel, wed allo over a noiseless to mak e errors? if we are a noiseless Consider channel, and assume that we force comm uni- binary cation at a rate greater than its capacit y of 1 bit. For example, if we require the sender to comm unicate at R = 2 bits per cycle then he must to attempt ely thro w away half of the information. What is the best way to do e ectiv this if the aim is to achiev e the smallest possible probabilit y of bit error? One of the simple unicate a fraction 1 =R is to comm source bits, and ignore strategy the and The receiv er guesses the missing fraction 1 1 =R at random, rest.

180 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10 | The Coding Theorem 168 Noisy-Channel 0.3 probabilit is y of bit average the error Optimum 0.25 Simple 1 0.2 =R : = ) p 1 (10.19) (1 b p b 2 0.15 strategy The dashed by the wn is sho line curv e corresp onding to this in g- 0.1 ure 10.7. 0.05 better than this (in terms of minimizing p ) by spreading We can out do b 0 all e the risk of corruption we can fact, In bits. the achiev among evenly 0.5 2.5 2 1.5 0 1 1 = H p ), whic h is sho wn by the (1 curv e in gure 10.7. So, how 1 =R solid R b 2 optim can this um be achiev ed? . A simple Figure on 10.7 bound dev a for ) code N;K ( We reuse a tool that we just the elop namely ed, ), and R;p points ( able achiev b noisy channel, and we turn it on its head, using the decoder to de ne a lossy Shannon's bound. Speci cally , we tak e an excellen t ( N;K ) code for the binary compressor. 0 channel. that suc h a code has a rate R symmetric = K=N , and that Assume of correcting errors by a binary symmetric channel it is capable introduced 0 codes . Asymptotically , rate- R probabilit q exist that transition whose y is 0 ' 1 H have ( q ). Recall that, if we attac R of these capacit h one hieving y-ac 2 N to a binary symmetric channel then (a) the probabilit y of length codes the outputs to uniform, since is close entrop y of the over the distribution 0 to the entrop y of the source ( NR output ) plus the entrop y of the is equal ( noise NH ( q )), and (b) the optimal deco der of the code, in this situation, 2 typically ed vector of length N to a transmitted vector di ering a receiv maps bits from the receiv ed vector. qN in e the signal that we wish to send, and chop it into blo We tak N cks of length (yes, , not K ). We pass eac h blo ck through the decoder , and obtain N a shorter signal K bits, whic h we comm of length over the noiseless channel. To unicate deco de the transmission, we pass the K bit message to the encoder of the original code. reconstituted message will now di er from the original The in some of its { typically qN of them. So the probabilit y of bit message bits 0 be error = q . The rate of this lossy compressor is R = N=K = 1 will =R = p b = 1 H ( p (1 )). 2 b y- lossy compressor to our capacit hing C error-free comm u- Now, attac this we have pro ved the achiev abilit y of comm unication up to the curv nicator, e ( p ;R ) de ned by: b C R = (10.20) : 2 p 1 ) H ( 2 b , see rate-distortion theory For further Gallager (1968), p. 451, about reading (2002), p. 75. or McEliece 10.5 non-achievable region (pa rt 3 of the theo rem) The ov chain: source, der, noisy channel The deco der de ne a Mark enco s ! x ! y ! ^ s and P ( s; x ; y ; ^ s ) = P ( s ) P ( x j s ) P ( y j x ) P (^ s j y ) : (10.21) The data cessing inequalit y (exercise 8.9, p.141) must apply to this chain: pro ( ( ; ^ s ) I ( x ; y ) : Furthermore, by the de nition of channel capacit y, I I x ; y ) s , so NC ( s ; ^ s ) NC . I error that achiev es a rate R and a bit Assume probabilit y p ; a system b then the mutual information I ( s ; ^ s ) is NR (1 H > NC ( p ) )). But I ( s ; ^ s 2 b C 2 able. is not achiev achiev so R > is not able, H ) ( 1 p 2 b [ 3 ] Exercise 10.1. Fill in the details in the preceding argumen t. If the bit errors s between s and s are indep enden t then we have I ( ^ ; ^ s ) = NR (1 H p ( )). 2 b

181 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. capacit y Computing 10.6: 169 among bit errors? Wh y does What if we have complex correlations those I ( s ; ^ s ) NR (1 H )) hold? ( p the inequalit y 2 b y 10.6 capacit Computing con capacit is the maxim um rate the h Sections 10.6{10.8 y of a channel tain We have pro ved that at whic material. The rst-time anced adv capacit the we compute How can ed. be achiev reliable comm y of can unication reader to skip is encouraged to memoryless channel? a given discrete to nd its optimal input distri- We need 10.9 section ). (p.172 we can the optimal input distribution by a computer In general bution. nd use of the deriv ativ e of the mutual information with resp ect searc h, making probabilities. input to the [ 2 ] Exercise Find the deriv ativ e of . ( X ; Y ) with resp ect to the input prob- 10.2. I y p abilit , @I ( X ; Y ) [email protected] . , for a channel with conditional probabilities Q i i j j i [ ] 2 Sho w that I ( X ; Y ) is a conca ve _ function of the input 10.3. Exercise prob- y vector p abilit . in the ( ; Y ) is conca I _ X input distribution p , any probabilit y distri- Since ve p at whic h I ( X ; Y ) is stationary must be a global maxim um of I ( X ; Y ). bution ; it is tempting deriv ativ e of I ( X the Y ) into a routine that nds a to put So maxim um of I ( X ; Y ), that is, an input distribution P ( x ) suc h that local ( ; Y ) @I X (10.22) = for all i; @p i P = 1. p multiplier asso ciated with the constrain t where is a Lagrange i i this approac to nd the righ t answ er, because I ( X ; Y ) However, h may fail by a distribution that has p migh = 0 for some inputs. A t be maximized i example is given by the ternary confusion channel. simple confusion channel . A . = f 0 ; ? ; 1 g . Ternary g = f 0 ; 1 A Y X - 0 0 = 0 = P ( y = 0 j x = 0) = 1 ; P ( y 0 ; j x = ? ) = 1 = 2 ; P ( y = 0 j x = 1) ? @ ( y = 1 j x P ( y = 1 j x = ? ) = 1 = 2 ; = 0) = 0 ; x ( 1 = = 1) : P P y = 1 j - @ R 1 1 er the input is used, the output is random; the other inputs Whenev ? rate reliable maxim um information The of 1 bit is achiev ed are inputs. no use of the input by making . ? [ 2, p.173 ] . 10.4. Sketch the mutual information for this Exercise as a channel function of the input distribution p . Pic k a con venien t two-dimensional represen tation of p . The must therefore tak e accoun t of the possibilit y that, routine optimization hill on I ( X ; Y ), we may run into the inequalit y constrain ts p as we go up 0. i [ 2, p.174 ] Exercise 10.5. . Describ e the condition, similar to equation (10.22), that describ is satis ed I ( X ; Y ) is maximized, and at a point where e a com- puter program for nding the capacit y of a channel.

182 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10 | The Coding Theorem 170 Noisy-Channel may that optimal input distribution help Results in nding the be used. must outputs 1. All parameters. Y ) is a con 2. ^ function of the channel I Reminder: The term `con vex ^ ' ( X ; vex and the term means `con vex', the look all they but distributions, input optimal may be several 3. There _ ve `conca `conca ve'; the ' means are bols sym frown and smile little same at the output. you included simply to remind ve mean. what vex and con conca ] 2 [ is unused ve that no output y 10.6. by an optimal input distri- Exercise Pro . it is unreac bution, that is, has Q ( y j x ) = 0 for all x . unless hable, 2 ] [ 10.7. Pro ve that I ( X ; Y ) is a con vex ^ function of Q ( y Exercise x ). j [ 2 ] Exercise Pro ve that all optimal input distributions of a channel have 10.8. P Q P ( x ) ). ( y j x same y ) = output the y distribution P ( probabilit x of with the fact These I ( X ; Y ) is a conca ve _ function along results, that input y vector p , pro ve the probabilit y of the symmetry argumen t the validit we have used when nding the capacit y of symmetric that If a channels. channel t under a group of symmetry operations { for example, is invarian the hanging sym bols and interc hanging input output sym bols { then, interc the input distribution that is not symmetric, i.e., is not invarian t given any optimal these under we can create another input distribution by averaging operations, its this together distribution and all optimal perm uted forms that we input can mak e by applying the symmetry operations to the original optimal input distribution. The distributions must have the same I ( X ; Y ) as the perm uted , so new input distribution created by averaging by symmetry original, the I ( X ; Y ) bigger than or equal to that of the original distribution, must have conca I . of the because vity of Symmetric channels to use of a argumen ts, it will help to have a de nition order In symmetry I like Gallager's (1968) de nition. symmetric channel. channel memoryless channel if the set of A discrete is a symmetric can be partitioned into subsets in suc h a way that for eac h outputs the subset probabilities has the prop erty that eac h matrix of transition 1) is a perm h other of eac than row and eac h column row (if more utation utation h other column. is a perm of eac Example 10.9. This channel ( y = 0 j x = 0) = 0 : 7 ; 1 ; : 0 = = 1) x j = 0 y ( P P P 2 ; 0 = = 0) x j ? = y ( P : ( y = ? j x = 1) = 0 : 2 ; (10.23) y = 1 j x = 0) = 0 : 1 ; P P ( y = 1 j x = 1) = 0 : 7 : ( 1) is a symmetric its outputs can be partitioned into (0 ; because channel and ? , so that the matrix can be rewritten: P ( y = 0 j x = 0) = 0 : 7 ; 1 ; : 0 = = 1) x j = 0 y ( P ( 1 ; = = 0) x j = 1 y ( P P 0 y = 1 j x = 1) = 0 : 7 ; : (10.24) ( ( = ? j x = 0) = 0 : 2 ; P P y = ? j x = 1) = 0 : 2 : y

183 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Other coding 171 10.7: theorems prop Symmetry we will see in a later chapter, erty because, is a useful as y can unication channels by line ar comm ed over symmetric be achiev at capacit codes. ] 2 [ ve that for a symmetric channel with Exercise ber of 10.10. Pro any num uniform over the inputs is an optimal input distribution the inputs, distribution. [ 2, p.174 ] . Are there channels that are not symmetric whose op- Exercise 10.11. input are uniform? Find one, or pro ve there are timal distributions none. 10.7 theo rems coding Other ved in this hannel noisy-c that we pro coding chapter is quite gen- The theorem applying to any discrete memoryless channel; but it is not very speci c. eral, ) R ( E r theorem with error probabilit y The says that reliable comm unication only by using ed suciently large blo cklength be achiev can R rate and with codes C to be to achiev e given values needs say how large does not theorem . The N N R . and R of C , the larger N smaller to is and , the Presumably has the closer R is to Figure . A typical 10.8 random-co ding exp onen t. be. coding theorem { version with explicit N -dep endenc e Noisy-channel memoryless channel, a blo cklength N and a rate R , For a discrete average exist ck codes of length N whose there probabilit y of blo error satis es: p (10.25) exp [ NE )] ( R r B onen E ) is the random-co ding exp R t of the channel, a where ( r ^ . The positiv e function of R for 0 R < C vex con , decreasing, exp wn t is also kno ding as the reliabilit y function. random-co onen wn an t it can also be sho argumen that there exist [By expurgation blo for whic h the maximal probabilit y of error p is also ck codes BM onen small in N .] exp tially de nition of E hes ( R ) is given in Gallager (1968), p. 139. E The ( R ) approac r r beha as ! C ; the zero R viour of this function is illustrated in g- typical ure 10.8. The computation of the random-co ding exp onen t for interesting channels is a challenging on whic h much e ort has been exp ended. Even task ex- simple binary symmetric channel, there is no simple like the for channels for E pression ( R ). r Lower bounds error probability as a function of blocklength on the codes theorem above asserts The there are stated with p than smaller that B exp [ NE it be ( R )]. But how small can the error probabilit y be? Could r much smaller? For any code with blo cklength N on a discrete memoryless channel, the probabilit y of error assuming all source messages are used with equal y satis es probabilit )] (10.26) exp [ ; NE R ( p sp B

184 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10 | The Coding Theorem 172 Noisy-Channel function where ( R ), the sphere-pac king exp onen t of the E the sp vex , decreasing, positiv e function of R for 0 channel, is a con ^ R < C . Gallager result and further references, see For a precise t of this statemen . p. 157 (1968), coding theo rems and coding practice 10.8 Noisy-channel who Imagine an error-correcting code and deco der a customer wants to buy the channel. describ ed above allo w us to o er results follo wing a noisy The for us the prop erties of his channel, the desired service: R and the if he tells rate error y p desired , we can, after probabilit out the relev ant functions working B , E problem ( R ), and E C ( R ), advise him that there exists a solution to his r sp a particular blo N ; indeed that almost any randomly chosen using cklength job. that should do the cklength Unfortunately we have not with blo code how to implemen t these enco ders and deco ders found the cost out in practice; ting enco der and deco der the a random code with large N of implemen for be exp onen would large in N . tially Furthermore, practical purp oses, the customer is unlik ely to kno w ex- for what that he is dealing with. So Berlek amp (1980) suggests actly channel sensible enco h error-correction is to design the ding-deco ding way to approac variety and systems performance on a plot of idealized channels as a their function of the channel's noise level. These charts (one of whic h is illustrated on page can then be sho wn to the customer, who can choose among the 568) on o er having to specify what he really thinks his channel systems without this the to the practical problem, With imp ortance of the is like. attitude ( E R ) and E functions ( R ) is diminished. sp r Further exercises 10.9 2 ] [ 10.12. Exercise erasure channel with input x and output y has A binary probabilit y matrix: transition 3 2 - q 0 1 0 0 @ 4 5 @ R q q = Q ? q 1 0 - 1 1 Find the mutual information I ( X ; Y ) between the input and output for general input f p distribution ;p of this g , and sho w that the capacity 1 0 is C = 1 q bits. channel has transition probabilit y matrix: A Z channel - 0 0 q 1 = Q 0 1 q - 1 1 Sho w that, using a (2 ; 1) code, two uses of a Z channel can be made to the emulate erasure channel, and state of an erasure probabilit y use one erasure channel. Hence sho w that the capacit y of the Z channel, of that 1 ) bits. (1 q , satis es C C Z Z 2 1 an than y rather inequalit ) is an (1 q Explain C result why the Z 2 equalit y.

185 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 173 10.10: [ ] 3, p.174 tic cable con tains N = 20 indistinguish- 10.13. Exercise A transatlan h wire You of guring out whic job is wires. able electrical have the a consisten t lab elling of the wires at eac h end. whic h, that is, to create are tools abilit y to connect wires to eac h other in groups Your the only for connectedness with a con tinuity tester. to test of two or more, and smallest num ber of transatlan tic What you need to mak e, is the trips how do do it? and you h as solv for larger N suc problem N = 1000? you e the How would illustration, if N were 3 then the task can be solv ed in two steps As an one by lab at one end a , connecting the other two together, elling wire are the measuring whic h two wires tic, connected, lab elling crossing Atlan b and c and the unconnected one a , then connecting them to a and b returning the Atlan tic, whereup on on disconnecting b from c , the across tities of and c can be deduced. iden b the can by persisten t searc h, but ed reason it is problem be solv This chapter is that it can also be solv posed by a greedy approac h in this ed on the acquired information . Let maximizing unkno wn per- based the of wires be x . Having chosen a set of connections of wires C at mutation end, mak can then one e measuremen ts at the other end, and these you . How much? ts vey information about x con And for what measuremen y of connections is the information that y con veys about x maximized? set 10.10 Solutions to exercise p (p.169) . If the input distribution is p = ( Solution 10.4 ;p ), ;p 1 0 ? is information mutual the - 0 0 ( j : p p + p ( 2) I ( X ; Y ) = H (10.27) Y ) H ( Y = X ) = H 2 0 ? ? 1 / 2 insp by careful in two ways: function ection a good sketch of this build We can ? 1 @ / cases. function, or by looking at special of the 2 @ R @ of the plots, and two-dimensional represen tation For the p I will use has p 0 - 1 1 ). as the indep enden t variables, so that p = ( p ;p ;p p ;p p ) = ( p ; (1 p ) 0 1 1 0 0 1 1 ? By ection. If we use the quan tities p two p insp + p our = 2 and p as 0 ? ? degrees of freedom, the mutual information becomes very simple: I ( X ; Y ) = p H p 2, ) p = . Con verting bac k to p ( = p p p = 2 and p = 1 1 2 0 ? ? ? the sketch sho wn at the left below. This function is like a tunnel we obtain p up direction of increasing p of and rising plot . To obtain the required the 1 0 I ( X ; Y ) we have to strip away the parts of this tunnel that live outside the the feasible this by redra wing we do surface, sho wing simplex of probabilities; the parts where p 0. A full > 0 and p wn > only plot of the function is sho 0 1 righ t. at the 1 1 1 0.5 0.5 0.5 0 1 1 1 p p 1 1 -0.5 0.5 0.5 0.5 0 0 -1 0 0 -0.5 0 0 0 -0.5 0 0.5 0.5 0.5 1 1 1 p p 0 0

186 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 10 Noisy-Channel Coding 174 | The Theorem is a noiseless special p the = 0, the channel case binary In cases. Special ? ( X ; Y ) = H ). ( p channel, and I 2 0 p p the = p case , the term H In ( special 1, so + p to = 2) is equal 0 2 1 0 ? 1 ; Y ) = 1 p I . ( X ? probabilit = 0, the channel is a Z channel with error special y case In the p 0 p 0.5 1 previous 9.3). chapter 0.5. We kno ( gure w how to sketch that, from the 1 0.5 10.9. in gure wn sho skeleton the w us to construct These allo cases special 0 0 0 0.5 1 p 0 t conditions to p Solution to exercise 10.5 (p.169) . Necessary and sucien for ; X ( I maximize ) are Y of the . Skeleton 10.9 Figure ) @I ) ; X ( Y mutual the for information = and p > 0 i @p i ternary confusion channel. (10.28) i; all for X ; ) @I Y ( = 0 p and i @p i is a constan t related to the capacit y by C = where e . + log 2 can be used in a computer program result evaluates the deriv a- This that and incremen ts and decremen ts the probabilities p tives, in prop ortion to the i di erences those deriv ativ es. between result is also useful for lazy human capacit y- nders who are good This con- Having optimal input distribution, one can simply the guessers. guessed that equation (10.28) holds. rm to exercise 10.11 (p.171) . We certainly exp ect nonsymmetric chan- Solution with nels optimal input distributions to exist, since when inventing a uniform of freedom we have ( J 1) degrees channel whereas the optimal input dis- I tribution is just ( I 1)-dimensional; so in the I ( J 1)-dimensional space of perturbations around channel, we exp ect there to be a subspace a symmetric of dimension of perturbations J 1) ( I 1) = I ( J 2) + 1 that leave the I ( optimal distribution hanged. unc input explicit is an a bit like a Z channel. example, Here 3 2 9585 0 : 0 0 : : 0 : 35 0 0415 6 7 35 0 : 0415 0 : 9585 0 : 0 0 : 6 7 (10.29) Q = 4 5 0 0 0 : 65 0 65 : 0 0 0 0 problem 10.13 . to exercise lab elling (p.173) can be solv ed for Solution The N > 2 with just two trips, one eac any the Atlan tic. h way across The step in the information-theoretic approac h to this problem is to key partition down information con ten t of one write , the com binatorial object the that is the connecting together of subsets of wires. If N wires are group ed together into g subsets of size 1, g ber of subsets of size 2, :::; then the num 2 1 is suc h partitions N ! Y = (10.30) ; g r g ! !) ( r r r and information con ten t of one suc h partition is the log of this quan tity. the this strategy the rst partition to maximize we choose information In a greedy con ten t. One game we can play is to maximize this information con ten t with re- bers, spect tities g t , treated as real num quan sub ject to the constrain to the r P t, the g constrain r = N . Introducing a Lagrange multiplier for the r r ativ e is deriv ! X @ + log r; g g + r (10.31) = log r ! log r r @g r r

187 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 175 10.10: Solutions 5.5 5 leads set nice expression h, when whic to the to zero, rather 4.5 4 r 3.5 e 3 (10.32) ; = g r 2.5 r ! 2 1.5 ortional optimal g to a Poisson is prop solv the the We can e for distribution! 1 r P (a) 0.5 Lagrange into the t g by plugging multiplier constrain = N , whic r g h 1000 100 10 1 r r r 2.5 equation gives the implicit = (10.33) e ; N 2 1.5 e Lagrange is a con venien t reparameterization of the where multiplier. 1 ws sho 10.10a of ws non- deduced the Figure sho 10.10b ); gure N ( a graph 0.5 ; 1 g = 2 ts assignmen integer : 2, and nearb y integers g when = f 1 ; 2 ; 2 ; 1 g r r (b) setting the rst partition to (a)(b c)(de)(fgh)(ijk)(lmno)(p qrst). that motiv ate 0 5 1 2 3 4 10 6 7 8 9 an other at the h has partition a random duces pro partition end, This whic . Appro 10.10 Figure ximate 4 bits h is a lot total the half than more information , whic con : = 40 t of log ten elling solution of the cable-lab con information ten t we need to acquire to infer the transatlan tic perm utation, problem using Lagrange if all trast, con . [In 61 bits ' 20! log together in pairs, wires joined the are The multipliers. parameter (a) of as a function ; the value N 29 bits is only about .] How to choose the the information con ten t generated (20) = 2 : 2 is highligh ted. (b) is left partition approac h is appropriate, A Shannonesque reader. to the second function of the values teger Non-in r picking at the need ; you g g f same the using end, other partition a random r / by lines r ! = sho wn g and are r as possible. h other to ensure the two partitions are as unlik e eac of g integer by motiv ated values r solutions has 9, are problem that If N 6 = 2, 5 or then the lab elling are non-in teger values those particularly par- to implemen t, called Kno wlton{Graham partitions: simple wn by crosses. sho two ways tition ;::: ;N g into disjoin t sets in 1 A and B , sub ject to the f condition that at most one elemen t app ears both in an A set of cardinal- eac ity and in a B set of cardinalit y k , for j h j and k (Graham, 1966; Graham and Kno wlton, 1968).

188 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 11 Ab out Chapter 11, should have read Chapters 9 and 10. Chapter you Before reading need to be familiar with the You distribution . will also Gaussian distribution . If a random variable y is Gaus- Gaussian One-dimensional 2 mean and variance sian , whic h we write: and has 2 2 ; Normal ) ; or P ( y ) = Normal ( y ; ; ; ) ( (11.1) y distribution y is: the of then 1 2 2 2 p y j ; P ) = ( 2 ) exp = ( : (11.2) y 2 2 sym P the for both probabilit y densities and probabilities.] bol [I use 2 1 / ariance inverse-v The is sometimes called the precision of the distribution. Gaussian Gaussian distribution . If y = ( y ;:::;y ;y a Multi-dimensional ) has N 2 1 multiv ariate Gaussian distribution, then 1 1 T y ; exp ) ( y x ) (11.3) A ( x ( j P ; A y ) = x 2 A ( Z ) is the is the of x distribution, A mean inverse of the where the variance matrix, and the normalizing constan t is Z ( A ) = variance{co 1 = 2 A 2 )) . = (det( distribution has the prop erty that the variance This of y the , and i ii of y covariance y by are given and j ij i 1 A y y [( )( y E y )] = ; (11.4) j i j ij i ij 1 where is the A of the matrix A . inverse The marginal distribution P ( y is Gaussian; ) of one comp onen t y i i the join distribution of any subset the t marginal comp onen ts is of multiv ariate-Gaussian; and the conditional densit y of any subset, given y the of another subset, for example, P ( values Gaussian. j y ), is also j i 176

189 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 11 Error-Correcting Channels Codes & Real hannel coding that we have pro ved sho ws that there exist The noisy-c theorem for any noisy channel. In this chapter we address error-correcting reliable codes two questions. y practical channels have real, rather than discrete, inputs and First, man can What tell us about these con tinuous channels? And outputs. Shannon signals waveforms, ed into analogue digital and vice versa ? how should be mapp how practical error-correcting codes made, and what is Second, are ed in practice, achiev e to the possibilities pro ved by Shannon? relativ The Gaussian 11.1 channel is the popular of a real-input, real-output channel del Gaussian mo most The channel. channel The a real input x and a real output y . The condi- Gaussian has distribution of y tional x is a Gaussian distribution: given 1 2 2 p ( 2 = exp (11.5) : y x ) ( ) = x j y P 2 2 This channel has a con tinuous input and output but is discrete in time. We will sho certain con tinuous-time channels are equiv alen t w below that discrete-time Gaussian to the channel. Gaussian is sometimes the channel e white called noise This additiv channel. (AWGN) with discrete channels, we will discuss what rate of error-free As information comm can be achiev ed over this channel. unication in terms channel Motivation of a continuous-time a physical (electrical, say) channel Consider inputs and outputs that are with con tinuous in time. We put in x ( t ), and out comes y ( t ) = x ( t ) + n ( t ). Our transmission a power cost. The average power of a transmission has T thus: of length may be constrained Z T 2 t [ (11.6) ( t )] d =T P: x 0 The receiv ed signal is assumed to di er from x ( t ) by additiv e noise n ( t ) (for del example whic h we will mo noise), as white Gaussian noise. The Johnson magnitude of this noise is quan ti ed by the noise spectral densit y , N . 0 177

190 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 11 | Error-Correcting and Channels 178 Codes Real be used unicate information? Consider h a channel to comm suc How could N real g transmitting num x a set of N bers f made in a signal of duration T ) t ( n 1 =1 n ( up of a weigh ted com ), t of orthonormal basis functions bination n N X ( t ) 2 x t ) = ( ( t ) ; (11.7) x n n =1 n R T the compute where then er can receiv . The ) = t ( ) t ( t d scalars: m nm n ( t ) 3 0 Z Z T T (11.8) d t ( n ) t ( ) t t x d + t ( y ( ) t ) = y n n n n 0 0 ( t x ) (11.9) + x n n n noise, no . If there :::N = 1 n for were white . The x equal would y then n n Figure 11.1 . Three functions, basis scalar t n noise ) adds to the estimate ( y . This noise noise is n Gaussian n n bination a weigh ted com of and Gaussian: P N t ( x them, ) = x ( ; ) with t n n n =1 ;N Normal(0 n (11.10) ; 2) = n 0 = 0 2, and : 0 = x 1. : x 4, x : = 0 1 2 3 is the spectral densit N above. Thus a con tinuous chan- where y introduced 0 in this way is equiv alen t to the Gaussian channel de ned at equa- nel used R T 2 PT t on a constrain d t [ de nes ( t )] x power constrain The (11.5). tion t 0 signal x , amplitudes the n X PT 2 2 PT x ) (11.11) : x n n N n Before returning to the Gaussian channel, we de ne the bandwidth (mea- sured in Hertz) tinuous channel to be: of the con max N (11.12) ; = W T 2 max maxim where N ber of orthonormal functions that can be um is the num in an interv al of length T . This de nition can be motiv ated by pro duced creating a band-limited of duration T from orthonormal co- imagining signal sine num es of maxim um frequency W . The and ber of orthonormal sine curv max relates = 2 WT . This de nition is to the Nyquist sampling functions N if the highest frequency presen t in a signal is W , then the signal theorem: be fully determined its values at a series of discrete sample points can from 1 / 2 W seconds. Nyquist al by the t = separated interv bandwidth the con tinuous channel with of a real W , noise spectral So use y N power , and densit P is equiv alen t to N=T = 2 W uses per second of a 0 2 to the noise level Gaussian = N with = 2 and sub ject channel signal power 0 2 P / . W 2 t constrain x n De nition of =N E 0 b enco Gaussian channel y Imagine = x ding + n that is used the an with n n n bits binary bits at a rate to transmit R source per channel use. How system of we compare two enco ding systems that have di eren t rates of comm uni- can 2 x ? Transmitting at a large rate R is R di eren t powers cation use that and n small power is good too. good; using ventional to measure the rate-comp ensated signal-to-noise ratio by It is con 2 ratio of the power per source bit E y = x the densit =R to the noise spectral b n E N =N is dimensionless, : but it is 0 b 0 2 usually in the of rep orted units x n : (11.13) E =N = 0 b els; the value given is decib 2 R 2 E 10 log =N . 0 b 10 E Gaussian =N for is one of the measures used to compare coding schemes 0 b channels.

191 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. input Inferring 179 11.2: to a real the channel input the channel 11.2 Inferring to a real of pulses' dete best `The ction a memorandum (Shannon, 1993) on the problem of In 1944 Shannon wrote between best of kno wn shap e, represen ted di eren tiating two types of pulses has and by vectors , given that one of them x been transmitted over a x 0 1 This is a pattern recognition problem. It is assumed that the noisy channel. probabilit y y densit with is Gaussian noise x 0 1 = 2 1 A T exp n ( det ; P An n (11.14) ) = x 2 2 1 inverse of the A is the variance matrix of the noise, a sym- where variance{co and metric tity matrix, iden of the is a multiple A (If matrix. e-de nite positiv y 2 noise = For more , then the I is `white'. is `coloured'.) general A , the noise probabilit The signal s was source the that given y vector ed receiv y of the is then (either zero or one) x and 11.2 Figure o pulses . Tw 0 x as 31-dimensional ted , represen 1 2 = 1 version vectors, and of one a noisy 1 A T exp j s ) = P ( y det A y x ) ) ( x ( y : (11.15) s s y of them, . 2 2 detector is based on The posterior probabilit y ratio: optimal the = 1) s j y ( P P = 1) ( s j = 1 P ( ) y s = (11.16) P P ( y j s = 0) ( s = 0 j P ( s = 0) y ) 1 = 1) s 1 ( P T T ) A ( y = x exp ) + ( y ( y x ) x A ( y x ) + ln 0 0 1 1 s 2 ( = 0) 2 P T ( A = x exp x ( ) + ) ; (11.17) y 0 1 is a constan t indep enden t of the receiv ed vector y , where ( P = 1) s 1 1 T T Ax + + ln Ax x = (11.18) x : 0 1 0 1 ( s P 2 2 = 0) s is forced e a decision detector guess either to mak = 1 or s = 0) then If the (i.e., w that the decision probabilit y of error is to guess the most prob- the minimizes of a We can the optimal decision in terms hypothesis. discriminan t able write : function weigh . The 11.3 Figure t vector T A x ( x a ) + (11.19) ( y ) y 1 0 / x to w is used that x 0 1 discriminate between x x and . 0 1 with decisions the ( ) a > 0 ! guess s = 1 y ( 0 ) < a ! guess s = 0 y (11.20) y ) = 0 ! guess either. ( a Notice that a ( y ) is a linear function of the receiv ed vector, T a ) = w ( y + ; (11.21) y w ( A where x ). x 1 0 11.3 Capacit y of Gaussian channel Until now we have measured the join t, marginal, and conditional entrop y the of discrete . In order to de ne only information con veyed by variables con tinuous variables, there are two issues we must address { the in nite length of the real line, and the in nite precision of real num bers.

192 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. and Real Channels 180 11 | Error-Correcting Codes inputs In nite in one use of a Gaussian vey If How much information we con can channel? to put any real num ber x into the Gaussian channel, we could we are allo wed d enormous of N digits d unicate d an string = :::d x by setting comm N 3 2 1 (a) t of error-free d 000 ::: 000. The amoun :::d information con veyed in d d 3 1 2 N be made arbitrarily large by increasing N , could transmission a single just unication could be made arbitrarily reliable by increasing the and the comm end of . There is usually some power cost asso ciated ber of zero num es at the x g - inputs, limits in the dynamic however, with not tion practical large to men (b) ventional to introduce a range acceptable to a receiv er. It is therefore con input x , and constrain codes to have an average cost function v ( x ) for every to channel maxim um value. A generalized equal cost v less than or some function inputs, be pro ved { see for can a cost including theorem, coding the y C ( v ) that is a function of McEliece (1977). The result is a channel capacit we will assume permitted cost. For the Gaussian channel the a cost 2 x ) = (11.22) x ( v 2 power' suc `average the h that ated We motiv is constrained. input of the x this h the physical cost function above in the case of real electrical channels in whic 2 . x = v mak es . The x power consumption is indeed quadratic in constrain t . . to comm use in one ossible information in nite unicate Gaussian of the it imp channel. Figure 11.4 . (a) A probabilit y y densit we P ( x ). Question: can de ne `entrop y' of this the In nite precision (b) y? evaluate densit We could the entropies of a sequence of de ne join t, marginal, and conditional entropies for real It is tempting to with probabilit y distributions by replacing a well simply this summations is not by integrals, but variables , but g grain-size decreasing these interv an we discretize divi- operation. de ned smaller As al into smaller and entropies tend to Z diverges of the the entrop discrete distribution sions, logarithm (as the y of the 1 h is not d x , whic x ) log ( P y) ( gure to tak it is not of 11.4). permissible ularit gran e the logarithm Also, g P ( x ) enden t of g : the entrop y indep tity suc h as a probabilit a dimensional dimensions ) (whose quan x y densit y P ( 1 every bit by one goes up for x [ are ] ). of g . halving Z however, information measure, ved limit, that There has a well-b eha is one 1 x d is an illegal P ( ) log x really one since namely the mutual information { and this is the matters, that x ( ) P how much information one it measures con veys about another. In the variable integral. case, discrete X x;y ) P ( P ( x;y ) log ) = ( Y I X ; : (11.23) ( ) P ( y ) P x x;y Now because the argumen t of the log is a ratio of two probabilities over the same space, to have P ( x;y ), P ( x ) and P ( y ) be probabilit y densities it is OK the replace by an integral: and sum Z P ( x;y ) (11.24) d x d y P ( ( ) log X ; Y ) = I x;y ( x ) P ( y ) P Z P x j y ( ) x ) ( y j x ) log ( y P d x d = P : (11.25) ( ) P y channel: now ask questions We can the Gaussian these (a) what probabilit y for distribution P ( x ) maximizes the mutual information (sub ject to the constrain t 2 x v )? and (b) does the = mutual information still measure the maximal maxim um error-free comm unication rate of this real channel, as it did for the discrete channel?

193 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Capacit 11.3: y of Gaussian channel 181 3, ] [ p.189 the probabilit y distribution P 11.1. x ) that max- Pro ve that Exercise ( 2 (sub ject to the constrain imizes x the = v ) is a mutual information t of mean and variance v . zero Gaussian distribution ] p.189 2, [ 11.2. Sho w that the mutual information I ( X ; Y ), in the case . Exercise distribution, optimized of this is 1 v : (11.26) C = 1 + log 2 2 imp ortan t result. We see that the capacit y of the Gaussian channel This is an 2 signal-to-noise v= . ratio of the is a function es given a Gaussian input distribution Infer enc 2 ( x ) = Normal( x ; 0 ;v ) and P ( If j x ) = Normal ( y ; x; P ) then the marginal y 2 y is P ( y ) = Normal ( y ; 0 ;v + distribution ) and the posterior distribution of input, the output is y , is: that of the given x j y ) / P ( y j x ) P ( x ) (11.27) P ( 2 2 2 y x ) / = 2 exp( ) exp( ( = 2 v ) (11.28) x ! 1 v 1 1 : (11.29) x ; Normal = y ; + 2 2 v + v [The from (11.28) to (11.29) is made by completing the square in the step exp onen t.] This form ula deserv es careful study . The mean of the posterior v distribution, ted com be view ed as a weigh , can y bination of the value 2 + v ts the output, x = y , and the value that best ts the prior, x = 0: that best 2 =v 1 = 1 v (11.30) 0 y = + y : 2 2 2 1 = 1 =v + 1 = + 1 + v =v 2 the weigh and 1 =v are = precisions of the two Gaussians that we The ts 1 together in equation (11.28): the prior and the likeliho od. multiplied precision posterior The distribution is the sum of these two pre- of the This er two indep prop erty: whenev cisions. enden t sources con- is a general tribute via Gaussian distributions, about an unkno wn variable, information, the precisions add. [This is the dual to the better-kno wn relationship `when indep enden are added, their variances add'.] t variables coding theorem the Gaussian channel Noisy-channel for a maximal information. Does it corresp ond to a We have evaluated mutual um possible rate maxim information transmission? One way of of error-free pro that this is so is to de ne a sequence ving channels, all deriv ed of discrete from the Gaussian channel, with increasing num bers of inputs and outputs, and pro the maxim um mutual information of these channels tends to the ve that theorem C noisy-c hannel coding . The for discrete channels applies asserted to eac h of these deriv ed channels, thus we obtain a coding theorem for the t for con channel. Alternativ ely, we can mak e an intuitiv e argumen tinuous the coding theorem speci c for the Gaussian channel.

194 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 11 | Error-Correcting Codes Real Channels 182 and of the noisy-channel theorem: spher e packing Geometric al view coding corresp = ( ;:::;x x ) of inputs, and the Consider onding output x a sequence 1 N N dimensional space. For large N , the noise y , as de ning two points in an 2 (fractionally) N likely . The output y is therefore power is very to be close to p 2 to the surface of a sphere of radius likely very to be close tred on N . x cen original , if the x is generated at random sub ject to an average Similarly signal 2 x v , then x is likely to lie = to a sphere, cen tred t power constrain close p of radius on the origin, because total ; and average power of y is Nv the 2 the , the + ed signal y is likely to lie on surface of a sphere of radius v receiv p 2 v + tred ), cen N ( the origin. on The of an N -dimensional sphere of radius r is volume N= 2 N ( V ) = r;N r (11.31) : 2+1) N= ( unication a comm based on non-confusable making Now consider system , that is, inputs whose spheres inputs not overlap signi can tly. The max- x do num S of non-confusable inputs is given ber the volume of imum by dividing sphere of probable y s by the volume of the sphere for y given x : the ! p N 2 + v ) ( N p (11.32) S 2 N y is bounded by: Thus the capacit 1 v 1 (11.33) : M 1 + log log = C 2 2 N es- detailed t like the A more used in the previous chapter can argumen one tablish equalit y. Back to the continuous channel Recall that the use of a real con tinuous channel with bandwidth W , noise spectral N and power P is equiv alen t to N=T = 2 W uses per second of y densit 0 2 2 = N channel = 2 and ject to the constrain t a Gaussian with sub x P= 2 W . 0 n for the capacit y of the result channel, we nd the the Substituting Gaussian con tinuous channel to be: capacit y of the P (11.34) bits per second. C 1 + W log = N W 0 This gives insigh t into the tradeo s of practical comm unication. Imag- form ula we have a xed to mak t. What is the best bandwidth that e ine power constrain 1.4 the of that = P=N use , i.e., Introducing bandwidth for whic h the power? W 0 0 1.2 log(1 + sho ws C=W is 1, gure = W=W signal-to-noise 11.5 W ratio =W ) as 1 0 0 0 0.8 . The capacit y increases to an asymptote of W of log e . It W=W a function 0 0 0.6 capacity 0.4 (in terms of capacit xed power) better is dramatically y for to transmit at a 0.2 signal-to-noise with than bandwidth, over a large ratio low signal-to-noise high 0 3 4 5 6 1 0 2 ation w bandwidth; this is one motiv in a narro for wideband comm unication bandwidth approac ectrum' metho ds suc spread-sp h as the `direct h used sequence in 3G are not electromagnetic your and alone, neigh- you course, Of phones. mobile y versus Figure . Capacit 11.5 may not be pleased if you use a large bandwidth, so for social reasons, bours a real channel: for bandwidth (1 + log W=W = C=W =W W ) engineers often trans- narro have to mak e do w-bandwidth with higher-p ower, 0 0 0 . as a function of W=W 0 mitters.

195 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. What are capabilities of practical error-correcting codes? 183 11.4: the What of practical erro r-co rrecting codes? are the 11.4 capabilities onen are all codes require exp nearly tial look-up codes good, but Nearly all implemen tation of the enco der and deco der { exp onen tial tables for practical N in the the coding theorem required N to be large. blo cklength . And can practical we mean one that code, be enco ded a By error-correcting ded in a reasonable amoun and for example, a time that scales deco t of time, function blo cklength N { preferably linearly . as a polynomial of the limit The d in practic e is not Shannon achieve sho of of the e pro coding theorem noisy-c wed that non-constructiv The hannel exist for any noisy channel, good blo indeed that nearly all blo ck ck codes and are But writing down an explicit good. practical enco der and de- codes and that coder as good as promised by Shannon is still an unsolv ed problem. are Very . Giv en a channel, a family of blo ck codes that achiev e good codes small probabilit at any comm unication rate up to arbitrarily y of error for y of the called `very good' codes are that channel. capacit the channel code families that achiev e arbitrarily small probabilit y of Good codes are comm at non-zero rates up to some maxim um rate that unication error than the capacit y of the given channel. less may be codes are code families that cannot achiev e arbitrarily small Bad y probabilit of error, can achiev e arbitrarily small probabilit y of error only by or that the rate to zero. Rep etition codes are an example decreasing information code family . (Bad codes are not necessarily useless for practical of a bad oses.) purp codes are code families that can be enco ded and deco ded Practical in time and polynomial in the blo cklength. space establishe d codes are line ar codes Most Let us review the de nition of a blo ck code, and then add the de nition of a linear blo ck code. K ( blo ck code for a channel Q is a list of S = 2 N;K codew ords An ) K ( N (2) ) s ) (1) (2 x , eac h of length N : x ; x ;:::; 2 A g f . The signal to be x X K ( s ) from an alphab et of size 2 . , is enco ded as x enco ded, s , whic h comes ( s ) ( ck code is a blo ck code in whic h the codew ords f x A linear ) blo g N;K N e up K -dimensional subspace of A operation a . The enco ding mak can X T to N K binary be represen G by an suc h that if the signal ted matrix be enco in binary notation, is s (a vector ded, K bits), then the of length T ded signal is t = G s enco mo dulo 2. = The f t g can be de ned as the set of vectors satisfying Ht ords codew 0 d 2, where H is the mo y-chec k matrix of the code. parit = 4 signal K es tak 1.2 of section code 4) Hamming ; (7 the For example 2 3 1 The wed , and = 7 s bits, follo them N by three k bits. parit transmits y-chec 1 6 7 T 1 bols G by given are d 2. mo s sym transmitted 6 7 T 1 G = 6 7 who of Hamming, work theory a fam- Coding invented was born with the 4 5 1 1 1 1 1 1 a in error ily of practical error-correcting codes, eac h able to correct one 1 1 1 blo ck of length N , of whic h the rep are code R 4) code and the (7 ; etition 3

196 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 11 | Error-Correcting Codes Real Channels 184 and Since the established codes have been generalizations of then simplest. most ury{Ho codes: Reed{M uller codes, Hamming's cquenhem Bose{Chaudh codes, codes, to name a few. and codes, Reed{Solomon Goppa Convolutional codes codes are con Another codes , whic h do not divide family of linear volutional stream cks, but instead read and into blo bits con tinuously . source the transmit bits are a linear function The past source bits. Usually the transmitted of the for the transmitted bits involves feeding the presen t source rule generating k shift-register k , and transmitting one or into a linear-feedbac bit of length functions of the state of the shift register at eac h iteration. The more linear transmitted bit is the con volution of the source stream with resulting stream The impulse-resp onse function of this lter may have nite or a linear lter. duration, dep ending on the choice of feedbac in nite k shift-register. We will con volutional codes in Chapter 48. discuss ar codes `good'? Are line t ask, migh that the Shannon limit is not achiev ed in practice is the One reason The codes tly not as good as random codes? inheren answ er linear because are is no, hannel coding theorem can still be pro ved for linear codes, the noisy-c for channels (see Chapter 14), some the pro ofs, like Shannon's at least though of for random codes, are non-constructiv e. pro codes easy Linear to implemen t at the enco ding end. Is deco ding a are deco code Not necessarily . The general easy? ding problem ( nd also linear T um ) is in fact od s in the equation G maxim s + n = r the NP-complete likeliho amp [NP-complete , 1978). (Berlek problems are computational problems et al. widely are that dicult and whic h are all believ ed to require exp o- equally nen tial computer time to solv e in general.] So atten tion focuses on families of codes for h there is a fast deco ding algorithm. whic atenation Conc k for One with practical deco ders is the idea of concatena- building tric codes tion. !D der system C ! Q hannel{deco can be view ed as de ning An enco der{c 0 0 Q !C ! !D C !D 0 | {z } er-c hannel Q complex a sup a smaller probabilit y of error, and with with 0 Q 0 0 errors. an create its enco der C among and deco der D correlations for We can 0 0 outer sup . The code consisting of the hannel code C Q follo wed by this er-c inner code C is kno wn as a concatenated code . the concatenated Some mak e use of the idea of interlea ving . We read codes ck being data cks, the size of eac h blo the larger than the blo cklengths in blo 0 constituen t codes and C of the . After enco ding the data of one blo ck using C 0 C bits are reordered within the , the ck in suc h a way that nearb y code blo are separated from eac h other once the blo ck is fed to the second code bits . A simple example of an interlea ver is a rectangular code or pro duct code C K h the in whic arranged in a K tally data horizon blo ck, and enco ded are 1 2 using an ( N code. ;K ) linear ) linear code, then vertically using a ( N ;K 2 2 1 1 [ 3 ] two codes either Sho w that inner of the ed as the can be view 11.3. . Exercise code or the outer code. As an example, gure 11.6 sho ws a pro duct code in whic h we enco de kno rst the rep etition code R ; (also with wn as the Hamming code H (3 1)) 3

197 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. What capabilities of practical error-correcting codes? 185 are 11.4: the 1 1 1 1 1 1 1 1 1 1 1 1 duct . A pro (a) Figure code. 11.6 ? ? 1 0 1 0 1 0 1 0 0 0 1 0 a using 1011 A string ded enco 1 1 1 1 1 1 1 1 1 1 1 1 consisting of code concatenated ? 1 1 1 1 1 1 0 1 1 1 1 1 codes, 1) and ; (3 two Hamming H ? 0 0 0 0 0 0 0 0 0 1 0 0 pattern (b) H (7 that ; a noise 4). ? 0 0 1 0 0 0 0 0 0 0 0 0 ed receiv 5 bits. (c) ips The 1 1 1 1 1 1 1 1 1 1 1 1 (c) (a) (d) (b) (e) deco ding using After (d) vector. (3 ; 1) deco the der, horizon tal and 1 1 1 0 1 1 the (e) tly subsequen after using (1) (1) (1) 1 1 0 4) deco ; vertical (7 der. The 1 1 1 1 1 1 deco the vector hes matc ded 1 1 1 0 1 1 original. 0 1 0 0 0 0 0 0 (d in the ) After ding deco other , e 0 1 0 0 0 0 0 0 1 1 1 1 1 1 remain. order, three errors still ) (d (e ) . The tally H (7 ; 4) vertically with blo cklength of the concatenated horizon then is 27. The num ber of source bits per codew ord is four, sho wn by the code rectangle. small deco de con venien tly (though not optimally) by using the individual We can ders es most eac h of the sub codes in some sequence. It mak deco sense to for and deco rst code whic h has the lowest rate de hence the greatest error- the correcting abilit y. Figure 11.6(c{e) sho ws what happ ens if we receiv e the codew ord of g- ure 11.6a some errors ( v e bits ipp ed, as sho wn) and apply the deco der with H (3 1) rst, and then the deco der for H (7 ; 4). The rst deco der corrects for ; errors, third erroneously mo di es the of the bit in the second row three but then are errors. there (7 ; 4) deco der can two bit correct all three where The errors. of these 0 0 { e Figure ) sho ws what 11.6(d ens if we deco de the two codes in the happ other order. one and two there are two errors, so the (7 ; 4) deco der In columns 3. The two extra the one error in column It corrects (3 ; 1) introduces errors. der then cleans up four of the errors, but erroneously infers the second deco bit. Interle aving y motiv for interlea ving is that by spreading out bits that The nearb ation are in one we mak e it possible to ignore code, complex correlations among the the errors that are pro duced by the inner code. Ma ybe the inner code will mess up an codew ord; but that codew ord is spread out one bit at a time over entire treat codew outer code. So we can of the the errors introduced by several ords inner code as if they are indep enden t. the channel Other models to the binary symmetric channel and the Gaussian channel, coding In addition keep more complex channels in mind also. theorists channels Burst-error imp ortan t mo dels in practice. Reed{Solomon are codes use Galois elds (see App endix C.1) with large num bers of elemen ts 16 e a degree ) as their input alphab ets, and thereb y automatically achiev (e.g. 2 of burst-error tolerance in that even if 17 successiv e bits are corrupted, only 2 successiv e sym bols in the Galois eld represen tation are corrupted. Concate- protection nation ving can give further interlea against burst errors. The and concatenated Reed{Solomon codes used on digital compact discs are able to correct bursts of errors of length 4000 bits.

198 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 11 | Error-Correcting Codes Real Channels 186 and [ p.189 ] 2, The technique of interlea ving, whic h allo ws bursts of . Exercise 11.4. enden used, but is theoretically as indep errors to be treated t, is widely burst errors, in terms of the amoun t a poor way to protect data against Explain why interlea is a poor metho d, of redundancy required. ving wing follo channel as an example. Time is divided the using burst-error N = 100 clock cycles; during eac h chunk, there into chunks of length with probabilit b = 0 : 2; during a burst, the channel is a bi- is a burst y burst, channel = 0 : 5. If there is no f the channel symmetric nary with binary channel. Compute the capacit y of this channel is an error-free it with um maxim compare comm unication rate that could con- and the the be achiev ably used interlea ving and treated ed errors as ceiv if one enden t. indep channels are real channels like Gaussian channels except that Fading the receiv power is assumed to vary with time. A moving mobile phone is an ed ortan t example. incoming radio signal is re ected o nearb y objects imp The signal are and the intensit y of the patterns receiv ed there interference so that varies with its location. The receiv ed power can easily vary by the phone by els of ten) as the phone's antenna moves through a distance 10 decib (a factor to the of the radio signal (a few cen timetres). similar wavelength The state of the art 11.5 are the best kno wn codes for comm unicating over Gaussian channels? What based the are linear codes, and are either codes on con volutional All practical or blo ck codes. codes codes, and codes based on them Convolutional ook volutional Textb codes . The `de facto standard' error-correcting con satellite is a con unications code for volutional code with constrain t comm 7. Con volutional codes are discussed in Chapter 48. length Concatenated con volutional codes . The above con volutional code can be used inner code whose outer code is a Reed{ as the code of a concatenated eigh sym bols. This code was used in deep space code with t-bit Solomon systems suc h as the comm spacecraft. For further unication Voyager about Reed{Solomon codes, see Lin and Costello (1983). reading - - C 1 Galileo but using a longer A code using code for format the The . same - a larger Reed{ constrain code volutional con and t length { 15 its { for - - C 2 ed oratory Lab Propulsion Jet (Sw elop was dev code an- Solomon by the of this code are unpublished outside JPL, and the son, 1988). The details enco der of a Figure 11.7 . The hardw are. deco ding is only possible using a room full of special-purp ose C , C turb o code. Eac h box , 1 2 1 / 4 . this was the best code kno wn of rate In 1992, tains con code. volutional a con source The reordered are bits and orted work Turb o codes . In 1993, Berrou, Gla vieux Thitima jshima rep using they a perm utation before o code of a turb der enco . The o codes turb ders enco the on is based on transmitted are fed to C . The 2 is obtained codew ord by of two con volutional codes. The source bits are fed into eac h enco der, the ving or interlea concatenating being bits the order of the source perm uted in a random way, and the outputs of the two con volutional resulting t code are transmitted. eac from h constituen y bits parit codes. The random perm utation eac h constituen The deco t ding algorithm involves iterativ ely deco ding is chosen when the code is designed, and xed thereafter. ding deco standard its using code of the using then output algorithm, other the der as the input to the deco deco der. This deco ding algorithm

199 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Summary 11.6: 187 sum{pro duct instance is an message-passing of a the algorithm called algorithm . Turb in Chap- 48, passing message o codes are discussed and in Chapter = H 25, ters 16, 17, and 26. Block codes kno wn low-densit Gallager's y parit . The k codes best blo ck codes y-chec channels were Gaussian were for but 1962 in by Gallager invented by most forgotten promptly were y. They unit comm theory coding of the redisco vered in 1995 and wn to have outstanding theoretical and prac- sho e turb Lik erties. prop they are tical deco ded by message-passing o codes, algorithms. y . A low-densit Figure 11.8 parit y-chec k matrix and the 47. codes We will discuss these beautifully simple in Chapter 1 / of a rate- graph onding corresp 4 low-densit y parit y-chec k code of the above codes are compared for Gaussian channels The performances blo cklength N = 16, and with 47.17, in gure . p.568 Eac h white ts. M = 12 constrain bit. ts a transmitted represen circle ry Summa 11.6 = 3 j participates in Eac h bit constrain ts, represen ted by require they good, but tial are codes Random onen de to enco resources exp squares. Eac h constrain t forces k to whic = 4 bits sum of the the h deco them. de and This it is connected to be even. most the not to be as good as random for tend part Non-random codes ; 4) code. code is a (16 performance is Outstanding may be easy , but even for codes. For a non-random code, enco ding when the blo cklength is obtained the simply-de ned linear codes, deco ding problem remains very dicult. to ' 10 000. N increased The best practical codes (a) emplo y very large blo ck sizes; (b) are based on semi-random constructions; and (c) mak e use of probabilit y- code deco ding based algorithms. Nonlinea r codes 11.7 Digital codes linear, but Most all. are soundtrac ks are used practically not onto cinema lm as a binary pattern. The enco errors a ecting the ded likely involve dirt scratc hes, whic h pro duce large and bers of 1 s and 0 s lm num ectiv ely. We want none of the codew ords resp like all- 1 s or all- 0 s, so to look that be easy to detect errors caused by dirt and scratc hes. One of the it will ; used sound systems is a nonlinear (8 cinema 6) code consisting in digital codes 8 of 64 of the t 4. patterns of weigh binary 4 11.8 rs other than noise Erro source of uncertain Another the receiv er is uncertain ty about the tim- ty for ing of the transmitted signal x ( t ). In ordinary coding theory and infor- mation theory transmitter's time t and the receiv er's time u are as- , the to receiv sync hronized. But if the receiv er sumed es a signal be perfectly imp ( ), where y receiv er's time, u , is an u erfectly kno wn function u ( t ) the of the transmitter's time t , then the capacit y of this channel for comm u- nication is reduced. theory of suc h channels is incomplete, compared The thus far. the hronized channels we have discussed with Not even the ca- sync pacity of channels with sync hronization errors is kno wn (Lev ensh tein, 1966; Ferreira , 1997); codes for reliable comm unication over channels with et al. sync hronization errors remain an activ e researc h area (Da vey and MacKa y, 2001).

200 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 11 | Error-Correcting Codes Real Channels 188 and Further reading For a review ectrum metho ds, see Scholtz (1982). of the of spread-sp history 11.9 Exercises The channel Gaussian [ 2, p.190 ] Consider with channel 11.5. a real input x , and a Gaussian . Exercise 2 v= . ratio to noise signal is its capacit y C ? (a) What p If the to be binary , x 2 f input (b) is constrained , what v is the g 0 capacit constrained channel? C y of this the output of the channel is thresholded using the (c) If in addition mapping 1 y > 0 0 ! = y y (11.35) 0 ; y 0 00 capacit y C channel? of the resulting what is the 2 Plot three capacities above as a function of v= (d) from 0.1 to 2. the 0 need integral to evaluate C to do .] [You'll a numerical 3 ] [ 11.6. For large integers K and N , what Exercise of all binary error- . fraction are of length and rate R = K=N N line ar codes? [The correcting codes er will dep end on whether you choose to de ne answ code to be an the K K ordered codew ords, that is, a mapping from s 2f 1 ; 2 ;::: ; 2 of 2 g list ) s ( to , or to de ne the code to be an unordered list, so that two codes x tical. of the codew ords are iden same Use the latter de nition: consisting a code is a set of codew ords; how the enco der operates is not part of the de nition of the code.] e channels Erasur ] 4 [ erasure a code for the binary 11.7. channel, and a deco ding Exercise Design . evaluate their probabilit y of error. [The algorithm, of good and design for channels is an activ e researc h area (Spielman, 1996; erasure codes , 1998); see also Chapter 50.] et al. Byers ] 5 [ 11.8. Design . a code for the q -ary erasure channel, whose input x is Exercise is equal wn 0 ; 1 ; 2 ; 3 ;::: ; ( dra 1), and whose output y from to x with q probabilit f ) and equal to y (1 otherwise. [This erasure channel is a ? good mo del for pac kets transmitted over the internet, whic h are either receiv ed or are lost.] reliably [ p.190 ] 3, How do redundan 11.9. ys of indep enden t disks (RAID) Exercise t arra These are information work? systems consisting of about ten [Some people say RAID stands for storage `redundan t arra y of inexp ensiv e are others the and be disabled can h any two or three of whic es, driv disk { silly that's I think but disks', able le. are codes What to still able to reconstruct any requested used, be a good idea still RAID would systems and how far are these from the Shannon limit for the problem disks exp ensiv e!] were even if the they solving? How would you design a better RAID system? Some are information is pro vided in the solution section. See http://www.acnc. com/raid2.html ; see also Chapter 50.

201 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 189 11.10: Solutions 11.10 for a Lagrange multiplier (p.181) the to exercise 11.1 . Introduce Solution , for power constrain constrain t of normalization of P ( x ). t and another, the R R 2 Y ) F d xP ( x ) x = I d xP ( x ) (11.36) ( X ; Z Z x ) P ( y j 2 j x ) ln d yP ( y ) d = x xP ( x : (11.37) ) P ( y e with resp ect to P ( x Mak ). e the functional deriv ativ Z ) x j y ( P F 2 y ( y P d ) ln x x j = x ) P ( P ( y ) Z Z 1 P y ) ( (11.38) : ( y j x ) x d ) y P ( x P d ( ) ) y x P ( P R ( y ) =P ( x y ) is found, using P ( y ) = The d xP ( x ) P ( nal j x ), to be factor P e to 1, x y ), and the whole of the last term collapses in a pu of smok j P ( be absorb ed into the term. h can whic p 2 2 2 x ) = exp( ( y x ) Substitute = 2 P ) ( y j = and set the deriv ativ e to 2 zero: Z ) ( y j x P 2 0 (11.39) = 0 x ( j ) ln y x d yP y ) P ( Z 2 2 1 ( y x ) ) = 2 exp( 2 0 p ] = y ) ln [ x P ( : ) y (11.40) d 2 2 2 This condition must be satis ed by ln[ P ( y ) ] for all x . 2 Writing of ln[ P ( y ) ] = a + by + cy a Taylor + , only a quadratic expansion 2 ln[ y ) ] = a + cy P would satisfy the constrain t (11.40). (An y higher function ( p p x p > 2, would pro duce terms in y , that are not presen t on order terms righ t-hand side.) Therefore P ( y ) is Gaussian. We can obtain this optimal the distribution a Gaussian output input distribution P ( x ). by using input to exercise . Giv en a Gaussian (p.181) distribution of vari- Solution 11.2 2 , the output distribution is Normal (0 ;v + ance ), since x v the noise and are enden t random variables, and variances add for indep indep t random enden variables. The mutual information is: Z Z X ; Y ) = ) (11.41) d x d y P ( x ) P ( y j x ) log P ( y j x ) I d y P ( y ) log P ( y ( 1 1 1 1 (11.42) log = log 2 2 + 2 2 v v 1 1 + log : (11.43) = 2 2 to exercise 11.4 (p.186) . The capacit y of the channel is one min Solution us the con ten t of the noise that it adds. That information con ten t is, information is burst per chunk, y of the selection of whether the chunk entrop y, H ), ( b the 2 plus, with probabilit y b , the entrop y of the ipp ed bits, N , whic h adds up per bit, to ( b ) + Nb per chunk (roughly; accurate if N is large). So, the H 2 y is, for N = 100, capacit 1 C = 1 H : ( b ) + b 207 = 1 0 : (11.44) = 0 : 793 2 N bursts trast, ving, whic In con interlea of errors as indep enden t, causes h treats the channel to be treated as a binary symmetric channel with f = 0 : 2 0 : 5 = 0 : 1, whose capacit y is about 0.53.

202 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Codes and Real Channels 190 11 | Error-Correcting thro Interlea useful information about the correlated- ws ving away the Theoretically be able to comm unicate about errors. , we should ness of the = 0 : 53) ' 1 : 6 times faster (0 a code and deco der that explicitly treat : 79 using as bursts. bursts Solution to exercise 11.5 (p.188) . together the results of exercises 11.1 and 11.2, (a) that Putting we deduce 2 real input x , and signal to noise ratio v= channel has a Gaussian with 1.2 1 capacit y 0.8 v 1 1 + log (11.45) : C = 0.6 2 2 0.4 p 0.2 input to If the is constrained , (b) x 2 f be binary y is capacit g v , the 0 capacit y achiev ed by using these two inputs with equal y. The probabilit 1 1.5 2 2.5 0.5 0 messy integral, is reduced to a somewhat 1 Z Z 1 1 00 ; ) log y y ) P (11.46) ( yP d ( ; 0) ( N d yN ( y ; 0) log y = C 1 1 0.1 p p 2 N = (1 ) x ; y ( where v= = ) 2], y ( P , and x ) x 2 ) exp[( y 0.01 y is smaller than [ unconstrained N ( y ; x ) + N ( y ; x )] = 2. This capacit the 0.1 1 small ratio, two capacities the capacit signal-to-noise y (11.45), for but are in value. close 11.9 Figure top (from . Capacities 0 to bottom , h graph) C , C in eac 00 channel If the (c) is thresholded, then the Gaussian output is turned into and C , versus the signal-to-noise p ratio v= is ( lower graph ). The probabilit a binary symmetric channel whose transition y is given by the plot. a log{log The function error capacit 156. page on de ned y is p 00 H C ( f ) ; where = ( = 1 v= ) : (11.47) f 2 to exercise 11.9 (p.188) . There are several RAID systems. One of Solution easiest consists of 7 disk driv es whic h store data at rate the to understand = 7 using a (7 ; 4) Hamming code: eac h successiv e four bits are enco ded with 4 code are the seven codew ord bits the written one to eac h disk. Tw o or and data. three driv es can go perhaps and the others can reco ver the disk down The e channel mo del here e ectiv erasure channel, because it is is a binary assumed that we can tell when a disk is dead. It is not possible ver the data for some choices of the three dead to reco driv es; you see why? disk can ] 2, p.190 [ Exercise Giv e an example 11.10. disk driv es that, if lost, lead . of three of the above RAID system, and three to failure can be lost without that failure. ords to exercise (p.190) . The (7 ; 4) Hamming code has Solution 11.10 codew of weigh of three disk driv t 3. If any set onding to one of those code- es corresp words is lost, then the other four disks can reco ver only 3 bits of information about the source bits; a fourth bit is lost. [cf. exercise 13.13 (p.220) with four This = 2: are no binary MDS codes. q de cit is discussed further in there section 13.11.] Any other set of three disk driv es can be lost without problems because is invertible. the onding four by four submatrix of the generator matrix corresp A better code would be a digital foun tain { see Chapter 50.

203 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Part III Further Topics in Information Theory

204 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. out Chapter Ab 12 we concen In Chapters two asp ects of information theory and trated 1{11, on coding { the compression of information so as to mak e theory: coding source of data transmission ecien storage channels; and channel coding { t use and redundan ding of information so as to be able to detect t enco correct the and unication errors. comm these In both we started by ignoring practical considerations, concen- areas trating question of the theoretical limitations and possibilities of coding. on the discussed source-co ding and channel-co ding schemes, shift- We then practical the emphasis towards computational feasibilit ing the prime criterion y. But for enco ding schemes remained comparing eciency of the code in terms the of the channel resources it required: the best source codes were those that achiev ed greatest compression; the best channel codes were those that the unicated highest rate with a given probabilit y of error. comm at the this chapter we now shift our viewp oint a little, In of ease of thinking information as a primary goal. It turns retrieval that the random codes out whic h were theoretically useful in our study of channel coding are also useful for rapid retriev al. information t information retriev Ecien of the problems that brains seem to al is one solv e e ortlessly , and con ten t-addressable memory is one of the topics we will study when we look at neural net works. 192

205 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 12 Codes Codes: for Hash Ecien t Information Retriev al 12.1 rmation-retrieval The problem info is the of imple- al problem information-retriev of an example A simple task N length string 200 ' 23 name , ting a phone directory men whic h, in resp onse to a person's service, 2 num ber of strings S ' N 200 2 ' 2 ber of possible num and (b) returns directory; in the is listed person that that a con rmation (a) strings person's phone the ber and other details. We could formalize this prob- num S be stored must that ber of names num the being in the with ws, as follo lem 12.1 . Cast of characters. Figure directory. ) S ( (1) S You of length N bits, f x of ;:::; x a list given are g , binary strings N is considerably than the total num ber of possible S 2 strings, . where smaller ) ( s call s ' in x the erscript sup the record num ber of the string. The We will ` is that s runs over customers in the order in whic h they are added to the idea s ) ( and is the name of customer s . We assume for simplicit y that directory x name have names same length. The of the length migh t be, say, people all = 200 bits, and we migh t want to store the details of ten million customers, N 7 23 possibilit S ' 2 10 . We will ignore the ' y that two customers have so tical names. iden ( s ) the inverse of the mapping from s to x The task is to construct , i.e., to ( s ) mak , returns the value of s suc h that x = x e a system given a string x that, exists, and rep orts that no suc h s exists. (Once we have the otherwise if one ber, we can go and look in memory location s in a separate memory num record of phone num bers to nd the required num ber.) The aim, when solving full task, computational minimal this resources in terms of the amoun t is to use used t of the inverse mapping from x to s and the amoun of memory to store to compute mapping. inverse time And, preferably , the inverse mapping the should ted in suc h a way that further new strings can be added be implemen to the directory in a small amoun t of computer time too. Some standar d solutions are simplest best solutions to the information-retriev al problem dum and The table and a raw list. a look-up N the table of memory of size look-up The log is a piece S , log being S 2 2 2 between t of memory to store an amoun required 1 and S . In integer N h of the 2 locations, eac we put a zero, except for the locations x that ( s ) h we write x corresp , into whic to strings the value of s . ond The look-up table is a simple and quic k solution, but only if there is if the sucien for the table, and t memory cost of looking up entries in 193

206 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 12 | Hash Codes: for Ecien t Information Retriev al 194 Codes enden is indep size. But in our de nition of the t of the memory memory amoun N bits or more, so the 200 t of task, we assumed is about that 200 2 ; this solution is completely out would be of size memory required in mind that the of the ber of particles in the solar question. Bear num 190 about 2 . is only system s ( ) pairs ( s; x list is a simple raw list ) ordered by the value The of ordered hing . The x to s is achiev ed by searc from through the list s of mapping from the top, and comparing the incoming string x of strings, starting ( s ) h record eac until a matc h is found. This system is very easy with x , about and a small amoun t of memory uses SN bits, but to main tain, slow to use, since on average ve million pairwise is rather comparisons will be made. 2, p.202 ] [ Exercise Sho w that the average time tak en to nd the required . 12.1. assuming the original names were chosen at in a raw list, string that S + N binary comparisons. (Note that you don't random, is about whole the of length N , since a comparison can have to compare string w that soon a mismatc h occurs; sho as you need on be terminated as two binary comparisons per incorrect average matc h.) Compare string this the worst-case searc h time { assuming that the devil chooses with set of strings the searc h key. the and way in whic standard directories are made impro ves on the look-up The h phone and the raw list by using an table etically-ordered list . alphab s ) ( g The f x . etical list strings are sorted into alphab etical order. Alphab Searc hing for an entry now usually tak es less time than was needed for the because we can tak e adv antage of the sortedness; for raw list we can open phoneb ook at its middle page, and compare example, the if the we nd the target string; with target is `greater' name the there middle string then we kno w that the required string, if it exists, than the in the be found half of the alphab etical directory . Otherwise, will second in the half. By iterating this splitting-in-the-middle pro ce- we look rst we can iden tify the target string, or establish dure, the string is not that listed, d log in S e string comparisons. The exp ected num ber of binary 2 comparisons comparison will tend to increase as the searc h per string but the total num ber of binary comparisons required will be progresses, greater than d log no S e N . 2 as that The required is the same amoun required for the raw t of memory list. Adding new strings to the database requires that we insert them in the correct location To nd that location tak es about d log in the S e list. 2 binary comparisons. ve on the well-established Can etized list? Let us consider we impro alphab task from some new viewp oints. our bits task x ! s from N a mapping to log is to construct S bits. This The 2 vertible mapping, since for any x that maps to a non-zero s , the is a pseudo-in ( s ) tains the pair ( s; x customer database con ) that tak es us bac k. Where have we come across idea of mapping from N bits to M bits before? the in source tered idea twice: rst, this coding, we studied blo ck We encoun codes whic h were mappings from strings of N sym bols to a selection of one retriev lab The task of information el in a list. al is similar to the task (whic h

207 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Hash 12.2: codes 195 solv we nev an enco der for a typical-set compression ed) er actually of making code. time ed bit strings to bit strings of another The we mapp that second channel codes. There, we considered we studied when dimensionalit y was ed from K bits to codes bits, with N greater than K , and we that mapp N progress random codes. made theoretical using together we put two notions. We will study random codes, In hash these codes from N bits to M bits where M is smal ler than N . that map idea we will map the original high-dimensional space down into The is that one h it is feasible to implemen t the dum b space, a lower-dimensional in whic metho d whic h we rejected a momen t ago. look-up table N 200 length string ' 23 S 2 ber of strings ' num codes Hash 12.2 of hash function M size 30 bits ' M table = 2 size of hash T First study the code works, prop erties we will we will describ e how a hash then 30 2 ' hash codes. A hash code implemen ts a solution to the information- of idealized al problem, retriev , with that is, a mapping from x to s the help of a pseudo- cast . Revised 12.2 of Figure the function a hash function , whic h maps called N -bit string x to an random characters. -bit string h ( x ), where M is smaller than N . M is typically chosen suc h that M M `table T ' 2 size' is a little bigger than S { say, ten times bigger. For the we migh if we were S to be about a million, ecting t map x into example, exp hash h (regardless of the size N of eac h item x ). The hash a 30-bit function is some deterministic function whic h should ideally be indistinguishable xed the a xed code. For practical purp oses, from hash function must be random quic k to compute. Tw o simple of hash functions are: examples metho d The table size T is a prime num ber, preferably one that Division . to a power of 2. The is not is the remainder when the hash value close by T . is divided integer x addition metho d . This Variable d assumes that x is a string string metho and the table size T is 256. that characters of x are added, of bytes The dulo 256. This hash function has the defect that mo strings that it maps are of eac h other onto the same hash. anagrams ved by putting running total through a xed pseu- It may be impro the perm after eac h character is added. In the variable dorandom utation exclusiv e-or metho d with table size string 65 536, the string is hashed twice way, with the initial running total being set to 0 and in this 1 resp ectiv ely (algorithm 12.3). The result is a 16-bit hash. Having picked a hash function h ( x ), we implemen t an information retriev er as follo ws. gure 12.4.) (See M is created of memory called the hash table A piece of size 2 ding b Enco . units, where b is the amoun t of memory needed to represen t an memory integer between S . This table is initially set to zero throughout. 0 and ( s ) h memory Eac is put through x hash function, and at the location the ) s ( ) s ( ), resulting to the onding corresp = h ( x table hash in the h vector the s is written { unless that integer in the hash table is already entry ) s ( some we have a x h case in whic occupied, and between earlier collision 0 ( s ) x whic h both happ en to have the same hash code. Collisions can be t { but handled ways { we will discuss some in a momen in various rst let us complete the basic picture.

208 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Codes: 12 | Hash Ecien t Information Retriev al Codes 196 for . C code Algo rithm 12.3 variable string ting implemen the contains array // This Rand8[256]; unsigned a random char exclusiv a d to create metho e-or to 0..255 0..255 from permutation in the h hash ::: 65 535 range 0 char; to the first // x is a pointer int Hash(char *x) { x a string . Author: from Thomas int h; // *x is the first character Niemann. char h1, h2; unsigned // Special of empty string if (*x == 0) return handling 0; h1 = *x; h2 = *x + 1; // Initialize two hashes // Proceed to the next character x++; (*x) { while ^ *x]; // Exclusive-or the two hashes h1 = Rand8[h1 with // h2 = Rand8[h2 and put through ^ *x]; the randomizer x++; // End of string when *x=0 } is reached // Shift h = ((int)(h1)<<8) 8 bits and add h2 | h1 left h2 ; (int) return h ; // Hash is concatenation of h1 and h2 } Hash function table Hash Strings - hashes . Use of hash 12.4 Figure functions M bits - For eac al. information for h retriev s ( s ) ) ( , the string x ) ( x = hash h h 6 (2) 2 ( x ) h ! is is computed, and the value of s bits N th row of the h into the written - Blank in the hash table. rows table con tain the value zero. hash (1) x 6 M The . = 2 T is size table @ (2) x @ R @ (3) (1) x 1 h ( x ) ! @ . @ . . S R @ (3) 3 ) x ! ( h M 2 s ( ) x A A . . A . A A U A ? ) s ( s ( ! ) x h ?

209 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Collision 12.3: resolution 197 To retriev . corresp onding to a target vector ding e a piece Deco of information corresp h x and look at the hash onding location the , we compute x of If there is a zero, then we kno w immediately in the the hash table. that is not database. The cost of this answ er is the cost of one string x in the M 2 in the table of size look-up one . If, on evaluation hash-function and hand, there is a non-zero entry s in the table, there are two the other ( s ) ( s ) ; or the is indeed equal x vector x to x the vector possibilities: either that happ ens to have the same hash code as the target is another vector third y is that this non-zero entry migh t have something possibilit . (A x our yet-to-b e-discussed collision-resolution system.) to do with ( s ) To chec to x k whether x is indeed , we tak e the ten tativ e answ er equal ( s ) s up x in the original forw ard database, and compare it bit by , look with x hes then we rep ort s as the desired answ er. This bit ; if it matc al has cost overall retriev of one hash-function evaluation, successful an M , another of size 2 in the look-up look-up in a table of size one table , and N binary comparisons { whic h may be much cheap er than the S solutions ted simple in section 12.1. presen [ 2, p.202 ] ( s ) Exercise the rst few bits of x x If we have chec ked with . and 12.2. found them what is the probabilit y that the correct entry to be equal, e hypothesis has ed, if the alternativ retriev is that x is actually not been in the database? Assume that the original source strings are random, and the function is a random hash function. How man y binary hash are needed with odds of a billion to one that the evaluations to be sure has retriev ed? entry correct been metho d of information retriev al can be used for strings x of The hashing to strings length, function h ( x ) can be applied hash of any arbitrary if the length. 12.3 Collision resolution study two ways of resolving collisions: app ending in the We will and table, storing elsewhere. ending in table App the ding, occurs, we con When down if a collision hash table and enco tinue the value of s into the next available location write that curren tly in memory con a zero. If we reac h the bottom of the table before tains tering a encoun zero, we con tinue from the top. When deco ding, if we compute the hash code for x and nd that the s ( s ) con point to an x tained table doesn't that matc hes the cue x , we in the ( s ) h table until we either con an s whose x tinue the hash does matc down nd the x , in whic h case we are done, cue encoun ter a zero, in whic h case or else we kno w that the cue x is not in the database. For this metho tial that the table be substan tially bigger in size d, it is essen M S than < S then the enco ding . If 2 will become stuc k with nowhere to rule put the last strings. Storing e elsewher A more robust and exible metho d is to use pointers to additional pieces of memory in whic h collided strings are stored. There are man y ways of doing

210 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 12 | Hash Codes: for Ecien t Information Retriev al 198 Codes an example, store in location h in the hash table a pointer this. As we could from ) to a `buc record num ber s be distinguishable ket' (whic h must a valid in a strings have hash code h are stored the sorted list . The all where that sorts the strings in eac h buc ket alphab etically as the hash table and enco der are created. buc kets relev der to go and look in the has ant buc ket and then deco simply The short list of strings that are there by a brief alphab etical searc h. chec k the d of storing This strings in buc kets allo ws the option of making metho the table quite small, whic h may have practical bene ts. We may mak e it the hash that almost all strings are involved in collisions, so all buc kets so small tain con a small ber of strings. It only tak es a small num ber of binary comparisons num tify whic strings in the buc ket matc hes the cue x . to iden h of the Planning a birthda y problem for collisions: 12.4 [ 2, p.202 ] If we wish a hash S entries using 12.3. function whose Exercise to store we exp has how man y collisions should bits, ect to happ en, output M that our hash function is an ideal random function? What assuming M size table is needed if we would like the exp ected num ber of of hash to be smaller than collisions 1? exp M table is needed size like the of hash ected num ber What if we would to be a small fraction, say 1%, of of collisions ? S [Notice similarit y of this problem to exercise 9.20 (p.156) .] the 12.5 Other roles for hash codes Che cking arithmetic If you wish to chec k an addition that was done by hand, you may nd useful the metho casting out nines . In casting out nines, one nds the sum, d of dulo nine, the digits of the num bers to be summed and compares mo of all sum, of the dulo nine, of the digits the putativ e answ er. [With a mo it with rapidly these can be computed much more sums than the full practice, little addition.] original 12.4. in the calculation sho wn Example margin the sum, mo dulo nine, of 189 In the , and the digits in 189+1254+238 is 7 the of sum, mo dulo nine, 1+6+8+1 +1254 7 is test. . The calculation thus passes the casting-out-nines + 238 1681 out nines gives a simple Casting of a hash function. For any example addition expression of the form a + b + c + , where a;b;c;::: are decimal num bers h 2f 0 ; 1 ; 2 ; 3 ; 4 ; 5 ; 6 ; 7 ; 8 g by we de ne of all ( + b + c + ) = sum mo dulo nine h digits in a;b;c ; (12.1) a then it is nice prop erty of decimal arithmetic that if a + b + c + = m + n + o + (12.2) m then h ( a + b + c + ) and h ( hashes + n + o + ) are equal. the [ 1, p.203 ] Exercise 12.5. . What evidence does a correct casting-out-nines matc h addition give in favour hypothesis that the of the has been done cor- rectly?

211 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Other roles hash codes 199 12.5: for ction among Err or dete friends computer, same? are on the same les we could just the If the Are two les by bit. But if the two les compare on separate mac hines, it them bit are to have a way of con rming that two les are iden tical without be nice would one of the les from A to B. [And even if we did transfer one having to transfer les, we would like a way to con rm whether it has been receiv ed of the still di cations!] without mo Alice be solv using hash codes. problem ed and Bob be the can This Let two les; Alice sen t the holders to Bob, and they wish to con rm of the le been ed without error. If Alice computes receiv hash of her le and it has the it to Bob, and Bob computes the hash of his le, using the same M -bit sends function, hash the two hashes matc h, then Bob can deduce that the two and are almost the same. les surely the What y of a false negativ e, i.e., probabilit probabilit y, 12.6. is the Example the two les do di er, that the two hashes are given ertheless that nev tical? iden the the function is random and If we assume hash pro cess that causes that that les to di er kno ws nothing about the hash function, then the probabilit y the M of a false e is 2 . 2 negativ 10 hash y of false negativ e of about 10 . It is A 32-bit gives a probabilit practice to use a linear hash function called a 32-bit cyclic redundancy common k to detect errors in les. (A cyclic redundancy chec k is a set of 32 parit y- chec of the chec to the 3 parit y-chec k bits similar (7 ; 4) Hamming code.) k bits To have a false-negativ e rate smaller than one in a billion, M = 32 bits is plen errors are pro duced by noise. ty, if the 2, p.203 ] [ Exercise Suc h a simple parit y-chec k code only detects errors; it . 12.7. correct Since error- correcting codes exist, why not help them. doesn't of them to get some error-correcting capabilit y too? use one ction Tamp er dete simply di erences the two les What not between `noise', but are if the are by an adv ersary introduced er forger called Fiona, who mo di es the , a clev original to mak e a forgery that purp orts to be Alice's le? How can Alice le e a digital no-one for the le so that Bob can con rm that mak has signature listening ered the tamp And how can we prev ent Fiona from with in on le? Alice's signature and attac hing it to other les? Let's assume that Alice computes a hash function for the le and sends it securely to Bob. computes a simple hash function for the le like the If Alice that cyclic k, and Fiona kno ws chec this is the metho d of linear redundancy the le's integrit y, Fiona can verifying e her chosen mo di cations to the mak le then easily iden tify and algebra) a further 32-or-so single bits (by linear that, when ipp ed, restore the hash function of the le to its original value. Line ar hash give no security against forgers. functions therefore be that the hash function We must hard to invert so that require no-one construct a tamp ering that leaves the hash function una ected. can We would still like the hash function to be easy to compute, however, so that h Bob have to do hours of work to verify every le he receiv ed. Suc doesn't hard a hash { easy to compute, but function to invert { is called a one-w ay

212 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 12 | Hash Codes: for Ecien t Information Retriev al 200 Codes . Finding suc is one of the activ e researc h areas of hash function h functions y. cryptograph used in the free soft ware is widely unit y to A hash that function comm do not di er is MD5 con rm h pro duces a 128-bit hash. The that two les , whic are complicated, involving con quite exclusiv e- of how it works details voluted 1 and or-ing if-ing and and-ing. a good one-w Even the digital signatures describ ed ay hash with function, vulnerable k, if Fiona has access to the hash function. still above are to attac tak e the tamp ered Fiona and hunt for a further tiny mo di cation to could le h that hash matc hes the original hash its le. This would tak e it suc of Alice's 32 { on average, about 2 hash attempts, if the some time has 32 bits { function but Fiona would nd a tamp ered le that matc hes the given hash. eventually against forgery signatures must either have enough bits To be secure , digital h a random be h to tak e too long, or the hash function itself must suc for searc secret. kept M 32 2 les to cheat. to hash 2 le mo di cations is not Fiona has man y, so a 32-bit hash function is not large enough for forgery very ention. prev person who migh t have a motiv ation for forgery is Alice herself. Another she t be making a bet on the outcome of a race, without For example, migh to broadcast her prediction publicly; a metho d for placing bets would wishing her the to Bob be for bookie the hash of her bet. Later on, she could to send can Bob details of her bet. Everyone the con rm that her bet is consis- send ten t with the previously publicized hash. [This metho d of secret publication was used by Isaac and Rob ert Hooke when they wished to establish Newton y for scien ideas without revealing them. Hooke's hash function priorit ti c TENSIO, as illustrated version of UT con SIC VIS etization by the was alphab CEIIINOSSSTTUV .] Suc h a proto col relies into the the assumption anagram on Alice change her bet after the cannot the hash coming that event without wrong. How big a hash function do we need out to ensure that Alice to use cannot The answ er is di eren t from the size of the hash we needed in cheat? to defeat Alice above, because Alice is the author of both les. order Fiona h other. cheat hing for two les could have iden tical hashes to eac by searc that For example, if she'd like to cheat by placing two bets for the price of one, she could mak e a large num ber N h of versions of bet one (di ering from eac 1 other details only), and a large num ber N in minor of versions of bet two, and 2 hashes If there's a collision between the all. of two bets of di eren t hash them then she can submit the common hash and thus buy herself types, option the of placing bet. either N 12.8. hash has M bits, Example do If the to be for and N need how big 2 1 Alice to have a good chance of nding two di eren t bets with the same hash? This y problem like exercise 9.20 (p.156) is a birthda are N tagues Mon . If there 1 and N the Capulets at a part y, and eac h is assigned a `birthda y' of M bits, 2 tague num ber of collisions between a Mon ected and a Capulet is exp M N ; 2 N (12.3) 2 1 1 http://www.freesoft.org/CIE /RFC/ 1321 /3.ht m

213 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Further exercises 12.6: 201 num ber of les N so to minimize + N the , Alice should mak e N hashed, 1 1 2 M= 2 les need to hash about 2 equal, and N until she nds two that and will 2 h. 2 matc M= 2 root of the Alice les to cheat. [This is the square has to hash 2 Fiona ber of hashes to mak e.] num had 6 T C = 10 the computers for of = 10 years, eac h computer has use If Alice = 1 ns to evaluate a hash, the bet-comm taking system is secure t unication Alice's y only if M 2 log against dishonest ' 160 bits. CT=t 2 Further reading The for hash codes is volume 3 of Knuth (1968). I highly recommend the Bible story McIlro y's spell program, as told in section 13.8 of Programming of Doug (Ben Pearls This astonishing piece of soft ware mak es use of a 64- tley , 2000). ord data to store yte spellings of all the words of 75 000-w structure kilob the . dictionary 12.6 Further exercises ] [ 1 12.9. the is the shortest Exercise address on a typical international What could to a unique human recipien t? (Assume letter be, if it is to get permitted characters are [A-Z,0-9] .) How long are typical the email addresses? 2, p.203 ] [ does a piece Exercise 12.10. of text need to be for you to be How long prett y sure that no human has written that string of characters before? How man y notes in a new melo dy that has not been comp osed are there before? [ 3, p.204 ] Exercise Pattern recognition . . by molecules 12.11. pro have a regulatory in a cell proteins role. A regulatory Some duced trols con transcription of speci c genes in the genome. This protein the trol often involves the protein's binding to a particular DNA sequence con vicinit in the gene. The presence of the bound protein regulated y of the or inhibits transcription of the gene. promotes either Use information-theoretic argumen ts to obtain a lower bound on (a) size protein that acts as a regulator speci c to one of a typical the in the whole human genome. Assume that the genome is a gene 9 of 3 10 sequence nucleotides dra wn from a four letter alphab et f A C ; G ; T g ; a protein is a sequence of amino acids dra wn from a ; alphab twenty letter [Hin t: establish how long the recognized et. DNA has to be in order for that sequence to be unique sequence to the vicinit y of one gene, treating the rest of the genome as a be to random discuss how big the protein must Then sequence. recognize a sequence of that length uniquely .] (b) Some of the sequences recognized by DNA-binding regulatory pro- or more, teins of a subsequence that is rep eated twice consist for example the sequence CACCCCTGCCCCC (12.4) GCCCCC

214 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 12 | Hash Codes: for Ecien t Information Retriev al 202 Codes found is a binding alpha-actin gene in humans. upstream site of the some binding consist of a rep eated subse- fact Does the sites that er to part (a)? answ in uence quence your 12.7 Solutions string . First imagine comparing the 12.1 x with Solution (p.194) to exercise ) s ( another random . The probabilit string the rst bits x two y that of the matc = 2. The probabilit y that the second bits matc h is 1 = 2. As- strings h is 1 comparing num we hit the rst mismatc h, the exp ected we stop ber suming once is 1, so the num ected hes ber of comparisons is 2 (exercise 2.34, of matc exp . p.38) at random string is located Assuming in the raw list, we will correct the with an average of S= 2 strings before we nd it, whic h costs have to compare S= 2 comparisons; and comparing the correct strings tak es N binary 2 binary giving a total ectation of S + N binary comparisons, if the comparisons, exp chosen are strings at random. case (whic h may indeed happ en in practice), the other strings In the worst similar h key, so that searc very a length y sequence of comparisons are to the is when eac h. The worst case to nd the correct string is needed h mismatc in the list, and all the other is last di er in the last bit only , giving a strings requiremen SN binary comparisons. t of od ratio to exercise . The likeliho (p.197) for the two hypotheses, 12.2 Solution s ( ) ( s ) = = x H of : x : bits H 6 x x , con tributed by the datum `the rst , and 0 1 ) ( s x and are equal' is x 1 (Datum ) P jH 0 = 2 = : (12.5) = 2 ) jH (Datum 1 P 1 r bits all matc h, the likeliho od ratio If the rst to one. On nding that r is 2 matc odds are a billion to one in favour h, the H 30 bits , assuming we start of 0 even odds. [For a complete answ er, we should compute the from evidence given prior information that the hash entry s has been found in the by the in favour at x ). This fact gives further evidence ( of H table .] h 0 to exercise 12.3 (p.198) . Let the Solution function have an output al- hash M T = 2 phab . If M were et of size to log have exactly S then we would equal 2 enough bits for eac h entry to have its own unique hash. The probabilit y that one particular of entries collide under a random hash function is 1 =T . The pair the ber of pairs ( S 1) = 2. So S exp ected num ber of collisions between num is is exactly pairs ( 1) S = (2 T ) : (12.6) S ( like this than 1, then we need T > S If we would S 1) = 2 so to be smaller M > 2 log (12.7) S: 2 We need e as many bits as the num ber of bits, log twic S , that would be 2 sucien h entry a unique name. t to give eac happ If we are collisions, involving a fraction f of the y to have occasional names S , then we need T > S=f (since the probabilit y that one particular name is f ' S=T ) so is collided-with M > log (12.8) S + log ; [1 =f ] 2 2

215 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 203 12.7: for h means 0 : 01 that we need an extra 7 bits above log f S . whic ' 2 ortan scaling of T with S in the two cases The t point to note is the imp function to be collision-free, then we must hash 12.8). (12.7, If we want the 2 S have . If we are happ T have a small frequency of greater than y to T to be of order S only . collisions, then needs for (p.198) The posterior probabilit y ratio 12.5 the two to exercise Solution . correct' = `calculation hypotheses, and H H = `calculation incorrect' is + duct of the prior probabilit y ratio P ( H the ) =P ( H pro ) and the likeliho od + ratio, h jH ). This ) =P (matc h jH P (matc second factor is the answ er to the + question. P (matc h jH numerator ) is equal to 1. The denominator's The + w that on our mo del of errors. If we kno ends the human calculator is value dep to errors involving multiplication of the prone er by 10, or to transp osition answ of adjacen neither of whic h a ects the hash value, then P (matc h jH t digits, ) could so that the correct matc h gives no evidence in favour be equal to 1 also, from of that errors are `random if we assume the point of view of the H . But + then the probabilit y of a false positiv e is P hash h jH function' ) = 1 = 9, (matc correct matc h gives evidence 9:1 the of H and . in favour + to exercise 12.7 (p.199) . If you add a tiny M = 32 extra bits of hash Solution N you le to a huge get prett y good error detection { the probabilit y that -bit M one , less than is undetected in a billion. To do error correction an error is 2 far more chec k bits, the num ber dep ending on the exp ected types of requires and size. the le corruption, For example, if just eigh t random bits in a on 23 2 180 ' ' 23 8 le corrupted, megab tak e about log yte are it would 2 8 bits to specify whic h are the corrupted bits, and the num ber of parit y-chec k bits used error-correcting code would have to be at least this by a successful ber, num ting argumen t of exercise 1.10 (solution, p.20). by the coun 12.10 . We want to kno w the length L of a string to exercise Solution (p.201) it is very improbable that that string matc hes any part of the entire suc h that of humanit y. Let's that these writings total about one book writings estimate tains eac and that eac h book con living, two million characters (200 for h person 16 10 000 from per page) { that's 10 with characters, dra wn pages characters alphab et of, an say, 37 characters. matc y that chosen string probabilit L a randomly hes at one The of length L works of humanit y is 1 = 37 num . So the exp ected point in the ber collected 16 L ' of matc 37 hes , whic h is vanishingly small if L 16 = log 10. 37 is 10 = 10 writings, of the and rep etition of humanit Because redundancy it is possible y's that L ' 10 is an overestimate. So, if you want to write something unique, sit down and comp ose a string of ten characters. don't write gidnebinzz , because I already though t of But string. that a new melo dy, if we focus for sequence of notes, ignoring duration As on the stress, and allo w leaps of up to an octave at eac h note, then the num ber and of choices per note The pitc h of the rst note is arbitrary . The num ber is 23. ensem dies r notes in this rather ugly of melo ble of Schonbergian of length r 1 tunes is 23 ; for example, there are 250 000 of length r = 5. Restricting the permitted als will reduce this gure; including duration and stress interv increase permitted [If we restrict the will interv als to rep etitions and it again. or semitones, the reduction is particularly severe; is this why the melo dy tones of `Ode to Joy' sounds so boring?] The num ber of recorded comp ositions is new probably a million. If you learn 100 than melo dies per week for every less Based of your life then you will have learned 250 000 melo dies at age 50. week

216 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 12 | Hash Codes Ecien t Information Retriev al Codes: 204 for that erience the game ` guess that exp ', it seems to In guess of playing tune , one player on empirical tune dy, and sings a a melo chooses y four-note man whereas that me in common between shared are sequences num ber of its gradually-increasing the ber of collisions between ve-note sequences is rather smaller melo dies, num participan ts other the while notes, sequences ve-note famous { most are unique. dy. melo the whole to guess try the Let (a) . (p.201) 12.11 to exercise Solution recognize protein DNA-binding code hash is a related Parsons The for h pair of eac dies: melo function preferen it binds is, nucleotides. to that L of length a sequence tially That consecutiv e notes is coded as U of DNA pieces to any other and sequence, DNA not (In genome. whole in the note second if the (`up') is higher characters, e.g., realit wildcard some tain may con sequence recognized y, the (`rep eat') if the the rst, R than * to be precise, and G , C , we are A `any of h denotes the '; so, in TATAA*A , whic T pitc hes are equal, and D (`do wn') tains con sequence recognized the that assuming characters.) non-wildcard L otherwise. You nd out how can genome the con- sequence Assuming the rest of the that is `random', i.e., function well this at works hash . http://musipedia .org/ y { whic of random , C , G and T with equal probabilit A h is sists nucleotides untrue, but it shouldn't mak e too much di erence to our calculation obviously chance that there is no other occurrence of the target sequence in the { the genome, nucleotides, N whole is roughly of length L L N 4) ) = ' exp( N (1 = 4) (1 ) ; (1 (12.9) whic to one only if h is close L 4 N 1 ; (12.10) that is, L > log N= log 4 : (12.11) 9 Using 10 N , we require the recognized sequence to be longer than = 3 = 16 nucleotides. L min of protein size does this What imply? can by assuming that the information lower bound be obtained A weak con sequence itself is greater than the information ten t of the protein t of the nucleotide con the protein prefers to bind to (whic h ten sequence above must 32 bits). This gives a minim um we have argued be at least length = log (20) ' 7 amino acids. protein of 32 2 Thinking realistically , the recognition of the DNA by the pro- sequence tein involves the protein coming into con tact with all sixteen presumably in the target If the protein is a monomer, it must nucleotides sequence. tact that enough ultaneously mak e con it can with sixteen nu- be big sim of DNA. One helical turn of DNA cleotides taining ten nucleotides con has of 3.4 nm, so a con tiguous sequence of sixteen nucleotides a length of the has nm. The diameter of 5.4 protein must therefore be a length about 5.4 nm or greater. Egg-white lysozyme is a small globular protein with a length amino acids and a diameter of about 4 nm. As- of 129 that volume ortional to sequence length and that volume suming is prop as the e of the diameter, a protein of diameter 5.4 nm must scales cub of length 2 : 5 have a sequence ' 324 amino acids. 129 (b) a target sequence consists of a twice-rep eated sub-sequence, we If, however, can by with a much smaller protein that recognizes only the sub-sequence, get and that binds to the DNA strongly only if it can form a dimer , both halv es of whic h are to the recognized sequence. Halving the diameter of the bound protein, need a protein whose length is greater than 324/8 = 40 we now only amino acids. A protein of length smaller than this cannot by itself serv e as a regulatory speci c to one gene, because it's simply too small to be protein able to mak e a sucien tly speci c matc h { its available surface does not have enough information con ten t.

217 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. out Chapter Ab 13 8{11, Shannon's noisy-c hannel coding theorem Chapters we established In channel for any input and output alphab ets. A great deal of a general with tion in coding theory focuses on the special atten of channels with binary case inputs. ourselv es to these channels simpli es matters, and leads Constraining exceptionally only whic h we will us into an taste in this book. rich world, One aims of this chapter is to point out a con trast between Shannon's of the aim of achieving reliable comm unication over a noisy channel and the apparen t aim of man world of coding theory. Man y coding theorists tak e as y in the fundamen of pac problem the task their king as man y spheres as possible, tal -dimensional with as possible, into an N as large space, with no spher es radius overlapping . Prizes are awarded to people who nd pac kings that squeeze in an extra spheres. While this is a fascinating mathematical topic, we shall see few that the aim of maximizing the distance between codew ords in a code has only a ten uous relationship to Shannon's aim of reliable comm unication. 205

218 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 13 Codes Binary hannel noisy-c for a general chan- We've established coding Shannon's theorem tion output ets. A great deal of atten and in coding any input with nel alphab the special case of channels theory binary inputs, the rst focuses on with being binary symmetric channel. implicit choice the symmetric der a code, given a binary deco channel, nds optimal The for ord the is closest to the receiv ed vector, closest in Hamming dis- Example: codew that Hamming The distance between tance . The Hamming distance two binary vectors is the num ber of between 00001111 errors coordinates in whic h the two vectors di er. Deco ding if the will occur 11001101 and t tak from the transmitted codew noise es us to a receiv ed vector r that ord is 3. to some other codew ord. The distances between codew ords are thus is closer ant to the probabilit ding error. relev y of a deco Distance of a code prop 13.1 erties smallest of a code is the between two of its codew ords. distance The separation The (7 ; 4) Hamming Example distance d = 3. All pairs of 13.1. code (p.8) has codew di er in at least 3 bits. The ords um num ber of errors its maxim ( w ) w A with d a code is b ( d 1) = 2 c - it can correct is t = 1; in general distance 0 1 error-correcting. 7 3 7 4 A more minim for um precise distance of the code. The distance is the term 7 1 or d distance of a code is often denoted by d . min Total 16 tion to linear codes. In a linear code, all We'll now constrain our atten prop distance codew ords have iden tical all erties, so we can summarize the 8 distances ords from the codew code's by coun ting the distances between the 7 6 ord. all-zero codew 5 4 weigh function be the ), is de ned The to w ( A of a code, t enumerator 3 ber of codew num in the have weigh t w code that . The weigh t enumerator ords 2 1 code. of the distribution distance as the wn kno is also function 0 1 7 6 5 4 3 2 0 code 4) Hamming ; Example 13.2. The weigh t enumerator functions of the (7 the and 13.2. and 13.1 in gures wn sho are code dodecahedron Figure of the graph . The 13.1 and (7 code, 4) Hamming ; its function. t enumerator weigh Obsession with distance 13.2 a code the um num Since that maxim can guar ante e to correct, ber of errors t , is related to its distance d by t = b ( d 1) = 2 c , man y coding theorists focus d = 2 t + 1 if d is odd, and d = 2 + 2 if d is even. t codes the of a code, on hing for distance of a given size that have the searc biggest possible distance. Muc h of practical coding theory has focused on patterns deco that give the optimal deco ding for all error ders of weigh t up to the half-distance t of their codes. 206

219 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Obsession 207 distance 13.2: with graph de ning . The 13.2 Figure 350 code dodecahedron 11) the (30 ; 300 w ( A w ) are the (the 30 circles transmitted 250 1 0 the bits triangles 20 the and are 200 12 5 of whic one ks, y chec parit h is 150 8 30 redundan t) and the weigh t 9 20 100 lines). function enumerator (solid 72 10 50 lines average dotted The w the sho 11 120 0 of all function t enumerator weigh 12 100 30 5 8 0 25 20 15 10 linear random codes with the 13 180 matrix, size same of generator 14 240 be computed h will whic . shortly 15 272 100 345 16 same lower gure The ws the sho 300 17 functions a log scale. on 200 18 10 120 19 36 20 1 2048 Total 0 30 20 15 10 8 5 25 deco that is a deco der A bounded-distance returns the closest code- der to a receiv ed binary r if the distance from r to that codew ord vector word or equal to t ; otherwise it returns a failure message. than is less rationale for not trying to deco de when more than t errors have occurred The t be `we can't guar ante e that we can correct more than t errors, so we migh bother in a deco { who would be interested won't der that corrects some trying not error t greater than t , but of weigh others?' This defeatist attitude patterns is an example of worst-case-ism , a widespread men tal ailmen t whic h this book is intended to cure. is that The ders cannot reac h the Shannon limit bounded-distance fact deco channel; corrects a deco der that often symmetric more than of the binary only codes can state of the art in error-correcting The have deco ders errors t do this. that the minim um distance of the code. work way beyond of good and e properties De nitions bad distanc of codes h- blo cklength N , and with rates approac en a family Giv of increasing in one R > to put that family 0, we may be able of the follo wing ing a limit whic h have some similarities to the categories of `good' and `bad' categories, Figure 13.3 . The graph of a de ned ): (p.183 earlier codes 1 / 2 low-densit y rate- The code. generator-matrix `good' distance if d=N of codes has A sequence tends a constan to t transmitted righ tmost M of the zero. than greater bits are eac h connected to a single distinct parit y constrain t. The to zero. d=N A sequence if distance tends `bad' has of codes leftmost K transmitted bits are eac h connected to a small num ber distance if d tends to a constan t. A sequence of codes has `very bad' of parit y constrain ts. 13.3. low-densit y generator-matrix code is a linear code whose K Example A generator matrix N has a small num ber d regardless of 1 s per row, G 0 of how big N is. The minim um distance of suc h a code is at most d , so 0 y generator-matrix codes have `very bad' distance. low-densit While having large distance is no bad thing, we'll see, later on, why an emphasis on distance can be unhealth y.

220 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 13 | Binary 208 Codes 13.4 . Schematic picture of Figure perfectly part of Hamming space the lled by t -spheres cen tred on of a perfect ords codew code. t t t . . . 1 2 13.3 codes Perfect -sphere (or a sphere of radius t A space, cen tred on a point x , t ) in Hamming set of points whose Hamming distance from x is less than or equal is the t . to The ; 4) Hamming code has the beautiful prop erty that if we place 1- (7 spheres about 16 codew ords, those h of its perfectly ll Hamming spheres eac without overlapping. As we saw in Chapter 1, every binary vector of space 7 is within a distance of t = 1 of exactly one codew ord of the Hamming length code. A code t -error-correcting code if the is a perfect of t -spheres cen- set tred on the codew ords of the code ll the Hamming space without over- lapping. (See 13.4.) gure K recap of characters. The num ber of codew ords is S = 2 our . cast Let's N ber of points in the Hamming space is 2 entire . The num ber of num The sphere of radius t is points in a Hamming t X N : (13.1) w =0 w with these parameters, we require S times For a code to be perfect num ber the N of points in the t -sphere to equal 2 : t X N N K (13.2) = 2 2 for a perfect code, w =0 w t X N N K 2 (13.3) or, = : tly, alen equiv w =0 w For a perfect code, the num ber of noise vectors in one sphere must equal 4) the syndromes. The (7 ; ber of possible Hamming code satis es this num numerological condition because 7 3 1 + = 2 : (13.4) 1

221 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Perfect codes 13.3: 209 of picture . Schematic 13.5 Figure not Hamming perfectly space t t the on t tred cen -spheres lled by . . . . . . The grey of a code. ords codew 1 1 2 2 are regions sho w points that at a distance Hamming t than of more any codew ord. This is a from as, misleading any picture, for code with large t in high t t space grey dimensions, the . . . . . . between the spheres tak es up 1 1 2 2 of Hamming space. almost all we happy be to use perfe ct codes How would were large num bers of perfect codes to choose from, with If there a wide range cklengths and rates, then these would be the perfect solution to of blo problem. We could unicate over a binary symmetric channel Shannon's comm t level noise example, by picking a perfect f -error-correcting code with , for N and t = f and N , where f with = f + blo N and are chosen cklength suc h that probabilit y that the noise ips more than t bits is satisfactorily the small. However, e are almost no perfe ct codes . The only non trivial perfect ther codes are binary with Hamming whic h are perfect 1. the codes, t = 1 and blo cklength codes M = 2 approac 1, de ned below; the rate of a Hamming code N hes 1 as its blo N increases; cklength rep etition of odd blo cklength N , whic h are perfect codes with 2. the codes N etition 1) = 2; the rate of rep = ( codes goes to zero as 1 =N ; and t 12 codew code with 2 able 3-error-correcting ords of blo ck- remark 3. one N = 23 kno wn as the binary Gola length [A second 2-error- y code. correcting y code of length N = 11 over a ternary alphab et was dis- Gola by a Finnish ool enthusiast called Juhani Virtak allio covered football-p in 1947.] are no other binary perfect codes. Wh y this shortage There codes? of perfect Is it because numerological coincidences like those satis ed by the precise parameters Hamming code (13.4) and the Gola y code, of the 23 23 23 11 1 + + + = 2 ; (13.5) 1 2 3 rare? Are there plen ty of `almost-p erfect' codes for whic h the t are ll -spheres almost whole space? the cen In the picture No. spheres fact, tred on the codew ords of Hamming almost lling Hamming space ( gure 13.5) is a misleading one: for most codes, all whether good codes or bad codes, almost are the Hamming space is they tak en up by the space betwe en t -spheres (whic h is sho wn in grey in gure 13.5). Having this gloomy picture, we spend a momen t lling in the established above. erties of the perfect codes men tioned prop

222 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 13 | Binary 210 Codes 00000 0 00000 00000 0 00000 0 00000 0 . Three ords. codew 13.6 Figure 1 1 100000 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0000 1 1 1 1 0000 1 1 1 1 1 1 00000 0 1 1 1 1 1 1 1 1 1 xN vN uN wN N The Hamming codes ; 4) Hamming code can be de ned as the linear code whose 3 7 parit y- The (7 3 con as its columns, all the 7 (= k matrix 2 1) non-zero vectors of chec tains, these all are 3. Since di eren t, any single bit- ip pro duces a 7 vectors length so all single-bit errors can be detected and corrected. syndrome, distinct generalize this We can with M = 3 parit y constrain ts, as follo ws. The code, Hamming are single-error-correcting codes de ned by picking a num ber codes M y-chec M ; the blo cklength N is N = 2 k constrain 1; the parit y- of parit ts, con columns, as its k matrix all the N non-zero vectors of length M tains, chec bits. rst few Hamming codes have the follo wing rates: The ks, M ( N;K ) R = K=N Chec (3, 1) rep 2 etition code R 1/3 3 3 (7 ; 4) Hamming code (7, 4) 4/7 11) (15, 11/15 4 26) 26/31 5 (31, 57) 57/63 (63, 6 ] [ 2, p.223 13.4. Exercise is the probabilit y of blo ck error of the ( N;K ) What the Hamming leading order, when to code is used for a binary code symmetric channel with noise densit y f ? 13.4 is unattainable { rst proof Perfectness sho w in several that useful perfect codes do not exist (here, We will ways close `having cklength N , and rate blo neither to 0 nor 1'). means large `useful' ved that, given a binary symmetric Shannon with any noise pro channel f exist codes with large blo , there N and rate as close as you level cklength C ( f like to H small ( f ) that enable comm unication with arbitrarily ) = 1 2 per blo probabilit , the num ber of errors N ck will typically be error y. For large fN , so these codes of Shannon are `almost-certainly- fN -error-correcting' about codes. Let's special case of a noisy channel with f 2 pick the = 3 ; 1 = 2). Can (1 we nd a large perfe ct code that is fN -error-correcting? Well, let's supp ose that suc has been found, and examine just three of its codew ords. h a code ber that have code ough t to have rate R ' 1 H the ( f ), so it should (Remem 2 N R an ber (2 ) of codew num ords.) Without loss of generalit y, we enormous codew choose codew ords to be the all-zero of the ord and de ne the other one two to have overlaps with it as sho wn in gure 13.6. The second codew ord of its di ers rst in a fraction u + v the coordinates. The third codew ord from di ers from the rst in a fraction v + w , and from the second in a fraction codew u w . A fraction x of the coordinates have value zero in all three + ords. Now, if the code is fN -error-correcting, its minim um distance must be greater

223 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. t enumerator 13.5: codes 211 function Weigh of random linear fN 2 than , so + w > 2 f; and u + w > 2 f: (13.6) u 2 + f; v v > dividing and Summing these three inequalities by two, we have 3 f: (13.7) u + v + w > x < 3, we can u + v + w > 1, so that = 0, whic h is imp ossible. 1 f > if So deduce the Suc have three codew ords, let alone h a code cannot exist. So code cannot N R N 2 . 1 0 1 0 1 0 1 0 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 0 for that, We conclude whereas Shannon pro ved there are plen ty of codes 0 0 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 0 over a binary symmetric channel with e are no ther 3, f > 1 unicating = comm 0 0 0 0 1 0 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 that can do this. perfe ct codes 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 0 0 1 0 1 1 1 0 M 1 0 1 1 1 1 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1 1 1 0 are t that indicates that there We now study no a more general argumen 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 1 large 0 and 1). We do this perfect linear codes for general rates (other than 0 1 0 0 0 1 0 0 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 distance typical code. linear of a random the by nding 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1 0 0 1 1 0 1 0 0 0 0 1 1 of random Weight r codes 13.5 enumerato r function linea Figure 13.7 . A random binary parit y-chec k matrix. in the M N parit y-chec k Imagine making a code by picking the binary entries H What weigh t enumerator function should we exp ect? at random. matrix weigh t enumerator of one particular code with parit y-chec k matrix H , The ords ( ) A , is the num ber of codew w of weigh t w , whic h can be written H X (13.8) ; = 0] Hx [ A = ( ) w H w x j : x = j and sum x whose weigh t is w vectors the truth function is over all the where = 0] equals one if [ = 0 and zero otherwise. Hx Hx nd exp ected value of the ( w ), We can A X ) w ( A ) (13.9) H P ( i ) w ( A h = H H X X = (13.10) ; = 0] Hx [ P ( H ) H : w = j x j x the probabilit y that a particular word of weigh t by evaluating 0 is a w > codew of the code (averaging over all binary linear codes in our ensem ble). ord the symmetry probabilit y dep ends only on By weigh t w of the word, not , this on the details of the word. The probabilit y that the entire syndrome Hx is zero together the probabilities that eac h of the by multiplying can be found in the syndrome is zero. Eac h bit z M of the syndrome is a sum (mo d bits m 1 / bits, so the probabilit y that z 2) of = 0 is w random 2 probabilit y that . The m = 0 is thus Hx X M M 1 / ; ) 2 = 2 (13.11) = 0] = ( Hx [ H P ) ( H indep enden t of w . The ected num ber of words of weigh t w (13.10) is given by summing, exp over all words of weigh t w , the probabilit y that eac h word is a codew ord. The N , so is num t ber of words of weigh w w N M i = h A ( w ) 2 for any w > 0 : (13.12) w

224 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 212 Codes 13 | Binary N 6e+52 ' NH to write ( M=N ) and R w=N 1 ' For large log use , we can N 2 w 5e+52 4e+52 ) i ' ( w=N ) M (13.13) ( A NH log h w 2 2 3e+52 (1 R )] for any w > 0 N (13.14) [ H ' ( w=N ) : 2e+52 2 1e+52 0 t enumerator sho 13.8 the example, a concrete As ws exp ected weigh gure 400 300 200 100 0 500 M = 360. = 540 N code with linear 3 random and function of a rate-1 = 1e+60 1e+40 1e+20 1 arshamov ert{V Gilb e distanc 1e-20 1e-40 1e-60 H of ectation exp ), the R (1 < ) w=N ( h that suc w ts weigh ) is w ( A For 2 1e-80 1e-100 w=N smaller 1; for weigh ts suc is h that H ectation ( than ) > (1 R ), the exp 2 1e-120 200 400 300 500 100 0 1. We thus exp than for large N , that the minim um distance of a ect, greater code will be close to the distance d de ned by linear random 13.8 . The exp ected weigh t Figure GV enumerator h A ( w ) of a i function random code with N = 540 linear d H (13.15) =N ) = (1 ( R ) : GV 2 ws Lower gure sho = 360. M and 1 ) i on a logarithmic scale. h A ( w v d NH distance, This (1 R ), is the Gilb ert{V arshamo De nition. GV 2 rate R and distance cklength N . for blo 1 ert{V arshamo v conjecture , widely believ ed, asserts that (for large The Gilb Capacity R_GV codes with minim um distance binary tly to create N possible ) it is not signi can . than d greater 0.5 GV arshamo v rate R rate is the maxim um De nition. at whic h The Gilb ert{V GV 0 comm deco der (as de ned on unicate you can reliably a bounded-distance with 0.5 0.25 0 is true. arshamo p.207 ), assuming that the Gilb v conjecture ert{V f Figure between trast . Con 13.9 ective, an obsession e distanc with is a bad persp and spher e-packing Why and Shannon's channel capacit y C opriate is inappr { the R rate ert Gilb the GV rate maxim unication comm um level maxim tolerable um If one the der, deco a bounded-distance uses noise a achiev using able 1 ip a fraction f = will is equal to =N of the bits. So, assuming d d min min bd 2 as a bounded-distance deco der, the (13.15), Gilb ert distance d we have: GV of noise level f . For any function R , the maxim um given rate, (13.16) : ) R ) = (1 f (2 H 2 GV bd noise level for tolerable Shannon um as big as the maxim is twice (2 H = 1 : ) f (13.17) R tolerable noise a level for GV 2 bd a uses who `worst-case-ist' said what Now, here's the crunc h: the did Shannon say is achiev able? He bounded-distance der. deco um possible rate of comm unication is the capacit y, maxim = 1 f H ( C ) : (13.18) 2 tolerable So for , the maxim um a given rate noise level, according to Shannon, R is given by H (13.19) ( f ) = (1 R ) : 2 Our imagine a good code of rate R has been chosen; equations conclusion: by and resp ectiv ely de ne the (13.16) um noise levels tolerable (13.19) maxim a bounded-distance der, f . , and deco deco der, f by Shannon's bd f (13.20) = f= 2 : bd Bounded-distance ders can only ever cop e with half deco noise-lev el that the Shannon pro ved is tolerable! How does this relate to perfect codes? A code is perfect if there are t - that spheres its codew ords around ll Hamming space without overlapping.

225 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Berlek 13.6: 213 amp's bats a typical random code is used to comm unicate over a bi- But when linear typical channel Shannon limit, the to the num ber of bits symmetric nary near is fN , and the minim um distance between codew ords is also fN , or ipp ed bigger, if we are below the Shannon limit. So the fN -spheres a little a little codew tly overlap with eac h other sucien the that eac h sphere around ords neigh con cen tre of its nearest the bour! The reason why this almost tains asso volume the dimensions, in high is because, disastrous is not overlap ciated o overlapping . Tw 13.10 Figure sho in gure overlap, of either is a tiny fraction 13.10, with the shaded wn as whose radius is almost spheres small. so the probabilit y of landing in it is extremely sphere, distance as the their big between tres. cen be bad moral is that worst-case-ism can story for you, halving The of the abilit y to tolerate noise. You have to be able to deco de way beyond the your um of a code to get to the Shannon limit! minim distance ertheless, the minim um distance of a code is of interest in practice, Nev under um conditions, the minim because, distance dominates the errors some by a code. made Berlek amp's bats 13.6 bat A blind about the cen tre of the cave, whic h corre- lives in a cave. It ies sponds to one codew ord, with its typical distance from the cen tre con trolled by a friskiness parameter . (The displacemen t of the bat from the cen tre f onds to the vector.) The boundaries of the cave are made up of corresp noise point in towards cave ( gure cen tre of the that 13.11). Eac h stalactites the codew is analogous between the home boundary ord and an- stalactite to the codew ord. The stalactite other shaded region in gure 13.10, but is like the reshap vey the idea that it is a region of very small volume. ed to con ding intended corresp ond to the bat's Deco trajectory passing inside errors the a stalactite. stalactites at various distances from with cen tre Collisions are possible. If the friskiness is very small, the bat is usually very close to the cen tre of the cave; collisions be rare, and when they do occur, they will usually will tre stalactites are closest to the cen tips point. Similarly , involve the whose low-noise conditions, deco ding errors under be rare, and they will typi- will cally eigh t codew ords. Under involve low-w conditions, the minim um low-noise distance of a code is relev ant to the (very small) probabilit y of error. . Berlek amp's 13.11 Figure picture schematic of Hamming space in the vicinit y of a line codew ord. The jagged solid encloses h this points to whic all The ord is the closest. codew t -sphere around the codew ord of this fraction a small tak es up t space. . . . 1 2

226 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Codes 214 13 | Binary the friskiness mak e excursions beyond the bat If the is higher, may often collide the stalactites start, but it will where most fre- t distance safe longest more distan t stalactites, owing to their greater num ber. There's quen tly with at the a tiny num um distance, so they are rela- only ber of stalactites minim to cause in a real errors. Similarly , errors ely error-correcting tively unlik the weigh end prop erties of the the t enumerator function. dep code on friskiness, the bat is alw ays a long way from the cen tre of At very high almost cave, and collisions involve con tact with distan t stalactites. the all its nothing conditions, collision frequency has bat's to do with these Under the from the cen tre to the closest stalactite. the distance Concatenation 13.7 of Hamming codes with more concatenation of Hamming codes, e to play some It is instructiv the visited in gure 11.6, because we will get insigh ts into the a concept we rst 1 and the relev ance or otherwise of the minim um distance notion of good codes 0.8 of a code. R 0.6 channel with a concatenated create a binary symmetric We can code for Hamming several in succession. with noise densit codes by enco y f ding 0.4 Hamming codes, indexed by The table recaps the key prop erties of the 0.2 . All distance codes Hamming the um M have minim num ber of constrain ts, 0 d = 3 and can correct one error in N . 0 2 4 6 8 10 12 C M = 2 N cklength blo 1 13.12 Figure R of the . The rate M K = bits N num ber of source concatenated code as a Hamming 3 N 2 to leading y of blo probabilit order f ck error p = B function of the num ber of 2 N concatenations, C . e a pro duct code by concatenating a sequence of C Hamming If we mak 1e+25 C M choose those parameters f M increasing g with , we can codes in suc h a c =1 c 1e+20 rate of the pro duct code way that the 1e+15 C Y 1e+10 N M c c = R (13.21) C N 100000 c c =1 1 4 6 2 0 12 10 8 C limit = 2, a non-zero to M tends if we set as example, For increases. 1 C rate 13.12). ( gure is 0.093 M asymptotic the = 3, then = 4, etc., M 3 2 Figure 13.13 . The blo cklength N C codes The blo cklength are N , so these C is a rapidly-gro function wing of (upp um minim e) and er curv is that their A further impractical. somewhat min- of these weakness codes d distance (lower curv e) of the C 13.13). good ( gure very is not distance t imum of the one Every constituen Hamming code as a concatenated distance um minim the 3, so distance Hamming codes has minim um of the ber of num of the function C C C th pro duct is 3 blo . The ratio cklength N gro ws faster than 3 d=N , so the C concatenations . tends to zero as C increases. In con trast, for typical random codes, the ratio d=N tends t suc h that H to a constan ( d=N ) = 1 R . Concatenated Hamming 2 codes thus have `bad' distance. sequence it turns that ertheless, simple out of codes yields good Nev this for some channels { but not very good codes codes section 11.4 to recall (see the of the terms `good' and `very good'). Rather than de nitions ve this pro result, we will simply explore it numerically . Figure 13.14 sho ws the bit error probabilit y p codes of the concatenated b assuming the constituen t codes are deco ded in sequence, as describ that ed in section [This one-co de-at-a-time 11.4. ding is sub optimal, as we saw deco there.] The horizon tal axis sho ws the rates of the codes. As the num ber and of concatenations rate drops to 0.093 the the error probabilit y increases, symmetric towards zero. The channel assumed in the gure is the binary drops

227 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 13.8: isn't 215 everything Distance 1 p b f is the highest noise level that can be tolerated channel = 0 with : 0588. This N=3 0.01 21 315 0.0001 concatenated code. using this 61525 1e-06 is distanc e isn't everything . The from The tak e-home message this story 1e-08 although theorists, um distance of a code, minim widely worshipp ed by coding 1e-10 imp is not reliable of achieving mission to Shannon's ortance of fundamen tal 1e-12 10^13 1e-14 unication over noisy channels. comm 1 0.8 0.6 0.4 0.2 0 R ] 3 [ ve that Pro distance `bad' with 13.5. Exercise . of codes families exist there . The Figure error bit 13.14 good' codes. `very are that R probabilities versus the rates of the concatenated Hamming codes, everything isn't Distance 13.8 symmetric for the binary channel with alongside els Lab 0588. : = 0 f a quan minim of the e ect the for e feeling titativ get Let's um distance of a w the the points sho blo cklengths, ws the solid line N . The sho for symmetric code, channel. of a binary case special the this channel. for limit Shannon probabilit error bit y drops to The low-weight one d with d codewor error probability asso ciate The tends rate zero while the to 0.093, concatenated so the Hamming and just a binary code have blo cklength N Let two codew ords, whic h di er in . a `good' code family codes are places. For simplicit y, let's assume d is even. What is the error probabilit y d code is used if this symmetric channel with noise level f ? on a binary di er. matter where the two codew ords in places The error ips only Bit by the probabilit y that d= 2 of these bits are probabilit ed. y is dominated ipp happ to the other bits is irrelev ant, since the ens deco der ignores What optimal 1 d=10 them. d=20 1e-05 d=30 d=40 d d=50 d= d= 2 2 1e-10 ) : (13.22) (1 f f (blo ' P ) ck error d=60 d= 2 1e-15 with is plotted a single codew of weigh ord ciated probabilit t error This d y asso 1e-20 in gure 13.15. Using the appro ximation for the binomial coecien t (1.16), 0.1 0.01 0.0001 0.001 ximate appro further we can h i 13.15 error Figure . The d 2 = 1 2 = 1 P (1 (13.23) ) f f 2 ' ck error) (blo ciated with a y asso probabilit t codew single , of weigh d ord d d d= d= 2 2 )] f ( ; (13.24) [ , as a function ) f (1 f 2 d= of . f 1 = 2 1 = 2 ( (1 f ) f ) = 2 f is called the Bhattac haryy a parameter of the where channel. Now, a general linear code with distance d . Its blo ck error prob- consider d 2 2 d= d= y must be at least abilit (1 , indep f ) f cklength enden t of the blo d= 2 of the code. For this reason, N of codes of increasing blo cklength a sequence N and constan t distance d (i.e., `very bad' distance) cannot have a blo ck er- ror probabilit tends to zero, on any binary symmetric channel. If we y that interested in making erb error-correcting codes with tiny, tiny error are sup y, we migh shun codes with bad distance. However, being probabilit t therefore we should look more carefully at gure 13.15. In Chapter 1 we pragmatic, that argued for disk driv es need an error probabilit y smaller than about codes 18 e is about . If the probabilit y in the disk 10 raw error 0 : 001, the error driv probabilit y asso ciated with one codew ord at distance d = 20 is smaller than 24 . If the raw error probabilit y in the disk driv e is about 0 : 01, the error 10 d y asso with one codew ord at distance ciated = 30 is smaller than probabilit 20 10 . For practical purp oses, therefore, it is not essen tial for a code to have good distance. codes of blo cklength 10 000, kno wn to have man y For example, codew ords of weigh t 32, can nev ertheless correct errors of weigh t 320 with tiny error probabilit y.

228 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 13 | Binary 216 Codes to think I wouldn't the use of codes with I am want you recommending 47 we will low-densit y parit y-chec k codes, my in Chapter bad discuss distance; h have both excellen t performance and good distance. favourite codes, whic union 13.9 The bound channel y of a code the probabilit symmetric on can be error The binary of its weigh bounded function by adding up appropriate in terms t enumerator of the probabilit y asso ciated with a single codew ord (13.24): multiples error X w : P (13.25) A ( w (blo ( f )] ck error) )[ 0 w> bound y, whic example of a This h is an , is accurate for low inequalit union levels f , but inaccurate for noise noise levels, because it overcoun ts the high con of errors that cause confusion with more than one codew ord at a tribution time. [ 3 ] . Poor man's noisy-channel coding theo rem . Exercise 13.6. the (13.25) bound that is accurate, and using the aver- union Pretending t enumerator function of a random linear code (13.14) (section weigh age as A ( w ), estimate the maxim um 13.5) R can ( f ) at whic h one rate UB unicate over a binary channel. comm symmetric at it more positiv to look the union bound (13.25) as an Or, ely, using y, sho inequalit comm unication at rates up to R ) is possible ( f w that UB over the binary symmetric channel. In the follo wing chapter, by analysing the probabilit y of error of syndr ome decoding for linear code, and using a union bound, we will pro ve a binary noisy-c hannel theorem (for symmetric binary channels), and Shannon's coding very ar codes exist . w that thus sho good line codes 13.10 Dual theory that imp ortance in coding some , though we will have A concept has immediate use for it in this book, is the idea of the dual no error- of a linear correcting code. K be though error-correcting code can ) linear t of as a set of 2 ( An N;K ords generated by adding together all com binations codew K indep enden t of basis ords. The generator matrix codew code consists of those K basis of the codew ords, con ventionally written as row vectors. For example, the (7 ; 4) Hamming code's matrix (from p.10) is generator 2 3 1 0 0 0 1 0 1 6 7 0 1 0 0 1 1 0 7 6 G = (13.26) 4 5 0 0 1 0 1 1 1 0 0 0 1 0 1 1 its sixteen codew ords were and yed in table 1.14 (p.9). The code- displa words of this code are linear com binations of the four vectors [ 1 0 0 0 1 0 1 ], [ ], [ 0 0 1 0 1 1 1 ], and [ 0 0 0 1 0 1 1 ]. 0 1 0 0 1 1 0 An ( N;K ) code may also be describ ed in terms of an M N parit y-chec k f matrix M = N K ) as the set of vectors (where t g that satisfy Ht = 0 : (13.27)

229 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. codes 217 13.10: Dual of this One eac h row of H speci es a vector equation way of thinking is that must if it is a codew ord. t be orthogonal to whic h speci es K vectors from which all codew ords The generator matrix of the y-chec k matrix speci es a set parit M vectors and be built, can codew to which are orthogonal. all ords dual by exc hanging the generator matrix The of a code is obtained parit k matrix. the y-chec and set of all vectors of length N that are orthogonal De nition. code- The to all ? , is called the dual of the C C in a code, . words code, t is orthogonal to h h and If ; , then it is also orthogonal to h h h + 1 1 2 3 2 M ords to any linear com bination of the orthogonal rows of codew are so all the set H linear com binations of the rows of the parit y-chec k matrix . So of all dual code. is the Hamming (7 ; 4) code, the parit y-chec k matrix is (from p.12): For our 2 3 1 1 1 0 1 0 0 5 4 (13.28) 0 1 1 1 0 1 0 = : P I = H 3 1 0 1 1 0 0 1 dual of the (7 ; 4) Hamming code H The 13.16. in table is the code sho wn 4) ; (7 ords t codew eigh . The Table 13.16 1100011 0000000 1001110 0101101 of the dual of the (7 ; 4) Hamming 1110100 1011001 0010111 0111010 [Compare code. 1.14, with table p.9.] is that prop A possibly pair of codes ected the dual, unexp erty of this ? H code dual , is con tained within the code H in the word every itself: 4) (7 ; 4) ; (7 ord of the original (7 ; 4) Hamming code. This relationship can be is a codew using set notation: written ? H H : (13.29) ; 4) (7 ; 4) (7 can possibilit the set of dual vectors y that overlap the set of codew ord The vectors is coun terin tuitiv e if we think of the vectors as real vectors { how can a vector be orthogonal ? But when we work in mo dulo-t wo arithmetic, to itself y non-zero are indeed orthogonal to themselv es! vectors man ] p.223 [ 1, 13.7. rule Giv e a simple . that distinguishes whether a binary Exercise vector to itself, as is eac h of the three vectors [ 1 1 1 0 1 0 0 ], is orthogonal [ 0 1 1 1 0 1 0 ], and [ 1 0 1 1 0 0 1 ]. Some mor e duals if a code has generator matrix, In general, a systematic T I G j P = [ ] ; (13.30) K where P is a K M matrix, then its parit y-chec k matrix is H = [ P j I ] : (13.31) M

230 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 13 | Binary 218 Codes The Example code R rep has generator matrix 13.8. etition 3 (13.32) G = 1 1 1 ; is k matrix parit its y-chec 1 1 0 : (13.33) = H 1 0 1 are ords 1 1 1 ] and [ 0 0 0 ]. two codew The [ code has generator The dual matrix 1 1 0 ? = G = H (13.34) 1 0 1 ? by row additions, alen or equiv form tly, mo difying G into systematic 1 0 1 ? G = (13.35) : 0 1 1 dual y code simple parit We call P code the ; it is the code with one this 3 y-chec k bit, whic h is equal parit sum of the two source bits. The to the dual four codew ords are [ 1 1 0 ], [ 1 0 1 ], [ 0 0 0 ], and [ 0 1 1 ]. code's this case, only vector common to the code and the dual is the In the ord. all-zero codew Goodness of duals is `good', are of codes duals good too? Examples can be If a sequence their of all cases: good codes with good duals (random linear codes); constructed codes bad bad duals; and good codes with bad duals. The last category with ecially y state-of-the-art ortan t: man is esp codes have the prop erty that imp is the duals bad. The classic example their low-densit y parit y-chec k code, are whose dual is a low-densit y generator-matrix code. [ 3 ] . Exercise Sho w that low-densit y generator-matrix codes are bad. A 13.9. of low-densit y generator-matrix is de ned by two param- family codes row weigh , whic the j;k weigh t and h are t of all rows and eters column resp ectiv ely of G . These weigh ts are xed, indep enden t of N ; columns example, ) = (3 j;k for ; 6). [Hin t: sho w that the code has low-w eigh t ( ords, the use codew argumen t from p.215 .] then 5 ] [ 13.10. Sho w that low-densit Exercise y-chec k codes are good, and have y parit good distance. (For solutions, see Gallager (1963) and MacKa y (1999b).) Self-dual codes in the (7 code had the prop erty that the dual was con tained 4) Hamming The ; A code is self-orthogonal if it is con tained in its dual. For example, code itself. dual of the the ; 4) Hamming code is a self-orthogonal code. One way of (7 seeing is that the overlap between any pair this of H is even. Codes that of rows con tain their duals are imp ortan t in quan tum error-correction (Calderbank and Shor, 1996). though not necessarily It is intriguing, to look at codes that are useful, self-dual . A code C is self-dual if the dual of the code is iden tical to the code. ? C = C : (13.36) be deduced: prop erties of self-dual codes can Some

231 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Generalizing perfectness channels 219 13.11: to other then 1. If a code matrix is also a parit y-chec k its is self-dual, generator code. the matrix for 1 = 2, i.e., M = K = N= 2. 2. Self-dual codes have rate have even weigh t. ords codew 3. All [ 2, p.223 ] 13.11. What prop erty must . matrix P satisfy , if the code Exercise the T G = [ I with j P generator ] is self-dual? matrix K of self-dual Examples codes etition code R 1. The is a simple example of a self-dual rep code. 2 G = H : (13.37) = 1 1 smallest non-trivial self-dual code is the follo wing (8 2. The 4) code. ; 2 3 0 1 1 1 1 0 0 0 6 7 0 1 0 0 1 0 1 1 T 6 7 G = I = P : (13.38) 4 4 5 0 0 1 0 1 1 0 1 1 1 1 0 0 0 0 1 p.223 [ 2, ] 13.12. Find the relationship Exercise above (8 ; 4) code to the . of the ; 4) Hamming code. (7 and graphs Duals a code in whic ted by a graph Let h there are nodes of two types, be represen joined y-chec equalit y constrain ts, ts and by edges whic h rep- parit k constrain t the bits of the code (not resen of whic h need be transmitted). all The code's graph is obtained by replacing dual parit y-chec k nodes by all equalit y nodes and vice versa . This type of graph is called a normal graph by Forney (2001). reading Further imp ortan t in coding theory because functions involving a code (suc h Duals are distribution ords) can be transformed by a Fourier posterior over codew as the over the dual code. For an accessible transform into functions introduction analysis nite groups, see Terras on See also MacWilliams to Fourier (1999). Sloane and (1977). Generalizing perfectness channels 13.11 to other codes up the searc given perfect on for the binary symmetric Having h for we could channel, ourselv es by changing channel. We could call a console code u -error-correcting code for `a perfect binary erasure channel' if it the can restore any u erased bits, and nev er more than u . Rather than using the In a perfect u -error-correcting code for the erasure binary however, word um is a `maxim h a code suc for term ventional con the perfect, the num channel, t ber of redundan distance separable code', or MDS code. bits . = K N be u must As we already noted in exercise 11.10 (p.190) , the (7 ; 4) Hamming code is sets not code. It can reco ver some MDS of 3 erased bits, but not all. If an any 3 bits corresp onding to a codew ord of weigh t 3 are erased, then one bit of ; information verable. This is why the (7 is unreco 4) code is a poor choice for a RAID system.

232 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 13 | Binary 220 Codes um A tiny example code is the simple parit y- distance of a maxim separable ]. k code y-chec k matrix is H = [ 1 1 1 parit This code has 4 P whose chec 3 of whic h have even codew y. All codew ords are separated by ords, all parit Any single erased bit can be restored by setting it to the of 2. a distance other parit The rep etition codes are also maxim um distance y of the two bits. codes. separable 5, p.224 ] [ K Can you mak e an ( N;K ) code, with M = N . 13.13. Exercise y sym for a q -ary erasure channel, bols, h that the deco der can parit suc ver the reco ord when any M sym bols are erased in a blo ck of N ? codew [Example: the channel with q = 4 sym bols there is an ( N;K ) = (5 ; 2) for whic h can any M = 3 erasures.] code correct are -ary channel q q > 2, there erasure large num bers of MDS For the with of whic h the Reed{Solomon codes are the most famous and most widely codes, As q as the eld size used. is bigger than the blo cklength N , MDS blo ck long of any rate see be found. (For further reading, codes Lin and Costello can (1983).) Summa ry 13.12 channel codes the Shannon's symmetric for can almost alw ays correct binary fN errors, but they are not fN -error-correcting codes. Reasons why distanc e of a code has little relevanc e the Shannon limit ws that the best codes must be able to cop e with 1. The sho as big a bounded- maxim um noise level for level twice a noise as the der. deco distance binary symmetric channel has f > 2. When = 4, no code with a the 1 deco der can comm unicate at all; but bounded-distance says Shannon good codes for suc h channels. exist if the sho we can get good performance even that dis- 3. Concatenation ws is bad. tance whole weigh t enumerator function is relev ant to the The of question whether a code is a good code. is discussed relationship good codes and distance prop The between erties further in exercise 13.14 (p.220) . 13.13 Further exercises 3, p.224 ] [ Exercise A codew ord t is selected from a linear ( N;K ) code 13.14. . , and C channel; the receiv ed signal is y it is transmitted over a noisy We assume that the channel is a memoryless channel suc h as a Gaus- sian channel. Giv en an assumed channel mo del P ( y j t ), there are two deco ding problems. h codew deco ding problem is the task of The whic ord inferring codew ord t was transmitted given the receiv ed signal. The bitwise deco ding problem is the task of inferring for eac h that transmitted t rather how likely it is that bit bit was a one n than a zero.

233 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. exercises 13.13: 221 Further ders deco two deco ding problems. Pro ve that for optimal Consider these optimal bitwise-deco der is closely related y of error probabilit the of the of the optimal codew ord-deco der, by pro ving to the probabilit y of error theorem. follo the wing line , code has minimum distanc e d 13.1 Theorem If a binary ar min d bit channel, the codewor given error probability of the then, any for decoder, p optimal , and the block error probability of the maxi- bitwise b od decoder, p mum , are relate d by: likeliho B d 1 min p p p (13.39) : B B b 2 N [ 1 ] What the 13.15. minim um distances of the (15 ; 11) Hamming Exercise are code? the ; 26) Hamming (31 code and 2 ] [ . Exercise Let A ( w 13.16. average weigh t enumerator function of a ) be the = rate-1 code with N = 540 and M = 360. Estimate, 3 random linear principles, the of A ( w ) at w = 1. from rst value 3C ] [ minimum d greater than A code with distance . A rather 13.17. Exercise GV ; 5) code is generated by this generator matrix, whic h is based nice (15 5 the the of all parities measuring on of source = 10 triplets bits: 3 3 2 1 1 1 1 1 1 1 7 6 1 1 1 1 1 1 1 7 6 7 6 : (13.40) 1 1 1 1 1 1 1 = G 7 6 5 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Find the distance and weigh t enumerator function of this code. minim um [ 3C ] of Find minim um 13.18. the the `pentagonful' low- Exercise distance k matrix y-chec parit whose k code y-chec y parit densit is 2 3 1 1 1 7 6 1 1 1 6 7 6 7 1 1 1 7 6 7 6 1 1 1 7 6 6 7 1 1 1 7 6 H = (13.41) : 7 6 1 1 1 6 7 7 6 1 1 1 7 6 6 7 1 1 1 7 6 13.17 . The graph of the Figure 5 4 1 1 1 pentagonful low-densit y 1 1 1 parit y-chec with 15 bit k code nodes and 10 parit y-chec k (circles) are rows ten of the nine param- code has t, so the enden indep Sho w that nodes (triangles). graph is [This = 6. weigh N = 15, K eters Using a computer, nd its t enumerator Petersen graph.] kno wn as the function. [ 3C ] 13.19. . Replicate the Exercise used to pro duce gure 13.12. calculations Chec k the assertion that the highest noise level that's correctable is 0.0588. alternativ e concatenated sequences of codes. Can you Explore nd a better sequence of concatenated codes { better in the sense that it or can has higher asymptotic rate R either tolerate a higher noise level f ?

234 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 13 | Binary 222 Codes [ ] 3, p.226 the possibilit y of achieving the Shannon Exercise 13.20. Investigate ting blo the follo wing coun using argumen t. linear ck codes, limit with of large blo cklength N and rate R Assume K=N . The a linear code = y-chec H has M = N k matrix rows. Assume that the parit code's K deco code's whic h solv es the syndrome deco ding problem optimal der, = , allo ws reliable comm unication over a binary symmetric channel Hn z probabilit f . ip with y noise vectors n are there? How man y `typical' y distinct are z how man there? Roughly syndromes deco is reliably from n by the optimal deduced der, the num ber Since z must of syndromes than or equal to the num ber of typical be greater noise What does this tell you about the largest possible value vectors. R for f ? of rate a given 2 ] [ 13.21. Linear binary codes use the input sym bols 0 and Exercise with . 1 y, implicitly probabilit the channel as a symmetric chan- equal treating Investigate in comm unication rate is caused by this nel. how much loss if in fact the channel is a highly asymmetric channel. Take assumption, as an example hannel. How much smaller is the maxim um possible a Z-c of comm unication symmetric inputs than the capacit y of the rate using er: about 6%.] channel? [Answ [ 2 ] Exercise 13.22. w that codes with `very bad' distance are `bad' codes, as Sho de ned in section 11.4 (p.183 ). [ 3 ] Exercise 13.23. One linear code can be obtained from another by punctur- ing . Puncturing taking eac h codew ord and deleting a de ned set means 0 Puncturing ( N;K ) code into an ( N turns ;K ) code, where of bits. an 0 < N . N e new linear codes from old is shortening Another way to mak . Shortening constraining a de ned set of bits to be zero, and then deleting means from the codew ords. Typically if we shorten by one bit, half of the them codew Shortening are lost. code's typically turns an ( N;K ) code ords 0 0 0 0 ( ) code, where N N ;K = K K . into an N way to mak from linear code Another two old ones is to mak e e a new new intersection two codes: the ord is only retained in the of the a codew code if it is presen t in both of the two old codes. Discuss the e ect on a code's distance-prop erties of puncturing, short- ening, and Is it possible to turn a code family with bad intersection. into a code family with or vice versa , by eac h of distance good distance, manipulations? these three [ 3, p.226 ] `hat Todd Ebert's Exercise puzzle' . 13.24. h Three a room and a red or blue hat is placed on eac enter players person's The colour of eac head. is determined by a coin toss, with h hat the outcome of one coin toss having no e ect on the others. Eac h person can see other players' hats but not his own. the except No of any sort is allo wed, unication for an initial strategy comm session before the group enters the room. Once they have had a chance players to look other hats, the at the must sim ultaneously guess their

235 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 223 13.14: if or pass. shares a $3 million prize group at least hat's The colour own correctly and no players guess one . If you already kno w the hat player guesses incorrectly could try the `Scottish you puzzle, game can be played with any num ber of players. The general The same version' of the rules in whic h the that group the for a strategy of problem is to nd chances its maximizes prize is only awarded to the group the winning three of size groups for strategies best the Find prize. and . correctly guess if they all Scottish In the `Reformed sev . en version', all the players must t: when [Hin e to solv t be able migh , you en sev and three e done you'v guess are , and there correctly two .] fteen rounds Those players of guessing. during who one leave guess round ] [ 5 The room. the players remaining Estimate k codes Exercise 13.25. y-chec y parit low-densit y binary how man must guess in round two. What duals. a huge ect exp ber, num have self-orthogonal we don't [Note that to should the team adopt strategy since almost a low- low-densit y parit y-chec k codes are `good', but all of winning? their maximize chance its y parit that con tains k code dual must be `bad'.] y-chec densit ] [ 2C 13.26. 13.15 we plotted the error probabilit y asso ciated Exercise In gure a single codew ord of weigh t d as a function of the noise level f of a with symmetric alen Mak e an equiv binary t plot for the case of the channel. channel, error wing the Gaussian probabilit y asso ciated with a single sho ord of weigh t d as a function of the rate-comp ensated signal-to- codew noise ratio E have to =N you . Because E rate, =N the dep ends on 0 0 b b 4, or 5 Cho = 1 = 2, 2 = 3, 3 = R = 6. choose a code rate. ose Solutions 13.14 13.4 (p.210) . The probabilit y of blo ck error to leading Solution to exercise N 3 2 p is order = f . B N 2 13.7 (p.217) . Solution vector is perp endicular to itself if to exercise A binary even weigh t, i.e., an even num ber of 1 s. it has has to exercise . The self-dual code (p.219) two equiv alen t Solution 13.11 T k matrices, H P = G = [ I be j P parit ] and H must = [ y-chec j I ]; these K K 2 1 alen t to eac h other through row additions, that is, there is a matrix U equiv h that UH = H , so suc 2 1 T UP j UI (13.42) ] = [ I : j P [ ] K K T P of this equation, we have U = sides the , so the left-hand t-hand righ From sides become: T P = I : (13.43) P K T with generator matrix G = [ Thus is an j P if a code ] is self-dual then P I K matrix, dulo 2, and vice versa . orthogonal mo to exercise 13.12 (p.219) Solution The (8 ; 4) and (7 ; 4) codes are intimately . related. The (8 ; 4) code, whose parit y-chec k matrix is 3 2 0 1 1 1 1 0 0 0 7 6 0 1 0 0 1 0 1 1 7 6 I (13.44) ; = P H = 4 5 4 0 0 1 0 1 1 0 1 0 0 0 1 1 1 1 0 is obtained by (a) app ending an extra parit y-chec k bit whic h can be though t code; of as the y of all seven bits of the (7 ; 4) Hamming parit and (b) reordering the rst four bits.

236 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 13 | Binary 224 Codes 13.13 Solution If an ( N;K ) code, with M = N K parit y (p.220) to exercise . the ver the erty that the deco der can reco has codew ord when any sym bols, prop is said bols in a blo ck of N , then the code erased to be maxim um M sym are (MDS). separable distance binary codes exist, apart No the rep etition codes and simple MDS from y codes. q > 2, some MDS codes can be found. parit For 8-ary example, ; 2) code for the is a (9 erasure channel. a simple here As in terms of the multiplication and addition The of GF (8), code is de ned rules h are in App endix C.1. The given ts of the input alphab et are whic elemen 0 f 1 ;A;B;C;D;E;F g and the generator matrix of the code is ; 1 0 1 A B C D E F (13.45) : = G 1 1 1 1 1 1 1 0 1 64 codew are: resulting The ords 0AAAAAAAA 0BBBBBBBB 0CCCCCCCC 0DDDDDDDD 0EEEEEEEE 0FFFFFFFF 000000000 011111111 1AB01EFCD 1BA10FEDC 1CDEF01AB 1DCFE10BA 1EFCDAB01 1FEDCBA10 101ABCDEF 110BADCFE A1BDFA0EC AB1FD0ACE ACE0AFDB1 ADF1BECA0 AECA0DF1B AFDB1CE0A A0ACEB1FD AA0EC1BDF B1AFCED0B BA1CFDEB0 BB0DECFA1 BCFA1B0DE BDEB0A1CF BED0B1AFC B0BEDFC1A BFC1A0BED C0CBFEAD1 CAE1DC0FB CBF0CD1EA CC0FBAE1D CD1EABF0C CEAD10CBF CFBC01DAE C1DAEFBC0 D1C0DBEAF DBEAF1C0D DC1D0EBFA DD0C1FAEB DEBFAC1D0 DFAEBD0C1 D0D1CAFBE DAFBE0D1C E1FE0CABD EACDBF10E EBDCAE01F ECABD1FE0 EDBAC0EF1 EE01FBDCA EF10EACDB E0EF1DBAC F1ECB0FDA FADF0BCE1 FBCE1ADF0 FCB1EDA0F FDA0FCB1E FE1BCF0AD FF0ADE1BC F0FDA1ECB proof of the to exercise (p.220) . Quick, rough Solution theo rem. Let x 13.14 denote the di erence between the reconstructed codew ord and the transmitted codew ord. channel output r , there is a posterior distribution For any given x . This distribution is positiv e only on vectors x belonging over posterior code; blo sums that follo w are over codew ords x . The the ck error to the probabilit y is: X ( = (13.46) : P p x j r ) B =0 6 x average bit error probabilit y, averaging over all bits in the codew The is: ord, X ) w ( x (13.47) ; ( x j r ) p = P b N =0 6 x w ( x ) is the weigh t of codew ord x . Now the weigh ts of the non-zero where ords codew satisfy d ( x ) w min (13.48) : 1 N N Substituting the inequalities (13.48) into the de nitions (13.46, 13.47), we ob- tain: d min p (13.49) p ; p B B b N the of two stronger, the h is a factor t, than on stated result (13.39). whic righ the pro of watertigh t, I have weak ened the result In making a little. Careful The theorem relates the performance of the optimal blo proof. ck de- coding algorithm and the optimal bitwise deco ding algorithm. We introduce another pair of deco ding algorithms, called the blo ck- der. guessing and the bit-guessing deco der The idea is that these two deco algorithms are similar to the optimal blo ck deco der and the optimal bitwise deco but lend themselv es more easily to analysis. der, We now de ne these deco ders. Let x denote the inferred codew ord. For any given code:

237 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 225 13.14: blo optimal returns the codew ord x that maximizes the ck deco The der P the x j r ), whic h is prop ortional to y likeliho od posterior probabilit ( r j ). P ( x of this deco y of error is called p . probabilit The der B bit deco der returns for eac h The the N bits, x optimal , the of n a that maximizes the posterior probabilit y P ( x value = a j r ) = of n P ]. [ x = a x j r ) ( P n x y of error of this deco der probabilit p The . is called b blo ck-guessing deco der returns a random The ord x with probabil- codew ity distribution by the posterior probabilit y P ( x j r ). given G probabilit deco der is called p The y of error . of this B bit-guessing der returns for eac h of the N bits, The x , a random bit deco n = y distribution P ( x from probabilit a j r ). the n G . y of error of this der is called p probabilit deco The b The that the optimal bit error states y p theorem is bounded above probabilit b p multiple and below by a given by of p (13.39). B B left-hand y in (13.39) is trivially true { if a blo ck is correct, all The inequalit constituen its are correct; so if the optimal blo ck deco der outp erformed t bits the optimal bit deco der, we could mak e a better bit deco der from the blo ck deco der. ve the t-hand y by establishing that: righ We pro inequalit deco deco bit-guessing as good as the optimal bit der der: (a) the is nearly G 2 p (13.50) : p b b deco der's error probabilit y is related bit-guessing the blo ck- the (b) to der's by guessing deco d min G G p p : (13.51) B b N G since Then p p , we have B B 1 d d 1 1 min min G G (13.52) : p p p > p B b B b N 2 2 N 2 ve the two lemmas. We now pro case r-optimalit Consider rst Nea y of guessing: of a single bit, with posterior the probabilit y f p y of error ;p probabilit g . The optimal bit deco der has 1 0 optimal = min( ;p ) : (13.53) p P 0 1 The der picks from 0 and 1. The truth is also distributed with guessing deco same probabilit y. The probabilit y that the guesser and the truth matc h is the 2 2 p + p they ; the probabilit y that y, mismatc h is the guessing error probabilit 0 1 guess optimal P p (13.54) p : 2 min( p = 2 ;p P ) = 2 1 0 0 1 G guess Since is the average of man p h error probabilities, P y suc , and p is the b b optimal of the corresp onding optimal error probabilities, P average , we obtain G the relationship (13.50) between p desired and p . 2 b b

238 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 13 | Binary 226 Codes bit erro y and block erro r probabilit y: The bit- between Relationship r probabilit ck-guessing we ders can be com bined in a single system: blo guessing and deco x x w a sample the marginal distribution P ( can dra j r ) by dra wing from n n x ; ; x ) from the join t distribution P ( x the a sample x ( r ), then discarding j n n x of . value two cases: the discarded We can of x is the between distinguish value ord, or not. The probabilit y of bit error for the bit-guessing correct codew can deco be written as a sum of two terms: der then G p (bit ( x correct ) P = error j x correct ) P b P ( x + ) P (bit error j x incorrect ) incorrect G p = 0 + P (bit j x incorrect ) : error B whenev x is incorrect, the true x guessed di er from it in at Now, er the must bits, so the probabilit y of bit error least cases is at least d=N . So d in these d G G : p p B b N QED. 2 to exercise 13.20 (p.222) . The num Solution noise vectors n ber of `typical' N H ( f ) M 2 . The num ber of distinct syndromes z is 2 is roughly . So reliable 2 unication implies comm NH f ( M ) ; (13.55) 2 or, of the rate R = 1 M=N , in terms ( R H (13.56) 1 f ) ; 2 a bound whic h agrees precisely with the capacit y of the channel. This argumen into a pro of in the follo wing chapter. t is turned to exercise 13.24 . In the three-pla yer case, it is possible for Solution (p.222) to win of the time. group the three-quarters time, two of the players will have hats of the same Three-quarters of the osite the player's hat will be the opp third colour. The group can colour and every time this happ ens by using the follo wing strategy . Eac h player looks win other at the hats. If the two hats are di er ent colours, he passes. two players' hat are colour, the player guesses his own same is the opp osite If they the colour. way, every time the hat colours This distributed two and one, one are player will correctly and the others will guess and the group will win the pass, game. When all the hats are the same colour, however, all three players will guess incorrectly the group will lose. and any particular a colour, it is true that there is only a When player guesses chance that their guess is righ t. The reason that the group wins 50:50 of 75% the is that their strategy ensures that when players time guessing wrong, are a great man y are guessing wrong. For larger num bers of players, the aim is to ensure that most of the time no one and occasionally everyone is wrong at once. In the game with is wrong group 7 players, for whic h the is a strategy wins 7 out of every 8 times there they play. In the game with 15 players, the group can win 15 out of 16 times. 15, If you gured out these winning strategies for teams of 7 and have not I recommend thinking about the solution to the three-pla yer game in terms

239 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 227 13.14: of the locations losing states on the three-dimensional winning and of the laterally . thinking hypercub e, then r N , is 2 If the num 1, the optimal strategy can be de ned ber of players, a Hamming of length N , and the code y of winning the prize using probabilit is ( N + 1). Eac h player is iden ti ed with a num ber n 2 1 :::N . The two N= colours mapp ed onto 0 and 1 . Any state of their hats can be view ed are as a receiv vector out of a binary channel. ed binary vector of length N A random is either a codew ord of the Hamming code, with probabilit y 1 = ( N + 1), or it di ers in exactly bit from a codew ord. Eac h player looks at all the other one and to a colour whether his bit can be set bits suc h that the state is considers deco ord h can be deduced using the (whic der of the Hamming code). a codew If it can, then the player guesses that his hat is the other colour. If the state is will actually all players will guess and ord, guess wrong. If the state is a codew a non-co dew ord, only one player will guess, and his guess will be correct. It's if the quite to train seven players to follo w the optimal strategy easy cyclic represen tation of the (7 ; 4) Hamming code is used (p.19).

240 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. out Chapter Ab 14 that chapter w together several ideas dra we've encoun tered this we will In so far short pro of. We will sim ultaneously pro ve both Shannon's in one nice hannel coding theorem (for symmetric binary channels) and his noisy-c source coding (for binary sources). While this pro of has connections to man y theorem chapters essen book, it's not preceding tial to have read them all. in the pro On hannel coding side, our noisy-c of will be more constructiv e than the the pro of given in Chapter 10; there, we pro ved that almost any random code is `very good'. we will sho w that almost any line ar code is very good. We Here 4 and mak of the idea of typical sets (Chapters will 10), and we'll borro w e use from the previous chapter's calculation of the weigh t enumerator function of random codes (section 13.5). linear On the source coding side, our pro of will sho w that random line ar hash sources, functions be used for compression of compressible binary can thus giving a link to Chapter 12. 228

241 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 14 Very Exist Good Linear Codes we'll In this calculation to pro ve sim ultaneously the source use chapter a single binary the hannel coding theorem for the and symmet- coding theorem noisy-c channel. ric , this pro of works for much more general channel mo dels, not Inciden tally binary only channel. For example, the pro of can be rew ork ed the symmetric with time-v outputs, for channels arying channels and for chan- non-binary for memory , as long as they have binary inputs satisfying a symmetry with nels prop section 10.6. erty, cf. A simultaneous proof of the coding and noisy-channel 14.1 source rems theo coding error-correcting code with binary parit y-chec k matrix H . a linear We consider Later matrix rows and N columns. M in the pro of we will increase The has and M N M / N . The rate of the code satis es , keeping M : (14.1) R 1 N this rows of H If all indep enden t then the is an equalit y, R = 1 M=N . are In what follo ws, we'll assume the equalit y holds. Eager readers may work out the exp rank of a random binary matrix H (it's very close to M ) and ected on the the di erence ( M rank) has that the rest of this pro of pursue e ect negligible). (it's ord t is selected, satisfying A codew = ; mo d 2 Ht (14.2) 0 x a binary channel adds noise x , giving the receiv and signal In this chapter symmetric denotes the ed noise added by the channel, not the input to the channel. r = t + x mo d 2 : (14.3) The receiv to infer both t and x from r using a syndrome-deco ding er aims (p.10 h. approac ding was rst introduced in section 1.2 Syndrome and deco 11). The receiv er computes the syndrome z = Hr mo d 2 = Ht + Hx mo d 2 = Hx mo d 2 : (14.4) x The dep ends on the noise only , and the deco ding problem is to syndrome nd the most probable x that satis es Hx = z mo d 2 : (14.5) 229

242 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 14 | Very 230 Good Linear Codes Exist ^ from vector, for x , is then subtracted noise r to give the estimate the This best t . Our aim is to sho w that, as long as R < 1 H ( X ) = 1 H ), ( f best guess for 2 is the probabilit y of the binary symmetric channel, the optimal where f ip this problem ding for has vanishing probabilit y of error, deco syndrome-deco der for random H . increases, as N result by studying a sub-optimal strategy for solving the We pro ve this problem. ding optimal deco der nor this typic al-set decoder deco the Neither t, but the typical-set deco der is easier to analyze. to implemen would be easy deco der examines the typical set T of noise vectors, the set The typical-set 0 0 1 / x that satisfy log vectors of noise X ) ' NH ( P ), chec king to see if any of We'll leave out the s and s that ( x 0 de nition mak e a typical-set syndrome, ed observ the satis es x vectors typical those Enthusiasts rigorous. are 0 4.4 section to revisit encouraged Hx = z : (14.6) and these into this put details 0 of. pro vector that typical does so, the typical set deco der rep orts one If exactly x the vector noise vector. If no typical as matc hes the vector hypothesized set or more one does, then the typical ed syndrome, deco der rep orts observ than error. an probabilit der, of the typical-set deco The for a given matrix H , y of error be written as a sum can of two terms, ) ( II I ) ( P + P = P (14.7) ; j H TS H j TS ( I ) where is the probabilit P the true noise vector x is itself not typical, y that ( II ) and P x y that is the probabilit other the true one is typical and at least H TS j typical vector clashes with it. The rst probabilit y vanishes as N increases, as we pro ved we rst studied typical sets (Chapter 4). We concen trate on when second probabilit we're imagining a true noise vector, x ; and the y. To recap, 0 0 of the vectors x , di eren t from x , satis es H ( x noise any x ) = 0, if typical we have an error. the truth function then We use 0 x ( x ; (14.8) H ) = 0 0 is one if the statemen t H ( x whose x ) = 0 is true value zero otherwise. and We can the num ber of type II errors made when the noise is x thus: bound X 0 ) = 0 (14.9) : x H ( x ber of errors x and [Num given H ] 0 T 2 x 0 : x 0 6 x x = sum The is either zero or one; the ber of errors on the righ t-hand side num may exceed one, in cases where several typical noise vectors have the same Equation (14.9) is a union bound. syndrome. We can down the probabilit y of a type-II error by averaging over now write : x X X ( ) II 0 ) ( x P P (14.10) x : ( H ) = 0 x TS j H 0 2 T x 0 T 2 x : x 0 x 6 x = nd the average of this probabilit y of type-II error over all linear Now, we will codes by averaging over H . By sho wing that the aver age probabilit y of type-II error vanishes, thus sho w that there exist linear codes with vanishing we will probabilit codes that almost all linear error are very good. y, indeed, h averaging binary matrices H by We denote ::: i average . The over all H probabilit y of type-II error is E D X II ) ( ) II ( ) II ( P = (14.11) P H P ) = ( P TS H j TS TS j H H H

243 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Data 14.2: hash codes 231 compression by linear + * X X 0 ( ) = 0 x x H = (14.12) P ) x ( 0 T 2 x 0 T x 2 : x 0 H x 6 = x X X 0 ) = 0 x ( H x : (14.13) = ( x ) P H 0 T 2 x 0 x T 2 : x 0 = 6 x x 0 ) = 0] H x ( x [ i already cropp ed up when we were the Now, h quan tity H the ected weigh t enumerator function of random linear codes exp calculating probabilit 13.5): binary vector v , the any non-zero y that Hv = 0, (section for M H . So matrices over all averaging , is 2 ! X ( ) II M P = 1) 2 ) ( ( j T j x P (14.14) TS x 2 T M j T ; (14.15) j 2 will T denotes the size of the typical set. As you j recall from Chapter j where ( X ) N H 4, there are roughly noise vectors in the typical 2 So set. II ) ( M N H ( X ) P 2 2 : (14.16) TS or gro the probabilit This either vanishes on ws exp onen tially bound y of error N increases (remem bering that we are keeping as prop ortional to N as N M increases). if It vanishes ( ) < M=N: (14.17) H X R = 1 Substituting M=N , we have thus established the noisy-c hannel coding theorem for the binary symmetric channel: very good linear codes exist for any rate R satisfying 1 ( X ) ; (14.18) R < H 2 ( entrop y of the channel noise, per bit. ) is the X H where ] 3 [ Redo the pro of for a more general Exercise 14.1. channel. comp ression by linea r hash codes 14.2 Data deco game we have just played can also be view ed as an uncompres- The ding game. The world pro duces a binary noise vector sion from a source P ( x ). x The has redundancy (if the ip probabilit y is not 0.5). We compress it noise x a linear maps the N -bit input that (the noise) to the M -bit with compressor z (the syndrome). Our uncompression output is to reco ver the input x task from output z . The rate of the the is compressor R M=N: (14.19) compressor [We don't care about the possibilit y of linear redundancies in our de nition of the rate, The result that we just found, that the deco ding problem here.] vanishing be solv almost any H , with for error probabilit y, as long as can ed, ( X ) < M=N , thus instan H pro ves a source coding theorem: tly Giv a binary source X of entrop y en ( X ), and a required com- H pressed rate R > H ( X ), there exists a linear compressor x ! z = Hx mo rate M=N equal to that required rate R , and an d 2 having ciated is virtually that asso lossless. uncompressor, of indep theorem not This for a source is true enden t iden tically dis- only tributed sym bols but also for any source for whic h a typical set can be de- example; ned: with memory , and time-v arying sources, for sources all that's required is that the source be ergo dic.

244 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 14 | Very Codes Exist 232 Good Linear Notes metho d for pro This that codes are good can be applied to other linear ving codes, suc h as low-densit y parit y-chec k codes (MacKa y, 1999b; Aji et al. , 2000). of its For eac we need an appro ximation h code exp ected weigh t enumerator function.

245 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 15 Exercises Theory on Further Information ideas exercises, introduce you to further h will in in- exciting whic The most towards the end of this formation theory , are chapter. cises sour ce coding and noisy on Refresher exer channels ] 2 [ Exercise Let X be an ensem ble with 15.1. A = f 0 ; 1 g and P . = X X 100 995 ; 0 : 005 g . Consider source coding using the blo ck coding of 0 f : X 100 2 X every con taining x few er 1s is assigned a distinct where 3 or ord, while the other x s are ignored. codew If the (a) ords are all of the same length, nd the min- assigned codew required with vide the above set length distinct code- imum to pro words. will (b) y of getting an x that the be ignored. Calculate probabilit 2 ] [ . Exercise Let X be an ensem ble 15.2. P with = f 0 : 1 ; 0 : 2 ; 0 : 3 ; 0 : 4 g . The en- X 0001 is enco ded using the sym bol code C = f ble ; 001 ; 01 ; 1 g . Consider sem N codew ord corresp onding to x 2 X the , where N is large. (a) Compute entrop y of the fourth bit of transmission. the Compute (b) entrop y of the fourth bit given the third the conditional bit. Estimate entrop y of the hundredth (c) bit. the given the Estimate y of the hundredth bit conditional the (d) entrop y-nin th bit. ninet ] [ 2 15.3. dice o fair Exercise are rolled by Alice and the sum is recorded. Tw task to nd a sequence of questions with yes/no answ ers Bob's is to ask minim this ber. Devise out a strategy that achiev es the num um in detail possible average num ber of questions. [ 2 ] w stra . Exercise How can you use a coin to dra 15.4. ws among 3 people? [ 2 ] . Exercise In a magic tric k, there are three participan ts: the magician, 15.5. t, and a volun teer. The assistan t, who claims to have paranor- an assistan mal is in a soundpro of room. The magician gives the volun teer abilities, six blank cards, ve white and one blue. The volun teer writes a dif- feren t integer 1 to 100 on eac h card, as the magician is watching. from The volun keeps the blue card. The magician arranges the ve white teer cards in some order and passes them to the assistan t. The assistan t then blue announces num ber on the the card. How does the tric k work? 233

246 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 15 | Further on Information Theory 234 Exercises ] [ 3 15.6. this tric k work? . Exercise How does pac an into random order. `Here's k of cards, ordinary shued from pac k, any that you wish. ve cards choose Please the see their faces. No, don't give them Don't pass let me to me: t Esmerelda. can look at them. them to my assistan She Hmm sho four of the cards. Esmerelda, ::: nine w, `No w me of clubs, four of spades, ten of diamonds. The six of hearts, card, must be the queen of spades!' hidden then, k can The ed above for a pac k of 52 cards. be performed tric as describ theory upp er bound on the num ber of cards information Use to give an h the tric for be performed. whic k can [ 2 ] Exercise Find a probabilit y sequence p = 15.7. p . ;p h that ;::: ) suc ( 2 1 ( p ) = 1 H . [ 2 ] 15.8. . Consider a discrete memoryless source with A Exercise = f a;b;c;d g X 8 are f 1 = 2 ; 1 = 4 ; 1 = 8 ; 1 = 8 g . There and 4 = = 65 536 eigh t-letter words P X be formed from the four letters. can the total num ber of suc h that Find that are in the typical set T words (equation 4.29) where N = 8 and N : 1. = 0 ] 2 [ f the source A = Exercise a;b;c;d;e g , P Consider = 15.9. . S S 1 1 1 1 1 / / / / / f ; 3 3 ; ; 9 ; 9 g and the channel whose 9 probabilit y matrix transition is 2 3 1 0 0 0 2 6 7 / 0 0 0 3 6 7 = Q (15.1) : 4 5 0 1 0 1 1 / 3 0 0 0 the source et has ve sym Note but alphab channel alphab et the that bols, A A duces = f 0 ; 1 ; 2 ; 3 g has only four. Assume that the source pro = Y X bols at exactly 3/4 the rate that the channel accepts sym sym- channel bols. (tin y) > 0, explain how you would design a system For a given channel comm source's output over the the with an aver- for unicating error probabilit y per source sym bol less than . Be as explicit as age possible. In particular, not invoke Shannon's noisy-c hannel coding do theorem. [ 2 ] 15.10. . Consider Exercise symmetric channel and a code C = a binary f 0000 ; 0011 ; 110 0 ; 111 1 g ; assume that the four codew ords are used with probabilities f = 2 ; 1 = 8 ; 1 = 8 ; 1 = 4 g . 1 is the ding ding rule that minimizes the probabilit y of deco What deco [The on deco ding rule dep ends error? the noise level f of the optimal rule symmetric Giv e the deco ding channel. for eac h range of binary values of f , for f between 0 and 1 = 2.] 2 ] [ for Find the capacit y and optimal input distribution Exercise the 15.11. three-input, three-output channel whose transition probabilities are: 3 2 0 0 1 1 2 5 4 / / 3 3 0 : (15.2) Q = 2 1 / / 3 3 0

247 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Further Exercises Information Theory 235 15 | on [ ] 3, p.239 input to a channel Q is a word of 8 bits. The Exercise 15.12. The of 8 bits. the h time it is used, a word channel ips output is also Eac kno transmitted but the receiv er does not of the w whic h one exactly bits, other seven bits are receiv ed without error. All one. are The 8 bits likely one that is ipp ed. Deriv e the capacit y of this equally to be the channel. w, an explicit enco der and deco Sho that it is possible der by describing is, with zero error probabilit y) to comm unicate 5 bits per (that reliably channel. over this cycle [ 2 ] . Exercise A channel with input x 15.13. 2f ; b ; c g and output y 2f r ; s ; t ; u g a has probabilit y matrix: conditional 2 3 1 / r 2 0 0 * a H 1 1 6 7 H j / / 2 0 2 s 6 7 * : = Q b 1 1 H 4 5 / / j H 0 2 2 t * c 1 / H 0 0 2 H j u What capacit y? is its [ 3 ] Exercise 15.14. . ten-digit num ber on the cover of a book kno wn as the The ISBN incorp orates an error-detecting code. The num ber consists of nine 0-521-64298-1 x a ten , and g 9 ;::: ; 1 ; 0 2 f th source digits x , satisfying ;x ;:::;x 2 9 1 n 1-010-00000-4 is given by value whose k digit chec ! valid ISBNs. Table 15.1 . Some 9 X included are hyphens [The for mo d 11 : = x nx 10 n y.] legibilit =1 n x wn 2 f 0 ; 1 ;::: ; 9 ; 10 g : If x Here = 10 then the ten th digit is sho 10 10 the numeral X. using roman Sho w that a valid ISBN satis es: ! 10 X nx mo d 11 = 0 : n n =1 Imagine that an ISBN is comm unicated over an unreliable human chan- nel whic modi es digits and sometimes reorders digits. h sometimes correct) w that be used to detect (but not code can all errors in Sho this h any one of the ten digits whic di ed (for example, 1-010-00000-4 is mo ! 1-010-00080-4). w that this code can Sho to detect all errors in whic h any two ad- be used jacen t digits are transp osed (for example, 1-010-00000-4 ! 1-100-00000- 4). What ositions of pairs of non-adjac transp digits can be de- other ent tected? ten th digit were de ned to be If the ! 9 X nx x = d 10 ; mo n 10 n =1 of both why would code not work so well? (Discuss the detection the mo di cations of single digits and transp ositions of digits.)

248 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 15 | Further on Information Theory 236 Exercises [ 3 ] proba- input and output y A channel transition x Exercise with 15.15. has bilit y matrix: - a a 2 3 @ @ R 0 1 0 f f - b b 6 7 f 1 f 0 0 6 7 = Q : - c c 4 5 0 g g 0 1 @ R @ - g 0 1 g 0 d d distribution of the form Assuming an input p 1 1 p p p P = ; ; ; ; X 2 2 2 2 ), and the y of the output, down ( Y entrop the conditional entrop y write H output given the input, H ( of the j X ). Y Sho the optimal input distribution is given by w that 1 ; p = g H ( f ) H )+ ( 2 2 1 + 2 p 1 1 1 d ) = ( f where f log H ber Remem . ) log + (1 f ) = log . H ( p 2 2 2 2 2 p p d f f (1 ) down the optimal Write distribution and the capacit y of the chan- input nel case f = 1 = 2, g = 0, and commen t on your answ er. in the [ 2 ] Exercise redundancies What are the di erences in the . needed in an 15.16. code detect h can reliably error-detecting that a blo ck of data has (whic detect corrupted) an error-correcting code (whic h can been and cor- and rect errors)? Further tales from information theory The follo exercises give you the chance to disco ver for yourself the answ ers wing more surprising of information theory . to some results [ ] 3 of info rmation from correlated sources. Imag- Exercise Communication 15.17. ( A ) ( B ) X that we want to comm unicate and X ine from two data sources data tral location C via noise-free one-w ay comm unication channels ( g- to a cen ( A ) ( B ) signals and x x ure 15.2a). are strongly dep enden t, so their join t The con marginal t is only a little greater than the information information con- ten t of either wishes For example, C is a weather collator who ten to of them. ( A ) ( whether it is raining in Allerton of rep x e a string receiv saying ) and orts ) B ( ) A ( B ) ( join t probabilit x ). The x ( and x in Bognor it is raining whether y of t be migh A ) ( ( ) ( A ) B x x ): ;x P ( 1 0 0 0.49 0.01 ) ( B x 0.01 (15.3) 1 0.49 ( A ) ( B ) successiv The of x weather collator would and x like to kno w N e values exactly , but, since he has to pay for every bit of information he receiv es, he is interested in the y of avoiding buying N bits from source A and possibilit A ( ) ( B ) variables x bits from source and x B N that are generated . Assuming rep eatedly from this distribution, can they be enco ded at rates R in and R B A suc h a way that C can reconstruct all the variables, with the sum of information transmission rates on the two lines being less than two bits per cycle?

249 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Further Exercises Information Theory 237 15 | on R B 15.2 Figure unication of . Comm ( ) A ( ) B X ; X ) ( H de enco from dep t information enden - ( A ) A ( ) ) B A ( ( ) t x ) ( B x (a) sources. x are and H ) H X ( Achievable R A H H j t sources (the dep enden C * endence ted dep by the is represen de enco (a) B ) ( ( ) B - t x of values of Strings w). dotted arro ) A ( B ( ) R B ) j X ( H X are using ded enco h variable eac and R into codes of rate R B A ) A ) B ( ( and t h , whic transmissions t (b) over noise-free are comm unicated ) B ( A ( A ( ) ) R A j ( ) H ( H ) X X X to a receiv er C channels The . (b) achiev rate region. Both able can be con strings veyed without 15.2. demonstrate, should h you whic er, answ The is indicated in gure ) A ( < H even though error ) X ( R A ( ) B ( ) A t sources In the general case of two dep enden exist X , there X and ( B ) X and R ( < H ). B for two transmitters that can achiev e reliable codes unication of comm the ( B ) ) A ( ( ) A from C, long to the information rate as X X X and , both as: B ( ) B ( ( A ) ) j X ( H ); the information rate from X , exceeds X R , R , exceeds B A ( B ) ( A ) X H j ); and the total information rate ( X + R t exceeds the join R B A ( A ) ( B ) y H ( X ;X ) (Slepian and Wolf, entrop 1973). B A ) ) ( ( x and above, eac h transmitter must transmit at case of x So in the and greater H must (0 : 02) = 0 : 14 bits, a rate the total rate R R + than A B 2 be greater than 1.14 bits, for example R codes = 0 : 6, R exist = 0 : 6. There B A can achiev rates. Your task is to gure out why this is so. that e these explicit solution h one of the sources is sen t as plain Try to nd an in whic B ( ) ) ( B x other the = is enco ded. , and text, t ] 3 [ Multiple access channels . Consider a channel with two sets Exercise 15.18. one output { for example, a shared telephone line ( gure 15.3a). of inputs and ) ( A ( B ) del has mo and x two binary inputs x system a ternary A simple and y to the arithmetic sum equal two inputs, that's 0, 1 or 2. There output of the noise. Users A and B cannot comm unicate with is no h other, and they eac cannot the output of the channel. If the output is a 0, the receiv er can hear that both were set to 0; and if the output is a 2, the receiv er be certain inputs But be certain inputs were set to 1. both if the output is 1, then can that be that the input state was (0 ; 1) or (1 ; 0). How should users A and it could use channel B so that their messages can be deduced from the receiv ed this How fast A and B comm unicate? signals? can the total information rate from A and Clearly to the receiv er cannot B be two bits. the other hand On to achiev e a total information rate it is easy R ) + R ;R of one bit. Can reliable comm unication be achiev ed at rates ( R A B A B h that R + R suc > 1? A B The er is indicated in gure 15.3. answ codes for multi-user channels are presen ted in Ratzer Some practical and y (2003). MacKa [ ] 3 Broadcast Exercise channels channel consists of a single 15.19. . A broadcast and two or more receiv ers. The transmitter erties of the channel are de- prop ( A ) ( B ) distribution Q ;y ( y ned j x ). (We'll assume the by a conditional channel ) A ( an enco der and two deco The to enable is memoryless.) task is to add ders y * receiv reliable comm unication of a common message at rate an ers, R to both x 0 H j H ) ( B message at rate , and A er to receiv R at rate message individual an individual y A broadcast con channel of the region y capacit . The is the B er to receiv R vex B Figure . The broadcast 15.4 ). ;R ;R R ( triplets rate able of achiev set of the hull 0 A B x input; channel is the channel. ) A ( ( B ) A simple benc hmark for suc h a channel is given by time-sharing (time- and are outputs. the y y division signaling). If the capacities of the two channels, considered separately ,

250 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 15 | 238 on Information Theory Further Exercises access . Multiple 15.3 Figure - ( ) A x A general (a) multiple channels. ) ( B ) A ( - y ) ;x x P ( y j with two channel access - ) B ( x (b) er. receiv transmitters and one (a) A binary multiple access channel to the equal output with sum of R B (c) The two inputs. achiev able 1 region. A ) ( : x y 0 1 2 = 1 able Achiev 0 1 0 B ) ( x 1 (b) (c) 1 2 1 1 = 2 R A ( A ) ) B ( C transmission , then by dev oting a fraction are of the C time and A achiev A = 1 to channel ) = to channel B, we can and e ( R ;R ;R A 0 B A B B ( ) ( A ) R B C ; ; (0 ). C A B 6 ( B ) C , imagine analogy an however. As speaking this, than better do We can @ to an American uen sim ultaneously are and a Belarusian; you t in American @ @ understands neither but other's ers the of your and in Belarusian, two receiv @ @ - distinguish er can h receiv If eac language. own is in their a word whether ) A ( R A C le then ts by recipien veyed to both be con language or not, can an extra binary by 15.5 . Rates achiev able Figure using its bits the whether the next transmitted word should be from to decide timesharing. simple source source or from the Belarusian American text. Eac h recipien t can text the that they understand in order to receiv e their personal concatenate words and can also reco ver the binary string. message, An example channel consists of two binary symmetric chan- of a broadcast with nels input. The two halv es of the channel have ip prob- a common abilities f i.e., and f hannel, . We'll assume that A has the better half-c B A 1 / < f f < related closely is a `degraded' broadcast channel, . [A 2 channel B A variables are h that the random suc have probabilities h the in whic conditional ov chain, the structure of a Mark ( ) A ) ( B R y ! (15.4) y ! x ; C A B ( ) A ( ) y version .] In this special case, it turns i.e., y is a further degraded of to receiv through B can also be is getting er out er information whatev that C B is no point distinguishing between R vered and by receiv er A . So reco there 0 capacit y region for the rate pair ( R the ;R : the ), where task is to nd R B 0 A f f f is the B , and R rate is the rate of reac hing both R A and of information 0 A B A . information reac hing the extra A 15.6 . Rate of reliable Figure alen t to this wing one, and a solution follo to it is The exercise is equiv comm unication , as a function of R 15.8. in gure illustrated Shannonesque , for f level noise codes to operate at noise designed ] 3 [ f (solid levels and f line) Exercise Variable-rate erro r-co rrecting codes for channels with unkno wn 15.20. A B (dashed line). characterized be well real . In level noise may sometimes channels life, not the a mo der is installed. As before del of this situation, imagine that a enco with is kno to be a binary symmetric channel wn noise level either f channel A or f . . Let f C > f and , and let the two capacities be C A A B B B who like to live dangerously Those t install a system designed for noise migh level f to be with rate R out ' C level turns ; in the event that the noise A A A f that , our exp erience there theories would lead us to exp ect of Shannon's B

251 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. on Information Theory 239 Further 15 | Exercises to comm unicate information reliably (solid line failure would be a catastrophic 15.6). in gure R for system worst- the ding enco the design h would e approac ativ A conserv C A ' scenario, a code with rate R case installing C (dashed line in gure 15.6). B B lower noise would managers In the the event that level, f the , holds true, A . R have a feeling of regret because of the wasted y di erence C capacit A B C B transmits reliably at some a system that not only Is it possible to create noise er the whatev `lower- rate extra, some R unicates comm also but level, 0 f f f B A This 15.7? priorit code if the noise level is low, as sho wn in gure y' bits noise at all high-priorit reliably y bits the unicates comm and f between levels A Figure . Rate of reliable 15.7 f noise if the also y bits low-priorit level is unicates or comm f , and the B A comm of , as a function R unication below. a desired f , for level noise problem is mathematically equiv alen t to the previous problem, the This code. variable-rate broadcast The lower rate of comm unication was there called degraded channel. noise , and the rate at whic h the low-priorit y bits are comm unicated if the R 0 level is low was called . R A and 01 : = 0 illustrativ An case the for 15.8, in gure wn er is sho e answ f A a broadcast for region able achiev the ws sho also gure 1. (This : = 0 f channel B 01 I admit 1.) : = 0 whose f and I = 0 two half-c f levels have noise hannels : A B 0.6 time-sharing nd gap between the simple the cunning solution and the solution 0.4 small. disapp ointingly 0.2 discuss In Chapter codes for a special class of broadcast channels, 50 we will namely channels, error without every sym where bol is either receiv ed erasure 0 rateless These have the or erased. codes nice prop erty that they are { the 0.8 1 0.6 0.4 0.2 0 bols y num reliable transmitted is determined on the suc h that ber of sym Figure 15.8 achiev able region . An for channel with unkno wn the of the statistics erasure ed, is achiev unication com er the whatev channel. level. Assuming the two noise [ 3 ] = 0 are : 01 f possible noise levels A info Multiterminal are both imp ortan t practi- networks Exercise rmation 15.21. : = 0 f and 1, the lines dashed B cally ay of a two-w example wing follo the . Consider theoretically intriguing and rates are that ;R R w the sho B A two people over the both to talk wish binary channel ( gure 15.9a,b): channel, a simple achiev able using person hear can you but is saying; they other the what want to hear both and approac h, and the time-sharing the transmitting are if you only person other a zero. signal transmitted by the achiev rates ws sho line solid able using h. a more cunning approac What ultaneous information rates from A to B and from B to A sim be can achiev and how? Everyda y examples ed, h net works include the VHF of suc channels used by ships, and computer ethernet net works (in whic h all the devices are to hear anything if two or more devices are broadcasting unable ultaneously). sim 1 / directions in both time- by simple 2 of achiev , we can Obviously e rates But can the two information rates be made larger? sharing. the Finding capacit y of a general two-w ay channel is still an open problem. However, we can obtain interesting results concerning achiev able points for the simple binary channel above, as indicated in gure 15.9c. There exist codes discussed sho can e rates up to the boundary that wn. There may exist better achiev codes too. Solutions to exercise 15.12 (p.235) . C ( Q ) = 5 bits. Solution Hin 5) code. the last part: a solution exists that involves a simple (8 ; t for

252 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Further Exercises on Information Theory 240 15 | A general 15.9 Figure . (a) B ( ) - - ( ) A y x ay channel. (b) The rules two-w A ( A ) ) ( B ) ( B ) ( y ) ;x x P j ;y ( for a binary two-w ay channel. ) A ( ) ( B y x The outputs w the sho two tables (a) B ) ( ) A ( that for eac h result y and y Achiev able state of the (c) inputs. ) A ( B ) ) ) ( A ( ( A x : y x : y two-w the for region ay binary 0 1 0 1 solid channel. Rates below the are able. The achiev dotted line 0 1 0 0 0 0 ( ) ( B ) B line sho ws the `obviously x x achiev able' region whic h can be 0 0 (b) 1 1 0 1 time-sharing. by simple attained 1 0.8 0.6 Achievable R(B) 0.4 0.2 0 (c) 0.2 0 0.4 0.6 1 0.8 R(A)

253 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 16 Message Passing themes of the is the idea of doing complicated calculations of this One book hardw quite It turns out that distributed a few interesting using simple are. ed by message-passing algorithms, in whic h simple mes- can problems be solv passed locally among simple pro sages whose operations lead, after are cessors time, solution of a global problem. some to the Counting 16.1 consider a line of soldiers walking in the mist. The commander As an example, the complex calculation of coun ting the num ber of soldiers to perform wishes line. This problem in the be solv ed in two ways. could First is a solution that uses exp ensiv e hardw are: the loud booming there could of the his men. The commander and shout `all soldiers voices commander ort bac k to me within one min ute!', then he could listen carefully as the rep resp here `Molesw orth men sir!', `Fotherington{Thomas here sir!', and so ond be This relies on several exp on. e pieces of hardw are: there must solution ensiv a reliable unication channel to and comm every soldier; the commander from must be able to listen to all the incoming messages { even when there are hundreds of soldiers be able to coun t; and all the soldiers must be { and must are possibly-large to shout bac k across the if they distance well-fed to be able from commander. them separating the way of nding this second function, the num ber of soldiers, The global require global comm unication hardw are, high IQ, or good food; we does not single require h soldier can comm unicate eac integers with the two that simply t soldiers in the line, and that the soldiers are capable of adding one adjacen rules: ber. h soldier follo ws these Eac to a num . Message-passing 16.1 rithm Algo t soldier in the line, say the num ber `one' to the 1. If you are the fron A. rule-set soldier behind you. 2. If you say the num ber `one' to are rearmost soldier in the line, the the soldier in fron t of you. of or behind you ber to you, add one 3. If a soldier ahead says a num the other side. soldier ber to the num new say the to it, and on If the clev er commander can not only add one to a num ber, but also add global two num together, then he can nd the bers num ber of soldiers by simply adding together: 241

254 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 16 | Message Passing 242 to him by the the ber said ber of (whic h equals num total num the t of him, soldiers in fron t) in fron soldier ber behind) num h is the (whic com- the + num ber said the to by the behind mander soldier him, coun + commander himself ). one (to t the requires only local comm unication hardw are and simple compu- This solution and addition of integers). (storage tations 1 2 3 4 Figure 16.2 of soldiers . A line coun ting themselv es using message-passing rule-set A. The 1 3 4 2 Commander commander `3' from the can add t, `1' the in fron from soldier and `1' for soldier behind, himself, deduce that there are 5 and in total. soldiers aration Sep clev er tric k mak es use of a profound prop erty of the total num ber of This that sum be written as the soldiers: of the num ber of soldiers in front it can point, two quan a point and num ber behind that of tities whic h can be the computed separately , because the two groups are separated by the commander. If the soldiers not arranged in a line but were travelling in a swarm, were it would not to separate them into two groups in this way. The then be easy Figure of guerillas. . A swarm 16.3 Commander Jim ted could guerillas be coun in gure using the above message-passing 16.3 not A, because, while the rule-set do have neigh bours (sho wn by lines), guerillas it is not who is `in fron t' and who is `behind'; furthermore, since the clear of connections between the guerillas con tains cycles, it is not possible graph for a guerilla in a cycle (suc h as `Jim') to separate the group into two groups, `those in fron `those behind'. t', and of guerillas message-passing be coun ted by a mo di ed A swarm algo- can if they are arrange d in a graph that contains no cycles . rithm Rule-set B is a message-passing algorithm for coun ting a swarm of guerillas kno whose a cycle-free graph , also form wn as a tree , as illustrated connections in gure 16.4. Any guerilla can deduce the total in the tree from the messages that they receiv e.

255 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. ting 243 16.1: Coun Figure 16.4 of guerillas . A swarm whose connections form a tree. Commander Jim 16.5 . Message-passing Algo rithm N ber of neigh num t your . 1. Coun bours, B. rule-set your 2. Keep you ed from num t of the coun have receiv ber of messages bours, v , :::;v of eac h of those m , and of the values v neigh , 1 2 N messages Let have messages. V be the running total of the you ed. receiv you to N 1, m ber of messages num 3. If the ed, have receiv , is equal who has not sen t you a message and tell then iden tify the neigh bour num ber V + 1. the them to N , then: 4. If the you have receiv ed is equal num ber of messages total. required (a) the num ber V + 1 is the (b) n bour h neigh eac for f V + 1 v say to neigh . bour n the num ber n g A Figure 16.6 . A triangular 41 41 grid. there How man y paths are from A to B? One path is sho wn. B

256 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 244 16 | Message Passing 16.2 Path-counting coun ting squaddies is the task of coun ting the A more task profound than and nding how man y paths pass through a grid, through num ber of paths any given point in the grid. ws Figure grid, and a path through the grid, con- 16.6 sho a rectangular A and path that starts from is one pro ceeds points A and necting B. A valid A downward moves. questions are: and Our tward to B by righ N P M A to B? 1. How man y suc h paths are there from what probabilit it 2. If a random path from A to B is selected, y that is the B we say `random', [When passes through a particular we grid? node in the probabilit y of being selected.] mean that all paths have exactly the same A to Figure 16.7 . Every path from P through upstream an P enters A to B be selected? from path a random 3. How can neigh bour of P, either M or N; so num ber of paths we can the nd Coun ting all the paths from A to B doesn't seem straigh tforw ard. The num ber the A to P by adding from { even if the of paths is exp ected to be prett y big permitted grid were a diagonal num ber of paths from A to M to N= 2 there would still be about 2 only three nodes strip wide, possible paths. num ber from A to N. the computational breakthrough is to realize that to nd the numb er of The 1 1 1 1 A paths explicitly . Pic k a point P in we do have to enumerate all the paths, not the ber of paths A to P. Every path from A from consider and the grid num 1 2 3 in to P through (`upstream' neigh come upstream to P must one of its bours 5 2 So the num ber of paths from A to P can be above or to the meaning left). ber of paths the by adding up num h of those found from bours. neigh A to eac 5 for a simple This message-passing is illustrated in gure algorithm 16.8 B We start edges. by twelve directed connected vertices ten with grid by send- . Messages t in the Figure 16.8 sen from ing the `1' message its A. When any node has receiv ed messages from all ard forw pass. neigh bours, it sends the sum of them on to its downstream neigh- upstream num At B, we have coun ted the ber 5 emerges: ber of paths num bours. the A them all. As a sanit y-chec k, gure 16.9 from A to B without enumerating sho ws the ve distinct paths from A to B. now move on prob- challenging to more Having coun ted all paths, we can goes through a given the probabilit y that a random path computing lems: vertex, and creating a random path. B Probability of passing a node through ve paths. . The 16.9 Figure pass, making kward pass as well as the forw ard a bac we can deduce how By 1 1 1 1 through eac h node; and if we divide that by the total man y of the paths go 3 5 5 1 selected probabilit a randomly we obtain path y that num ber of paths, the A 2 1 3 through messages bac the ws sho passes 16.10 kward-passing that Figure node. 2 2 1 mes- ard-passing in the lower-righ t corners forw tables, and the original of the 2 5 these at a given bers two num multiplying sages in the upp er-left corners. By 1 1 vertex. vertex, that through passing ber of paths num total the we nd For 5 B paths tral example, four cen pass through the vertex. 1 the sho 41 for triangular Figure 16.11 computation ws the result of this t in the sen Figure 16.10 . Messages The 41 grid. area of eac h blob is prop ortional to the probabilit y of passing forw passes. bac kward and ard through the onding node. corresp path Random sampling [ 1, p.247 ] 16.1. If one Exercise creates a `random' path from A to B by ipping a fair coin at every junction where there is a choice of two directions, is

257 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. the 16.3: 245 lowest-cost Finding path a uniform resulting from the set of all paths? the random path sample it for the grid of gure 16.8.] : imagine Hint [ trying t to be had here, and I'd There to have the satisfaction is a neat insigh like you it out. of guring [ 2, p.247 ] run the 16.2. ard and bac kward algorithms be- Exercise Having forw B on tween how can one dra w one path from A to points A and a grid, uniformly (Figure 16.11.) B at random? A Figure . (a) The probabilit y 16.11 through of passing h node, and eac (b) a randomly chosen path. B (b) (a) algorithm The to coun t the paths to B is an message-passing we used of the duct algorithm . The `sum' sum{pro es place at eac h node example tak it adds together the messages coming from its predecessors; the `pro duct' when men was not you can think of the sum as a weigh ted sum in whic h tioned, but summed all ened to have weigh t 1. terms the happ the lowest-cost path 16.3 Finding to travel as quic kly as possible from Am Imagine (A) to Bognor you wish bridge The possible routes are sho wn various 16.12, along with the cost (B). in gure of traversing eac h edge in the graph. For example, the route A{I{L{ in hours J 2 H 2 H path of 8 hours. without lowest-cost a cost has N{B We would like to nd the * j H H M 1 1 H H paths. of all explicitly evaluating this the ecien We can do by nding cost tly 4 2 H H * * j H j H A K B cost the A is. h node eac for node to that lowest-cost path from of the what H H 2 H H * * H j j H 1 1 tities quan These starting node A. from by message-passing, be computed can 3 I N H H * H j message-passing The algorithm or min{sum Viterbi algorithm is called the 1 3 L . algorithm For of the from path lowest-cost A to node brevit y, we'll call the cost from Figure diagram . Route 16.12 the sho wing Am to Bognor, bridge to its node ts descendan cost its x `the cost of x '. Eac h node can broadcast costs edges. the ciated with asso possible it kno the once of all its ws predecessors. Let's step through the costs algorithm by hand. The cost of A is zero. We pass this news on to H and I. As the passes along eac h edge in the graph, the cost of that edge is message 1 resp d costs of H and I are 4 and the ectiv ely ( gure 16.13a). adde . We nd then, the costs of J and L are found Similarly 2 resp ectiv ely, but to be 6 and what K? Out of the edge H{K about the message that a path of cost 5 comes exists from A to K via H; and from edge I{K we learn of an alternativ e path of cost 3 ( gure The min{sum algorithm sets the cost of K equal 16.13b). and to the of these (the `min'), um records whic h was the smallest-cost minim route into K by retaining only the edge I{K and pruning away the other edges leading 16.13c). Figures 16.13d and e sho w the remaining two to K ( gure iterations of the algorithm whic h reveal that there is a path from A to B with ters cost min{sum algorithm encoun 6. [If the a tie, where the minim um-cost

258 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 246 16 | Message Passing ed to a node route to it, then the algorithm path by more is achiev than one at random.] routes pick any of those can B, ktrac king from ver this follo wing reco We can path by bac lowest-cost bac k to A. We deduce that the lowest-cost path is the trail of surviving edges A{I{K{M{B. (a) J 2 H 2 H * j H Other of the min{sum applic ations algorithm 4 M 1 1 H H 2 4 H H H * * j H j H 0 Imagine of a pro raw materials that you manage from the pro duct duction K B H H 2 A H H * * H j H j 1 1 1 via critical path the wish to iden in your You of operations. set a large tify 3 N H I H * H j that the subset of operations that are holding up pro duction. If cess, pro is, 1 3 L critical the a little on any operations path were carried out faster then the 6 (b) 2 to get be reduced. would raw materials to pro from duct time H 2 J H * H j 4 be found using the The critical path of operations can min{sum of a set 1 M 1 H H 2 4 H H j H * * H j 5 0 algorithm. K B H H 2 A 3 H H * j H j H * be used will algorithm min{sum the in the deco 25 Chapter In of ding 1 1 1 3 N H I H * j H error-correcting codes. 2 1 3 L 6 (c) 2 16.4 ideas related ry and Summa H 2 J H * H j 4 M 1 1 H 2 4 H H * * j H ber Some global functions have a separabilit y prop erty. For example, the num 0 3 B H H 2 A K into the of paths num ber of paths of the from A to P separates sum from A to M H H * * H j j H 1 1 1 3 N H I the A to N (the and num left) point above from ber of paths point to P's (the H * j H 2 1 3 can be computed ecien by message-passing. P). Suc h functions tly Other L do erties, have suc functions not h separabilit example y prop for 6 (d) 2 2 J * 5 4 who in a troop of soldiers y; same the ber of pairs num 1. the birthda share 1 1 H 2 4 M H H * * j H 0 3 B H H 2. the of soldiers group a common of the size share t heigh who largest 2 A K H H * * H j j H 4 1 1 1 3 (rounded to the timetre); nearest cen H N I H H j 2 1 3 L that could 3. the length of the e that shortest tour a travelling tak salesman 6 soldier in a troop. every visits (e) 2 2 J * 4 5 to prob- of the One is to nd learning solutions low-cost hine of mac challenges 1 1 H 4 2 H M H * * j H 0 3 6 lems problem like these. are that of variables of nding The a large subset H H 2 A K B H H * H j j H can be solv h (Hop eld equal appro ximately with a neural net work approac ed 1 4 1 1 3 H I N H j H approac h to the A neural trav- and Hop eld dy, 2000; Bro and dy, 2001). Bro 2 1 3 L salesman problem will 42.9. elling be discussed in section . Min{sum 16.13 Figure message-passing algorithm to nd 16.5 Further exercises of getting the h node, cost to eac [ 2 ] the thence lowest and route cost e the . Exercise Describ 16.3. asymptotic prop erties of the probabilities de- A to B. from 16.11a, for a grid in a triangle of width in gure heigh t N . picted and ] 2 [ I 16.4. pro cessing, the integral image In image ( x;y ) obtained from Exercise . image f ( x;y ) (where x and y are pixel coordinates) is de ned by an y x X X ( ) I x;y (16.1) : f ( u;v ) u =0 =0 v y by mes- integral I ( x;y ) can be ecien w that computed image the Sho tly 2 passing. sage y 1 image, some simple functions of the image Sho w that, from the integral x x 1 2 of the can be obtained. For example, give an expression for the sum 0) (0 ; image intensities f ( x;y ) for all ( x;y ) in a rectangular region extending ) to ( from x ;y ;y x ( ). 1 2 1 2

259 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 247 16.6: Solutions 16.6 16.1 (p.244) . Since Solution are ve paths through the grid to exercise there 16.8, they must all have probabilit y 1 = 5. But a strategy of gure on fair based coin- ips pro duce paths whose probabilities are powers of 1 = 2. will e a uniform to exercise (p.245) . To mak 16.2 random walk, eac h for- Solution ward step of the walk should be chosen using a di eren t biased coin at eac h junction, with biases chosen in prop ortion to the backwar d messages ema- the rst from two options. For example, at the the choice after leaving A, nating there is a `3' message coming from the East, and a `2' coming from South, so with one go East with probabilit y 3 = 5 and South should probabilit y 2 = 5. This is how the path in gure 16.11b was generated.

260 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 17 Comm unication over Constrained Noiseless Channels we study the task of comm unicating ecien tly over a con- this In chapter channel { a constrained over whic h not all strings strained channel noiseless alphab et may be transmitted. input from the of the idea introduced We mak 16, that glob al properties e use in Chapter can be compute assing algorithm . of graphs d by a local message-p Three of constrained bina ry channels 17.1 examples can be de ned A constrained that de ne whic h strings are channel by rules permitted. 17.1. Example A every 1 must be follo wed by at least one 0 . In Channel Channel : A channel A valid string for this is is forbidden. substring the 11 : (17.1) 00100101001010100010 a motiv ation for this mo del, consider a channel in whic h 1 s are repre- As ted sen of electromagnetic energy , and the device that pro duces by pulses after pulses those very time of one clock cycle requires generating a reco a pulse before it can generate another. Example 17.2. Channel B has the rule that all 1 s must come in groups of two or more, and 0 s must come in groups of two or more. all B : Channel channel for is A valid string this 101 are forbidden. and 010 00111001110011000011 : (17.2) a motiv ation for this mo del, consider a disk As e in whic h succes- driv sive bits written onto neigh bouring points in a trac k along the disk are represen the 0 and 1 are surface; ted by two opp osite magnetic values orien tations. The strings 101 and 010 are forbidden because a single isolated magnetic surrounded by domains having the opp osite domain t turn tation so that 101 migh orien into 111 , for example. is unstable, Example 17.3. Channel C has the rule that the largest permitted runlength is two, that bol can be rep eated at most once. h sym is, eac C : Channel A valid string for this channel is and 000 are forbidden. 111 10010011011001101001 : (17.3) 248

261 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Three examples binary channels 249 17.1: of constrained ation A physical mo del is a disk driv e in whic h the rate of for motiv this is not wn accurately , so it is dicult to distinguish disk kno rotation of the 1 s and a string of three 1 s, whic h are between ted a string of two represen magnetizations 2 and 3 of duration resp ectiv ely, where ted by orien (poorly kno wn) tak en for one bit to pass by; to avoid is the time possibilit and the resulting loss of sync hronization of the y of confusion, receiv s and we forbid the string of three 1 and the string of sender er, 0 s. three of these channels are All of runlength-limited channels . three examples rules constrain the minim um and maxim um num bers The e 1 s and of successiv 0 s. Runlength of s Runlength of 0 s Channel 1 maxim minim um maxim um um um minim 1 1 1 unconstrained 1 A 1 1 1 1 1 2 B 2 1 1 1 2 C 2 of runs 0 A, but runs of 1 s are restricted to In channel s may be of any length one. In channel B all runs length be of length two or more. In channel must C, all must be of length one or two. runs capacit y of the binary channel is one bit per channel The unconstrained What the capacities of the three constrained channels? [To be fair, use. are de ned the `capacit we haven't h channels yet; please understand `ca- y' of suc pacit how man y bits can be con veyed reliably y' as meaning per channel-use.] Some codes for a constr aine d channel Let us concen trate for a momen t on channel A, in whic h runs of 0 s may be of any length but of 1 s are restricted to length one. We would like to runs comm unicate le over this channel as ecien tly as possible. binary a random Code C 1 A simple starting point is a (2 ; 1) code that maps eac h source bit into two 1 / the ects it resp and code, 2 ts of constrain is a rate- C bits, transmitted . This 1 s t better? channel A, so the capacit y of channel A is at least 0.5. Can we do 0 00 w C is redundan t because if the rst of two receiv ed bits is a zero, we kno 1 1 10 the second bit will also be a zero. We can achiev e a smaller average that es in length a code that omits the redundan t zero transmitted C . using 1 Code C 2 with used are C bols is suc h a variable-length code. If the source sym 2 is frequency then the average transmitted length per source bit equal s t 1 3 1 0 0 1 + (17.4) 2 = ; L = 2 2 2 1 10 average comm unication rate is so the 2 / 3 ; (17.5) = R 2 / the y of channel A must be at least and capacit . 3 ? There we do than C Can better are two ways to argue that the infor- 2 2 / 3 . mation = rate be increased above R could with The t assumes we are comfortable argumen the entrop y as a rst measure of information con ten t. The idea is that, starting from code C , we 2 y reduce the average message can without greatly reducing the entrop length,

262 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. | Comm unication over Constrained Noiseless Channels 250 17 we send, fraction of 1 s that we transmit. of the by decreasing message the 2 s is f in whic h the frequency of 1 of bits C . [Suc h Imagine feeding a stream into 2 source an binary le by passing the arbitrary from could a stream be obtained 1+f H_2(f) of an arithmetic code that le for compressing into the deco der is optimal f information rate R achiev ed is the entrop y binary strings of densit y .] The 1 f by the mean ( length, of the source, H transmitted ), divided 2 ) = (1 f = 1 + f: (17.6) ) + 2 f ( L f 0 1 0.75 0.5 0.25 0 Thus 0.7 ( ) H f H ( ) f 2 2 R ( f ) = : = (17.7) 0.6 ( 1 + f f ) L 0.5 1 / to onds original corresp cessor, f prepro , without C code The = . What 2 2 0.4 f if we perturb a little f towards , setting happ smaller ens R(f) = H_2(f)/(1+f) 0.3 0.2 1 (17.8) + ; f = 0.1 2 0 1 0 0.25 0.5 0.75 1 / y of f = for small negativ e ? In the vicinit 2 ) varies f ( L denominator , the . In the trast, con has with ) only f ( a second-order H numerator linearly 2 information The Top: . 17.1 Figure dep endence on . bol and sym t per source ten con mean per transmitted length [ ] 1 2 f ( 17.4. H of expansion Taylor , the . to order Find, Exercise ) as a function 2 bol as a function source sym of the . of source densit y. Bottom: The t per information con ten ( f linearly with decreasing R order, To rst . It must be possible ) increases sym transmitted bol, in bits, as a function of f . . Figure f by decreasing R ) does to increase ( R functions; these ws sho 17.1 f per increase decreases and has a maxim um of about 0.69 bits f indeed as use at f ' 0 : 38. channel this argumen t we have sho wn that the capacit y of channel A is at least By ( max f ) = 0 : 69. R f [ 2, p.257 ] 1 Exercise If a le con taining a fraction . = 0 : 5 17.5. s is transmitted f by s? , what fraction of the C stream is 1 transmitted 2 What fraction of the transmitted bits is 1 s if we driv e code C a with 2 sparse of densit y f = 0 : 38? source more fundamen approac h counts how man y valid sequences A second, tal S there S N . We can comm unicate log are, of length bits in N channel N N by giving one name to eac h of these valid cycles sequences. The y of a constrained noiseless channel 17.2 capacit the capacit y of a noisy channel in terms of the mutual information We de ned ved that its and its output, then we pro input this num ber, the capac- between ity, was related to the num ber of distinguishable messages S ( N ) that could be reliably con over the channel in N uses of the channel by veyed 1 S (17.9) : ) log ( N = C lim !1 N N In the case of the constrained noiseless channel, we can adopt this iden tity as the our channel's capacit y. However, of the name s , whic h, when de nition we were making codes for noisy channels (section 9.6), ran over messages role: s ;:::;S , is about to tak e on a new = 1 lab elling the states of our channel;

263 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. ber of possible messages 251 num the Coun ting 17.3: s s s s s s s s 1 5 4 2 3 8 6 7 1 f f f f f f f f 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 @ @ @ @ @ @ @ @ @ @ @ @ @ @ 0 R R R @ R R @ R @ @ @ @ @ R f - - - - - - - - f f f f f f f f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (a) (c) s (from) s +1 n n 0 1 j j 1 1 1 0 1 1 0 @ (d) A = (to) (b) 17.2 State Figure diagram for . (a) 0 1 1 @ (c) section. Trellis (b) A. channel @ R @ (d) matrix. Connection Trellis. j - j 0 0 0 s s s s trellis diagrams, . State 17.3 Figure n +1 +1 n n n matrices sections connection and 1 m m - n n 1 11 11 11 11 for C. B and channels 1 1 11 11 A A 0 0 A A 1 0 1 A A m n m n 0 A A 1 1 1 1 1 1 1 0 A A @ 0 A A @ 1 A A @ A U U A 1 @ R 0 0 1 1 m n m n 0 0 0 0 0 0 @ @ 0 1 0 @ @ @ @ 00 00 @ R R @ 0 m - n m n 00 00 00 00 0 3 2 3 2 0 0 1 0 1 0 1 0 7 6 7 6 1 0 0 1 0 0 0 1 7 6 7 6 C = A A B = 5 4 5 4 1 0 0 1 0 0 0 1 0 0 0 1 1 1 0 0 chapter we will denote the num ber of distinguishable messages of so in this de ne N M length , and by the capacit y to be: N 1 = lim C log M : (17.10) N !1 N N Once the capacit y of a channel we will return to the we have gured out a practical for that channel. of making code task the numb er of possible messages 17.3 Counting us introduce some represen tations of constrained channels. In a state let First , states of the transmitter are represen ted diagram lab elled with the by circles name state. Directed edges from one state to another indicate that of the transmitter state to move from the rst the to the second, and a is permitted bol emitted lab edge indicates the sym that when that transition is made. el on Figure 17.2a sho ws the state diagram for channel A. It has two states, 0 and 1. When transitions 0 are made, a 0 is transmitted; when transitions to state 1 are state a 1 is transmitted; transitions from to state 1 to state 1 are made, possible. not also represen t the state diagram We can trellis section , whic h sho ws by a two successiv e states in time at two successiv e horizon tal locations ( g- is called ure state of the transmitter at time n The s of . The set 17.2b). n possible state sequences can be represen ted by a trellis as sho wn in gure 17.2c. the A valid corresp onds to a path through sequence trellis, and the num ber of

264 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Noiseless Channels 252 unication 17 | Comm over Constrained . Coun ting the num ber Figure 17.4 M M = 3 M = 3 = 5 = 2 = 2 M M = 2 M 2 3 2 1 1 1 of channel trellis of paths in the The A. to the coun ts next nodes 1 1 2 1 1 1 h h h h h h 1 1 1 1 1 1 by passing accum from are ulated @ @ @ left t across the trellises. to righ @ @ @ @ @@ R @ R @ R @ - - - - - - h h h h h h h h h 0 0 0 0 0 0 0 0 0 2 3 2 1 1 1 = 55 M = 3 M = 5 M = 34 = 8 M M M = 13 M = 21 M = 2 6 2 8 5 3 4 7 1 Channel (a) A 8 2 21 5 3 13 1 1 h h h h h h h h 1 1 1 1 1 1 1 1 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @@ @ @ R R @ R @@ @ @ R R @ @ @ @ R R @ - - - - - - h - - h h h h h h h h 0 0 0 0 0 0 0 0 0 21 34 13 3 2 5 8 1 M = 3 M = 5 = 8 M = 21 M M M = 13 M = 55 = 34 = 2 M 6 3 8 2 4 5 7 1 B (b) Channel 6 4 17 10 2 1 3 h h h h h h h - - - - - - - h 11 11 11 11 11 11 11 11 A A A A A A A A A A A A A A 7 2 1 11 4 1 1 1 A A A A A A A h h h h h h h h 1 1 1 1 1 1 1 1 A A A A A A A A A A A A A A 4 1 6 3 2 10 A A A U A U A U AA A U A A U U A U A A A h h h h h h h h 0 0 0 0 0 0 0 0 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ R R @ @ R R @ @@ @ @ R @@ @ R R @ @ @ - h - - - h - h h h h h - - - h h 00 00 00 00 00 00 00 00 00 17 4 1 11 2 1 7 1 M = 34 M = 21 = 13 M M = 8 = 5 M = 3 M = 2 M = 1 M 8 7 6 5 4 3 2 1 C Channel (c) 7 4 2 2 1 1 h h h h h h h h 11 11 11 11 11 11 11 11 A A A A A A A A A A A A A A 10 7 4 2 2 1 1 A h A h A h A h A h A h A h h 1 1 1 1 1 1 1 1 A A A A A A A @ @ @ @ @ @ @ A A A A A A A @ @ @ @ @ @ @ 11 6 4 3 1 1 1 A A U A U A U AA U A A U A A U A A A A U @ R @ @ @ R R @ @ @@ R R @ @ @ @ R R @@ h h h h h h h h 0 0 0 0 0 0 0 0 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ R R @ R @ R @ R @ R @ @ R h h h h h h h h h 00 00 00 00 00 00 00 00 00 6 4 3 1 1 1 A, B, and at 17.5 . Coun ting of channels the Figure C. We assume trellises in the ber of paths num that start is preceded rst bit A and B, any initial by 00 character , so that channels for the the is permitted, character be a 1 the but must . C, channel for rst

265 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Coun num ber of possible messages 253 ting 17.3: the 1 17.6 Figure ber num the ting . Coun log M =M n log M M M n n 1 n n n 2 2 n of paths trellis A. of channel in the 1 2 1.0 1.00 3 2 1.500 1.6 0.79 2.3 1.667 3 0.77 5 1.600 3.0 4 0.75 8 5 13 1.625 3.7 0.74 0.73 4.4 1.615 21 6 0.73 1.619 34 7 5.1 0.72 8 55 1.618 5.8 9 89 1.618 6.5 0.72 10 144 1.618 7.2 0.72 0.71 11 233 1.618 7.9 377 1.618 8.6 0.71 12 20 69.7 9 10 0.70 100 1.618 41 10 200 0.70 139.1 1.618 7 62 10 300 1.618 208.5 6 0.70 83 0.69 10 1.618 5 400 277.9 sequences valid ber of paths. For the purp ose of coun ting how man y is the num are els the trellis, we can ignore the lab there on the edges and paths through 0 h by the connection matrix A , in whic section A summarize trellis the = 1 ss 0 0 edge state is an to s if there , and A from s = 0 otherwise ( gure 17.2d). ss trellis ws the state Figure sho sections and connection matrices 17.3 diagrams, channels B and C. for coun num Let's ber of paths for channel A by message-passing in its t the steps Figure sho trellis. the rst few 17.4 of this coun ting pro cess, and ws gure 17.5a sho ws the num ber of paths ending in eac h state after n steps for n = 1 num ber of paths of length n , M ;:::; , is sho wn along the total 8. The n M as the Fib onacci series. top. We recognize n ] 1 [ Exercise 17.6. w that the ratio of successiv e terms in the Fib onacci series Sho . golden tends to the ratio, p 1 + 5 (17.11) = 1 : 618 : 2 N a constan t factor, M scales Thus, as M , so the to within as N !1 N N y of channel A is capacit 1 N : (17.12) = 0 618 : 1 : 694 = log = log t constan C = lim log 2 2 2 N we describ e what we just did? The coun t of the num ber of paths How can n ) ( ( n +1) ( n ) c ; we can from c obtain c is a vector using: ( ( n ) n +1) Ac : (17.13) c = So ( ) N (0) N c A ; (17.14) = c (0) any sym is the state coun t before where bols are transmitted. In gure 17.5 c T (0) bols ; 1] c , i.e., that either of the two sym = [0 is permitted at we assumed P ( n ) ) n ( M the = outset. The total num ber of paths is n = limit, . In the c c s n s ( N ) vector becomes dominated by the principal righ t-eigen c of A . (0) N ) ( N e c ! constan t : (17.15) 1 R

266 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 17 | Comm Noiseless Channels 254 unication over Constrained is the . eigen value of A Here, principal 1 to do is nd y of any constrained channel, all we need to nd So the capacit - t the , of its connection matrix. Then principal eigen value, 1 z z d h 0 1 = log : (17.16) C 1 2 6 - s to our Back channels 17.4 model z z h d 1 0 t c it looks and as if channels B and Comparing gure 17.5a and gures 17.5b same three eigen values of the A. The y as channel C have the capacit principal ? - - s for vectors eigen (the same the are trellises at the given B are A and channels of table bottom intimately are channels the indeed ). And p.608 related. C.4, . An accum ulator and 17.7 Figure a di eren tiator. e of channels A and B Equivalenc e any valid an s for channel A and pass it through If we tak accum ulator , string t by: de ned obtaining t = s 1 1 (17.17) = t t 2, + s n mo d 2 for n n n 1 the resulting string is a valid string for channel B, because there are then no 11 s , so there are no isolated digits s in t . The accum ulator is an invertible in operator, so, similarly , any valid string t for channel B can be mapp ed onto a valid string channel A through the binary di eren tiator , s for = s t 1 1 (17.18) t s t 2. = mo d 2 for n 1 n n n are equiv alen t in mo dulo 2 arithmetic, the di eren tiator Because + and is a blurrer, volving the source stream with con lter (1 ; 1). also the C is also intimately related to channels A and B. Channel s c ( s ) 1, p.257 ] [ Exercise C to channels A 17.7. . What is the relationship of channel 1 00000 B? and 2 10000 3 01000 constrained 17.5 Practical communication over channels 00100 4 00010 5 all it in practice? t, we can equiv OK, three how to do channels are Since alen 6 10100 concen trate on A. channel 01010 7 10010 8 solutions Fixe d-length Table 17.8 . A runlength-limited code for channel A. es explicitly-en with We start achiev table code in the The codes. umerated 17.8 3 / of a rate = 0 : 6. 5 1, [ p.257 ] Exercise . Similarly 17.8. all strings of length 8 that end in , enumerate the zero state. (There are 34 of them.) Hence sho w that we can map 5 5 / 625. 8 = 0 : and achiev e rate bits source strings) to 8 transmitted bits (32 num What can be achiev ed by mapping an integer rate ber of source bits to N = 16 transmitted bits?

267 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Practical comm over constrained channels 255 17.5: unication solution Optimal variable-length vey information way to con channel is to nd The over the optimal constrained 0 for points in the trellis, Q the probabilities optimal transition , and all e mak s j s transitions with these probabilities. A, we sho wed that a sparse source with densit y discussing When channel : driving code C f , would 38, e capacit y. And we kno w how to = 0 achiev 2 (Chapter 6): we design an arithmetic code mak is optimal e sparsi ers that compressing source; then its asso ciated deco der gives an optimal for a sparse dense mapping binary) strings to sparse strings. (i.e., from random task the probabilities is given as an exercise. of nding The optimal 3 ] [ Sho w that the 17.9. transition probabilities Q can be found Exercise optimal ws. as follo , that principal t- and left-eigen vectors Find A righ is the solutions the of T T R ) ) L ( ( R ( ) ) L ( = Ae e of and e e A = eigen with . Then value largest whose t distribution is prop ortional to Q invarian construct a matrix ) ( R ( L ) e e , namely i i ( L ) 0 A e 0 s s s 0 (17.19) : Q = j s s L ( ) e s t: exercise 16.2 (p.245) migh t give helpful cross-fertilization here.] [Hin p.258 [ 3, ] . Exercise Sho w that when sequences are generated using the op- 17.10. transition y matrix (17.19), the entrop y of the resulting probabilit timal is asymptotically log condi- per sym bol. [Hin t: consider the sequence 2 tional y of just one sym bol given the previous one, assuming the entrop one's is the invarian t distribution.] previous distribution we would probably use nite-precision appro In practice, to the ximations optimal variable-length solution. One migh t dislik e variable-length solutions because of the resulting unpredictabilit y of the actual enco ded length in any particular case. applications we would like a guaran tee that Perhaps in some ded the le of size N bits will be less than a given length enco of a source a disk N= h as + ). For example, ( driv e is easier to con trol if suc length C cks of 512 bytes are kno wn to tak e exactly the all amoun t of disk blo same For some channels we can mak e a simple mo di cation real-estate. constrained enco ws. and o er suc h a guaran tee, as follo variable-length We to our ding 2 0 strings to variable-length enco dings, nd two codes, two mappings of binary 1 2 3 1 x , if the enco ding of x under having the prop erty that for any source string 0 1 0 0 1 4 5 0 0 1 the under the second code is shorter than enco ding the x of average, rst then 0 1 1 1 and . Then to transmit x a string vice versa average, than is longer code we 0 3 send whic hev er enco ding has the enco de the whole string with both codes and 2 3 0 1 0 0 1 0 by a suitably enco ded single bit to con vey whic h shortest length, prep ended 2 0 6 7 0 1 0 0 1 used. of the is being two codes 6 7 4 5 1 0 0 0 1 0 1 1 1 1 1 [ ] 3C, p.258 0 17.11. 8 starting . Exercise with y valid of length How man sequences 0 17.9? run-length-limited in gure wn sho channels the for there are 0 a and . State 17.9 Figure diagrams are channels? of these capacities the What for matrices connection channels 1 with s um runlengths for maxim nd Using a computer, the matrices Q for generating a random path 3. to 2 and equal the through trellises of the channel A, and the two run-length-limited channels sho wn in gure 17.9.

268 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 17 | Comm over Constrained Noiseless Channels 256 unication [ p.258 ] 3, Consider the run-length-limited channel in whic h any . Exercise 17.12. 0 maxim and the of um run length of 1 s is a s is permitted, length of run L suc h as nine or ninet y. ber num large capacit y of this channel. (Giv e the rst two terms in a Estimate the involving expansion .) series L for , is the optimal matrix Q of the generating a roughly form What, through the trellis of this channel? Focus on the random of path values elemen Q the , 0 , the probabilit ts a 1 given a preceding y of generating 0 j 1 Q of run a preceding given 1 , the probabilit y of generating and a 1 L j L 1 s. Chec k your answ er by explicit computation for the channel in L 1 maxim s is nine. runlength of 1 h the whic um Variable ol durations 17.6 symb a further frill to We can task of comm unicating over constrained add the by assuming that the sym bols we send have di eren t durations , and channels that our is to comm unicate at the maxim um possible rate per unit time. aim h channels come in two avours: unconstrained, and constrained. Suc can onstr aine d channels with variable symb ol durations Unc tered unconstrained We encoun noiseless channel with variable sym bol du- an problem, rations (p.125) . Solv e that 6.18 and you'v e done this in exercise topic. The task is to determine the optimal frequencies with whic h the sym- bols should given their durations. be used, is a nice There this task and the task of designing an analogy between binary bol code When we mak e an 4). sym bol code sym optimal (Chapter with unequal probabilities p for a source optimal message lengths are , the i 1 / p , so = log l i 2 i l i (17.20) : = 2 p i l whose sym bols have durations we have a channel Similarly (in some , when i of time), the optimal probabilit y with whic h those sym bols should be units is used l i ; (17.21) = 2 p i time. is the where y of the channel in bits per unit capacit Constr aine d channels with variable symb ol durations Once you have grasp ed the preceding topics in this chapter, you should be able to gure how to de ne and nd the capacit y of these, the tric kiest out channels. constrained [ 3 ] channel A classic example of a constrained 17.13. with variable Exercise bol durations `Morse' sym channel, whose sym bols are is the dot d , the the dash D , the short space (used between letters in morse code) s , and the long (used between words) S ; space constrain the that spaces may only be follo wed by dots and dashes. ts are Find the capacit y of this channel in bits per unit time assuming (a) that sym all sym bols have equal durations; or (b) that the four bol durations are 2, 4, 3 and 6 time units resp ectiv ely.

269 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 17.7: Solutions 257 4 ] [ code for English (with, say, the Exercise 17.14. is Morse How well-designed 2.1)? probabilit y distribution of gure [ 3C ] Exercise into a narrow tub e? How dicult 17.15. is it to get DNA entrop y asso ciated with a constrained the theorist, To an information can be con veyed over it. In sta- channel reveals how much information same tistical are done for a di eren t reason: to physics, the calculations dynamics for example. thermo of polymers, predict the a polymer of length N that can either As a toy example, consider sit e, of width , or in the open where L are no tub in a constraining there In the open, the polymer adopts a state dra wn at random constrain ts. set from random walks, with, say, 3 possible the of one dimensional The y of this walk is log 3 per step, i.e., a per step. directions entrop to be is de ned polymer of the energy free [The 3. log N kT of total times is the polymer's e, the tub In the erature.] temp T where this, one- way, so dimensional walk can go in 3 directions unless the wall is in the L matrix = 10), connection the (if example is, for 3 2 0 0 0 0 0 1 1 0 0 0 7 6 0 1 1 1 0 0 0 0 0 0 7 6 7 6 1 0 0 0 0 1 1 0 0 0 7 6 7 6 0 1 0 0 0 1 1 0 0 0 7 6 : 7 6 0 0 0 0 1 1 1 0 0 0 7 6 7 6 . . . . . . 7 6 . . . 7 6 5 4 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 y change in entrop Now, what is the entrop y of the polymer? What is the polymer obtain asso ciated with the an entering the tub e? If possible, a computer of . Use expression as a function y of the entrop the to nd L Figure 17.10 . Mo del of DNA L of value a particular for walk , e.g. y y densit probabilit the plot and 20, w tub e. The squashed in a narro polymer's tub in the location erse transv e. of the DNA to pop have a tendency will out outside tub of the e, because, Notice the di erence in capacit y between two channels, one constrained random walk has the tub e, its and one unconstrained, is directly prop ortional to the force required to greater y. entrop pull DNA into the tub e. the 17.7 Solutions to exercise 17.5 (p.250) . A le transmitted by C aver- Solution tains, on con 2 age, one-third 1 s and two-thirds 0 s. If f = 0 : 38, the fraction of 1 s is f= (1 + f ) = ( 1 : 0) = (2 1 : 0) = 0 : 2764. Solution to exercise (p.254) . A valid string for channel C can be obtained 17.7 a valid ], then for channel A by rst inverting it [ 1 ! 0 ; 0 ! 1 from string it through These accum ulator. passing operations are invertible, so any an a valid string valid also be mapp ed onto for string for A. The only C can pro viso here comes from the edge e ects. If we assume that the rst character transmitted over channel by a string of zero es, so that the rst C is preceded is forced the 1 ( gure 17.5c) then character two channels are exactly to be a A's equiv if we assume that channel t only rst character must be a zero. alen Solution to exercise 17.8 (p.254) . With N = 16 transmitted bits, the largest so the integer ber of source bits that can be enco ded is 10, num maxim um rate of a xed length code with N = 16 is 0.625.

270 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. unication 17 Channels 258 over Constrained | Comm Noiseless . the invarian t distribution be Solution to exercise 17.10 (p.255) Let ) ( L ( R ) e P ) = ( s e (17.22) ; s s constan where entrop y of S given S is a normalization denotes S , assuming Here, as in Chapter 4, t. The t 1 t t whose ble ensem the random is t distribution, S comes from the invarian t 1 state s variable . is the t X 0 0 H ) ( j S S = s j P ( s ) P ( s (17.23) j ) ) log P ( s s 1 t t 0 s;s ( L ) L ( ) X 0 0 A e A e 0 0 s s s s ( ) R ( L ) s s e e log (17.24) = s s ) L ( L ) ( e e 0 s s s;s ( ) L i h X 0 e A 0 s s ) L ( ) L ( R ( ) s 0 (17.25) : e + log A e log log log = e 0 s s s s s 0 s;s 0 is either Now, to A 0 or 1, so the con tributions from the terms prop ortional s s 0 0 A log A are all zero. So s s s s ! X X L ) ( L ( ) ) R ( 0 H ( S ) = log + j S + e log e e A 0 0 1 t t s s s s s 0 s s ! X X ) ( L ) ) R ( ( L 0 e log e (17.26) A e 0 s s s s s 0 s s X X ( ) L ( L ) ) ( R ) ( L ) R ( ( L ) = log + e log e e e e e log (17.27) 0 0 0 s s s s s s 0 s s = log : (17.28) to exercise 17.11 (p.255) . The principal eigen values of the connection Solution of the matrices are 1.839 and 1.928. The capacities (log ) are two channels and 0.947 0.879 bits. 17.12 (p.256) . to exercise channel is similar to the unconstrained Solution The channel; runs of length greater than L are rare if L is large, so we only binary ect di erences exp from this channel; these di erences will sho w up in weak capacit texts con run length is close to L . The where y of the channel is the very close to one bit. A lower bound on the capacit y is obtained by considering the simple variable-length code this channel whic h replaces occurrences of the maxi- for , and runlength ::: 1 by 111 ::: 10 111 otherwise leaves the source le mum string L average rate of this code is 1 = (1 + 2 unc hanged. ) because the The t invarian L the `add an extra zero' state a fraction 2 distribution hit will time. of the We can the solution for reuse variable-length channel in exercise 6.18 the (p.125) . The capacit y is the value of suc h that the equation L +1 X l (17.29) 2 = 1 Z ) = ( =1 l The L +1 terms in the sum is satis ed. ond to the L +1 possible strings corresp that can be emitted, 0 , 10 , 110 , ::: , 11 ::: 10 . The sum is exactly given by: L +1 1 2 (17.30) : ( ) = 2 Z 2 1

271 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 259 17.7: # " N N +1 X 1) a ( r n we used : Here ar = 1 r =0 n ( be a little than 1 in order for Z less ) to should that We anticipate y L True capacit , using ximately + x ) ' x , equal 1. Rearranging and solving appro ln(1 for 2 0.910 0.879 (17.31) ) = 1 Z ( 3 0.947 0.955 +2) L ( 4 0.977 0.975 ln 2 : (17.32) ) ' 1 = 2 0.9881 0.9887 5 0.9942 0.9944 6 L We evaluated the true capacities for L = 2 and exercise. earlier = 3 in an 9 0.9993 0.9993 the the a y for capacit true The with y ximate table compares capacit appro of values of L selection . The t Q elemen in the since will be close to 1 = 2 (just a tiny bit larger), 0 j 1 a run channel Q unconstrained = 1 = 2. When 1 has of length L binary 1 0 j ely have a choice of prin ting occurred, or 0 . Let the probabilit y of we e ectiv 10 10 f . Let us estimate the be y of the remaining N characters selecting entrop stream as a function of f , assuming the rest of the matrix Q to have in the been set optimal value. The entrop y of the next N characters in the to its f is the rst bit, H stream ( y of the ), plus the entrop y of the remaining entrop 2 whic h is roughly ( N 1) bits if we select characters, as the rst bit and 0 ( 2) bits if 1 is selected. More N , if C is the capacit y of the channel precisely (whic h is roughly 1), H next chars ) ' H (the ( f ) + [( N 1)(1 f ) + ( N 2) f ] C N 2 ( = ) + NC fC ' H H ( f ) + N (17.33) f: f 2 2 and setting to zero Di eren the optimal f , we obtain: tiating to nd f 1 1 f log ) 1 ) (17.34) ' 2 ' f ' 1 = 3 : 2 f f probabilit y of emitting a 1 thus decreases from about 0.5 to about 1 = 3 as The the num 1 s increases. ber of emitted is the matrix: Here optimal 3 2 0 0 0 0 : 3334 0 0 0 0 0 7 6 0 0 0 0 0 0 0 4287 : 0 0 7 6 7 6 4669 0 0 0 0 0 : 0 0 0 0 7 6 7 6 0 : 0 0 0 0 0 4841 0 0 0 7 6 7 6 0 0 0 : 4923 0 0 0 0 0 0 7 6 : (17.35) 7 6 4963 0 0 0 0 0 0 0 0 0 : 7 6 7 6 0 0 0 0 0 0 0 : 4983 0 0 7 6 7 6 : 0 0 0 4993 0 0 0 0 0 0 7 6 5 4 0 0 : 0 0 0 4998 0 0 0 0 : 5331 : 5713 : 6666 5159 1 : : 5077 : 5037 : 5017 : 5007 : 5002 Our rough theory works.

272 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 18 Codebreaking Crossw ords and through a few chapter related to lan- walk In this e a random we mak topics mo delling. guage ords Crossw 18.1 may be though rules a constrained The of crossw ord-making t of as de ning that many valid crossw ords can be made demonstrates that channel. The fact channel has y greater than zero. this constrained a capacit are ypal formats. In a `type A' (or American) two archet ord crossw There S O A M S T D H A C A S S of a succession of words of length 2 crossw ord, every row and column consists I S U P T E O F U I L I A S M H E R F I T A E O R T B or more by one or more separated British) spaces. ord, In a `type B' (or crossw T E F R O R V E E E H R single characters, eac h row and column and of words of a mixture consists E S A P I M S A S E P R S T T O S L I O S A or more character spaces, and every separated lies in at least one word by one K U L T I L S C T I L N Y U Whereas ord lies or vertical). letter tal (horizon in a in a type A crossw every T S E A A U T H E A R S L T O N S V D O C E E O T E C word only about horizon a vertical type B crossw ord tal in a typical word, and E L N M U R E R R N A of the . other the half half only so; do letters lie in one word Y R E I L G G A R M R T S T I A D O I T C E A Type A crossw to create than type B because of the con- are ords harder R H R B R L O D I T E T O A gener- single are strain no permitted. characters t that Type B crossw ords are E A O D A A E T E S O R R E P T H S S U E R E E L M are because to solve er constrain harder ally ts per character. there few A R E E N B A S I B G R E K O V A O R I L I P N M A R S A A L C E T T are crosswor ds possible? Why L S L M K L E O A T N E E I T A N L S V E N no a form a grid any letters If a language has , then redundancy on written B N O T E E C A D O N Y A H P O S R valid In a language it with other redundancy the crossw ord. hand, high , on H E E U ber of trivial num a small perhaps (except ords e crossw is hard to mak ones). E F N P E T I J S S E N R P X E T O thus demonstrates a ords The in a language crossw possibilit y of making bound U C K D A R E T C N E R U T on ords normally written of that redundancy not are language. in the Crossw W R U A T S O A H L A S L B T T P R I E T A uine They consisting language the in `word-English', written are English. gen E E T E E I E of strings , separated from by spaces. a dictionary of words I E E S A R L S B S T T N U [ 2 ] y of word-English, capacit . Exercise 18.1. the per character. in bits Estimate 18.1 ords of types Figure . Crossw channel [Hin t: think of word-English as de ning a constrained (Chapter B (British). and A (American) 17) and exercise 6.18 (p.125) .] see fact leads man y crossw ords can be made The to a lower bound on the that y of word-English. entrop For simplicit y, we now mo del word-English by Wenglish, the language in- troduced in section 4.1 whic h consists of W words all of length L . The entrop y is: of suc inter-w ord spaces, including per character, h a language, log W 2 (18.1) : H W + 1 L 260

273 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Crossw 18.1: ords 261 to dep conclusions on the value of H end and the we come that We'll nd B A W are L . Consider a large crossw ord of size not terribly sensitiv e to the value of 1 2 f w S S and let the num ber of squares in area. Let the num ber of words be f w + 1 L L + 1 squares crossw of types A and B made f be ords ccupied letter-o . For typical S 3 L L 1 f 1 f and of words f of length have roughly the values in L , the two fractions L L + 1 + 1 4 w 1 18.2. table . Factors 18.2 and f f by Table w 1 S of size our using simple We now estimate how man y crossw ords there are and ber of words h the whic num by gener- at random mo del of Wenglish. We assume that Wenglish is created num ber of letter-squares y entrop with source memoryless) (i.e., a monogram . H ating W strings from 0 resp than smaller are ely ectiv the If, for example, y source used all A = 26 characters with equal probabilit the ber of squares. num total we use = log A = 4 : 7 bits. If instead then Chapter 2's distribution then H 0 2 y is 4.2. The redundancy of Wenglish stems from two sources: it the entrop some letters more than others; and there are only W words in tends to use . dictionary the t how man y crossw ords Let's are by imagining lling in now coun there squares of a crossw ord at random using the same distribution that pro- the the Wenglish and evaluating the probabilit y that this random duced dictionary columns. duces words in all rows and pro The total num ber of scribbling valid al llings-in of the f crossw S squares in the typic ord that can be made is 1 SH f 1 0 : (18.2) = 2 j T j probabilit y that one word of length L is validly lled-in is The W = (18.3) ; LH 0 2 the probabilit whole crossw ord, made of f and S words, is validly the y that w typical in- lling is appro ximately This calculation underestimates by a single lled-in ber of valid Wenglish the num f S w by coun only crossw ting ords : (18.4) `typical' with lled ords crossw monogram strings. If the num log to be the So is estimated S of size ords crossw ber of valid of the then distribution is non-uniform f S w true the by coun t is dominated (18.5) ] W log f H ) log L f + j T j = S [( f 1 w 0 w llings-in, in whic h `atypical' ; ) ] H + 1) L = S [( f ( f f L (18.6) H + 0 w 1 W w ear app words ord-friendly crossw often. more whic h is an increasing function of S only if ( f (18.7) f : L ) H 0 + f > ( L + 1) H W w 1 0 w in man can be made only if there's enough words ords arbitrarily So y crossw dictionary that Wenglish the ) ( f f L w 1 > H (18.8) : H W 0 L ( f + 1) w in the values of f Plugging and f wing. from table 18.2, we nd the follo w 1 type A B Crossw ord 1 1 L L H H H > ords H > Condition for crossw W 0 0 W 4 2 L +1 L +1 If we set H = 4 : 2 bits and assume there are W = 4000 words in a normal 0 English-sp er's dictionary , all with eak L = 5, then we nd that the length condition for crossw ords of type B is satis ed, but the condition for crossw ords that of type A is just satis ed. This ts with my exp erience only crossw ords of type A usually con tain more obscure words.

274 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Crossw 18 | Codebreaking ords 262 and reading Further crossw ords were rst made about (1948); I These ations observ by Shannon from Wolf and Siegel (1998). The topic is closely learned about them related capacit constrained channels. An example of a to the y of two-dimensional is a two-dimensional bar-co de, as seen constrained two-dimensional channel on parcels. [ 3 ] 18.2. channel is de ned by the constrain t that, Exercise A two-dimensional t neigh interior pixel in an N of every N rectangular eigh bours of the must be blac k and four white. (The coun ts of blac k and grid, four white 18.3 in pattern . A binary Figure pixels A binary pixels constrained.) boundary not are pattern around whic h every pixel is adjacen t to satisfying this constrain t is sho wn in gure 18.3. What is the capacit y blac four white pixels. k and four in bits of this large N ? per pixel, for channel, language models Simple 18.2 ot distribution The Zipf{Mandelbr mo mo for a language is the monogram del del, whic h asserts that crudest The h successiv e word is dra wn indep enden tly from a distribution over words. eac is the of this distribution over words? nature What 's law (Zipf, 1949) asserts that the probabilit y of the r th most probable Zipf in a language ximately word is appro ; (18.9) ) = ( P r r where exp onen t has a value close to 1, and is a constan t. According the a log{log plot versus word-rank should sho w a straigh t to Zipf, of frequency slop with . line e (1982) di cation of Zipf 's law introduces a third param- Mandelbrot's mo v , asserting that the probabilities are given by eter ( ) = r P (18.10) : v + r ( ) documen ts, suc h as Jane Austen's Emma , the Zipf{Mandelbrot dis- For some ts tribution { gure 18.4. well documen ts give distributions Other are not so well tted by a Zipf{ that Mandelbrot distribution. Figure 18.5 sho ws a plot of frequency versus rank for A the T L X source of this book. Qualitativ ely, the graph is similar to a straigh t E line, a curv e is noticeable. To be fair, this source le is not written in but A x pure maths sym bols suc h as ` { it is a mix ', and L English T X of English, E commands. 0.1 18.4 Figure . Fit of the to the and of Zipf{Mandelbrot distribution I 0.01 (18.10) (curv empirical e) to the is of words in Jane frequencies Harriet Austen's Emma (dots). The tted 0.001 : 0; = 8 56; v parameters are = 0 : 26. : = 1 information 0.0001 probability 1e-05 10000 1000 100 10 1

275 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Simple language dels 263 18.2: mo 0.1 the Figure . Log{log 18.5 plot of of a is the versus frequency rank for x A 0.01 T L in the X le words of this E probability book. information 0.001 Shannon Bayes 0.0001 0.00001 1000 100 10 1 alpha=1 four for 18.6 plots Figure . Zipf 0.1 `languages' randomly generated alpha=10 from cesses Diric hlet pro with 0.01 1 to parameter ranging from alpha=100 is the wn sho Also 1000. plot Zipf 0.001 alpha=1000 book. this for 0.0001 book 0.00001 10000 1000 10 1 100 The Dirichlet process we are Assuming in monogram mo dels for languages, what mo del interested should One dicult y in mo delling a language is the unboundedness we use? . The greater sample of language, the greater the num ber of vocabulary the tered. a language e mo del for encoun should emulate of words A generativ word prop `what is the next ed in a newly-disco vered work this erty. If ask esp eare?' our probabilit of Shak over words must surely include y distribution some probabilit y for wor ds that Shakesp eare never used befor e . Our non-zero e monogram language del for generativ should also satisfy a consistency mo rule exc hangeabilit y . If we imagine generating a new language from called our generativ e mo del, pro ducing an ever-gro wing corpus of text, all statistical prop erties text should be homogeneous: the probabilit y of nding a of the word at a given in the stream of text should be the same particular location stream. in the everywhere hlet pro cess mo del is a mo The for a stream of sym bols (whic h we Diric del of as `words') that satis es the exc hangeabilit y rule and that allo ws think the vocabulary bols to gro w without limit. The mo del has one parameter of sym bol by a . As stream of sym bols is pro we iden tify eac h new sym the duced, unique w . When we have seen a stream integer F sym bols, we de ne of length the probabilit y of the next sym bol in terms of the coun ts f F bols g of the sym w seen thus: the probabilit y that the next sym bol is a new sym bol, nev er so far before, seen is : (18.11) F + is probabilit the next The bol is sym bol w y that sym F w : (18.12) F + Figure 18.6 sho ws Zipf plots (i.e., plots of sym bol frequency versus rank) for priors million-sym ts' generated by Diric hlet pro cess cumen with values bol `do of ranging from 1 to 1000. It is eviden t that a Diric hlet pro cess is not an adequate mo del for observ ed distributions that roughly obey Zipf 's law.

276 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 18 | Crossw and Codebreaking 264 ords 0.1 the Figure 18.7 . Zipf plots for words of two `languages' by creating successiv e generated 0.01 hlet a Diric characters from declaring = 2, and pro cess with 0.001 one character to be the space es result character. The two curv 0.0001 t choices two di eren from of the space character. 0.00001 1 10000 100 10 1000 pro tweak, hlet pro cesses can Diric duce rather nice a small however, With Imagine generating a language comp osed Zipf tary sym bols plots. of elemen a Diric pro cess with a rather small hlet of the parameter , so that using value num ber of reasonably frequen t sym bols is about 27. If we then the declare one sym bols (no w called `characters' rather than words) to be a space of those then we can tify the strings between the space characters as character, iden a language frequencies way then the If we generate of words `words'. in this wn out come Zipf plots, as sho as very in gure 18.7. Whic h character often nice as the space character determines the slop e of the Zipf plot is selected { a less probable character gives rise to a richer language with a shallo wer slop e. space 18.3 of info rmation content Units information con ten t of an outcome, x , whose probabilit y is P The x ), is de ned ( to be 1 : (18.13) ( ) = log h x ) P x ( The entrop y of an ensem ble is an average information con ten t, X 1 H ( X ) = (18.14) : ( ) log x P ) ( P x x hypotheses with eac h other in the ligh t of data, it is of- When we compare data venien the log of the probabilit y of the t to compare under the ten con e hypotheses, alternativ evidence for H (18.15) ' = log P ( D jH ; `log ) i i in the case two hypotheses are being compared, we evaluate the or, just where `log odds', P ( D jH ) 1 ; (18.16) log jH ( ) P D 2 in favour h has been called the `weigh whic also of H The '. t of evidence 1 log evidence for a hypothesis, log P ( D jH information ) is the negativ e of the i con t of the data D : if the data have large information con ten t, given a hy- ten then are surprising to that hypothesis; if some other hypothesis pothesis, they so surprised by the data, then that hypothesis is not more probable. becomes `Information ten t', `surprise value', con log likeliho od or log evidence are and the same thing. All these tities are logarithms of probabilities, or weigh ted sums of quan of probabilities, so they can all be measured in the same units. logarithms The units dep end on the choice of the base of the logarithm. units The that have been given to these names are sho wn in table 18.8.

277 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. A taste of Ban us 265 18.4: burism . Units t of measuremen 18.8 Table that units those has Expression Unit ten t. con of information log bit p 2 nat log p e p ban log 10 deciban p 10 log (db) 10 Because that we use most in this book. unit the word `bit' The is the bit meanings, a bac kup has for this unit is the shannon . A byte is other name 20 6 A megab ' 10 is 2 bytes. If one works in natural logarithms, 8 bits. yte ten measured weigh ts of evidence are con in nats . The most information ts and . units ban and the deciban the interesting are history of the ban The me tell you why a factor of ten in probabilit y is called a ban. When Let Alan Turing the other codebreak ers at Bletc hley Park were breaking eac h new and Enigma code, task was a huge inference problem: to infer, given day's their cyphertext, Enigma h three wheels were in the day's mac hines that the whic starting positions were; what further letter substitutions were day; what their on the stec kerb oard; and, not least, what the original German messages in use These conducted were were. using Bayesian metho ds (of course!), inferences the the were decibans or half-decibans, units deciban being judged and chosen smallest weigh t of evidence discernible to a human. The evidence in favour the hypotheses was tallied using sheets of pap er that were specially of particular from ted prin a town about 30 miles in Ban Bletc hley. The inference task bury, was kno wn as Ban burism us, and the units in whic h Ban burism us was played were called after that town. bans, 18.4 A taste of Banburismus were code-breaking ds of Bletc The Park metho kept secret of the details hley time, but some asp ects of Ban burism us can be pieced together. for a long us follo of a small part of Ban burism description is not too I hop wing e the 1 inaccurate. was needed? The num ber of possible settings of How much information 12 mac hine was about 8 the Enigma . To deduce the state of the mac hine, 10 `it necessary to nd about 129 decibans from somewhere', as was therefore the it. us was aimed not at deducing burism entire state of the Good puts Ban hine, but only at guring out whic h wheels mac in use; the logic-based were bombes, with guesses of the plain text fed were then used to crac k what (cribs), the settings of the wheels were. The Enigma hine, once its wheels and plugs were put in place, im- mac that ted hanging perm utation cypher tinually-c wandered deter- plemen a con 3 a state space of 26 ministically perm utations. Because an through enormous num were sen t eac h day, there was a good chance that what- ber of messages ever state mac hine was in when sending one character of a message, there one would be another mac hine in the same state while sending a particular char- acter in another Because the evolution of the mac hine's state was message. same deterministic, hines would remain in the two mac state as eac h other the 1 I've been most help ed by descriptions given by Tony Sale ( http://www. work codesandciphers.org.uk/lectu ) and by Jac k Good (1979), who res/ ed with Turing at Bletc hley .

278 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 18 | Crossw and Codebreaking 266 ords rest for The resulting correlations between the out- of the the transmission. ten of mac vided a dribble of information-con pro t from h pairs puts of suc hines his co-w ork ers extracted their daily 129 decibans. whic h Turing and two came ct that from machines with a common How to dete messages e state sequenc , whic null hypothesis, H are hypotheses h states that the mac hines The the 0 di er ent states, and that the two plain messages are are and the in unrelated; h' hypothesis, `matc , whic h says that the mac hines are in the same state, H 1 attempt messages are unrelated. No two plain is being made and that the what the state of either mac hine is. The data pro vided are the here to infer x and ; let's assume they both have length T and that the two cyphertexts y y of the et size (26 in Enigma). What is the probabilit A data, given alphab is two hypotheses? the the null hypothesis. This hypothesis asserts First, the two cyphertexts that are by given = x x x x (18.17) ::: = c ::: ( u ) ) c u ( u ( ) c 3 2 1 1 3 3 2 2 1 and 0 0 0 v ( :::; (18.18) ) c ) v ( v ( ) c y y c = ::: = y y 3 2 1 2 3 1 3 2 1 0 codes c of the and c two unrelated where are the time-v arying perm utations t t et, and alphab u exact u An ::: and v messages. v u v text ::: are the plain 3 1 3 2 1 2 of the probabilit y of the data ( x ; y computation dep end on a language ) would mo of the plain text, and a mo del of the Enigma mac hine's guts, but del if we assume eac h Enigma mac hine is an that random time-v arying perm uta- ideal tion, then the probabilit y distribution of the two cyphertexts is uniform. All cyphertexts are likely. equally 2 T 1 all for (18.19) x ; y of length T: ( ; jH x ) = P y 0 A H What ? This hypothesis asserts that a single about arying perm uta- time-v 1 tion c underlies both t = x (18.20) x ::: x ) ::: = c u ( u ( ) c ) ( u c x 2 3 1 1 3 3 2 1 2 and ::: : y (18.21) y = y ) ::: = c v ( v ( ) c y ( v c ) 2 3 1 1 3 3 2 1 2 is the )? We have to mak y of the data ( x ; y What e some assumptions probabilit case about text language. If it were the plain that the plain text language the was completely random, then the probabilit y of u would u ::: u v ::: and v v 1 2 3 2 1 3 y and that of x and y , so the probabilit would P ( x ; y jH be uniform, ) so 1 be equal to P ( x ; y jH two hypotheses ), and the would H be and H would 1 0 0 indistinguishable. text that the plain by assuming is not completely ran- We mak e progress dom. plain texts are written in a language, and that language has redun- Both dancies. Assume for example that particular plain text letters are used more often than So, even though the two plain text messages are unrelated, others. letters are they more likely to use the same sligh as eac h other; if H is true, tly 1 two sync hronized letters from the two cyphertexts are sligh tly more likely to be iden Similarly , if a language uses particular bigrams and trigrams tical. frequen tly, then the two plain text messages will occasionally con tain the same h other, bigrams trigrams at the same time as eac and giving rise, if H is true, 1

279 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. A taste 18.4: of Ban burism us 267 u LITTLE-JACK-HORNE R-SAT-IN-THE-CORNER-EATING-A-CHRISTMAS-PIE--HE-PUT-IN-H v -TO-BANBURY-CROSS-TO-SEE-A-FINE-LADY-UPON-A-WHITE-HORSE RIDE-A-COCK-HORSE ...*...*...*... .*...*..******.* matches: 18.9 . Tw of pieces o aligned Table u and English , with plain text, v ws sho h a coinci- 18.9 Table letters. suc to a little burst of 2 or 3 iden tical that . Notice * mark hes matc ed by except both are they unrelated, are that dence in two plain text messages that twelve matc are there hes, in English. written the whereas of six, including a run were that pairs for of messages pairs sus- among hunted ers The codebreak in hes ber of matc num ected exp of strings random two completely mono- hing of matc bers num the piciously similar to eac h other, coun ting up be about would = 74 T length 3. rst d was metho This Polish by the grams, bigrams, trigrams, etc. used onding two corresp The er Rejewski. codebreak hines two mac from cyphertexts in and del mo language case estimate simple Let's look at the of a monogram also would states tical iden have how long hines is needed to be able to decide whether two mac a message hes. twelve matc in the are I'll assume the source language is monogram-English, same state. in whic probabilit e letters are dra wn i.i.d. from the language y the h successiv y of p g of gure 2.1. The probabilit f x and y is non uniform: distribution i two single characters, x ( = c y ( u probabilit ) and y ); the = c v consider t t t t t t they are iden tical is that X X 2 [ ] = v = u p (18.22) m: ( P v ) P u ( ) t t t t i ;v u i t t quan tity the name m , for `matc h probabilit y'; for both English We give this 1 German, is about 2 = 26 rather than and = 26 (the value that would hold m for a completely random language). Assuming that c random is an ideal t perm the probabilit y of x utation, and y , is, by symmetry t t ( m y = if x t t A (18.23) jH ( ) = P ;y x (1 m ) t t 1 for x 6 = y . t t 1) ( A A that a pair x and y of length T en matc h in M places and Giv of cyphertexts not matc h in N places, the log evidence in favour do H is then of 1 m ) (1 m=A y jH ) x ( ; P 1) A A ( 1 M log = (18.24) + N log log 2 2 ( y x =A ) P jH 1 =A 1 ; 0 m (1 ) A (18.25) : = M + N log log mA 1 A Every matc h con tributes log mA in favour of H tributes ; every non-matc h con 1 1 A . H of in favour log 0 A (1 ) m y for m 0.076 h probabilit monogram-English Matc matc Coinciden y 1 =A 0.037 tal h probabilit for H mA per matc h 10 log db log-evidence 3.1 1 10 A ) (1 m : 18 db 0 h 10 log H for per non-matc log-evidence 1 10 1) ( A N were If there hes and M = 47 non-matc hes in a pair of length = 4 matc T = 51, for example, the weigh t of evidence in favour of H be +4 would 1 decibans, od ratio of 2.5 to 1 in favour. or a likeliho expected = 20 t of evidence from a line of text of length T The weigh ends is the ectation of (18.25), whic h dep characters on whether H or H exp 1 0 m is true. H the is true then matc hes are exp ected to turn up at rate If , and 1 is true ected weigh t of evidence is 1.4 exp per 20 characters. If H decibans 0

280 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 18 | Crossw and Codebreaking 268 ords matc hes exp ected to turn up at rate 1 =A , and the exp ected then spurious are 1 : 1 decibans per 20 characters. Typically , roughly 400 weigh is t of evidence ected in order to have a weigh t of evidence greater to be insp characters need to one (20 decibans) in favour of one hypothesis or the other. than a hundred plain two English have more matc hes than two random strings. So, texts consecutiv in English are not indep enden t, because e characters Furthermore, and trigram statistics of English are non uniform the the matc hes bigram and to occur of consecutiv e matc hes. [The in bursts observ ations also tend same apply Using better language mo dels, the evidence con tributed to German.] by runs hes was more accurately computed. Suc h a scoring system of matc ed out and re ned by Good. Positiv e results were passed was work by Turing and According owered codebreak ers. to automated to Good, the on human-p was ositiv arose in this work false-p a string of 8 consecutiv e longest e that hes between two mac hines that were actually in unrelated matc states. Further reading see reading and Bletc hley Park, Turing Hodges (1983) and For further about For an in-depth read about Good (1979). y, Schneier's (1996) cryptograph book recommended. It is readable, clear, and entertaining. is highly 18.5 Exercises [ 2 ] the . Exercise Another weakness in the design of 18.3. Enigma mac hine, whic h was intended a perfectly random time-v arying perm u- to emulate is that it nev ed a letter to itself. When you press Q , what tation, er mapp is alw t letter from Q . How much information per out ays a di eren comes ed by this design aw? How long character would be needed is leak a crib t that the crib is correctly aligned with the cyphertext? to be con den how long a crib would be needed And con den tly to iden tify to be able the key? correct was. crib what the plain text for Imagine that the Brits [A is a guess w that a very imp ortan kno is travelling from Berlin to Aac hen, t German and intercept Enigma-enco ded they sen t to Aac hen. It is a messages good bet that one or more of the original plain text messages con tains the string HEINRICHXVONXWEIZSAECKER , OBERSTURMBANNFUEHRERXGRAFX name imp ortan t chap. A crib could be used in a brute-force the of the h to nd the correct Enigma key approac the receiv ed messages (feed through possible Engima mac hines all see if any of the putativ e and deco ded texts matc h the above plain text). This question cen tres on the be used idea crib can also the in a much less exp ensiv e manner: that slide the plain text crib along all the enco ded messages until a perfect if correct, mismatch crib and the enco ded message is found; of the this alignmen t then tells you a lot about the key.]

281 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 19 Wh Acquisition y have Sex? Information and Evolution 9 ening on earth for about the last 10 Evolution years. Un- has been happ pro has acquir ed during this been cess. Thanks to the information deniably , of the Blind Watchmak er, some cells now carry tireless them all work within information to be outstanding spiders; other cells carry all the the required to mak this t octopuses. Where did required information information e excellen from? come planet t of all organisms on the entire has emerged in a teac h- The blueprin pro cess in whic h the teac her is natural ing tter individuals have selection: more y, the tness being de ned by the local environmen t (including progen other organisms). teac hing signal is only a few bits per individual: an the The has a smaller or larger num ber of grandc hildren, dep ending individual simply the individual's tness. `Fitness' is a broad term that could cover on the antelop y of an e to run faster than other antelop es and hence abilit being by a lion; avoid eaten the abilit y of a lion to be well-enough camou aged and run fast enough h one to catc e per day; antelop the abilit ck to attract a peahen to mate with it; y of a peaco the abilit y of a peahen to rear man y young sim ultaneously . The tness is largely determined by its DNA { both the coding of an organism or genes, and non-co ding regions (whic h play an imp ortan t role regions, the transcription as a function We'll think of tness the in regulating of genes). sequence environmen the DNA t. of the and DNA tness, and how does information get from How does the determine selection into the genome? Well, natural gene that codes for one of an if the antelop proteins is defectiv e, that antelop e migh t get eaten by a lion early e's and rather two grandc hildren in life than fort y. The information have only con t of natural selection is fully con tained in a speci cation of whic h o - ten spring surviv ed to have children { an information con ten t of at most one bit per o spring . The hing signal does not comm unicate to the ecosystem teac of the that erfections in the organism any description caused it to have imp signal er children. bits of the teac hing few are highly redundan t, because, The throughout a species, un t individuals who are similar to eac h other will be failing to have o spring similar reasons. for how man y bits So, are acquired by the species as a whole per generation by natural selection? How man y bits has natural selection succeeded in con- of life, veying human branc h of the tree to the since the divergence between 269

282 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 19 | Wh Information Acquisition and Evolution 270 y have Sex? apes 4 000 Australopithecines ago? Assuming a generation time 000 and years generations repro have been about 400 000 there of for of 10 years duction, the divergence from apes. Assuming a population of human precursors since 9 individuals, a couple of bits of information from natural 10 eac h receiving mo num of information resp onsible for total difying the selection, ber of bits 14 B.C. the human genome is about 8 10 genomes of 4 million into today's However, natural selection is not smart at collating the bits. as we noted, is a great it dishes population, and there to the deal of that out information information. If the population size were twice as great, redundancy in that as fast? would because natural selection will simply be No, it evolve twice same defects twice as often. the correcting Ma ynard John has suggested that the rate of information acquisition Smith by a species enden t of the population size, and is of order 1 bit per is indep This gure allo w for only 400 000 bits of di erence between generation. would a num of the is much smaller than the total size humans, apes and ber that 9 9 10 genome bits. [One human genome tains about 3 10 { 6 human con It is certainly the case that the genomic overlap between apes nucleotides.] humans is huge, but is the di erence that small? and of the chapter, elop a crude mo del dev pro cess of information In this we'll through evolution, based on the assumption that a gene with two acquisition is typically defects to be more defectiv e than a gene with one defect, and likely organism with two defectiv e genes is likely to be less t than an organism an with one defectiv e gene. Undeniably , this is a crude mo del, since real biological systems are que constructions with complex interactions. Nev ertheless, baro with a simple because it readily yields striking results. we persist mo del this simple mo del is that we nd What from ynard Smith's gure of 1 bit per generation is correct for an 1. John Ma eproducing population; ly-r asexual if the species reproduc es sexual ly , the rate of information 2. in con trast, p G where is the G bits per generation, as can acquisition be as large of the genome. size also nd results concerning the maxim um mutation rate interesting We'll can withstand. a species that 19.1 The model a simple mo del of a repro ducing population of We study individuals with N a genome G bits: variation is pro duced by mutation or by recom bina- of size the (i.e., and truncation selection selects tion N ttest children at eac h sex) generation to be the paren ts of the next. We nd striking di erences between populations that bination and populations that do not. have recom bits, genot is a vector x of G h individual eac h having a good The ype of eac x is simply = 1 and a bad state x ) of an individual = 0. The tness state ( x F g g sum bits: of her the G X x ( F ) = x : (19.1) g g =1 The bits in the genome could be considered to corresp ond either to genes ( that x nucleotides = 1) and bad alleles ( x or to the = 0), have good alleles g g of a genome. We will concen trate on the latter interpretation. The essen tial prop that we are assuming is that it is locally a roughly linear erty of tness function one genome, that is, that there are man y possible changes of the

283 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Rate of increase 271 19.2: of tness e to the could h of whic h has a small e ect on tness, and genome, mak eac com appro ximately linearly . e ects bine that these tness f ( x ) F We de ne x ) =G . the normalized ( by natural selection under two mo dels of variation. evolution We consider . The mo del assumes discrete Variation At eac h by mutation generations. t individual pro duces two children. The children's generation, , every from paren t's by random mutations. Natural selec- ypes di er the genot the ttest N progen y in the child population to repro tion selects duce, a new starts. and generation is kno of the N selection at eac h generation ttest wn [The individuals selection.] as truncation simplest mo del of mutations The the child's bits f x in- g are is that g enden t. h bit has a small probabilit y of being ipp ed, whic h, dep Eac bits onding roughly to nucleotides, is tak en to of the thinking as corresp m , indep enden t of x be a constan t ely we though t of the . [If alternativ g onding to genes, then we would as corresp del the probabilit y of bits mo disco very of a good gene, P the x a smaller = 0 ! x being = 1), as ( g g ber than the y of a deleterious mutation in a good gene, num probabilit x ( = 1 ! x = 0).] P g g by recom bination (or Variation ver, or sex) . Our organisms are crosso haploid, diploid. They enjo y sex not bination. The N individ- by recom uals in the population are married into M = N= 2 couples, at random, and eac has C children { with C = 4 children being our stan- h couple assumption, so as to have the double and halv e every dard population The children's genot ypes are indep enden t given as before. C generation, ts'. Eac h child obtains its genot ype z the crosso ver of paren by random paren genot ypes, x and y ts' simplest mo del of recom bination its . The no link age, so that: has probabilit x 2 with = y 1 g (19.2) z = g with y 1 = 2. y probabilit g the MC progen y have been born, the paren ts pass away, the ttest Once progen by natural selected N selection, and a new generation starts. y are these dels We now study of variation in detail. two mo 19.2 of increase of tness Rate The ory of mutations We assume that the genot ype of an individual with normalized tness f = F=G is sub jected that ip bits with probabilit y m . We rst sho w that to mutations than average f of the population is greater tness 1 = 2, then if the normalized optimal mutation rate is small, and the rate the of information of acquisition is at most one bit per generation. of order f it is easy Since tness of to achiev = 1 = 2 by simple muta- e a normalized tion, we'll assume f > 1 = 2 and work in terms of the excess normalized tness with f 1 = 2. If an individual f excess normalized tness f has a child and the mutation rate m is small, the probabilit y distribution of the excess normalized of the child has mean tness f (19.3) = (1 2 m ) f child

284 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 19 | Wh Acquisition and Evolution 272 y have Sex? Information and variance ) (1 m m m (19.4) ' : G G 2 mean If the ( t ) and variance population ( t ) m=G , then of paren ts has f population, before selection, will have mean (1 2 m ) f ( t ) and vari- the child er half selection chooses the upp . Natural of this distribution, m=G ) (1+ ance and variance of tness at the next generation are given by so the mean tness r p m 2 m ) f ( t ) + f ( t +1) = (1 ) ; (19.5) (1 + G m 2 ; (19.6) ) +1) = t ( (1 + G is the deviation from the mean, measured in standard devia- where mean is reduced factor by whic h the child distribution's variance and tions, is the case num The and are of order 1. For the bers of a Gaussian by selection. p = distribution, = ' 0 : 8 and = (1 2 = ) ' 0 : 36. If we assume that 2 2 2 variance i.e., the ( t +1) ' equilibrium, ( t ), then is in dynamic 1 (1 + ) = ; so (1 + ) = (19.7) ; 1 p factor (1 + and ) in equation (19.5) is equal to 1, if we tak the results e the for Gaussian distribution, an appro the that becomes poorest when ximation the discreteness of tness becomes imp ortan t, i.e., for small m . The rate of increase of normalized is thus: tness r m f d 2 mf + ; (19.8) ' d t G 2 1, is maximized ) ( f for h, assuming whic G 1 m = ; (19.9) opt 2 G f ) ( 16 at whic h point, 1 f d (19.10) = : 8 ( f ) t d G opt the rate of increase of tness F = fG is at most So F d 1 = per generation : (19.11) d 8( f t ) with low tness ( f < 0 : 125), the rate For a population of tness of increase p Indeed, if f may exceed 1 = per generation. 1 unit , the rate of increase, if G p p 1 / 2 , is of order = G initial spurt can last only of order m G generations. ; this f > than : 125, the rate of increase of tness is smaller For one per generation. 0 tends the As hes G , the optimal mutation rate tness to m = 1 = (4 G ), so approac that an average of 1 = 4 bits are ipp ed per genot ype, and the rate of increase of tness is also to 1 = 4; information is gained at a rate of about 0 : 5 bits per equal for It tak 2 G generations es about the genot ypes of all individuals generation. in the population to attain perfection. For xed tness is given by , the m 1 2 mt p (19.12) ; ) (1 ce ) = t ( f mG 2

285 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Rate 19.2: of increase of tness 273 Sex No sex Histogram ts' tness of paren Histogram of children's tness children's Selected tness Figure 19.1 . Wh y sex is better to the constrain t ject sub t of integration, is a constan c 2, where = 1 ) t ( f If duction. repro than sex-free ype, ber of bits num mean If the 2. = = 1 (0) f to 1 if equal per genot ipp ed to create used are mutations the mG , exceeds 1, then tness F = approac F value equilibrium an hes eqm among it then children, variation p = (2 2 + 1 (1 = G . mG )) is una voidable that the average tness of the children is lower inaccurate theory This is somewhat y distribu- probabilit true the in that tness; the paren the than ts' tized tion of tness is non-Gaussian, asymmetrical, values. to integer and quan variation, greater greater the the same, the All with at variance grossly not are theory of the predictions the de cit. average the Selection the results of sim ulations describ ed below. the tness bumps up mean again. recom In con trast, bination variation duces a without pro ory of sex The in average The tness. decrease xi- two appro with tractable population sexual of the analysis The becomes typical scales amoun t of variation p ool mixes gene-p the that we assume rst, that rapidly mations: tly sucien genome , where G is the G as size, so after selection, the average be neglected; can second, , geneity homo we assume correlations between genes p tness ). by O ( G rises ), that fraction f the of bits g that are in the good state is the i.e., f ( t same, g all . for g en these assumptions, if two paren Giv F = fG mate, the prob- ts of tness abilit y distribution of their children's tness has mean equal to the paren ts' tness, F pro duced by sex does not reduce the average tness. ; the variation p deviation of the children scales as The tness of the standard f ). (1 Gf selection, the increase in tness Since, ortional to this standard after is prop the incr ease per gener ation scales tness squar e root of the deviation, as the p sho G . As size wn in box 19.2, the mean tness of the genome, = fG F the di eren tial evolves in accordance with equation: p d F ' t f ( t )(1 f ( (19.13) )) G; t d p where = ( + 2). The solution of this equation is 2 p p 1 p , (19.14) f G= t ( ) = G=; 1 + sin ) t + for ( ; 2 c t + c 2 2 2 G 1 where c = sin is a constan t of integration, (2 f (0) 1). So this idealized c a nite reac a state of eugenic perfection ( f = 1) within system time: hes p G generations. ( = ) Simulations Figure 19.3a sho ws the tness of a sexual population of N = 1000 individ- uals with size of G = 1000 starting from a random initial state a genome theoretical normalized 0 with 5. It also sho ws the tness curv e f ( t ) G from : equation (19.14), whic h ts remark ably well. w the In con 19.3(b) and (c) sho gures evolving tness when variation trast, is pro duced by mutation at rates m = 0 : 25 =G and m = 6 =G resp ectiv ely. Note the di erence in the horizon tal scales from panel (a).

286 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. | Wh 274 Acquisition and Evolution y have Sex? 19 Information theory of the of Box 19.2 . Details sex. t +1) dep end on f ( t )? Let's rst assume the two paren How does both f ( ts of a child ( ) G good bits, and, by our homogeneit t that those bits are f have exactly y assumption, t random subsets of the G bits. The num ber of bits that are indep enden good in both 2 ( t ) ts is roughly G , and the num ber that are good in one paren t only is roughly paren f 2 ) )(1 f ( t )) G f tness of the child will be f ( t t ( G plus the sum of 2 f ( t )(1 f ( t )) G 2 , so the coin ips, whic h has a binomial distribution of mean f ( t )(1 f ( t )) G and variance fair 1 ( )(1 f ( t )) t . The tness of a child is thus roughly distributed as f G 2 1 )(1 G ( t f f ( t )) = F Normal mean = : ( t ) G; variance f child 2 imp ortan t prop erty of this distribution, con trasted with the distribution under The duced is that tness is equal to the paren ts' tness; the variation pro mean the mutation, does not reduce the average tness. by sex 2 paren tal population's variance, whic h we will write as If we include ( t ) = the 1 f distributed as are f ( t )(1 tnesses ( t )) G , the children's t ) ( 2 1 f )) t G ( t )(1 f ( : 1 + F mean = f ( t ) G; Normal variance = child 2 2 er side selects children selection the upp the of this distribution. The mean Natural on in tness will be increase p 1 = 2 f t ) = [ (1 + = 2) t +1) = F 2] F ( ( t )(1 f ( t )) G; ( the variance of the surviving children will be and 1 2 )) G; ( f )(1 t t ( f + 1) = t ( (1 + = 2) 2 2 2 = where and = (1 2 = ). If there is dynamic equilibrium [ 2 = t ( ( t )] + 1) = the factor in (19.2) is then p 2 1 = 2 : ' 0 2 = : 62 = 2) = (1 + ( + 2) ( this t to be De ning natural 2 = constan + 2), we conclude that, under sex and selection, the mean tness of the population increases at a rate proportional to the squar size of the genome , e root of the d F ' per generation f ( t )(1 : f ( t )) G bits d t

287 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. maximal tolerable mutation rate 275 19.3: The 19.3 . Fitness as a function Figure 1000 The genome size is of time. G w the = 1000. dots The sho 900 tness of six randomly selected from individuals birth the 800 at eac population h generation. 700 initial The of population N = 1000 had randomly 600 genomes with generated 5 (exactly). : = 0 (0) f Variation (a) 500 duced by sex alone. Line pro (a) 0 50 20 30 40 80 70 60 10 sho for e (19.14) curv ws theoretical in nite homogeneous population. duced (b,c) Variation by pro 1000 1000 mutation, with and without sex, sex 900 900 is the rate mutation when sex bits mG = 0 : per or 6 (c) 25 (b) 800 800 genome. sho line dashed The ws e (19.12). curv the 700 700 no sex no sex 600 600 500 500 200 0 600 800 1000 1200 1400 1600 400 350 300 250 200 150 100 50 0 (b) (c) = 1000 G = 100 000 G Figure 19.4 . Maximal tolerable 20 50 sho as num wn rate, mutation ber 45 of errors ( mG ), versus per genome with sex 40 15 f tness . Left F=G normalized = 35 with sex G size genome panel: = 1000; mG 30 25 without sex 10 t: = 100 righ G 000. 20 t of genome Indep enden size, a 15 5 (no sex) parthenogenetic species 10 without sex 1 error only of order can tolerate 5 0 0 per generation; a per genome 0.65 1 0.65 0.8 0.75 0.85 0.9 0.95 1 0.95 0.9 0.85 0.8 0.75 0.7 0.7 bination that uses recom species (sex) can tolerate far greater f f rates. mutation [ 3, p.280 ] Dep endence 19.1. population size . How do the results for Exercise on population dep end on the population a sexual We anticipate that size? there um population size above whic h the theory of sex is is a minim How is that minim um population size related to G ? accurate. ] 3 [ 19.2. Exercise endence on mechanism . In the simple mo del of Dep crossover two paren h bit en at random from one of the eac ts, that is, we sex, is tak w crosso vers to occur with probabilit allo between any two adjacen t y 50% nucleotides. mo del a ected (a) if the crosso ver probabilit y is How is the at (b) vers occur exclusiv ely smaller? hot-sp ots located every d if crosso bits along the genome? 19.3 maximal tolerable mutation rate The if we com What the two mo dels of variation? What is the maxim um bine rate that can be tolerated by a species that has sex? mutation The rate of increase of tness is given by r p = ) f 2 (1 f m + d f ' 2 + mf 2 ; (19.15) d G t

288 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 19 276 Information Acquisition and Evolution | Wh y have Sex? e if the h is positiv satis es mutation whic rate r (1 f ) f : (19.16) m < G this rate with the result in the absence of sex, Let h, from us compare whic is that maxim um tolerable mutation rate the equation (19.8), is 1 1 m < (19.17) : 2 ) (2 G f p G greater than that times tolerable mutation rate with sex is of order The without sex! could try to species out of this A parthenogenetic (non-sexual) wriggle its mutation rate by increasing its litter sizes. bound if mutation ips on But average bits, the probabilit y that no mG are ipp ed in one genome on bits mG mG e , so a mother to have roughly e is roughly o spring in order needs of having one with the same tness as her. The to have a good chance child mG of a non-sexual to be exp onen tial in thus has (if mG is size species litter 1), if the species bigger than is to persist. the um tolerable mutation rate is pinned maxim to 1 =G , for a non- So close p whereas it is a larger num ber of order 1 = sexual species, G , for with a species bination. recom around, we can predict the results possible genome these Turning largest a given xed mutation rate, m . For a parthenogenetic species, the size for 2 size largest 1 =m , and for a sexual species, 1 =m genome . Taking is of order 8 = 10 the m gure mutation rate per nucleotide per generation (Eyre- as the Walker and Keigh tley , 1999), and allo wing for a maxim um bro od size of 20 000 9 (that ' 10), we predict that all species with more than G = 10 is, coding mG mak e at least use of recom bination. If the bro od size is nucleotides occasional 8 . ber falls to G then : 5 10 this num 12, = 2 19.4 Fitness increase and info rmation acquisition simple mo del it is possible to relate increasing tness to information For this acquisition. If the are set at random, the tness is roughly F = G= 2. If evolution bits F to a population have the maxim um tness h all individuals = G , leads in whic G bits of information have been acquired by the then namely for eac h species, bit better. , the species has gured x whic h of the two states is the out g We de ne the information acquired at an intermediate tness to be the amoun (measured required to select the perfect state t of selection in bits) population gene the a fraction f from of the pool. have x = 1. Because Let g g to nd (1 =f ) is the information required log a blac k ball in an urn con taining 2 information white balls in the ratio f : 1 f , we de ne the k and acquired blac to be X f g (19.18) : bits I = log 2 = 2 1 g If all fractions f the are equal to F=G , then g 2 F log = G ; (19.19) I 2 G whic h is well appro ximated by ~ 2( F G= 2) : (19.20) I The rate of information acquisition is thus roughly two times the rate of in- crease of tness in the population.

289 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Discussion 277 19.5: Discussion 19.5 tify quan kno wn argumen t for why species repro duce These the results well useful namely recom bination allo ws that muta- recom with by sex bination, rapidly through the species and tions ws deleterious muta- to spread more allo rapidly from the population (Ma ynard Smith, 1978; tions to be more cleared and Ma 1988; Ma ynard Smith Smith, Szathmary , 1985; Felsenstein, ynard that repro duces by recom 1995). can acquire informa- A population bination p selection at a rate of order from tion natural faster G than times a partheno- p it can a mutation and that is of order tolerate G genetic population, rate 8 of size G ' 10 nucleotides, coding times this factor of greater. For genomes p is substan G tial. adv by Kon- conferred by sex has been noted before enormous This antage but this meme, whic h Kondrasho v calls `the deterministic drasho v (1988), mutation does not seem to have di used throughout the evolu- hypothesis', tionary h comm unit y, as there are still numerous pap ers in whic h the researc alence of sex ed as a mystery to be explained by elab orate mec ha- prev is view nisms. cost { stability of a gene for sex or partheno genesis `The of males' people declare sex to be a mystery? The main Wh ation for being y do motiv is an called the `cost of males'. idea repro duction is disad- mysti ed Sexual compared with asexual repro duction, it's argued, because of every vantageous pro (on by sex, one two o spring average) is a useless male, incapable duced In earing, one is a pro ductiv e female. only the same time, a of child-b and mother could give birth to two female clones. To put it an- parthenogenetic way, the antage adv other of parthenogenesis, from the point of view of big individual, on one is able to pass the 100% of one's genome to one's is that instead of only 50%. Thus if there were two versions of a species, one children, repro ducing with and one without sex, the single mothers would be exp ected to outstrip their The simple mo del presen ted thus far did not sexual cousins. repro genders y to con vert from sexual abilit duction to either include or the we can easily mo dify the mo del. asexual, but the mo del so that one of the G bits in the genome determines dify We mo individual prefers to repro duce parthenogenetically ( x = 1) or sex- an whether ( x = 0). The results dep end on the num ber of children had by a single ually mother, parthenogenetic and the num ber of children born by a sexual K p K . Both ( K couple, K K d- = 4) and ( mo reasonable = 4, K = 4) are = 2, s p s s p The former ( K els. = 2, K case = 4) would seem most appropriate in the s p organisms, where the cytoplasm of both paren ts goes into the of unicellular The solely ( K = 4, K children. = 4) is appropriate if the children are latter s p nurtured by one paren ts, so single mothers have just as man y o spring of the del, pair. on the latter mo trate since it gives the greatest as a sexual I concen antage to the parthenogens, who are supp osedly exp ected to outbreed adv the sexual unit y. Because parthenogens have four children per generation, comm them maxim tolerable mutation rate for the is twice the expression (19.17) um deriv ed before for K rate = 2. If the tness is large, the maxim um tolerable p is ' 2. mG = the are set Initially with F genomes G= 2, with half of the pop- randomly ulation having the gene for parthenogenesis. Figure 19.5 sho ws the outcome. During `learning' phase of evolution, in whic h the tness is increasing the rapidly , pockets of parthenogens app ear brie y , but then disapp ear within a couple of generations as their sexual cousins overtak e them in tness and

290 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 19 y have Sex? Acquisition and Evolution 278 | Wh Information (b) mG = 1 mG (a) = 4 . Results when there is 19.5 Figure no and a gene parthenogenesis, for 1000 1000 and interbreeding, mothers single produc e as many en as childr 900 900 . couples sexual = 1000, G 800 800 = 1000. (a) mG = 4; (b) N sho axes = 1. Vertical mG w the sexual fitness 700 sexual fitness 700 parthen fitness parthen fitness tnesses two of the Fitnesses opulations, the sub-p and 600 600 that population of the tage percen 500 500 is parthenogenetic. 200 50 0 250 200 150 100 50 0 100 150 250 100 100 80 80 60 60 40 40 Percentage 20 20 0 0 150 100 50 0 150 100 50 0 250 250 200 200 behind. Once population reac hes its top tness, however, the leave them the tly can if the mutation rate is sucien e over, low ( mG = 1). parthenogens tak the presence of a higher mutation rate ( mG = 4), however, the In sexual nev The breadth of the e over. population's t- er tak parthenogens p , so a mutan G tly t parthenogenetic colon y arising with sligh is of order ness p p last for about = G= ( mG ) = 1 tness ( m will G ) generations above-average its falls before below that of its sexual cousins. As long as the popu- tness sexual lation tly large for some is sucien individuals to surviv e for this size time, sex will not die out. In a sucien tly environmen t, where the tness function is con- unstable changing, the will alw ays lag behind the sexual comm u- tinually parthenogens are t with the argumen t of Haldane and Hamilton results consisten nity. These sex is helpful in an arms race with (2002) The parasites de ne that parasites. e ectiv function whic h changes with e tness and a sexual population an time, alw ays ascend the will t tness function more rapidly . curren Additive function tness function course, dep end on the tness results that we assume, and on Of our mo del of selection. Is it reasonable to mo our tness, to rst order, as a sum del of indep t terms? Ma ynard Smith (1968) enden that it is: the more good argues genes you have, the higher you come in the pecking order, for example. The directional selection del has been used extensiv ely in theoretical popula- mo genetic to (Bulmer, 1985). We migh t exp ect real tness functions tion studies in whic t reduce crosso ver migh involve interactions, the average tness. h case adv since bination gives the biggest However, antage to species whose t- recom ness functions are additiv e, we migh t predict that evolution will have favour ed species that of the genome that corresponds to a tness used a representation . And that only weak inter actions function even if there are interactions, has it seems plausible that the tness would still involve a sum of suc h interacting some terms, the num ber of terms being with fraction of the genome size G .

291 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Further exercises 19.6: 279 [ 3C ] how fast sexual and asexual species evolve if Exercise 19.3. Investigate with For example, let the tness function they have a tness interactions. of pairs of bits; compare the evolving tnesses be a sum of exclusiv e-ors sexual and asexual species with a simple additiv e tness with those of the function. tness function were a highly nonlinear function of the Furthermore, if the be made more smo genot and locally linear by the Baldwin ype, it could oth The e ect (Baldwin, 1896; Hin ton and Nowlan, 1987) has been e ect. Baldwin evolution, as a mec y learning guides whereb and it could studied hanism widely at the level of transcription and translation. Consider also evolution of act the sequence a new purp ose. Assume for e ectiv eness of the peptide a peptide the nonlinear function of the is a highly perhaps having a small island sequence, of good sequences by an ocean of equally bad sequences. In an surrounded whose transcription translation mac hinery is awless, the tness organism and equally function of the DNA sequence, and evolution be an nonlinear will around the ocean making progress towards the will only by a wander island walk. trast, an organism having In con same DNA sequence, but random the DNA-to-RNA transcription or RNA-to-protein translation is `fault y', whose will occasionally or mistranscription, acciden tally pro duce a , by mistranslation enzyme; it will do so with greater probabilit y if its DNA sequence working and to a good sequence. One cell migh t pro duce 1000 proteins from the is close mRNA sequence, of whic h 999 have no enzymatic e ect, and one does. one for The catalyst will be enough working that cell to have an increased one tness relativ e to rivals whose DNA sequence is further from the island of good sequences. For this that, at least early in evolution, reason I conjecture still and code was not implemen ted perfectly but was now, the perhaps genetic , with some codons coding for a distribution of possible ted implemen noisily This noisy code amino even be switc hed on and o from cell acids. could in an by having multiple aminoacyl-tRNA syn thetases, organism to cell some reliable than others. more our mo del assumed that the bits of the genome do not interact, Whilst the ignored the information is represen ted redundan tly, assumed fact that and is a direct between phenot ypic tness there the genot ype, that relationship assumed that the crosso ver probabilit y in recom bination is high, I believ e and complex qualitativ would still hold if more e results mo dels of tness and these p ver were used: the relativ e bene t of sex will still scale as crosso G . Only in in-bred populations the bene ts of sex exp ected to be diminished. small, are Wh Because sex is good for your bits! In summary: y have sex? Further reading a high-information-con ten t self-replicating system ever emerge How did in the rst In the general area of the origins of life and other tric place? ques- ky tions evolution, I highly recommend about ynard Smith and Szathmary Ma (1995), Ma ynard Smith and Szathmary (1999), Kondrasho v (1988), Ma y- nard Smith Ridley (2000), Dyson (1985), Cairns-Smith (1985), and (1988), (1978). Hop eld 19.6 Further exercises [ 3 ] Exercise 19.4. How good must the error-correcting mac hinery in DNA repli- Estimate cation that mammals have not all died out long ago? be, given the probabilit y of nucleotide substitution, per cell division. [See App endix C.4.]

292 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 19 280 Acquisition and Evolution | Wh y have Sex? Information 4 [ ] that DNA replication is achiev ed Giv bling Bro w- 19.5. en Exercise by bum ordinary thermo dynamics in a bio chemical porridge at a nian motion and it's temp that the error-rate of DNA replication erature of 35 C, astonishing 9 nucleotide. How can this per replicated y be achiev ed, 10 is about reliabilit energetic di erence between a correct base-pairing and an incor- given that the is only one bonds and the thermal energy kT is only rect one or two hydrogen ciated of four the free energy asso than with a hydro- a factor about smaller If ordinary thermo dynamics is what favours correct base-pairing, gen bond? frequency base-pairing should be about surely the of incorrect E / = exp( f kT ) ; (19.21) 4 is the free energy di erence, i.e., an error frequency of f ' 10 where E ? How has replication cheated thermo dynamics? DNA situation is equally in the case of protein syn thesis, whic h The perplexing mRNA with into a polyp eptide in accordance an the ge- translates sequence protected code. chemical reactions are o speci c against errors: the netic Tw of tRNA molecules to amino acids, and the pro duction of the polyp ep- binding in ribosome, tide whic h, like DNA replication, involves base-pairing. the 4 (an error rate of about 10 y is high the ), and this delit y Again, delit be caused by the energy of the can't nal state being esp ecially low `correct' { the polyp eptide sequence is not correct ected to be signi can tly lower in exp energy than any other sequence. How do cells perform error correction? (See Hop eld (1974), (1980)). Hop eld [ ] 2 While the genome acquires information through natural se- Exercise 19.6. information of a few your brain acquires per generation, bits at a rate lection rate. at a greater rate new information can be stored in long Estimate memory at what term brain. of learning the words of a new language, Think example. by your for 19.7 Solutions to exercise 19.1 (p.275) . For small enough Solution , whilst the average t- N ness population increases, some unluc ky bits become frozen into the of the state. ho- bad genes are sometimes kno wn as hitc hhik ers.) The bad (These y assumption , all down. Eventually mogeneit individuals have iden tical breaks tain ypes that genot 1-bits, but con are some 0-bits too. The smaller mainly the population, the greater the num ber of frozen 0-bits is exp ected to be. How small can population size N be if the theory of sex is accurate? the based exp that the theory tally on assuming homogeneit y ts erimen We nd p G . If N is signi can tly if the poorly size only N is smaller than population p be acquired G , information cannot possibly than at a rate as big as smaller p G , since the information con ten t of the Blind Watchmak er's decisions cannot be any greater than N bits per generation, this being the num ber of bits 2 get required h of the 2 N children whic to repro duce. Baum et al. to specify (1995), analyzing a similar mo del, sho w that the population size N should be p 2 G (log G ) about to arise. e hitc hhik ers unlik ely to mak

293 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Part IV Probabilities and Inference

294 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. out Part Ab IV ber of inference The can (and perhaps should) be tackled problems num that ds example, In this book, for metho we by Bayesian inference is enormous. the deco for error-correcting codes, problem task of inferring the discuss ding data, the task of interp clusters through noisy data, and the task from olation patterns lab elled examples. Most techniques for solving of classifying given can ws. as follo problems these be categorized directly ds the metho quan tities compute . Only a few inter- Exact required problems have a direct solution, but exact metho ds are imp ortan t esting for as tools subtasks within larger problems. Metho ds for the solving solution of inference are the sub ject of Chapters 21, 24, exact problems 26. and 25, metho ds can be sub divided into ximate Appro appro ximations , whic h include maxim um likeli- 1. deterministic 22), Laplace's metho d (Chapters 27 and 28) hood (Chapter and variational ds (Chapter 33); and metho Mon metho ds { techniques in whic h random num bers 2. te Carlo integral part { whic h will be discussed play an 29, 30, in Chapters and 32. part of the book This form a one-dimensional story . Rather, the does not ideas mak e up a web of interrelated threads whic h will recom bine in subsequen t chapters. Chapter h is an honorary mem ber of this part, discussed a range of 3, whic examples of inference and their Bayesian solutions. simple problems ation metho the toolb ox of inference motiv ds discussed in To give further for Chapter 20 discusses the problem of clustering; subsequen t chapters this part, the probabilistic interpretation of clustering as mixture mo delling. discuss Chapter 21 the option of dealing with probabilit y distributions discusses enumerating hypotheses. Chapter 22 introduces the idea by completely all metho ds as a way of avoiding the large of maximization asso ciated with cost complete enumeration, and points out reasons why maxim um likeliho od is not good enough. Chapter 23 reviews the probabilit y distributions that arise most often inference. Chapters 24, 25, and 26 discuss another in Bayesian the of complete enumeration: marginalization. Chapter way of avoiding cost message-passing metho ds appropriate for graphical mo 25 discusses using dels, the ding of error-correcting codes as an deco Chapter 26 com bines example. these ideas with message-passing concepts from Chapters 16 and 17. These chapters are for the understanding of adv anced error-correcting a prerequisite codes. 27 discusses deterministic appro ximations including Laplace's Chapter metho d. This chapter is a prerequisite for understanding the topic of complex- that ity con in learning algorithms, an idea trol is discussed in general terms in Chapter 28. 282

295 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. out Part 283 Ab IV of discusses metho ds. Chapter 30 gives details te Carlo Mon 29 Chapter te Carlo techniques. Mon state-of-the-art the Ising mo del as a test-b ed for probabilistic meth- Chapter 31 introduces exact ods. metho d and a Mon te Carlo metho d are demon- An message-passing mo A motiv studying the Ising for del is that it is intimately strated. ation to several neural net work mo dels. Chapter related es `exact' Mon te 32 describ Carlo ds and demonstrates their application to the Ising mo del. metho 33 variational metho ds and their application to Ising Chapter discusses dels and to simple statistical inference problems including clustering. This mo will the chapter reader understand the Hop eld net work (Chapter 42) help ortan and algorithm, whic h is an imp EM t metho d in laten t-variable mo d- the elling. Chapter 34 discusses a particularly simple laten t variable mo del called indep enden onen t analysis. t comp 35 discusses of assorted inference topics. Chapter 36 Chapter a ragbag example . Chapter theory a simple 37 discusses di erences discusses of decision theory and Bayesian metho ds. between sampling what infer enc e is about A theme: A widespread misconception the aim of inference is to nd the most is that explanation most some data. While this probable probable hypothesis may for and some inference metho ds do locate it, this hypothesis is just be of interest, the peak of a probabilit y distribution, and it is the whole distribution that is of interest. As 4, the most probable outcome from a source we saw in Chapter that not is often al outcome from a source. Similarly , the most probable typic hypothesis given some data may be atypical of the whole set of reasonably- plausible hypotheses. About Chapter 20 Before reading the next chapter, exercise 2.17 (p.36) and section 11.2 (inferring the input to a Gaussian channel) are recommended reading.

296 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 20 An Task: Clustering Example Inference are Human in data. One way of expressing good at nding brains regularities of objects that are similar to eac h other. a set into groups regularit y is to put have found that most objects in the natural For example, biologists world one things that are bro of two categories: and run away, and into fall wn are green and don't run away. The rst group things call animals, that they the plan ts. We'll call this operation of grouping things together and second, biologist sub-divides the cluster of plan ts into sub- . If the further clustering call this `hierarc hical clustering'; but we won't clusters, we would be talking hierarc clustering yet. In this chapter hical just discuss ways to about we'll e a set of N objects and tak them into K clusters. group There several motiv ations for clustering. First, a good clustering has are e power. When biologist encoun ters a new green thing he has predictiv an early before, lls internal mo del of plan ts and animals seen in predictions for not his him of the it's unlik ely to jump on thing: and eat him; if he attributes green hes it, he migh t get grazed or stung; if he eats it, he migh t feel touc of sick. All these while uncertain, are useful, because they help the biologist predictions, spent watching his example, the time (for for predators) well. invest resources we perform clustering because we believ e the Thus, cluster lab els underlying are will lead to a more ecien meaningful, of our data, and will t description help us choose better actions. This type of clustering is sometimes called `mixture densit delling', and the objectiv e function that measures how y mo the predictiv del is working is the information con ten t of the data, well e mo =P f x g ). 1 ( log can be a useful aid to comm unication because Second, allo w clusters they compression. biologist can give directions The suc h as `go to lossy to a friend third tree on the righ t then the e a righ t turn' (rather than `go past the tak large thing with red berries, then past the large green thing with thorns, green ::: category The brief then name `tree' is helpful because it is sucien t to '). compression, iden object. Similarly , in lossy image an the aim is to con vey tify in as few bits as possible a reasonable repro duction of a picture; one way to do this is to divide image into N small patc hes, and nd a close matc h to eac h the we send h in an K image-templates; then et of a close t to the patc alphab by sending the list of lab els k image ;k templates. ;:::;k hing of the matc N 1 2 task of creating The of image-templates is equiv alen t to nding a good library a set of cluster cen tres. This type of clustering is sometimes called `vector quan tization'. of an formalize quan tizer We can a vector assignmen t rule x ! in terms k ( x ) for assigning datap oints x to one of K codenames, and a reconstruction ( k ) ( k ) the , the aim being to choose k functions k ( x ) and m rule ! m so as to 284

297 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. K-means clustering 20.1: 285 exp minimize , whic h migh t be de ned to be ected the distortion h i X 2 1 x ( k ( )) = : m x D (20.1) ( ) P x 2 x e function would be to minimize the psyc [The per- ideal objectiv hologically distortion image. Since it is hard to quan tify the distortion ceiv ed of the are vector tization and lossy compression quan not so ed perceiv by a human, problems crisply mo delling and lossless compression.] In vec- de ned as data ( k ) believ e that the templates f m quan we don't necessarily g have tor tization, they simply tools to do a job. We note in passing meaning; any natural are y of the assignmen t rule (i.e., the enco der) of vector quan tization the similarit problem decoding deco ding an error-correcting code. to the when is that for a cluster mo del reason failures of the cluster A third making del may highligh mo objects that deserv e special atten tion. If t interesting we have trained quan tizer to do a good job of compressing satellite a vector of ocean surfaces, maybe patc hes of image that are not well pictures then con vector are the patc hes that tizer tain ships! If the by the quan compressed ters a green thing and sees it run biologist slither) away, this mis t encoun (or his mo del (whic h says green cluster don't run away) cues him with things atten tion. One can't spend all one's time being fascinated by to pay special the things; mo del can help sift out from the multitude of objects in cluster world ones that really deserv e atten tion. the one's reason they is that algorithms clustering liking A fourth e for may serv The algorithm clustering systems. in neural cesses as mo dels of learning pro is an K-means we now discuss, the algorithm, that example of a comp etitiv e = 40 Figure . data 20.1 N points. learning algorithm. The algorithm works by having the K clusters comp ete with h other the righ t to own the data points. eac for K-means 20.1 clustering algorithm is an algorithm for putting N data points in an I The About the name... As far as I K-means - w, the `K' in K-means kno by a vector is parameterized h cluster Eac clusters. K into space dimensional ) k ( to the clustering simply refers m mean. called its chosen num ber of clusters. If ) ( n sup the where g runs n x f by be denoted points will erscript The data Newton follo wed the same had ts onen . x comp I has x h vector . Eac N points ber of data from 1 to the num i policy , maybe we would naming x space we have that and space lives in is a real that the We will assume that `calculus at school about for learn '. It's name, x variable the a silly that de nes distances between points, for example, a metric it. stuc we are but k with X 1 2 ; y ) = d ( x ( y ) : (20.2) x i i 2 i ( k ) f the K means algorithm m the (algorithm 20.2), g To start K-means initialized in some way, for example to random values. K-means is then are an iterativ algorithm. In the assignmen t step , eac h data point n is e two-step , the assigned mean. In the update step nearest means are adjusted to to the matc h the sample means of the data points that they are resp onsible for. The K-means is demonstrated for a toy two-dimensional data set algorithm 20.3, assignmen 2 means are used. The in gure ts of the points to the where two clusters indicated by two point styles, and the two means are sho wn are by the circles. The algorithm con verges after three iterations, at whic h point ved when the ts are unc hanged so the means remain unmo assignmen updated. The K-means algorithm alw ays con verges to a xed point.

298 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 20 | An Example Task: Clustering 286 Inference 20.2 . The K-means Algo rithm ( ) k f g to random values. means K Set . Initialization m algorithm. clustering is assigned Assignmen t step . Eac n h data point to the nearest mean. ) ( n ) ( n We denote x that k cluster point the the for belongs our guess ) ( n ^ . k to by ) ( ( k n ) ) ( n ^ x g : (20.3) d k = argmin f ( m ; ) k equiv t of tation t represen alen assignmen this e, alternativ An of is given by `resp onsibilities', whic h are indicator points to clusters ) n ( ( n ) r variables t step, . In the we set assignmen r to one if mean k k k ) ( n ( n ) closest is the mean x r ; otherwise to datap oint is zero. k ( n ) ^ 1 = k if k ) n ( r = (20.4) k n ) ( ^ k: 0 if k 6 = What the ties? { We don't exp ect two means to be exactly about ( n ) ^ en, is point, but a data if a tie from distance same does happ k . k f winning of the smallest to the set g Up means, date to matc h the parameters, are del mo The . step adjusted resp onsible for. the sample means of the data points that they are X ( n ) ( n ) r x k n ) k ( (20.5) = m ) ( k R k ( ) k , where R resp y of mean is the total onsibilit X ) ( n ) k ( : (20.6) r R = k n k ( ) about = 0, then means with no responsibilities? { If R we What ) ( k mean m it is. leave the where step assign- the Rep eat the assignmen until update and t step not ts do men change.

299 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. K-means clustering 20.1: 287 . K-means Figure 20.3 algorithm to a data points. of 40 applied set K = 2 means evolve to stable Data: iterations. after three locations Update t Assignmen Update Assignmen t Update Assignmen t 1 Run Figure . K-means algorithm 20.4 set of 40 points. applied to a data runs, both with Tw o separate K = 4 means, reac h di eren t solutions. a ws h frame sho Eac successiv e assignmen t step. 2 Run [ p.291 ] 4, See if you can pro 20.1. K-means alw ays con verges. Exercise ve that t: nd a physical analogy and an asso [Hin Lyapuno v function.] ciated [A v function is a function of the state of the algorithm that Lyapuno decreases er the state changes and that is bounded below. If a whenev system has a Lyapuno v function then its dynamics con verge.] The K-means with a larger num ber of means, 4, is demonstrated in algorithm on 20.4. of the algorithm dep ends outcome the initial condition. gure The rst case, after ve iterations, a steady state is found in whic h the data In the fairly evenly split between the four clusters. In the second case, points are in one after half the data points are iterations, cluster, and the others are six shared among the other three clusters. Questions about algorithm this features. K-means has The ad hoc algorithm Wh y does the update step several set the `mean' to the mean of the assigned points? Where did the distance d ? come What if we used a di eren t measure of distance between x and m from? [In How can the `best' distance? we choose vector quan tization, the distance

300 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 20 | An Clustering 288 Example Inference Task: 10 10 algorithm . K-means 20.5 Figure two dissimilar a case for with 8 8 (a) clusters. \little large" 'n' The (b) data. A stable set of 6 6 means. that Note ts and assignmen (b) (a) broad points belonging four to the 4 4 cluster have been incorrectly assigned to the narro wer cluster. 2 2 (Points assigned righ t-hand to the cluster are sho wn by plus signs.) 0 0 4 6 2 8 10 10 8 6 4 2 0 0 Figure 20.6 . Tw o elongated clusters, and the stable solution algorithm. found by the K-means (b) (a) is pro vided as part of the problem de nition; but I'm assuming we function interested in data-mo rather than vector quan tization.] How do we are delling ? Having for multiple alternativ e clusterings K a given K , how choose found we choose among can them? e K-me wher might be viewe d as failing. Cases ans questions arise when we look for cases where Further algorithm beha ves the badly with what the man in the street would call `clustering'). (compared 20.5a points generated ws a set of 75 data Figure from a mixture of two sho weigh The t-hand Gaussian has less Gaussians. t (only one fth of the data righ points), and it is a less broad cluster. Figure 20.5b sho ws the outcome of using K-means clustering K = 2 means. Four of the big cluster's data points with assigned to the cluster, and both means end up displaced have been small of the K-means cen tres of the clusters. The left algorithm tak es to the true the of the between the means and distance data points; it has accoun t only represen tation of the weigh t or breadth no h cluster. Consequen tly, data of eac points that belong to the broad cluster are incorrectly assigned to the actually w cluster. narro 20.6 sho ws another case of K-means beha ving badly . The data Figure the tly into two elongated clusters. But fall only stable state of the eviden K-means algorithm is that sho wn in gure 20.6b: the two clusters have been sliced in half two examples sho w that there is something wrong with ! These distance has in the K-means algorithm. The K-means algorithm the no way d ting the size or shap e of a cluster. of represen criticism A nal is that it is a `hard' rather than a `soft' of K-means algorithm: points are assigned to exactly one cluster and all points assigned to a cluster are in that cluster. Points located near the border between equals clusters should, arguably , play a partial role in determining the two or more locations of all the clusters that they could plausibly be assigned to. But in the K-means algorithm, eac h borderline point is dump ed in one cluster, and

301 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. K-means clustering 289 20.2: Soft in any equal all the other points in that cluster, and no vote with vote an has clusters. other 20.2 Soft K-means clustering of K-means These ate the `soft K-means algorithm', algo- criticisms motiv 20.7. algorithm has one parameter, The h we could term the rithm , whic . sti ness 20.7 . Soft K-means Algo rithm ) ( n x a soft `degree of as- point h data Eac . t step Assignmen is given 1. version algorithm, ( n ) signmen means. h of the x to whic t' to eac h degree We call the ) n ( onsibilit y r is assigned to cluster k the resp y of onsibilit (the resp k k cluster n for point ). ) ( k ) n ( x d ) ( ; m exp ) ( n P (20.7) : r = 0 k ) ( n ) k ( m x ) ( ; d exp 0 k K for onsibilities resp the n th point is 1. of the sum The are Up to matc h date step . The mo del parameters, the means, adjusted the for. sample means of the data points that they are resp onsible X ) ( n ( n ) r x k n k ( ) (20.8) = m k ( ) R ) k ( R is the total resp where y of mean k , onsibilit X ) n ( ) ( k (20.9) : r = R k n similarit soft K-means algorithm to the hard K-means the y of this Notice The update step is iden tical; the only di erence is that algorithm 20.2. the ( n ) onsibilities resp can tak r values between 0 and 1. Whereas the assign- e on k n ( ) ^ k men in the K-means algorithm involved a `min' over the distances, the t for the resp onsibilities is a `soft-min' (20.7). assigning rule ] 2 [ algo- 20.2. Sho w that as the sti ness . goes to 1 , the soft K-means Exercise rithm iden tical to the becomes hard K-means algorithm, except original for the way in whic h means with no assigned points beha ve. Describ e what those do instead of sitting still. means , the is an inverse-length-squared, so we can as- sti ness Dimensionally p is it. algorithm K-means soft , with The sociate = 1 a lengthscale, in gure 20.8. demonstrated lengthscale is sho wn by the radius of the The circles surrounding the four means. Eac h panel sho ws the nal xed point reac hed a di eren t value of the lengthscale . for 20.3 Conclusion point, we may have xed some of the At this with the original K- problems means algorithm by introducing an extra complexit y-con trol parameter . But about how should ? And what we set the problem of the elongated clusters,

302 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 20 | An Example Task: Clustering 290 Inference Large ::: K-means . Soft Figure 20.8 version 1, applied to a algorithm, K data set of 40 points. = 4. parameter lengthscale Implicit 1 = 2 = varied from a large to = 1 Eac value. sho ws a small h picture state of all four means, with the implicit lengthscale wn by the sho radius four circles, after the of the ::: the algorithm sev eral running for At the of iterations. tens largest lengthscale, all four means to the exactly con data verge Then the four means mean. separate into two groups of two. At shorter lengthscales, eac h of these pairs itself bifurcates into subgroups. ::: small Adding the weigh t and width? of unequal one sti ness parameter and clusters is not going to mak e all these problems go away. come as we dev k to these questions in a later chapter, We'll elop the bac y-mo view of clustering. mixture-densit delling Further reading tization approac h to clustering see For a vector-quan 1989; Luttrell, (Luttrell, 1990). 20.4 Exercises [ 3, p.291 ] . Exercise 20.3. Explore the prop erties of the soft K-means algorithm, version 1, assuming the datap oints f x g come from a single separable that Gaussian with mean zero and variances distribution two-dimensional 2 2 2 2 ; var( )) = ( ) x (var( x ; ), with > is K = 2, assume N . Set 1 2 1 2 1 2 and investigate the xed points of the algorithm as is varied. large, (2) (1) m [Hin = ( m; 0) and m that = ( m; 0).] t: assume [ 3 ] soft . Exercise Consider the 20.4. K-means algorithm applied to a large data that comes from a mixture of two equal- t of one-dimensional amoun standard t Gaussians with true means = 1 and , deviation weigh P = 2 example K = 1. Sho w that for hard K-means algorithm with the P leads to a solution in whic h the two means are further apart than the 1 -1 other two true what happ ens for Discuss values of , and nd means. the value of suc h that the soft algorithm puts the two means in the correct places.

303 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 20.5: 291 Solutions 20.5 . We can asso ciate an `energy' with the state Solution to exercise 20.1 (p.287) n ( ) h point x a spring between eac and by connecting K-means of the algorithm is resp onsible for it. The energy of one spring the ortional to mean that is prop ( n ) ( k ) ; m namely ( x ) where is the sti ness d spring. its squared length, of the energy the springs is a Lyapuno of all for the algorithm, total The v function the assignmen t step can only decrease the energy { a point only because (a) allegiance if the of its spring would be reduced; (b) the its changes length m 1 ( k ) to the only the can decrease mean is the update { moving m step energy m 2 the is bounded below (c) way to minimize the energy of its springs; and energy v function. Since the algorithm { whic h is the second condition for a Lyapuno has verges. it con v function, a Lyapuno (1) m; initialized to m to exercise = ( 20.3 0) Solution . If the means are (p.290) m m (1) 2 1 for a point at location x the ;x = ( gives m; 0), and m t step assignmen 2 1 2 ) x m ( 2) = exp( 1 x ) = r ( (20.10) 1 2 2 m exp( + = x ( 2) + exp( = 2) ) m ) x ( 1 1 Figure diagram of . Schematic 20.9 1 the bifurcation as the largest data (20.11) ; = variance increases from below 1 + exp( ) mx 2 1 1 1 1 2 2 = = = = 1 data to above 1 . The is indicated variance by the and the updated m is ellipse. R ( d x ( P ) x x ) x r 1 1 1 1 0 R (20.12) m = ( x ) d P ( x ) x r 1 1 1 Z 1 2 x (20.13) P ( x : ) x d = 1 1 1 1 + exp( 2 mx ) 1 Now, m = 0 is a xed point, but the question is, is it stable or unstable? For 4 Data density 3 tiny we can (that is, 1), m m Taylor-expand 1 Mean locations 2 1 -1 0 1 2 -2 1 1 mx (1 + ) + ' (20.14) 0 1 2 mx 2 1 + exp( ) 1 -1 2 -2 1 -1 0 -2 so -3 Z -4 1.5 3 3.5 1 0 0.5 2 2.5 4 0 P x ( m x ) d mx (1 + ' x ) (20.15) 1 1 1 1 Figure 20.10 . The mean stable 2 = (20.16) m: 1 locations as a function , for of 1 numerically t constan , found this under mapping, For small m , m either gro ws or deca ys exp onen tially k lines), (thic the and 2 lines). appro (20.22) ximation (thin on dep ending whether 1. is greater or than than The xed point less 1 = 0 is stable if m 2 1 = (20.17) 1 0.8 Data density 0.6 [Inciden tally , this deriv ation sho ws that this result is and unstable otherwise. Mean locns. 0.4 x y distribution ) having variance general, ( probabilit holding for any true P 1 0.2 1 0 -1 -2 2 2 Gaussian.] the just , not 0 1 2 -0.2 > then = is a bifurcation and there there If are two stable xed points 1 1 2 1 0 -1 -2 -0.4 xed point at m = 0. To illustrate this bifurcation, surrounding the unstable -0.6 K-means algorithm with gure 20.10 sho ws the outcome of running the soft -0.8 2 1.5 1 0.5 0 for of various values deviation standard with data = 1 on one-dimensional 1 point of view, bifurcation . Figure 20.11 sho ws this pitc hfork from the other 1 mean stable . The 20.11 Figure 2 = 1 deviation and is xed the lengthscale standard data's algorithm's where the , locations as a function of 1 = 1 2 1 = constan for . t 1 tal = 1 = is varied on the horizon axis.

304 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 20 | An 292 Task: Clustering Example Inference theory Here how the tted parameters m beha ve beyond to mo is a cheap del This based tinuing the series expansion. con con tinuation bifurcation, the on is rather susp ect, since of the series isn't necessarily exp ected to series the verge the bifurcation point, but the theory ts well anyway. con beyond analytic approac h one term further in the expansion We tak e our 1 1 1 3 ( mx (20.18) ' mx ) + ) (1 + 1 1 2 3 1 + exp( 2 mx ) 1 then e for the shap e of the bifurcation to leading order, whic h we can solv on ends fourth momen t of the distribution: the dep Z 1 0 3 x m P ( x ' ) d (1 + mx x ( mx (20.19) ) ) 1 1 1 1 1 3 1 4 3 2 m : ( 3 ) (20.20) = m 1 1 3 [At (20.20) we use the fact that P ( x t.] ) is Gaussian to nd the fourth momen 1 map a xed point at m suc h that This has 2 2 2 m ) (1 ( ) = 1 ; (20.21) 1 1 i.e., 2 = 1 2 ( 1) 1 1 2 = m = : (20.22) 2 1 thin line in gure 20.10 sho ws this theoretical appro ximation. Figure 20.10 The ws bifurcation as a function of sho for xed the ; gure 20.11 sho ws the 1 1 = 2 = bifurcation as a function of 1 for xed . 1 [ 2, p.292 ] . 20.5. Wh y does the pitc hfork in gure 20.11 tend to the val- Exercise 1 = 2 0? = ues : 8 as 1 ! 0 Giv e an analytic expression for this asymp- tote. mean Solution (p.292) . The asymptote is the 20.5 of the recti ed to exercise Gaussian, R 1 p Normal( x; 1) x d x 0 = : = (20.23) 2 798 ' 0 : = 2 1

305 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 21 Exact Inference by Complete Enumeration by discussing for handling probabilities ds a toolb ox of metho We open our d: complete enumeration of all brute-force and inference metho hypotheses, probabilities. approac h is an exact This d, and the of their evaluation metho it out will motiv ate the smarter exact and appro ximate dicult y of carrying in the metho chapters. ds introduced follo wing burgla The r alarm 21.1 is sometimes Bayesian `common sense, ampli ed'. probabilit y theory called about the follo wing questions, please ask your common sense When thinking Burglar e Earthquak ers are; we will then see how Bayesian metho ds con rm it thinks the what answ j j y intuition. everyda your @ R @ j j Alarm 60 miles work. utes comm Example 21.1. Fred lives in Los Angeles and to @ Radio es a phone-call that Whilst at work, he receiv saying from his neigh bour R @ j Fred's burglar alarm is ringing. What is the probabilit y that there was Phonecall a burglar in his house today? While driving home to investigate, Fred hears his day near e that earthquak was a small there that radio the on for work net . Belief 21.1 Figure the he home. `Oh', e says, feeling reliev ed, `it was probably the earthquak problem. alarm burglar was a burglar set alarm'. What is the probabilit y that there the that o house? (After Pearl, 1988). in his introduce variables b (a burglar was presen t in Fred's house today), Let's (the receiv is ringing), p (Fred a es a phonecall from the neigh bour re- alarm the earthquak e (a small porting e took place today near Fred's house), alarm), e is heard r radio rep ort of earthquak and by Fred). The probabilit y of (the all these variables migh t factorize as follo ws: p P ) = P ( b ) P ( e ) P ( a j b;e ) P ( b;e;a;p;r j a ) P ( r j e ) ; (21.1) ( and plausible values for the probabilities are: 1. Burglar probabilit y: b P = 1) = ; P ( b = 0) = 1 ; (21.2) ( e.g., = 0 : 001 gives a mean burglary rate of once every three years. 2. Earthquak y: e probabilit P ( e = 1) = ; P ( e = 0) = 1 ; (21.3) 293

306 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 21 | Exact by Complete Enumeration 294 Inference with, 001; our assertion that the earthquak es are indep enden t = 0 e.g., : ( prior b and e is P ( b;e ) = P ( b ) P y of e ), i.e., of burglars, the probabilit we tak e into accoun t opp ortunistic burglars who unless reasonable seems earthquak es. strik e immediately after y: we assume the alarm will ring if any of the 3. Alarm ringing probabilit events happ trig- (a) a burglar enters the house, and three follo wing ens: (let's the alarm alarm has a reliabilit y of assume = 0 : 99, i.e., the gers b trigger the alarm); (b) 99% earthquak e tak es place, and of burglars an the (perhaps triggers = 1% of alarms are triggered by earth- alarm e es?); other event causes a false alarm; let's assume the or (c) quak some from rate so Fred has false alarms is 0.001, non-earthquak e alarm false f every three years. [This type of dep endence of a on b and e causes once wn as a `noisy-or'.] probabilities of a given b and e are then: is kno The P a j b = 0 ; e ( = (1 f ) ; = 0 ( a = 1 j b = 0 ; e = 0) = f P = 0) ( a = 0 j b = 1 ; e = 0) P (1 f )(1 ) ) ; P ( a = 1 j b = 1 ; e = 0) = 1 (1 f )(1 = b b ( a j b = 0 ; e = 1) = (1 f )(1 P ) ; P ( a = 1 j b = 0 ; e = 1) = 1 (1 f )(1 = 0 ) e e a = 0 j P = 1 ; e = 1) = (1 f )(1 a )(1 ) ) ; P ( ( = 1 j b = 1 ; e = 1) = 1 (1 f )(1 )(1 b e e b b or, bers, in num P a = 0 j b = 0 ; e = 0) = 0 : 999 ; P ( a = 1 j b = 0 ; e = 0) = 0 : 001 ( ( ( = 0 j b = 1 ; e = 0) = 0 : 009 99 ; P P a = 1 j b = 1 ; e = 0) = 0 : 990 01 a ( P = 0 j b = 0 ; e = 1) = 0 : 989 01 ; P a a = 1 j b = 0 ; e = 1) = 0 : 010 99 ( P ( a = 0 j b = 1 ; e = 1) = 0 : 009 890 1 ; P ( a = 1 j b = 1 ; e = 1) = 0 : 990 109 9 : We assume the bour would nev er phone if the alarm is not ringing neigh P ( = 1 j a = 0) = 0]; and that the radio is a trust worth y rep orter too [ p P ( j e = 0) = 0]; we won't need to specify the probabilities = 1 ( p = 1 j a = 1) P r [ ( r = 1 j e = 1) in order to answ er the questions above, since the or P outcomes = 1 and = 1 give us certain ty resp ectiv ely that a = 1 and e = 1. p r er the about answ the burglar by computing the We can two questions of all hypotheses given the available information. Let's posterior probabilities by reminding ourselv es that the probabilit y that there is a burglar, before start p either r is observ ed, is P ( b = 1) = = 0 : 001, and the probabilit y that an or and e took P ( e = 1) = = 0 : 001, is these two prop ositions are earthquak place endent . indep when p = 1, we kno w that the alarm is ringing: a = 1. The posterior First, y of becomes: and e probabilit b P ( ) e ( P ( a = 1 j b;e ) P ) b P ( = 1) = a j b;e : (21.4) = 1) P ( a The four possible values are numerator's P ( a = 1 j b = 0 ; e = 0) P ( b = 0) P ( e = 0) = 0 : 001 0 : 999 0 : 999 = 0 : 000 998 0 P j b = 1 ; e = 0) P ( b = 1) P ( e = 0) = = 1 : 990 01 0 : 001 0 : 999 = 0 : 000 989 a ( ( a = 1 j b = 0 ; e = 1) P ( b = 0) P ( e = 1) = 0 : 010 99 0 : 999 0 : 001 = 0 : 000 010 979 P 7 a = 1 j b = 1 ; e = 1) P ( b = 1) P ( e = 1) = 0 : 990 109 9 0 : 001 0 : 001 = 9 : 9 10 P ( : The normalizing t is the sum of these four num bers, P ( a = 1) = 0 : 002, constan the posterior probabilities are and a ( = 0 ; e P j b = 1) = 0 : 4993 = 0 P ( b = 1 ; e = 0 j a = 1) = 0 : 4947 (21.5) P b = 0 ; e = 1 j a = 1) = 0 : 0055 ( = 1) P = 1 ; e = 1 j a b = 0 : 0005 : (

307 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Exact inference con tinuous hypothesis spaces 295 21.2: for question, To answ probabilit y a burglar was there?' we `what's er the the e variable : earthquak e over the marginalize = 0 j a = 1) = P ( b = 0 ; e = 0 j P = 1) + P ( b = 0 ; e = 1 j a = 1) = 0 : 505 ( b a b = 1 j a = 1) = P ( b = 1 ; e P j a = 1) + P ( b = 1 ; e = 1 j a = 1) = 0 : 495 : ( = 0 (21.6) is nearly a 50% chance that there was a burglar presen t. It is imp or- So there , are that variables b and e t to note h were indep enden t a priori the tan , whic dependent posterior distribution (21.5) . The a separable function of now is not and e . This fact is illustrated most simply by studying the b of learning e ect that = 1. e we learn e the posterior probabilit y of b is given by When = 1, bot- b e = 1 ; a = 1) = P ( b;e = 1 j a = 1) =P ( e = 1 ( a = 1), i.e., by dividing the j P j two rows by their sum P ( e = 1 j a of (21.5), : 0060. The posterior tom = 1) = 0 y of b is: probabilit ( b = 0 j e P ; a = 1) = 0 : 92 = 1 (21.7) b = 1 j e = 1 ; a = 1) = 0 : 08 : P ( is thus 8% chance that a burglar was in Fred's house. It is There now an with everyda y intuition that the probabilit y that b = 1 (a pos- in accordance e, an sible alarm) reduces when Fred learns that an earthquak of the cause alternativ e explanation of the alarm, has happ ened. Explaining away This phenomenon, that one of the possible causes ( b = 1) of some data (the data in this being a = 1) becomes less probable when another of the causes case e = 1) becomes probable, even though those two causes were indep en- ( more a priori wn as explaining away . Explaining away is an t variables , is kno den t feature of correct inferences, and one that any arti cial imp ortan intelligence replicate. should service the bour and If we believ radio neigh are unreliable or e that the so that we are not certain that the alarm really is ringing capricious, or that an e really has happ ened, the calculations become more complex, earthquak earthquak the persists; the arriv al of the way e ect e rep ort r but explaining-a ultaneously mak es it mor e probable that the alarm sim is ringing, and truly less that the burglar was presen t. probable , we solv In summary the inference questions about the burglar by enu- ed merating all four hypotheses about the variables ( b;e ), nding their posterior probabilities, and to obtain the required inferences about b . marginalizing 2 ] [ es the After Fred receiv Exercise phone-call about the burglar alarm, . 21.2. before he hears the radio rep but what, from his point of view, is the ort, probabilit there was a small earthquak e today? y that Exact inference for continuous hyp othesis spaces 21.2 Man y of the hypothesis spaces we will consider are naturally though t of as con tinuous. the unkno wn deca y length of section 3.1 (p.48) For example, unkno tinuous space; and the one-dimensional wn mean and stan- lives in a con dard deviation of a Gaussian ; live in a con tinuous two-dimensional space. In any practical implemen tation, suc h con tinuous spaces will neces- computer sarily be discretized, however, and so can, in principle, be enumerated { at a In gure grid values, for example. of parameter 3.2 we plotted the likeliho od

308 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. | Exact 296 Enumeration Inference 21 by Complete of an . Enumeration Figure 21.2 (discretized) hypothesis entire Gaussian with space for one tal axis) (horizon parameters and (vertical). the deca y length as a function of function by evaluating the likeliho od for series at a nely-spaced of points. model arameter A two-p at the Gaussian distribution as an example of a mo del with Let's look a two- hypothesis space. The one-dimensional Gaussian distribution is dimensional by a mean and a standard deviation : parameterized 2 ) x ( 1 2 p ; : ) (21.8) ; Normal ( x exp ( ; x ) = j P 2 2 2 0 0.5 1 1.5 2 2.5 -0.5 hundred hypotheses the mean 21.2 sho ws an enumeration of one about Figure oints e datap . Fiv 21.3 Figure distribution. These Gaussian of a one-dimensional deviation standard and 5 g x f tal horizon . The n =1 n ten are hypotheses values evenly covering grid square by ten in a ten spaced coordinate value of the is the of of is represen ted by a picture sho wing . Eac values ten and h hypothesis datum, x ; the vertical coordinate n has meaning. no the the . We now examine x on it puts y that y densit probabilit of inference wn given data x and n = 1 ;:::;N , assumed to be dra points indep enden tly , n from this densit y. that we acquire data, for example the ve points sho wn in g- Imagine 21.3. now evaluate the posterior probabilit y of eac h of the one We can ure subh ypotheses by evaluating the likeliho od of eac h, that is, the value hundred 5 P ( f x of g od values in ; ). The likeliho j are sho wn diagrammatically n n =1 21.4 de the the line thic kness to enco gure value of the likeliho od. Sub- using 8 the than e with od smaller times hypotheses maxim um likeliho od likeliho have been deleted. Using a ner grid, we can represen t the same information by plotting the and likeliho or con tour plot as a function of plot ( gure 21.5). od as a surface A ve-p arameter mixtur e model Eyeballing the data ( gure 21.3), you migh t agree that it seems more plau- from sible they come not from a single Gaussian but that a mixture of two Gaussians, de ned by two means, two standard deviations, and two mixing

309 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. inference for con tinuous hypothesis spaces 297 21.2: Exact . Lik od function, 21.4 Figure eliho given the data of gure 21.3, by line thic kness. represen ted having od ypotheses Subh likeliho 8 e times than smaller the um od are not maxim likeliho sho wn. 0.06 od likeliho . The 21.5 Figure 1 0.05 for function the of a parameters 0.9 0.04 distribution. Gaussian 0.8 0.7 con and plot Surface of plot tour 0.03 0.6 the log likeliho od as a function of 0.02 sigma 0.5 N of set = 5 . The and data 0.01 0.4 x 0 and : mean points had = 1 P 0.3 0 1 2 x ( = S : = 1 0. ) x 0.2 0.8 0.1 0.6 sigma 0.5 0 1.5 2 1 0.4 2 1.5 0.2 1 mean 0.5 mean 0

310 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Inference by Complete | Exact 21 298 Enumeration of the Figure 21.6 . Enumeration entire hypothesis (discretized) space for a mixture of two mixture t of the Weigh Gaussians. : ; comp = 0 onen ts is ; 0 : 4 in 6 1 2 and 0 : 8 ; 0 : 2 in the top the half and Means half. bottom 1 2 tally , and standard vary horizon deviations vary and 1 2 vertically . and ts coecien 0. + = 1, , satisfying i 2 1 2 1 2 2 2 1 ( x ) ( x ) 2 1 p p + ; ) = ; P ( x ; j ; ; exp exp 1 1 1 2 2 2 2 2 2 2 2 1 2 2 1 2 enumerate the subh ypotheses for this alternativ e mo del. The parameter Let's is ve-dimensional, challenging to represen t it on a single so it becomes space Figure 21.6 enumerates 800 subh ypotheses with di eren t values of the page. . The ; ; ; ve parameters ; means are varied between ve values 1 2 2 1 1 eac h in the tal directions. The standard deviations tak e on four values horizon h vertically . And eac tak es on two values vertically . We can represen t the 1 oints as ve parameters in the ligh t of the ve datap inference about these wn in gure 21.7. sho to compare the one-Gaussian mo del with the mixture-of-t wo If we wish mo del, nd the mo dels' posterior probabilities by evaluating the we can gjH likeliho for eac h mo del H marginal P ( f x od or evidence ). The evidence , is given by integrating over the parameters, ; the integration can be imple- men ted numerically by summing over the alternativ e enumerated values of , X ( x gjH ) = P f ( ) P ( f x gj ; H P ; (21.9) ) where ( ) is the prior distribution over the P of parameter values, whic h grid I tak e to be uniform. For the mixture of two Gaussians this integral is a ve-dimensional integral; if it is to be performed at all , the grid of points will need to be accurately gures. than much ner sho wn in the the If the uncertain ty about eac h grids of K parameters has been reduced by, say, a factor of ten by observing the K a grid brute-force integration requires then of at least 10 data, points. This

311 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Exact inference con tinuous hypothesis spaces 299 21.2: for . Inferring of 21.7 Figure a mixture eliho od two Gaussians. Lik given the of function, data gure 21.3, represen ted by line kness. The hypothesis space thic is iden to that sho wn in tical 21.6. ypotheses having Subh gure 8 e od smaller times likeliho than maxim um likeliho the not od are sho wn, hence the blank regions, whic h corresp ond to hypotheses that the out. data have ruled 0.5 0 1.5 2 2.5 -0.5 1 onen tial gro wth of computation with mo del size is the reason why complete exp is rarely a feasible computational strategy . enumeration [ 1 ] 21.3. Exercise Imagine tting a mixture of ten Gaussians to data in a twenty-dimensional Estimate the computational cost of imple- space. men ting inferences for this mo del by enumeration of a grid of parameter values.

312 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 22 um od and Clustering Lik Maxim eliho in num all h may be exp onen tial { whic ber enumerate hypotheses Rather than by homing in on one good hypothesis that { we can save a lot of time ts well. is the philosoph y behind the maxim This likeliho od metho d, data the um ti es the setting of the parameter vector whic maximizes the h iden that od, (Data j ; H ). likeliho P can mo maxim um likeliho od parameters the be iden ti ed some dels For from the data; for instan complex mo dels, nding the maxim um like- tly more od parameters may require an iterativ e algorithm. liho any mo del, it is usually easiest to work with the logarithm of the For od rather than od, since likeliho ods, being pro ducts of the likeliho the likeliho y data points, tend small. Lik eliho ods multiply; to be very probabilities of man ods add. likeliho log likeliho od for one Gaussian 22.1 Maximum Assume the our rst examples. for we have data Gaussian to We return N g likeliho f od is: . The log x n =1 n X p 2 2 N ln( ) j ; ) = N (22.1) ( x ) : = (2 f ln P 2 x ) ( g n n =1 n n likeliho od can of two functions of the data, the The be expressed in terms mean sample N X x =N; (22.2) x n n =1 sum of square deviations the and X 2 (22.3) ( x : x ) S n n p N 2 2 ln g P ( f j ; ) = N ln( x 2 ) (22.4) [ N ( x ) + S ] = (2 : ) n =1 n Because the likeliho od dep ends on the data only through x and S , these two quan tities kno wn as sucien t statistics . are w that, 22.1. tiate the log likeliho od with resp ect Example and sho Di eren to if the deviation is kno wn to be standard , the maxim um likeliho od mean of a Gaussian is equal to the sample mean x , for any value of . Solution. ) x ( N @ P ln = (22.5) 2 @ = 0 when = x . 2 (22.6) 300

313 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Maxim od for one Gaussian 301 um 22.1: likeliho 0.06 . The likeliho od Figure 22.1 1 0.05 of a for the parameters function 0.9 0.04 Gaussian distribution. 0.8 0.7 and plot Surface a2) (a1, tour con 0.03 0.6 plot of the log likeliho od as a 0.02 sigma 0.5 . The and of set function data 0.01 0.4 N : x mean 0 = 1 of = 5 points had P 0.3 0 1 2 and x x ( = S ) = 1 : 0. 0.2 0.8 The posterior probabilit y of (b) 0.1 0.6 sigma 2 1.5 0 0.5 1 values for of . various 0.4 2 1.5 (c) y of posterior The probabilit (a2) (a1) 0.2 1 mean 0.5 mean 0 of for various values xed ). y over ln as a densit wn (sho 0.09 4.5 mu=1 sigma=0.2 sigma=0.4 mu=1.25 4 0.08 mu=1.5 sigma=0.6 3.5 0.07 3 0.06 2.5 0.05 2 Posterior 0.04 1.5 0.03 1 0.02 0.5 0.01 (b) (c) 0 0.8 1 1.2 1.4 1.6 1.8 2 0 0.6 0 0.4 0.2 0.6 0.4 0.2 mean 2 1.8 1.6 1.4 1.2 1 0.8 the likeliho od about the maxim um, we can de- If we Taylor-expand log appro ximate error bars on the maxim um likeliho od parameter: we use ne appro od to estimate how far from the maxim um-lik eliho a quadratic ximation setting go before the likeliho od falls by some standard fac- parameter we can 2 = 4 1 = 2 the , or e example for . In e special case of a likeliho od that is a tor, Gaussian function of the parameters, the quadratic appro ximation is exact. Example the deriv ativ e of the log likeliho od with resp ect to Find 22.2. second the nd on , given the data and error . , and bars Solution. 2 N @ P = : (22.7) 2 ln 2 2 @ ature with the curv ature Comparing log of a Gaussian distri- of the this curv 2 2 2 bution , exp( over = (2 of standard deviation whic h is 1 = )), , we that the error bars on (deriv ed from deduce likeliho od function) are can the p = (22.8) : N error bars have this prop erty: at the two points = The x , the likeliho od 1 = 2 maxim um value by a factor of e is smaller than its . Example 22.3. maxim um likeliho od standard deviation of a Gaus- Find the N x wn whose , in the ligh t of data f is kno sian, g mean . Find to be n =1 n second deriv ativ e of the log likeliho od with resp ect to ln , and error the on . ln bars The likeliho od's dep endence on is Solution. p S tot N (22.9) ; 2 ) j ; ) = N ln( x ln g P ( f n =1 n 2 ) (2 P 2 of the um . To nd maxim ( x the ) od, we can likeliho = S where n tot n di eren tiate with resp with to ln . [It's often most hygienic to di eren tiate ect

314 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 22 | Maxim eliho od and Clustering 302 um Lik n = u , when u is a scale variable; we use d u u than d(ln u ) = resp rather to ln ect n .] nu N f @ ln g P ( ) j ; x S n tot =1 n (22.10) N + = 2 @ ln ativ e is zero when This deriv S tot 2 = ; (22.11) N i.e., s P N 2 ( x ) n =1 n : (22.12) = N deriv ativ e is The second N 2 ln ( f x S g @ P j ; ) n tot =1 n (22.13) ; = 2 2 2 ) (ln @ 2 error od value of at the , this equals 2 N . So eliho bars and maxim um-lik are ln on 1 p = (22.14) 2 : ln N 2 [ 1 ] Exercise Sho w that the values of and 22.4. that join tly maximize the ln . o n p where S=N ; od are: x; f = ; g likeliho = N ML s P N 2 ) ( x x n =1 n : (22.15) N N Maximum likeliho 22.2 of Gaussians od for a mixture We now deriv algorithm for tting a mixture of Gaussians to one- e an data. In fact, this algorithm is so imp ortan t to understand that, dimensional you , gen tle reader, get to deriv e the algorithm. Please work through the fol- lowing exercise. [ ] p.310 2, variable x 22.5. to have a probabilit y A random Exercise is assumed is a mixtur e of two Gaussians , distribution that # " 2 2 X x ( ) 1 k p (22.16) ; ; P ( ; ) = j x exp p 1 2 k 2 2 2 2 =1 k the two Gaussians are given the lab els k = 1 and k = 2; the prior where ; y of the lab el k is f p probabilit = 1 = 2 ; p means = 1 = 2 g class f the g are 1 2 k two Gaussians; and both of the deviation . For brevit y, we have standard denote these parameters by ff . g ; g k N consists of N points f x set g A data whic h are assumed to be indep en- n n =1 t samples from this distribution. Let k the denote den unkno wn class lab el of n n the th point. that f y g and Assuming kno wn, sho w that the posterior probabilit are k of the class lab el k as of the n th point can be written n 1 k P = 1 j ( ) = ; x n n 1 + exp[ ( w )] x w + n 1 0 (22.17) 1 ; ( = 2 j x k ; P ) = n n 1 + exp[+( x )] w w + 0 n 1

315 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. ts to soft 22.3: K-means Enhancemen 303 w and w . and for give expressions 0 1 means f that g are not kno wn, and Assume we wish to now that the k N from f x infer g data them the standard deviation is kno wn.) In . (The n =1 n of this question we will deriv e an iterativ e algorithm for the remainder nding for values g that maximize the likeliho od, f k Y N ( jf x g ; ) = g P f (22.18) ) ; : ( x g jf P n n k k n =1 n denote the natural log of the Let od. Sho w that the deriv ativ e of the L likeliho likeliho od with resp ect to log is given by k X ( ) x @ n k p = L (22.19) ; k j n 2 @ k n j (22.17). above at equation P ( k eared = k where x ) app ; p n n n j k @ ativ deriv second j the ), that P ( k = k e x ; in terms neglecting w, Sho n n @ k by given ximately is appro 2 X 1 @ p : (22.20) L = k j n 2 2 @ k n sho w that from an initial state ximate ; Newton{Raphson , an appro Hence 2 1 0 0 parameters to step updates these ; , where 1 2 P p x n k j n n 0 P (22.21) : = k p k j n n 0 to L ( ) updates d for Newton{Raphson = [The maximizing metho h . i 2 @L L @ .] 2 @ @ 1 2 3 4 5 6 0 that = 1, sketch a con tour plot of the likeliho od function as a Assuming of function and of for the data set sho wn above. The data set consists 2 1 Describ e the peaks in your sketch and indicate their widths. 32 points. that ed algorithm you have deriv Notice for maximizing the likeliho od the tical to the soft K-means algorithm of section 20.4. Now that it is clear is iden that clustering can be view ed as mixture-densit y-mo delling, we are able to deriv e enhancemen K-means algorithm, whic h rectify the problems ts to the earlier. we noted 22.3 Enhancements to soft K-means 22.2 sho ws a version of the soft-K-means algorithm corresp Algorithm onding to a mo assumption that eac h cluster is a spherical delling having its Gaussian ( k ) 2 has its own updates own width = 1/ (eac the ). The algorithm h cluster k lengthscales t parame- for itself. The also includes cluster weigh algorithm k ters delling ; mo ;:::; accurate whic h also update themselv es, allo wing K 2 1 from clusters of unequal of data ts. This algorithm is demonstrated in weigh ws 22.3 for two data sets that we've seen before. The second example sho gure

316 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 22 | Maxim um Lik eliho od and Clustering 304 22.2 . The soft K-means Algo rithm Assignmen t step are onsibilities . The resp algorithm, 2. version 1 ) ) k n ( ( 1 p d ; m ) ( x exp k I 2 ( 2 ) k ) ( n k r = (22.22) k P 1 0 ) ( ) n k ( 1 p m d ; x ( ) exp 0 k I k 2 2 ( ) 0 k 0 k is the x where . y of dimensionalit I ) ( k 2 , and step Up date . Eac h cluster's parameters, m , adjusted , are k k for. to matc onsible h the data points that it is resp X ( n ) ( n ) r x k n ( k ) (22.23) = m ) k ( R X n ) ( n ) ( 2 ( k ) r ) ( x m k n 2 (22.24) = k ) k ( IR ( k ) R P (22.25) = k ( ) k R k k ) ( R total resp onsibilit y of mean k , where is the X ) n ( ) k ( R = r (22.26) : k n = 9 t t = 2 t = 1 t = 3 = 0 t K-means . Soft 22.3 Figure K with algorithm, = 2, applied to the 40-p oint data set of (a) 20.3; gure to the (b) little 'n' large data set of gure 20.5. = 1 t = 0 t t = 30 = 35 = 20 t = 10 t t . The Algo rithm K-means soft 22.4 3, whic algorithm, version h ! I . X del of corresp onds to a mo 1 k ( ) k ) ( ( n ) 2 2 2( ) exp x m ( ) p axis-aligned Gaussians. k Q i i i ( ) k I 2 ( ) n i =1 =1 i i P = r (22.27) k 0 ) in place of k with k (numerator, 0 k X ( ) ( k ) ( n n ) 2 ( r m x ) i i k ( k ) n 2 (22.28) = i ( k ) R

317 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. A fatal aw of maxim likeliho od 305 22.4: um t = 10 = 20 t = 30 t = 0 t 22.5 K-means Figure . Soft algorithm, version 3, applied to the data consisting of two K = 2 (cf. ed clusters. cigar-shap gure 20.6). t = 10 t = 20 t = 26 = 32 t = 0 t . Soft Figure K-means 22.6 algorithm, version 3, applied to 'n' large data set. K = 2. the little eventually vergence tak e a long time, but can the algorithm iden ti es con that small cluster and the large the cluster. Soft version 2, is a maxim um-lik eliho od algorithm for tting a K-means, the of to data { `spherical' meaning that the variance A pro of that al Gaussians algorithm does spheric mixture maximize the likeliho od is indeed no is still of the Gaussian is the same in all directions. This algorithm good 33.7. deferred to section If we wish delling ed clusters of gure 20.6. cigar-shap to mo del the at mo the by axis-aligned Gaussians with clusters variances, we replace possibly-unequal the t rule (22.22) and the assignmen update rule (22.24) by the rules variance (22.27) and (22.28) displa yed in algorithm 22.4. This third of soft K-means is demonstrated in gure 22.5 on the version data set 20.6. After 30 iterations, the algorithm correctly `two cigars' of gure same two clusters. 22.6 sho ws the the algorithm applied to the locates Figure 'n' large data set; again, the correct cluster locations little found. are A fatal likeliho od 22.4 aw of maximum , gure 22.7 sounds a cautionary note: when we t K Finally to our = 4 means rst set, we sometimes nd that toy data small clusters form, covering very just one or two data points. This is a pathological prop erty of soft K-means clustering, 3. versions 2 and [ 2 ] k ) ( Exercise what if one mean m Investigate 22.6. ens sits exactly on happ . 2 data point; sho w that if the variance top of one is sucien tly small, k 2 no return is possible: then ever smaller. becomes k t = 5 t = 10 t = 20 t = 0 Figure . Soft K-means 22.7 algorithm applied to a data set of 40 points. K = 4. Notice that at con one very small vergence, cluster has formed between two data points.

318 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 22 | Maxim Lik eliho od and Clustering 306 um KABOOM! K-means Put one cluster exactly on one data point and let its can Soft blow up. can large an arbitrarily { you likeliho od! Maxim um variance go to zero obtain dels ds can by nding highly tuned mo down that t part od metho likeliho break perfectly . This phenomenon is kno wn as over tting. The of the data reason not in these solutions with enormous likeliho od is this: sure, we are interested may have enormous probabilit y density , parameter-settings posterior these densit y is large over only a very small volume of parameter but So the space. probabilit mass asso ciated with y likeliho od spik es is usually tiny. the these that maxim um likeliho od metho ds are not a satisfactory gen- We conclude solution eral delling problems: the likeliho od may be in nitely large to data-mo parameter settings. if the likeliho od does not have in nitely- at certain Even tativ es, um of the likeliho od is often unrepresen maxim e, in high- the spik large problems. dimensional problems, maxim um Even od solutions can be in low-dimensional likeliho tativ you may kno w from basic e. As the maxim um like- unrepresen statistics, od estimator (22.15) for a Gaussian's liho deviation, d , is a biase standard N a topic that tak e up in Chapter 24. estimator, we'll a posteriori (MAP) metho d The maximum popular replacemen t for maximizing the likeliho A the od is maximizing Bayesian probabilit y densit y of the parameters instead. However, posterior multiplying likeliho od by a prior and maximizing the posterior does the not mak e the above problems go away; the posterior densit y often also has in nitely-large spik the maxim um of the posterior probabilit y densit y es, and tativ whole posterior distribution. Think bac k to unrepresen e of the is often of typicalit y, whic h we encoun tered in Chapter the dimen- concept 4: in high most probabilit y mass is in a typical set of the prop erties are sions, whose di eren t from the points that have the maxim um probabilit y densit y. quite are Maxima atypical. a posteriori for the maxim A further disliking is that it is basis- reason um . If we mak e a nonlinear change of basis from the parameter to dependent y of parameter = f ( ) then the probabilit y densit u is transformed to the @ ) P ( ) = P u ( (22.29) : @u usually um densit y P ( maxim ) will of the not coincide with the maxim um The u densit y P of the ). (For gures illustrating suc h nonlinear changes of basis, ( see next chapter.) It seems undesirable to use a metho the answ ers d whose change when we change represen tation. Further reading The soft algorithm is at the heart of the automatic classi cation K-means et al. kage, (Hanson et al. , 1991b; Hanson pac , 1991a). AutoClass 22.5 Further exercises Exer cises wher e maximum likeliho od may be useful [ 3 ] the Exercise Mak e a version of the K-means algorithm that mo dels 22.7. i.e., data of K arbitrary Gaussians, as a mixture Gaussians that are not constrained to be axis-aligned.

319 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. exercises 22.5: 307 Further [ 2 ] A photon coun ter is pointed at a remote star for one Exercise 22.8. (a) . rate the tness, i.e., the brigh of photons to infer ute, min in order ter per min ute, . Assuming arriving num ber of at the coun the r a Poisson distribution with mean , photons collected has r r ) j P ( ; (22.30) ) = exp( ! r maxim um likeliho what for , given r = 9? Find is the od estimate bars ln . error on situation, (b) that the coun ter detects not but Same now we assume kground' from but also `bac star photons. The photons only the rate of photons is kno wn to bac b = 13 photons per kground be ute. the num ber of photons collected, r , has a Pois- min We assume . Now, given distribution + b mean r = 9 detected photons, son with is the maxim um likeliho od estimate for ? Commen t on this what er, answ the Bayesian posterior distribution, and the discussing also ^ b , estimator' r theory . `un biased of sampling ] 2 [ A bent coin is tossed N times, giving N Exercise heads and 22.9. tails. N a b distribution for the probabilit y of heads, p , for Assume prior a beta example distribution. Find the maxim um likeliho od and the uniform um a posteriori values of p , then nd the maxim um likeliho od maxim maxim and values of the logit a ln[ p= (1 p )]. Compare um a posteriori predictiv e distribution, i.e., the probabilit y that the next toss the with up heads. come will [ 2 ] Exercise 22.10. Two men looked through prison bars; one saw stars, the . ( x ) ; y other frame was. d to infer trie wher e the window max max ? see w and a windo look through you of a room, side stars other the From ? ) it is to- at locations f ( x because ;y w edges windo g . You can't see the ? n n ? windo the Assuming and w is rectangular tally dark apart from the stars. ? ? that the visible stars' distributed, are indep enden tly randomly locations ( ) ; y x min min x values of ( x are ; y inferred , what to , y ), according the max max min min likeliho od? Sketch the likeliho od as a function of x maxim , for um max x y xed , and y , . max min min ] [ 3 A sailor infers his Exercise ( x;y ) by measuring the bearings 22.11. . location b the chart. his on given ) are Let of three buo ;y ys whose x ( locations n n ( x ) ; y Q 3 3 Q ~ A that of the of t measuremen his bearings . Assuming ys be buo true Q n n A Q A deviation noise to Gaussian ject is sub h bearing standard , eac of small Q Q A by maxim location, inferred what is his od? likeliho um A A sailor's position en boat's to The can rule the of thum b says that be tak A b tre intersection cen by the duced pro triangle the cocked hat, be the of the ) ; y x ( A 1 1 22.8). ( gure bearings measured three Can you persuade of the him that b A ) ( ; y x 2 2 er is better? the maxim um likeliho od answ p.310 ] [ 3, . The way of Figure standard 22.8 od tting of an exp onential-family 22.12. Maximum likeliho . Exercise t dra wing three sligh tly inconsisten model. bearings a duces pro a chart on that y distribution from a probabilit Assume a variable x comes of the a cocked hat. triangle called Where sailor? is the form ! X 1 x ( (22.31) f w ; ) exp ( P x j w ) = k k w Z ( ) k

320 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 22 | Maxim Lik eliho od and Clustering 308 um functions f ( x ) are given, and the parameters w = f w where g are the k k ) ( n x kno A data wn. g of N points is supplied. set f not tiating the log likeliho od that the maxim um-lik eliho od Sho w by di eren w parameters satisfy ML X X 1 ( ) n ) f (22.32) ( x ) = x ; j ) f w P x ( ( ML k k N x n the left-hand is over the x , and sum righ t-hand sum is over the where all points. A shorthand for this result is that eac h function-a verage data the under mo del must equal the function-a verage found in the tted data: h i f (22.33) : i f h = k k ) Data j w x ( P ML [ 3 ] entrop `Maximum Exercise y' tting of models to constraints. . 22.13. confron ted by a probabilit y distribution P ( x ) about whic h only a When facts the kno wn, few maxim um entrop y principle (maxen t) o ers a are for a distribution that satis es those constrain ts. Accord- rule choosing to maxen t, you should select the P ( x ) that ing the entrop y maximizes X H = ( P x ) log (22.34) =P ( x ) ; 1 x sub ject to the constrain ts. Assuming the constrain ts assert that the aver ages functions f of certain ( x ) are kno wn, i.e., k h i f (22.35) ; F = k k ) x ( P by introducing Lagrange multipliers (one for eac h constrain sho w, t, in- normalization), the maxim um-en trop that has the cluding y distribution form ! X 1 (22.36) ; x exp ( ) f w ( x = P ) t Maxen k k Z k the parameters Z and f w where g are set suc h that the constrain ts k (22.35) satis ed. are hence the maxim um entrop y metho d gives iden tical results And to max- imum od tting of an exp onen likeliho mo del (previous exer- tial-family cise). The maxim um entrop y metho d has sometimes been recommended as a metho d for prior distributions in Bayesian mo delling. While assigning outcomes maxim um entrop y metho d are sometimes interesting the of the though t-pro voking, I do not adv ocate maxen and the approac h to t as assigning priors. osed um y is also sometimes Maxim entrop as a metho d for solv- prop ing inference problems { for example, `giv en that the mean score of this six-sided die is 2.5, what is its probabilit y distribution unfair ( p y ;p entrop ;p um ;p maxim ;p to use ;p idea )?' I think it is a bad 6 4 3 2 1 5 in this way; it can give silly answ ers. The correct way to solv e inference problems is to use Bayes' theorem.

321 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. exercises 22.5: 309 Further likeliho MAP have diculties wher e maximum Exer cises od and 2 ] [ maximizing exercise the idea that explores a proba- 22.14. Exercise . This way to nd a point that is represen tativ e of bilit y densit y is a poor densit a Gaussian in a k -dimensional space, y. Consider distribution the p P k 2 2 k all nearly of the w that ). Sho P = 2 ( w = ) = (1 ) 2 exp( w W W i 1 p is in a thin shell of radius r = probabilit y mass of a Gaussian k W p and to r= of thic kness prop ortional in 1000 dimen- example, . For k of the mass of a Gaussian with sions, = 1 is in a shell of radius 90% W C D-G B A and the 31.6 y density at the origin is probabilit kness 2.8. However, thic 217 k= 2 than the densit y at this shell where most 10 e times bigger ' of is. y mass probabilit the 20 -30 -20 -10 0 10 Now consider that in di er dimensions in 1000 densities two Gaussian con probabilit 1%, total equal tain y mass. that and radius by just W tre at the y is greater y densit probabilit um maxim the of w that Sho cen x tist Scien n the ' 20 000. 01 Gaussian with smaller by a factor of exp(0 : k ) W A 27.020 problems, posterior a typical ted osed In ill-p is often a weigh distribution B 3.570 with of Gaussians osition erp sup means standard deviations, and varying C 8.191 prob- true posterior has a skew peak, so the the maxim um of the with D 9.898 near the mean of the Gaussian distribution that abilit y densit y located 9.603 E smallest deviation, not the the standard Gaussian with the has greatest F 9.945 weigh t. G 10.056 [ 3 ] wn from 22.15. Exercise . . oints f x N g are scientists seven datap dra The n 22.9 Figure ts en measuremen . Sev h are all N distributions, of whic Gaussian with a common mean but f en x g of a parameter by sev n maxim . What um are the with t unkno wn standard deviations di eren scien tists eac h having his own n . noise-lev el n data? likeliho For seven the given g example, f ; od parameters n scien B, C, D, E, F, G) with wildly-di ering exp erimen tal skills tists (A, accurate ect some of them to do exp work (i.e., to have measure . You in wildly ), and some of them to turn small inaccurate answ ers (i.e., n What ). Figure 22.9 sho ws their seven results. to have enormous is n , and how reliable is eac h scien tist? e you agree that, intuitiv ely, it looks prett y certain that A and B I hop both true measurers, that D{G are better, and that the are value inept od to 10. But what does maximizing the likeliho close of is somewhere you? tell [ 3 ] Exercise Problems 22.16. MAP metho d. A collection of widgets i = with 1 have a prop erty called `wodge', ;::: ;k wid- , whic h we measure, w i get by widget, in noisy exp erimen ts with a kno wn noise level 0. = 1 : Our del for these quan tities is that they come from a Gaussian prior mo 2 1 / ( ) = Normal(0 ; P j w kno = ), where is not for wn. Our prior = 1 i W variance is at over log = 0 from = 10. this : 1 to W W W rio 1. Supp ose four widgets have been measured and give the fol- Scena 2.2, data: d interested ;d lowing ;d . We are ;d g g = f f 2 : 2, 2.8, 2 : 8 4 3 1 2 in inferring the wodges of these four widgets. (a) Find values of w and that maximize the posterior probabilit y the ( log ; P j d ). w posterior Marginalize and nd the (b) probabilit y densit y of w over given the data. [Integration skills required. See MacKa y (1999a) for d solution.] of P ( w j maxima ). [Answ er: two maxima { one at Find w on all four = f 1 : 8 parameters 1 : 8 ; 2 : 2 ; 2 : 2 g ; with error bars ; MP

322 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 22 | Maxim Lik eliho od and Clustering 310 um Gaussian appro to the posterior) 0 : 9; and (obtained from ximation 0 04 = f 0 : 03 ; 0 : 03 ; 0 : 1.] ; 0 : 04 : with error bars 0 g one at w MP Supp ose 2. to the four measuremen ts above we are rio Scena in addition there are four more widgets that have been measured now informed that 0 accurate t, having a much less with instrumen : 0. Thus we now = 100 and parameters, as in a typical have both ill-determined well-determined The data ill-p these measuremen ts were a string of osed problem. from f d ;d uninformativ . ;d e values, ;d g g = f 100, 100 ; 100, 100 8 7 5 6 of the ask to infer again wodges ed widgets. Intuitiv ely, our We are the about the well-measured inferences should be negligibly a ected widgets by this information about the poorly-measured widgets. But vacuous happ what metho d? ens to the MAP the values w and the that maximize of posterior probabilit y Find (a) 5 ; log j d ). P ( w 4 one ( w j d ). [Answ er: only P maxim um, w (b) = maxima Find of MP 3 0 : 03, 0 : 03, 0 f 03, 0 : 0001, 0 : 0001, 0 : 0001, 0 : 0001 g , with 0 : 03, : all t parameters 0 : 11.] error bars on eigh 2 1 22.6 Solutions 0 0 4 5 2 1 3 to exercise (p.302) . 22.5 Figure 22.10 sho ws a con Solution plot of the tour Figure 22.10 . The likeliho od as a tred cen y-near prett are peaks The points. likeliho od function for the 32 data . of function and 2 1 y-near ; (5 ; points (1 and are prett 5) and circular in their con tours. the on 1), p h of the peaks is a standard deviation of = The width of eac 16 = 1/4. The peaks roughly Gaussian in shap e. are 22.12 (p.307) . The log likeliho od is: Solution to exercise X X ) n ( ( n ) w ) = N ln Z ( w ) + ( f x (22.37) w : f ) ( x ln P gj k k n k X @ @ ( n ) f (22.38) : ) x ( gj ) = w ln ln Z ( w ) + N P x ( f k @w @w k k n the fun part is what happ ens when we di eren tiate the log of the nor- Now, constan malizing t: ! X X 1 @ @ 0 0 Z ( w exp ) = ln w f ) ( x k k @w Z w ) ( @w k k 0 x k ! X X X 1 0 0 = ( ) = ) x ( f x f w exp ( ) (22.39) j x ; x f ) P ( w k k k k w ) ( Z 0 x x k so X X @ n ) ( w ) P ( x j ; ) f x ( x ) + (22.40) ( f w ) = N f x ln P ( gj k k @w k x n at the maxim um of the likeliho od, and X X 1 ) n ( x (22.41) ( f : ) ) x ( P ( x f j ) = w ML k k N n x

323 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 23 Probabilit y Distributions Useful delling, there's a small collection of probabilit In Bayesian data y distribu- mo 0.3 0.25 is to intro- chapter of this ose purp again. The again up come that tions and 0.2 0.15 so that distributions these duce encoun when be intimidating won't they tered 0.1 0.05 situations. bat in com 0 There Gaussian; the perhaps except any of them, to memorize need is no 6 9 8 7 0 10 1 2 3 4 5 t enough, ortan is imp it otherwise, and itself, memorize it will if a distribution 1 up. can easily be looked 0.1 0.01 0.001 0.0001 integers over Distributions 23.1 1e-05 1e-06 onential exp Binomial, Poisson, 3 9 10 6 5 2 1 0 7 4 8 r Poisson the and distribution binomial the tered We already encoun distribution 2. on page Figure . The binomial 23.1 parameters with r integer an for distribution binomial The bias, (the f = 0 P ; N 3 : = 10), f j r ( distribution ber of trials) is: f 2 [0 ; 1]) and N (the num and a on a linear scale (top) (bottom). logarithmic scale N r N r ( r f j (1 f ) f;N ) = 2f r P 0 ; 1 ; 2 ;::: ;N g : (23.1) r 0.25 when arises, for example, binomial we ip a bent coin, The distribution 0.2 f , N times, and observ e the num ber of heads, r . with bias 0.15 0.1 Poisson with parameter The > distribution 0 is: 0.05 0 r 15 10 5 0 1 ; 2 ;::: g : (23.2) r ; 2f 0 ( r j ) = e P ! r 1 0.1 0.01 arises, Poisson distribution t the ber of The for example, when we coun num 0.001 interv al, given that the mean that arriv r photons e in a pixel a xed during 0.0001 1e-05 pixel num average intensit y on . ber of photons to an onds corresp the 1e-06 1e-07 ,, integers on distribution tial onen exp The 10 15 0 5 r r f ;::: ; 2 ; 1 ) P ( r j f ) = ; (23.3) (1 f ) r 2 (0 ; 1 Figure 23.2 . The Poisson distribution P ( r j = 2 : 7), on a have to wait until a six is rolled, arises in waiting problems. How long will you scale (top) and a linear er: six-sided dice is rolled? Answ if a fair the probabilit y distribution of the (bottom). logarithmic scale ber of rolls, r , is exp onen tial over integers with parameter f = 5 = 6. The num may also be written distribution r r j f ) = (1 f ) e P ( r 2 (0 ; 1 ; 2 ;::: ; 1 ) ; (23.4) where = ln(1 =f ). 311

324 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Probabilit 312 y Distributions 23 | Useful over real numb ers unb Distributions 23.2 ounded onential, , biexp inverse-cosh. Gaussian, Student, Cauchy distribution with mean The standard Gaussian distribution or normal and is deviation 2 1 x ) ( ; 1 ) x 2 ( (23.5) ; 1 ( ) = ; P j x exp 2 Z 2 where p 2 = Z 2 : (23.6) 2 with the It is sometimes tity 1 = useful , whic h is called the to work quan parameter Gaussian. precision of the from a standard ariate Gaussian can be generated by z A sample univ computing p ) (2 2 ln(1 =u = cos ) ; (23.7) z u 1 2 u and where = are u distributed in (0 ; 1). A second sample z uniformly 2 1 2 p sin(2 u ) t of the ), indep 2 ln(1 enden =u rst, can then be obtained for free. 1 2 Gaussian distribution is widely used and often asserted to be a very The distribution in the world, but I am sceptical about this asser- common real unimo tion. may be common; but a Gaussian is a spe- dal Yes, distributions t tails: unimo distribution. It has very ligh extreme, the log- rather cial, dal y decreases quadratically . The probabilit deviation of x from y-densit typical the resp ectiv e probabilities that x deviates , but by more than 2 , is from 7 5 5 0 : 046, 0.003, 6 10 , and , and 6 10 , 4 , are . In my exp erience, 3 from four or ve times greater than a mean typical deviation deviations the 5 as rare as 6 10 use may be rare, ! I therefore urge caution in the but of not distributions: that is mo delled with a Gaussian actually if a variable Gaussian 0.5 vier-tailed distribution, the rest of the mo del will con tort itself to has a hea 0.4 0.3 by a of pap er being crushed outliers, of the the deviations like a sheet reduce 0.2 er band. rubb 0.1 0 [ 1 ] -2 0 2 4 6 8 is supp 23.1. Exercise . bell-shap y osedly that k a variable Pic ed in probabilit e a plot distribution, gather data, and mak of the variable's empirical scale as a histogram a log on and w the distribution. Sho distribution 0.1 delled by a Gaussian distribu- investigate whether the tails are well-mo 0.01 is the of an tion. audio amplitude [One to study of a variable example 0.001 signal.] 0.0001 2 4 6 8 -2 0 of Gaus- vier distribution with hea One tails than a Gaussian is a mixture sians by two means, is de ned for of two Gaussians, . A mixture example, dal Figure 23.3 . Three unimo two standard ts coecien and two , satisfying and deviations, mixing 1 2 t o Studen Tw distributions. parameters with distributions, + = 1, 0. i 1 2 m;s ) = (1 ( 1) (hea vy line) ; (a 2 2 1 2 ( x x ( ) ) 2 1 Cauc 4) ; (2 and hy distribution) p p x ( ) = j ; ; ; ; P ; + : exp exp 1 1 2 2 1 2 2 2 2 2 1 2 a Gaussian and t line), (ligh 2 2 2 1 mean = 3 and distribution with in nite num ber of If we tak e an appropriately weigh ted mixture of an = 3 (dashed deviation standard sho vertical linear on wn line), Gaussians, all having mean , we obtain a Studen t- t distribution , scales logarithmic and (top) 1 1 Notice vertical (bottom). scales ; (23.8) x ( P j ) = ;s;n n +1) = 2 2 ( 2 that the Cauc hea vy tails of the hy Z ( )) = ns ) x (1 + ( distribution scarcely eviden t are e'. upp er `bell-shap ed curv in the where p 2) ( n= 2 ns (23.9) = Z (( n + 1) = 2)

325 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Distributions over e real num bers 313 23.3: positiv is called and ber of degrees of freedom and is the gamma function. the n num mean the (23.8) has a mean and that t distribution is 1 then Studen n > If 2 2 also a nite variance, . If = ns n > = ( n 2). 2 the distribution has !1 , the Studen As approac hes the normal distribution with n t distribution standard deviation s . The Studen t distribution arises both in mean and (as sampling-theoretic distribution of certain statistics) statistics the classical inference (as the probabilit y distribution and coming in Bayesian of a variable a Gaussian distribution whose standard deviation we aren't sure of). from special case In the = 1, the Studen t distribution is called the Cauc hy n distribution . whose tails intermediate in hea viness between Studen t A distribution are is the onen tial distribution , and Gaussian biexp x j j 1 exp (23.10) ) 1 ; 1 ( x 2 ( x j ;s ) = P s Z where = 2 s: Z (23.11) The distribution inverse-cosh 1 ( ) / x P j (23.12) 1 = x )] [cosh ( mo del in indep enden t comp onen is a popular In the limit of large , t analysis. the y distribution P ( x j probabilit ) becomes a biexp onen tial distribution. In the limit ! 0 P ( x j ) approac hes a Gaussian with mean zero and variance 1 = . 23.3 positive real numb ers Distributions over gamma, log-no rmal. and onential, Exp inverse-gamma, tial distribution , The exp onen 1 x P ( x j s ) = 1 x 2 (0 ; ) ; (23.13) exp s Z where = s; Z (23.14) arises problems. How long will you have to wait for a bus in Pois- in waiting ville, every that buses arriv e indep enden tly at random with one son s given x utes average? min er: the probabilit y distribution of your wait, on , is Answ exp onen tial with mean s . The gamma distribution is like a Gaussian distribution, except whereas the Gaussian goes from to 1 , gamma distributions go from 0 to 1 . Just as 1 Gaussian mean has two parameters and whic h con trol the the distribution width gamma distribution, the and distribution has two parameters. It of the tial pro of the one-parameter exp onen is the distribution (23.13) with a duct 1 c in the x exp onen t c . The polynomial is the second parameter. polynomial, 1 c 1 x x 1 (23.15) x < 0 ; ( ; s;c ) = j s;c ) = x ( P x exp s s Z where (23.16) = ( c ) s: Z

326 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 23 | Useful Probabilit 314 y Distributions 0.8 1 . Tw 23.4 Figure o gamma 0.9 0.7 0.8 with distributions, parameters 0.6 0.7 0.5 0.6 ; 3) (hea s;c ( vy ) = (1 lines) and 0.4 0.5 0.4 0.3 3 (ligh t lines), ; on sho wn 10 : 0 0.3 0.2 0.2 0.1 vertical scales and (top) linear 0.1 0 0 logarithmic scales vertical 8 4 2 0 6 10 -4 -2 0 2 4 sho as a wn and (bottom); 1 of function the left on x (23.15) 0.1 0.1 and l = ln t (23.18). righ on the x 0.01 0.01 0.001 0.001 0.0001 0.0001 8 -4 0 2 4 -2 6 4 2 0 10 = ln x x l 2 peak ed distribution with mean sc and variance s This c . is a simple It is often to represen t a positiv e real variable x in terms of its natural = ln l . The probabilit y densit y of l is x logarithm @x ( x (23.17) ) l ( ( P x )) l = l )) l ( x ( P ) = P ( @l c 1 x ) l ( l ( x ) = exp ; (23.18) s s Z l where Z = ( c ) : (23.19) l constan distribution [The after its normalizing gamma t { an odd is named con vention, it seems to me!] Figure 23.4 sho ws a couple of gamma distributions as a function of x and of l where the original gamma distribution (23.15) may have a . Notice that h a spik x over l nev er has suc distribution e. The spik e is e' at `spik = 0, the of a bad choice of basis. an artefact sc ;c limit ! 0, we obtain the noninformativ e prior for a scale In the = 1 er prior 1 prior. This improp =x is called noninformativ e because parameter, the no asso ciated length scale, no characteristic value of x , so it prefers all it has x of . It is invarian t under the reparameterization equally = mx . If x values the 1 =x probabilit y densit y into a densit y over l = ln x we nd we transform latter y is uniform. densit the [ 1 ] . Exercise Imagine 23.2. we reparameterize a positiv e variable x in terms that 1 = 3 u = x of its cub e root, . If the probabilit y densit y of x is the improp er distribution 1 , what is the probabilit y densit y of u ? =x l gamma ays a unimo dal densit y over is alw = ln x , and, The distribution be seen in the gures, it is asymmetric. If x as can a gamma distribution, has and to work in terms of the inverse of we decide , v = 1 =x , we obtain a new x distribution, in whic h the densit y over l is ipp ed left-for-righ t: the probabilit y , densit is called an inverse-gamma distribution v y of +1 c 1 1 1 1 ; (23.20) 0 v < exp ) = s;c j v P ( Z sv sv v where Z (23.21) = ( c ) =s: v

327 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. over perio 315 23.4: dic variables Distributions 0.8 2.5 23.5 gamma . Tw o inverse Figure 0.7 2 parameters with distributions, 0.6 0.5 1.5 ) = (1 and ( lines) vy 3) (hea ; s;c 0.4 1 0.3 10 on wn sho t lines), 3 (ligh : 0 ; 0.2 0.5 0.1 (top) scales vertical linear and 0 0 logarithmic vertical scales 3 1 4 2 0 -2 -4 0 2 wn sho and (bottom); as a and left the on x of function 1 0.1 righ the t. on x = ln l 0.1 0.01 0.01 0.001 0.001 0.0001 0.0001 -4 4 -2 3 2 1 0 0 2 v ln v inverse gamma distributions crop up in man y inference prob- and Gamma h a positiv from tity is inferred in whic data. Examples include lems e quan noise variance noise from some the samples, and infer- inferring of Gaussian the rate parameter of a Poisson distribution from the coun t. ring distributions also naturally in the distributions of waiting Gamma arise cess Poisson-distributed en a Poisson pro Giv with rate between times events. probabilit y densit y of the arriv al time x of the m th event is , the 0.4 0.35 m 1 0.3 x ) ( x 0.25 (23.22) : e 0.2 1)! m ( 0.15 0.1 0.05 0 Log-normal distribution 0 1 2 3 4 5 ber is the log-normal Another distribution over a positiv e real num distribu- x 0.1 that results when l = ln x has a normal distri- tion, whic h is the distribution value of bution. We de ne m standard to be the median to be the s x , and 0.01 . x deviation of ln 0.001 2 0.0001 1 ln m ) ( l 0 5 1 2 3 4 ; (23.23) ( 2 l ; ) 1 1 m;s ) = P ( exp l j 2 Z 2 s 23.6 . Tw Figure o log-normal where p with parameters distributions, 2 Z = (23.24) ; 2 s : 1 ; ) = (3 m;s ( and line) vy 8) (hea 7) (ligh sho t line), on (3 ; 0 : wn implies and linear vertical scales (top) scales vertical logarithmic 2 (ln x 1 ln m ) 1 do really they [Yes, (bottom). x : (23.25) ) 1 ; (0 2 m;s exp ) = P j x ( 2 2 s x Z same have the of the value median, m = 3.] 23.4 Distributions over perio dic variables A perio dic is a real num ber 2 [0 ; 2 ] having the prop erty that = 0 variable and are equiv alen t. = 2 that A distribution perio dic variables the role played by the Gaus- plays for sian distribution for real variables is the Von Mises distribution : 1 (23.26) exp ( cos( ) )) 2 (0 ; 2 : j ; ) = P ( Z normalizing constan t is Z = 2 I Bessel ( The ), where I di ed ( x ) is a mo 0 0 function.

328 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 316 23 | Useful Probabilit y Distributions 5 Bro the circle is the from wnian A distribution that arises around di usion 4 distribution, ed Gaussian wrapp 3 1 X 2 2 j ( P ) = ; Normal( ) 2 (0 ; 2 : (23.27) ; ( + 2 n ) ; ) 1 n = 1 0 0 0.25 0.5 0.75 1 0.6 Distributions over 23.5 probabilities 0.5 Beta distribution, Dirichlet distribution, entropic distribution 0.4 0.3 The is a prob- p y over a variable that distribution is a probabilit y densit beta 0.2 ; (0 2 p y, abilit 1): 0.1 1 1 u u 1 0 1 2 (1 p ) p : (23.28) p ) = ;u u j ( P 2 1 4 -6 6 -4 2 0 -2 ;u ) ( u Z 1 2 e any positiv The constan e value. normalizing parameters may tak t ;u The u beta Figure . Three 23.7 1 2 distributions, with function, beta is the u and ;u ( ) = (0 : 3 ; 1), (1 : 3 ; 1), ) u )( u ( 1 2 2 1 ) = ;u u ( Z (23.29) : 2 1 2). ; (12 ws sho er gure upp The ) ( u + u 1 2 p of ( j P ; the ;u p ) as a function u 2 1 Je reys u = 1 ;u distribution uniform = 1; the Special cases include the { 1 2 corresp onding lower sho ws the = 0 improp = 0. ;u = 0 prior { u u = 0 : 5 ;u { prior : 5; and the er Laplace If , densit y over the logit 2 1 1 2 logit densit y over the we transform the beta distribution to the corresp onding p ln : it is alw ), we nd p / (1 p l y over , while ed densit t bell-shap ays a pleasan l ln p 1 densit = 0 and y over = 1 ( gure p the p p may have singularities at 23.7). ved the Notice how well-b eha are densities of the as a function Mor e dimensions logit. Diric The is a densit y over an I -dimensional vector p whose hlet distribution onen The positiv e and sum to 1. comp beta distribution is a special I ts are Diric of a Diric with I = 2. The distribution hlet distribution is case hlet by a measure u (a vector with all coecien ts u h > 0) whic parameterized i I will here as u = m , where m is a normalized measure over the I write P is positiv m comp = 1), and onen e: ts ( i I Y P 1 m 1 ) I ( i p : j p ) ( m (23.30) hlet Diric 1) ( P ( m ) = p j p i i i ( ) Z m =1 i The function ( x ) is the Dirac delta function, whic h restricts the distribution P to the h that p is normalized, i.e., simplex suc p The normalizing = 1. i i t of the hlet distribution is: constan Diric Y m ) = Z ( m ) /( ) : (23.31) ( i i vector m is the mean of the The y distribution: probabilit Z I ) I ( = p j m ) p d hlet Diric ( m : (23.32) p When with a probabilit y vector working , it is often helpful to work in the p `softmax basis', in whic h, for example, a three-dimensional probabilit y p = ;a ( ;p p ;p = 0 ) is represen ted by three num bers a a ;a + a + satisfying a 3 1 1 2 2 1 3 3 2 and P 1 a a i i (23.33) . e = e ; where Z p = i i Z This nonlinear transformation is analogous to the ! ln transformation transformation for variable and the logit a scale for a single probabilit y, p !

329 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 317 over probabilities Distributions 23.5: 2 10 u = (0 : 2 ; 1 ; 2) u = (0 : 7) ; 0 : 3 ; 0 : 15) ; u = (20 ; . Three 23.8 Figure Diric hlet distributions over a three-dimensional probabilit y ). The ;p vector upp ( er ;p p 1 2 3 gures sho w 1000 random dra ws wing sho h distribution, from eac of values the p and p on the two 2 1 ). The p axes. ( p = 1 p + 1 2 3 8 8 8 is the rst in the gure triangle simplex of legal probabilit y 4 4 4 distributions. lower gures same sho The w the 0 0 0 points in the `softmax' basis (equation The two axes (23.33)). -4 -4 -4 . = . sho w a a and a a a 3 1 2 1 2 -8 -8 -8 -4 -8 -4 8 4 4 0 -4 -8 -8 0 8 4 0 8 p min ugly . In the softmax basis, the ts in the onen us-ones in the exp ln p 1 hlet distribution (23.30) disapp ear, and the densit y is given by: Diric I Y P 1 m i ( P j m ) / a ( p a ) : (23.34) i i i ) m Z ( =1 i First, parameter can be characterized in two ways. mea- of the role The sharpness of the distribution ( gure 23.8); sures how di eren t the it measures ect samples p from the distribution to be from the mean m , just we exp typical 2 1 / precision as the = of a Gaussian how far samples stra y from its measures I = 100 value of pro duces a distribution over p that is sharply peak ed mean. A large 1 0.1 e ect situations can be visualized of m in higher-dimensional . The around 1 10 ( I ) 100 0.1 from j m ), with Diric hlet sample a typical wing distribution the p by dra ( 1000 1 / making plot, that is, a rank ed I , and a Zipf uniform vector m = to the m set i 0.01 onen p plot (ver- p ts . It is traditional both comp of the values of the to plot i i scales so that power tical axis) and the rank (horizon tal axis) on logarithmic 0.001 a ws plots sho 23.9 Figure t lines. as straigh ear app law relationships for these 0.1 ensem single sample with and from I and = 100 I with from bles = 1000 0.0001 100 10 1 w with ts having onen plot is shallo For large man y comp to 1000. , the simi- I = 1000 onen t one p , typically receiv es an overwhelming lar values. For small comp i 1 of the remains probabilit share to be shared small y, and probabilit of the y that 0.1 1 10 0 0.1 t onen comp another comp onen p other the among ts, receiv large es a similarly i 100 1000 the to an share. In the tends plot steep goes to zero, as limit increasingly 0.01 power law. 0.001 of role the characterize we can Second, e dis- predictiv of the in terms when obtain tribution that results ts e samples coun and p from we observ 0.0001 ;F of F = ( F value The outcomes. de nes possible ) of the the ;:::;F 1 2 I 1e-05 from p that are ber of samples in order that the data dominate num required 1 10 100 1000 prior over the in predictions. Figure 23.9 . Zipf plots for random ] [ 3 additivit y prop erty. Exercise 23.3. The Diric hlet distribution satis es a nice from samples Diric hlet distributions with various values die has two red Imagine and four blue faces. that a biased six-sided faces = 0 : 1 ::: 1000. For eac h value of and die the examine two Bayesians outcomes times N in is rolled The h eac and , = 100 I of or 1000 order Bayesian e predictions. mak and die of the bias the to infer One sample p from the Diric hlet one a two- has he , and only outcomes colour red/blue to the access infers was generated. The distribution y vector access has Bayesian other ). The ;p comp onen t probabilit p ( B R probabilities Zipf plot sho ws the h of the whic up, faces and six to eac h full outcome: he can see came p , rank ed by magnitude, versus i rank. their he infers a six-comp onen t probabilit y vector ( p ;p ), where ;p ;p ;p ;p 3 4 2 5 1 6

330 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 23 | Useful Probabilit 318 y Distributions . Assuming = + p p and p sec- = p the + p that + p p + p 5 1 4 3 B 2 R 6 ;p a Diric assigns to ( p ;p Bayesian hlet ond ;p ;p ;p ) with distribution 5 2 6 4 1 3 ( hyperparameters ;u rst ;u the ;u for ;u in order ;u u ), sho w that, 6 4 3 2 1 5 inferences t with those of the second Bayesian, Bayesian's to be consisten rst Bayesian's prior should the hlet distribution with hyper- be a Diric parameters u (( + u )). ) ; ( u u + u + + u 5 4 6 2 1 3 approac P compute the integral : a brute-force ( p h is to ;p Hint ) = B R R 6 d P ( p j u ) ( p er ( p A cheap + p )). )) ( p p ( p + + p p + p 4 5 3 B 6 2 1 R approac h is to compute the predictiv e distributions, given arbitrary data e dis- ( ;F ;F two predictiv ;F the ;F for ;F condition ), and nd the F 6 4 3 2 1 5 h for data. to matc all tributions distribution for The y vector p is sometimes used in entropic a probabilit `maxim um entrop y' image reconstruction comm unit y. the P 1 p 1) (23.35) ; j P ) = m ; p ( D ( exp p jj m )] ( [ i KL i ) m ; ( Z P m , the measure, is a positiv e vector, and D where ( p jj m ) = log p =m . p KL i i i i reading Further (MacKa y and Peto, 1995) for fun with See hlets. Diric 23.6 Further exercises [ 2 ] Exercise 23.4. N datap oints f x distribution g are dra wn from a gamma n s ( x j s;c ) = ( x ; s;c ) with unkno wn parameters P and c . What are the maxim um likeliho od parameters s and c ?

331 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 24 Exact Marginalization the we avoid tially large cost of complete enumeration of exp How can onen we explore we stoop ximate metho ds, appro two Before to all hypotheses? exact marginalization: rst, marginalization approac tinuous hes to over con kno as nuisance parameters) by doing wn grals ; and (sometimes variables inte second, variables by message-passing. summation over discrete marginalization tinuous parameters is a mac ho activit y en- Exact over con are t in de nite integration. This chapter uses gamma who uen joyed by those in the previous chapter, distributions; distributions as was explained gamma a lot distributions, except that like Gaussian the Gaussian goes are whereas 1 from 1 , gamma distributions go from 0 to 1 . to Inferring and variance of a Gaussian distribution mean 24.1 the the one-dimensional Gaussian distribution, parameterized We discuss again and a standard deviation : by a mean 2 ) x ( 1 2 p exp P j x ( ) = ; ) : (24.1) Normal ( x ; ; 2 2 2 inferring distribution. parameters, we must specify their prior When these speci c prior opp ortunit y to include The kno wledge that we have gives us the about and (from indep enden t exp erimen ts, or on theoretical grounds, for example). If we have no suc then we can construct an appropriate h kno wledge, 21.2, embodies osed ignorance. In section supp we assumed a that our prior over the range of parameters uniform If we wish to be able to prior plotted. exact marginalizations, it may be useful to consider conjugate priors ; perform are priors whose functional form com bines naturally with the likeliho od these h that suc have a con venien t form. the inferences for and Conjugate priors conjugate prior for a mean is a Gaussian: we introduce two `hy- The the and perparameters', , whic h parameterize write prior on , and 0 2 ( j , we obtain ; ! 1 ) = Normal( ; ; P = 0, ). In the limit 0 0 0 noninformative prior. for a location parameter, the at the This is prior e because it is invariant under the natural reparameterization noninformativ 0 ( = + c . The prior P ) = const : is also an impr oper prior, that is, it is not normalizable. conjugate prior for a standard deviation is a gamma distribution, The whic h has two parameters b prior and c the . It is most con venien t to de ne 319

332 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 24 | Exact 320 Marginalization 2 y of the = 1 = variance : densit precision parameter) (the inverse 1 c 1 exp ; ;c ( ) = P b ) = ( (24.2) : 1 < ; 0 c b ) ( c b 2 and distribution b is a simple c peak mean variance b ed with This c . In limit c b = 1 ;c a scale the 0, we obtain the noninformativ e prior for ! 1 = prior. This is `noninformativ e' because it is invarian t parameter, the 0 reparameterization = c . The 1 = prior is less strange-lo oking if the under we change densit , or ln resulting , whic h is at. This is the Reminder: when y over ln we examine the to l ( ), a from variables ignorance prior about by saying `well, that expresses it could be 10, or it one-to-one function of , the . . . ' Scale could be 1, or it could best usually are h as suc variables be 0.1, y densit y transforms probabilit e 1 = prior this represen ted in terms of their logarithm. Again, noninformativ ) to P ( from er. is improp @ follo I will er noninformativ In the improp the use examples, wing e priors : P ) = l ( ( P ) l @l and for improp circles, in some as distasteful ed is view er priors . Using me sak the for y; if I included e of readabilit so let it's excuse myself by saying Jacobian the is Here, be key points would the but be done still could calculations the er priors, prop parameters. ood of extra by the obscured @ : = @ ln likeliho od and mar ginalization: Maximum and N 1 N task of inferring The mean and standard deviation of a Gaussian distribu- the tion from N samples is a familiar one, though maybe not everyone understands the di erence the between and us Let calculator. buttons on their 1 N N form recap deriv e them. ulae, the then N D f x Giv g data en is = `estimator' of , an n =1 n P N x =N; (24.3) x n n =1 of are: two estimators and s s P P N N 2 2 x x x ) ) ( x ( n n n =1 =1 n and : (24.4) N 1 N 1 N N two principal paradigms for are sampling theory and Bayesian There statistics: In sampling theory (also kno wn as `frequen tist' or ortho inference. dox statis- tics), invents estimators of quan tities of interest and then chooses between one estimators using some criterion measuring their sampling prop erties; those there is no clear principle for deciding whic h criterion to use to measure the performance of an nor, for most criteria, is there any systematic estimator; cedure for construction of optimal estimators. In Bayesian inference, pro the about once explicit all our assumptions trast, the mo del in con we have made the data, our inferences are mec hanical. Whatev er question we wish and to pose, rules of probabilit y theory give a unique answ er whic h consisten tly the es into accoun Human-designed the given information. tak estimators and t all inference; interv have no role in Bayesian con dence human input only en- als ters into the imp ortan t tasks of designing the hypothesis space (that is, the speci cation of the del and all its probabilit y distributions), and guring mo how to do the computations that implemen t inference in that space. The out answ ers to our questions are probabilit y distributions over the quan tities of interest. nd that the estimators of sampling theory emerge auto- We often matically as mo des or means of these posterior distributions when we choose a simple hypothesis space and turn the handle of Bayesian inference.

333 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. mean Inferring of a Gaussian distribution 321 the 24.1: and variance 0.06 Figure 24.1 od likeliho . The 1 0.05 of a parameters the for function 0.9 0.04 Gaussian distribution, rep eated 0.8 0.7 21.5. from gure 0.03 0.6 a2) Surface and (a1, con plot tour 0.02 sigma 0.5 likeliho log plot of the od as a 0.01 0.4 . The data set function and of 0.3 0 1 x mean 0 N of = 5 points had : = 1 P 0.2 0.8 2 and ( x x ) = = 1 : 0. Notice S 0.1 0.6 sigma 2 0 0.5 1.5 1 um the in is skew that maxim . 0.4 2 1.5 The of standard two estimators (a2) (a1) 0.2 1 mean 0.5 mean 0 deviation have values = 0 : 45 N = 0 and 50. : N 1 0.09 0.09 y of The posterior probabilit (c) mu=1 mu=1.25 0.08 0.08 various xed values of for mu=1.5 P(sigma|D,mu=1) 0.07 0.07 wn as a densit ). y over ln (sho 0.06 0.06 , y of probabilit The posterior (d) 0.05 0.05 a at D prior j ), assuming ( P on 0.04 0.04 jecting by pro , obtained the P(sigma|D) 0.03 0.03 in (a) probabilit y mass onto the 0.02 0.02 ( um of P The j D axis. ) is maxim 0.01 0.01 the . By trast, con at N 1 (c) (d) 0 0 maxim um ) is at P ( j D; = x of 1 0.8 0.6 0.4 0.2 2 1.6 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 1.4 1.2 1.8 . (Both are sho ws probabilities N .) over ln as densities theory , the estimators above can be motiv ated as follo ws. x is In sampling the unbiased whic h, out of all of possible unbiased estimators an estimator , has smallest variance (where this variance is computed by averaging over of an ensem of imaginary exp erimen ts in whic h the data samples are assumed ble estimator to come wn Gaussian distribution). The an unkno ( x; ) is the from N maxim um likeliho od estimator for ( ; ). The estimator , however: is biase d N the of exp , given , averaging over man y imagined exp erimen ts, is ectation N . not 2, ] p.323 [ e an intuitiv Exercise why the e explanation 24.1. is Giv estimator N biased. motiv ates the invention, in sampling theory , of This bias be , whic h can 1 N 2 to be an Or to be precise, it is estimator. sho unbiased wn that is an N 1 2 estimator of . unbiased at some Bayesian inferences for this problem, assuming non- We now look e priors informativ and . The emphasis is thus not on the priors, but for (b) (a) likeliho od function, and the the concept of marginalization. on rather join t posterior probabilit y of and is prop ortional to the likeliho od The log illustrated plot in gure 24.1a. The tour likeliho od is: function by a con X p N 2 2 ln( N j ; ) = 2 ) ( g P ln f x ) (24.5) ( x ) ; = (2 n n n =1 n p 2 2 x ) [ N ( + ) 2 S ] = (2 ) ; (24.6) ln( N = P 2 S where x od can x be ( . Giv en the Gaussian mo del, the likeliho ) n n expressed of the two functions of the data x and S , so these two in terms The tities kno wn as `sucien t statistics'. are posterior probabilit y of quan and is, using the improp er priors: N g ( ) P ( f x P ) ; j ; n N =1 n (24.7) P ) = jf g x ( ; n =1 n N ( x f P g ) n =1 n 2 S ) x N + ( 1 1 1 exp 2 2 N= 2 2 ) (2 (24.8) : = N f x P ) ( g n n =1

334 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Marginalization 322 24 | Exact es the describ question, `giv en the data, and the This answ function er to the t be?' and migh It may be of interest to nd noninformativ e priors, what maximize the posterior probabilit y, though it should values parameter the that be emphasized y maxima have no fundamen tal status that posterior probabilit since location dep ends on the choice of basis. Here in Bayesian inference, their is at, ( ln ), in whic h our prior basis so that the posterior the we choose ; um coincides with the maxim um of the likeliho probabilit As we y maxim od. 22.4 , the maxim um likeliho od solution for and ln saw in exercise (p.302) n o p = x; is f = ; g S=N : ML N to the posterior distribution than just its mo de. As can There is more 24.1a, a skew likeliho od has in gure peak. As we increase , be seen the of the increases distribution of width ( gure 22.1b). And the conditional mean moving away from the sample of values x , we if we x to a sequence a sequence of conditional distributions over whose maxima move to obtain values increasing ( gure 24.1c). of posterior probabilit given is The y of N g x P ( f j ) ( P ) ; n N n =1 (24.9) P g x jf ( ) = ; n n =1 N P g ( f ) j x n =1 n 2 2 N ( exp ) = (2 ( )) (24.10) / x 2 ; x; (24.11) =N ) : = Normal( p the We note familiar = of the bars on . scaling error N the question `giv en the data, and the noninformativ e priors, Let us now ask the t be?' This question di ers from rst one we ask ed in that we migh what now not interested in . This parameter must therefore be mar ginalize d are The probabilit y of is: over. posterior N g x f ( P ( P ) j ) n =1 n N ) = (24.12) : jf x g P ( n =1 n N ( g P f x ) n =1 n N P ( f x data-dep g The t term enden j ) app eared earlier as the normalizing n n =1 (24.9); one name for this quan constan `evidence', or t in equation tity is the likeliho . We obtain the evidence for by integrating out marginal od, for we call ; a noninformativ ( ) = constan t is assumed; P this constan t e prior = as a top-hat , so that we can think of the prior 1 prior of width . The R N N P g integral, Gaussian yields: j ) = f P ( f x d; g x ) ( j ; ) P ( n n =1 =1 n n p p p N 2 = S N : (24.13) j ) = N ln( + ln ) 2 ln f x P g ( n =1 n 2 2 The rst two terms are the best- t log likeliho od (i.e., the log likeliho od with = last term is the log of the Occam factor whic h penalizes smaller x ). The in Chapter discuss Occam factors more . (We will 28.) When we values of tiate the log evidence with resp ect to ln , to nd the most probable di eren p , the volume factor ( = additional ) shifts the maxim um from N to N p = 1) (24.14) : S= ( N 1 N ts ely, the ( N 1) coun ts the num ber of noise Intuitiv denominator measuremen P 2 con tained in the quan tity S = ( x N x ) tains . The sum con residuals n n but are only ( squared, there 1) e ectiv e noise measuremen ts because the N determination of one parameter from the data causes one dimension of noise to be gobbled up in una voidable over tting. In the terminology of classical

335 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Exercises 24.2: 323 2 Bayesian's for sets the (the measure of deviance best statistics, guess P 2 2 2 ( x ber of degrees num ^ ) = to the of freedom, ) equal ^ by de ned n n N 1. sho ws the posterior probabilit y of , whic Figure ortional 24.1d h is prop marginal od. This may be con trasted with the posterior prob- to the likeliho h is sho xed to its most probable value, x = 1, whic y of wn in abilit with and d. gure 24.1c data, inference t wish to mak e is `giv nal the we migh what is ?' The en ] 3 [ 24.2. Marginalize . over and obtain the posterior marginal distri- Exercise of h is a Studen t- t distribution: bution , whic N= 2 2 S = N ( ( ) D + j ) / 1 P x (24.15) : Further reading A bible of exact marginalization (1988) book on Bayesian spec- is Bretthorst's analysis and parameter estimation. trum 24.2 Exercises A C D-G B 3 ] [ requires mac ho Exercise capabilities.] Giv e . 24.3. [This exercise integration (p.309) , where seven scien a Bayesian of solution to exercise 22.15 tists personal noise levels varying , capabilities with have measured n 0 10 20 -10 -30 -20 and we are interested in inferring . Let the prior on eac h be a n broad for example a gamma distribution with parameters ( s;c ) = prior, ; (10 1). Find the posterior distribution of . Plot it, and explore its : 0 one for y of data sets suc h as the a variet given, and the data erties prop f x 39 g = f 13 : 01 ; 7 : set g . n [ : rst nd the posterior distribution of Hint given and x , n n ( is j x P ; ). Note that the normalizing constan t for this inference n n ( t, then j ). Marginalize P x constan to nd this normalizing over n n use Bayes' theorem a second time to nd P ( jf x ).] g n Solutions 24.3 24.1 (p.321) . 1. The data to exercise distributed with mean Solution points are 2 true about the squared deviation 2. The sample mean is unlik ely mean. to exactly the true mean. 3. The sample equal is the value of that mean minimizes the sum squared deviation of the data points from . Any other have a larger value (in particular, the true value of ) will value of the of sum-squared deviation that = x . So the exp ected mean squared deviation from the sample mean is neces- 2 deviation than the mean squared smaller sarily about the true mean.

336 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 25 Exact Marginalization in Trellises we will In this exact metho ds that are used in proba- discuss chapter a few task delling. example we will discuss the an of deco ding a linear mo bilistic As We will see that inferences can be conducted error-correcting e- code. most tly message-passing algorithms , whic h tak e adv antage of the graphical cien by problem avoid of the unnecessary duplication of computations structure to Chapter (see 16). 25.1 Deco ding problems ord t is selected from a linear ( A codew ) code C , and it is transmitted N;K over a noisy the receiv ed signal is y . In this chapter we will assume channel; the is a memoryless channel suc h as a Gaussian channel. Giv en that channel assumed channel mo del P ( y j t ), there are two deco ding problems. an codew deco The ding problem is the task of inferring whic h codew ord ord was transmitted receiv the t ed signal. given task bitwise The problem is the deco of inferring for eac h transmit- ding ted bit t a zero. how likely it is that that bit was a one rather than n As example, tak e the (7 ; 4) Hamming code. In Chapter 1, we a concrete that the deco ding problem for ord code, assuming a binary discussed codew channel. We didn't discuss the bitwise deco ding problem and we symmetric discuss more didn't general channel mo dels suc h as a Gaussian how to handle channel. Solving codewor d decoding problem the y of the Bayes' By posterior probabilit theorem, codew ord t is the ) t ( P ) t P ( y j ) = y ( P j (25.1) t : ) ( P y eliho od function . The rst factor in the numerator, P ( y j t ), is the likeli- Lik channel, of the ord, whic h, for any memoryless codew is a separable hood function, N Y y j t ) = P ( t ( y j P ) : (25.2) n n n =1 For example, if the channel is a Gaussian channel with transmissions x , then and e noise of standard deviation additiv the probabilit y densit y 324

337 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Deco 25.1: 325 ding problems ; ed in the two cases t of the = 0 signal 1 is receiv y n n 2 1 ) ( y x n p (25.3) j t = 1) P ( y = exp n n 2 2 2 2 2 x + ) y ( 1 n p exp (25.4) : P t j y = = 0) ( n n 2 2 2 2 is the point of view all that matters ding, likeliho od the From of deco h for the case of the Gaussian channel is ratio , whic y ( P = 1) t j xy 2 n n n = exp (25.5) : 2 j t ( P y = 0) n n [ ] 2 25.1. Sho w that from the point of view of deco ding, a Gaussian Exercise is equiv alen arying binary symmetric channel with channel t to a time-v on noise a kno whic h dep ends f n . wn level n The second factor in the numerator is the prior probabilit y of the Prior . P ord, t ), whic h is usually assumed to be uniform over all valid codew ( codew ords. The denominator in (25.1) is the normalizing constan t X : ) (25.6) t P ( y j t ) P ( ( y P ) = t problem complete ord deco ding codew is a list of all to the The solution and their probabilities as given by equation (25.1). Since the num- codew ords K ber of codew code, 2 ords , is often very large, and since we are not in a linear interested wing the detailed probabilities in kno the codew ords, we often of all restrict atten tion to a simpli ed version of the codew ord deco ding problem. The MAP ord deco ding problem is the task of iden tifying the codew probable codewor t given the receiv ed signal. most d probabilit ords is uniform then this task is iden- prior y over codew If the problem of maxim um likeliho od deco tical , that is, iden tifying to the ding codew that maximizes P ( ord j t ). the y In Chapter 1, for the (7 ; 4) Hamming code Example: a binary symmetric and channel a metho d for deducing the most probable codew ord from we discussed codew syndrome ed signal, thus solving the MAP receiv ord deco ding the of the for that case. We would like a more problem solution. general The codew ord deco ding problem can MAP ed in exp onen tial time be solv K order 2 that ) by searc hing through all codew ords for the one (of maximizes P ( j t ) P ( t ). But we are interested in metho ds that are more ecien t than y In section wn we will discuss an exact metho d kno this. as the min{sum 25.3, whic deco to solv e the codew ord algorithm ding problem more h may be able on tly; ecien tly dep ends ecien the prop erties of the code. how much more It is worth emphasizing that MAP codew ord deco ding for a gener al lin- ear code wn to be NP-complete (whic h means in layman's terms that is kno scales codew MAP ding has a complexit y that ord exp onen tially with the deco blo cklength, unless there is a revolution in computer science). So restrict- the ing tion to the MAP deco ding problem hasn't necessarily made atten task much less challenging; it simply mak es the answ er briefer to rep ort.

338 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. in Trellises 25 | Exact Marginalization 326 problem the Solving bitwise decoding of the bitwise deco ding problem is obtained from Formally , the exact solution ginalizing over the other bits. mar by (25.1) equation X P ( ) = t j y j (25.7) : ) y t ( P n 0 g 6 t = n : n f 0 n [ S ] that is aid of a truth function We can also write this marginal with the one prop and zero otherwise. is true S osition if the X P y ) = ( t = 1] = 1 [ t (25.8) j ) P ( t j y n n t X = 0 j y P ( (25.9) t = 0] [ t : ) = ) P ( t j y n n t marginal by an explicit sum over all codew ords Computing probabilities these es exp onen tial time. But, for certain codes, the t deco ding problem tak bitwise be solv much more ecien tly using the ed ard{bac kward algorithm . can forw (a) e this whic h is an example of the sum{pro duct describ We will algorithm, t. Both the min{sum algorithm and the sum{pro duct algorithm , in a momen imp algorithm have been invented man y times ortance, have widespread and Rep etition code R y elds. in man 3 (b) Codes 25.2 trellises and ) codes in terms In Chapters 1 and 11, we represen ted linear ( N;K of their In the of a systematic k matrices. generator matrices and case their y-chec parit Simple y code P parit 3 the source blo ck code, the rst K transmitted bits in eac h blo ck of size N are parit the means are bits K N y-chec = M remaining the and bits, k bits. This the generator that matrix be written can of the code I K T (c) ; (25.10) = G P k matrix y-chec can the and be written parit = P H I (25.11) ; M (7 4) Hamming code ; M P where matrix. K is an code called a In this section we will study another represen tation of a linear . Examples Figure of trellises. 25.1 represen not trellis. The be systematic in general codes t will that trellises these in a trellis Eac elled is lab h edge by a square) wn (sho by a zero or they if desired codes ed onto systematic be mapp can by a reordering but codes, by a cross). a one (sho wn ck. of the in a blo bits of a trellis De nition de nition will be quite narro w. For a more comprehensiv e view of trellises, Our reader hang consult Ksc hisc the and Sorokine (1995). should kno is a consisting of nodes (also A trellis wn as states or vertices) and graph edges . The nodes are group ed into vertical slices called times , and the times are suc h that eac h edge connects a node in one time to ordered elled a node time. Every edge is lab bouring with a sym bol . in a neigh The leftmost and righ tmost states con tain only one node. Apart from edge these nodes, all nodes in the trellis have at least one two extreme connecting left wards and at least one connecting righ twards.

339 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solving the ding problems on a trellis 327 25.3: deco N A trellis a code of blo cklength N as follo ws: a + 1 times with de nes left by taking crosses the trellis from that to righ t is obtained a path codew ord the sym bols on the edges that and traversed. Eac h valid path reading out are trellis a codew ord. We will num de nes leftmost time `time the through ber the righ tmost `time N '. We will 0' and ber the leftmost state `state 0' the num the tmost `state I ', where I is the total num ber of states (vertices) in and righ The th bit of the codew ord is emitted as we move from time n trellis. 1 the n . n to time of the trellis at a given time is the num ber of nodes The width in that The maximal width of a trellis is what it sounds like. time. is called a A trellis trellis if the code it de nes is a linear code. We will linear solely with linear trellises from now on, as nonlinear trellises are be concerned complex beasts. y, we will only discuss binary trellises, much more For brevit es and whose lab elled with zero are ones. It is not hard is, trellises edges that metho ds that follo to generalize q -ary trellises. the w to 25.1(a{c) w the trellises corresp onding sho rep etition code Figures to the and whic h has ( N;K ) = (3 ; 1); the parit y code R 2); with ( N;K ) = (3 ; P 3 3 (7 ; code. the 4) Hamming 2 ] [ codew Con rm Exercise the sixteen 25.2. ords listed in table 1.14 are . that by the generated sho wn in gure 25.1c. trellis Observations line ar trellises about has code the trellis For any linear one that minimal the smallest num ber is the of nodes. In a minimal trellis, eac h node has at most two edges entering it and at most two edges it. All nodes in a time have the same left degree as leaving h other and have the same righ t degree as eac h other. The width is eac they ays a power of two. alw A minimal a linear ( N;K ) code cannot have a width greater than for trellis K ord node has at least one valid codew since through it, and there are 2 every K = codew ords. Furthermore, if we de ne M only N K , the minimal 2 M width less than 2 trellis's . This will be pro ved in section 25.4. is everywhere all that linear trellises in gure 25.1, the of whic h are minimal Notice for K is the num ber of times a binary branc h point is encoun tered trellises, as the trellis from left to righ t or from righ t to left. is traversed We will the construction of trellises more in section 25.4. But we discuss now kno w enough to discuss the deco ding problem. 25.3 the deco ding problems on a trellis Solving view trellis of a linear code as giving a causal description of the We can the pro cess that gives rise to a codew ord, with time owing from left probabilistic t. Eac a divergence to righ is encoun tered, a random source (the source h time bits unication) comm of information determines whic h way we go. for At the end, we receiv e a noisy version of the sequence of edge- receiving lab els, and wish to infer whic h path was tak en, or to be precise, (a) we want to iden tify most probable path in order to solv e the codew ord deco ding the y that and we want to nd the probabilit (b) the transmitted sym bol problem; at time n was a zero or a one, to solv e the bitwise deco ding problem. the Example Consider the case of a single transmission from 25.3. Hamming (7 ; 4) trellis sho wn in gure 25.1c.

340 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 25 | Exact Marginalization 328 in Trellises probabilities . Posterior 25.2 Figure Lik probabilit y eliho t od Posterior when sixteen over the codew ords 0.25 0000000 0.0275562 the y has receiv ed vector normalized ods likeliho 0001011 0.0013 0.0001458 4 : 0 ; 1 : (0 0 ; 1 : 0 ; 1 : 0 ; 1 : 3). : 0 ; 9 : 0 ; 0.012 0.0013122 0010111 0.027 0.0030618 0011100 0100110 0.0002268 0.0020 0101101 0.0000972 0.0009 0110001 0.0708588 0.63 0.018 0111010 0.0020412 1000101 0.0013 0.0001458 0.0000042 0.0000 1001110 0.027 1010010 0.0030618 1011001 0.0013122 0.012 1100011 0.0000972 0.0009 1101000 0.0020 0.0002268 1110100 0.0020412 0.018 0.0001 0.0000108 1111111 the normalized Let ods be: (0 : 1 ; 0 : 4 ; 0 : 9 ; 0 : 1 ; 0 : 1 ; 0 : 1 ; 0 : 3). That is, likeliho the of the likeliho ods are ratios ( j x = 1) y P 0 : 4 0 : 1 = 1) P x y j ( 1 1 2 2 = ; = (25.12) ; etc. j 0 : 9 ( y x j x P = 0) = 0) ( y P 0 : 6 2 2 1 1 this ed signal be deco ded? How should receiv the likeliho 1. If we threshold at 0.5 to turn the signal into a bi- ods nary receiv ed vector, we have r = (0 ; 0 ; 1 ; 0 ; 0 ; 0 ; 0), whic h deco des, using the der for the binary symmetric channel (Chapter 1), into deco ^ t 0 ; 0 ; 0 ; 0 ; 0 ; 0). = (0 ; are is not deco ding pro cedure. Optimal inferences optimal This the ays obtained by using Bayes' theorem. alw nd the posterior probabilit y over codew ords by explicit enu- 2. We can of all This codew ords. meration posterior distribution is sho wn sixteen 25.2. course, we aren't really interested in suc h brute-force in gure Of and the aim of this chapter is to understand algorithms for solutions, K the information out in less than 2 same computer time. getting the posterior probabilities, we notice that the most probable Examining codew ord is actually the string t = 0110001 . This is more than twice as probable as the er found by thresholding, 0000000 . answ 25.2, the sho wn in gure probabilities we can also com- Using posterior the posterior marginal distributions of eac pute bits. The result h of the is sho in gure 25.3. Notice that wn 1, 4, 5 and 6 are all quite con- bits den tly inferred to be zero. The strengths of the posterior probabilities for bits 7 are not so great. 2 2, 3, and above example, In the MAP codew ord is in agreemen t with the bitwise the deco ding that is obtained by selecting the most probable state for eac h bit ays the using posterior marginal distributions. But this is not alw the case, as the follo wing exercise sho ws.

341 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solving deco problems on a trellis 329 25.3: the ding posterior Figure 25.3 . Marginal probabilities for 7 bits the under marginals Posterior eliho Lik n od of distribution posterior the P ( t j = 1 j t ) P ( t P = 0 j y ) = 1) P ( y ( j t y = 0) y n n n n n n 25.2. gure 9 939 0 : 1 0 : 1 0 : 061 0 : 0 0 : 4 326 6 0 : 674 2 0 : : : 254 0 : 9 0 3 1 0 : 746 0 : 9 0 : 939 1 0 0 : 061 : 0 4 : 0 : 9 0 : 061 0 : 939 5 0 : 1 9 939 0 061 : 0 : : 1 : 0 6 0 0 7 7 0 : 659 0 : 341 0 : 3 : [ 2, p.333 ] Exercise Find the most probable codew ord in the case where 25.4. ; normalized : 2 ; 0 : 2 ; 0 : 9 ; 0 : 2 od is (0 0 : 2 ; 0 : 2 ; 0 : 2). Also nd or the likeliho the marginal posterior probabilit y for eac h of the seven bits, estimate give the y-bit and deco ding. bit-b that t: concen the few codew ords on have the largest probabil- [Hin trate ity.] how to use message passing on a code's trellis to solv e the We now discuss ding deco problems. min{sum The algorithm MAP codew ord deco ding The can be solv ed using the min{sum al- problem gorithm that was introduced in section 16.3. Eac h codew ord of the code corresp onds across the trellis. Just as the cost of a journey is the to a path of the costs constituen t steps, the log likeliho od of a codew ord is sum of its we ip of the likeliho ods. By con vention, log the sign of the sum bitwise the od (whic h we would like to maximize) and talk in terms of a cost, log likeliho like to minimize. whic h we would y with h edge a cost log P We asso eac j t ciate ), where t is the trans- ( n n n bit mitted ciated with that edge, and y asso is the receiv ed sym bol. The n iden ted in section 16.3 can then presen tify the most prob- min{sum algorithm codew ord in a num ber of computer operations equal able num ber of to the edges trellis. This algorithm is also kno wn as the in the algorithm Viterbi (Viterbi, 1967). The sum{pr oduct algorithm To solv e the bitwise deco ding problem, we can mak e a small mo di cation to the min{sum so that the messages passed through the trellis de ne algorithm, t point' instead probabilit up to the curren data of `the cost of the `the y of the route to this point'. We replace the costs on the edges, log P ( best ), by j t y n n t likeliho es, P ( y operations the ods themselv sum ). We replace the min and j n n of the min{sum algorithm by a sum and pro duct resp ectiv ely. start Let i = 0 be the lab el for the over nodes/states, state, P ( i ) run i the set of states that are paren ts of state i , and w od be the likeliho denote ij the asso with the edge from node j to node i . We de ne ciated forw ard-pass messages by i = 1 0

342 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. in Trellises Marginalization 25 | Exact 330 X = (25.13) w : i ij j i j 2P ( ) sequen tially from left to righ t. can messages These be computed [ ] 2 25.5. w that for a node i whose time-co ordinate is n , . Sho Exercise is i join t probabilit y that the to the ord's path passed prop ortional codew i and that the rst n receiv ed sym bols were y through ;:::;y node . n 1 to the computed at the end node of the trellis is prop ortional The message I y of the data. probabilit marginal [ 2 ] K er: What t of prop ortionalit y? [Answ constan 2 25.6. ] . is the Exercise set of bac kward-pass messages manner. in a similar We de ne a second i node end node. Let I be the 1 = I X = (25.14) : w j i ij ) i 2P j : i ( t can tially in a bac kward pass from righ sequen These messages be computed to left. 2 ] [ Exercise Sho w that for a node i whose time-co ordinate is n , 25.7. is . i ortional conditional probabilit y, given that the codew ord's prop to the bols through i passed the subsequen t receiv ed sym node were path , that :::y . y n +1 N , to nd the probabilit y that the n th bit was a 1 or 0, we do two Finally of pro ducts forw ard and bac kward messages. Let i run over summations of the 1, and n j run over nodes at time n at time let t nodes be the value and ij t ciated asso of with the trellis edge from node j to node i . For eac h value of n t = 1, we compute = 0 X t ( ) w : r (25.15) = j ij i n i;j = t : j 2P ( i ) ; t ij Then posterior probabilit y that t the was t = 0 = 1 is n 1 ( t ) r (25.16) ; ) = P t ( = t j y n n Z (1) (0) nal should be iden tical to the + r normalizing constan t Z = r where the n n forw ard message that was computed earlier. I ) t j y ( P n n n t = 1 = 0 t ] [ 2 n n the algorithm does com- Con rm 25.8. that Exercise duct above sum{pro 1 1 / / 2 4 1 = P pute y j t ). t ( n 1 1 / / 4 2 2 duct names Other presen for sum{pro the ard{ forw `the are here ted algorithm 1 1 / / 2 8 3 BCJR `the bac kward algorithm', algorithm', and `belief propagation'. [ 2, p.333 ] y code Exercise is transmitted, 25.9. A codew ord of the simple parit . P ods for 25.4 . Bit wise likeliho Table 3 a codew ord of P . ed signal receiv the ciated and has likeliho ods sho wn in table 25.4. y asso 3 sum{pro the algorithm and Use min{sum duct algorithm in the trellis the ( gure 25.1) to solv e the MAP codew ord deco ding problem and the bitwise deco problem. Con rm your answ ers by enumeration of ding 101 codew ( 000 all 011 , 110 , ords ). [ Hint : use logs to base 2 and do , the min{sum computations by hand. When working the sum{pro duct algorithm you may nd it helpful to use three colours of pen, by hand, one for s.] s, one for the w s, and one for the the

343 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. More on 331 25.4: trellises Mo re on 25.4 trellises the various You may safely We now discuss of a code. trellis ways of making section. jump over this ord is the set of bits con tained between the The bit in span of a codew rst ord is non-zero, and the last bit that is non-zero, inclusiv e. We codew the that the span of a codew ord by a binary vector can wn in table 25.5. indicate as sho and ords codew . Some Table 25.5 0000000 ord 0001011 0100110 1100011 0101101 Codew their spans. 0111111 0111110 0001111 0000000 1111111 Span is in trellis-orien ted form if the spans of the A generator of the matrix rows matrix in di eren t columns and the all start all end in di eren t generator spans columns. to make a trellis from a gener ator How matrix First, the generator matrix into trellis-orien ted form by row-manipulations put to Gaussian similar our (7 ; 4) Hamming code can elimination. For example, be generated by 3 2 1 0 1 0 0 0 1 7 6 1 0 0 0 1 0 1 7 6 (25.17) = G 5 4 1 1 0 1 0 0 1 1 0 0 0 1 0 1 example, this in trellis-orien ted form { for is not rows 1, 3 and 4 but matrix have spans that end in the same column. By subtracting lower rows from all that upp obtain an equiv alen t generator matrix (that is, one we can er rows, generates same set of codew the as follo ws: ords) 2 3 1 0 0 1 1 0 0 7 6 1 0 1 0 0 1 0 6 7 = G : (25.18) 5 4 0 0 0 1 1 1 0 1 0 1 1 0 0 0 eac h row of the generator matrix can be though t of as de ning an Now, N; 1) sub code of the ( N;K ) code, that is, in this case, a code with two ( ords For the N = 7. codew rst row, the code consists of the two of length ords row and 0000000 . The sub code de ned by the second codew 1101000 of to construct and 0000000 . It is easy consists the minimal trellises 0100110 in the sub they are sho wn of these left column of gure 25.6. codes; We build the trellis incremen tally as sho wn in gure 25.6. We start with the trellis onding to the sub code given by the rst row of the generator corresp vertices Then sub code at a time. The in one within the span matrix. we add new sub code are all duplicated. The edge sym of the in the original trellis bols are unc hanged and the edge sym left in the second part of the trellis are bols ipp ed wherev er the new sub code has a 1 and otherwise left alone. Another (7 4) Hamming code can be generated by ; 2 3 0 1 1 1 0 0 0 6 7 0 1 0 1 1 0 1 6 7 G = (25.19) : 4 5 0 1 1 0 1 0 0 0 0 0 1 1 1 1

344 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. in Trellises 332 25 | Exact Marginalization . Trellises for four 25.6 Figure (7 codes of the 4) Hamming ; sub code (left column), and the made sequence of trellises that are trellis for when the constructing + code (7 ; 4) Hamming the (righ t column). = in a trellis h edge is lab Eac elled by a zero (sho wn by a square) or a one (sho wn by a cross). + = + = ; 4) Hamming code generated by this matrix di ers by a perm utation The (7 from the generated by the systematic matrix used in Chapter bits of its code utation y-chec corresp onding to this perm parit is: above. The 1 and k matrix 2 3 1 0 1 0 1 0 1 5 4 (a) 0 0 1 1 H = (25.20) 1 : 0 1 1 1 1 0 0 0 1 obtained from the perm uted matrix G given in equation (25.19) is The trellis in gure wn num ber of nodes in this trellis is smaller Notice sho 25.7a. that the ; in the for the Hamming (7 trellis 4) code ber of nodes num the than previous e that rearranging the order of the codewor d in gure 25.1c. We thus observ simpler ler, lead to smal bits trellises. can sometimes (b) ck matric es Trellises from parity-che the is in terms way of viewing Another of the syndrome. The syndrome trellis to be is de ned H r of a vector , where Hr is the A vector k matrix. y-chec parit the Figure 25.7 . Trellises for is zero. As we generate a codew ord we can a codew is only ord if its syndrome perm uted (7 ; 4) Hamming code duct pro the is, , that curren e the describ syndrome partial by the t state of generator the (a) from generated d of metho by the matrix in the h state with trellis H the codew ord bits thus far generated. Eac is a 25.6; (b) the parit y-chec k gure at one starting states ending and The partial syndrome coordinate. time are page matrix by the metho d on both represen constrained to be the zero syndrome. Eac h node in a state ts a 332. value N H M is an di eren t possible for the partial syndrome. Since matrix, elled Eac h edge in a trellis is lab M -bit So at we need vector. where M = N K , the syndrome is at most an or (sho wn by a square) by a zero M 2 its nodes most h state. We can construct the trellis of a code from in eac wn (sho a one by a cross). parit y-chec by walking from eac h end, generating two trees of possible k matrix the sequences. intersection of these two trees de nes syndrome trellis of The the code. In the pictures we obtain from this construction, we can let the vertical coordinate t the syndrome. Then any horizon tal edge is necessarily represen asso syndrome) with a zero bit (since only a non-zero bit changes the ciated

345 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Solutions 333 25.5: (Th edge ciated with any non-horizon bit. is asso us in this rep- and tal a one we no longer need to lab el the edges in the trellis.) Figure 25.7b resen tation (25.20). the corresp onding to the parit y-chec k matrix of equation trellis sho ws Solutions 25.5 posterior . The 25.8 Table y over codew probabilit ords for t Lik eliho od Posterior probabilit y 25.4. exercise 0.3006 0.026 0000000 0.00041 0.0047 0001011 0.0423 0010111 0.0037 0011100 0.015 0.1691 0.0047 0100110 0.00041 0.0012 0101101 0.00010 0.015 0.1691 0110001 0.0423 0111010 0.0037 0.0047 1000101 0.00041 1001110 0.0012 0.00010 1010010 0.1691 0.015 1011001 0.0037 0.0423 0.0012 1100011 0.00010 1101000 0.00041 0.0047 0.0423 1110100 0.0037 0.0007 0.000058 1111111 probabilit 25.4 . The posterior (p.329) y over codew ords is to exercise Solution wn in table 25.8. The most probable codew ord is 0000000 . The marginal sho probabilities bits seven posterior are: of all Lik od Posterior marginals n eliho ( y ) j t y = 1 ) P ( y j j t 0 = 0 ) P ( t = = 1 j y ) P P t ( n n n n n n 0 0 2 0 : 8 0 : 266 1 : 734 : 0 2 2 0 : 8 : : 266 0 : 734 0 3 0 : 9 0 : 1 0 : 677 0 : 323 4 0 0 : 8 0 : 266 0 : 734 : 2 266 : : 8 0 : 0 0 : 734 0 2 5 : 2 0 : 8 0 6 266 0 : 734 0 : 0 2 0 : 8 0 : 266 0 : 734 7 : the bitwise deco ding is 0010000 , whic h is not actually a codew ord. So is to exercise . The MAP codew ord (p.330) 101 , and its like- 25.9 Solution od is 1 = 8. The normalizing constan t of the sum{pro duct algorithm is liho 3 1 1 5 4 / / / / / = = Z 16 to righ t) . The intermediate left are (from 2 , 4 , , 16 16 ; I i 1 1 9 3 / / / / are (from righ t to left), intermediate the 2 , , 8 32 , 16 . The bitwise i deco ding is: P ( t 6. The = 1 j y ) = 3 = 4; P ( t = = 1 j y ) = 1 = 4; P ( t ) = 5 = 1 j y 1 1 1 8 1 2 1 / / / / , 110 , 011 . , 000 101 12 for 12 , , 12 , 12 codew are ords' probabilities

346 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 26 Exact in Graphs Marginalization of the tasks of inference and marginalization. We now tak e a more view general you should message about chapter, passing in Chapter Before reading this read 16. general problem 26.1 The N of a set of N Assume that f x a function g P variables x as is de ned n n =1 duct M as follo ws: a pro factors of M Y x ) = P ( x ) : (26.1) f ( m m m =1 factors h of the ( x Eac ) is a function of a subset x f of the variables that m m m e function mak is a positiv x then we may be interested in a second . If P e up normalized function, M Y 1 1 P x ) ( ; (26.2) f ( x ) x P ) = ( m m Z Z =1 m normalizing constan t Z is de ned by where the M Y X ( : ) f x (26.3) = Z m m x =1 m example of the As we've just introduced, here's a function of an notation binary variables x x , x ve factors: , three by the de ned 3 2 1 1 = 0 : 0 x 1 ) = f ( x 1 1 9 x 0 = 1 : 1 = 0 0 : 1 x 2 f ) = ( x 2 2 0 : 9 x = 1 2 x 9 : 0 = 0 3 ) = ( f x 3 3 : 0 1 = 1 x 3 (26.4) 1) ; or (1 ( 1 0) x ; ;x ) = (0 2 1 ( f ) = ;x x 2 1 4 x 0) ;x 0 ) = (1 ; ( or (0 ; 1) 2 1 ; 1) 0) ; ) = (0 or (1 1 ( x ;x 2 3 ) = ( ;x f x 2 5 3 x ;x ( 1) ) = (1 0 0) or (0 ; ; 3 2 P x ) = f x ( x ) ) f ;x ( x ( ) f ( ( x f ) f ) ( x ;x 1 2 3 5 3 2 2 2 3 1 1 4 1 x P ( ) = x ) f ( x : ) f ( f ( ) f ( x ;x ) f ( x ;x ) x 2 1 3 5 4 2 2 1 3 1 2 3 Z The ve subsets of f x (26.1) ;x function ;x general g denoted by x in the m 2 1 3 are here x . = f x g g , x ;x = f x x g , x f = f x = g , x x = f x , and ;x g 2 2 4 5 3 3 1 2 2 3 1 1 334

347 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. problem 335 26.1: The general ( function way, may be recognized as the posterior prob- The x P ), by the transmitted in a rep etition code (section three bits of the y distribution abilit ed signal is r 1.2) 1 ; 1 ; 0 ) and the channel is a binary sym- when the receiv = ( ip metric y 0.1. The factors f channel and f with resp ectiv ely probabilit 4 5 ts that x enforce and x constrain must be iden tical and that the x and x 2 1 2 3 x x x 1 3 2 must be iden , f tical. are the likeliho od functions con- The factors f f , g g g 2 3 1 r . tributed by eac onen h comp t of @ @ @ @ (26.1) be depicted A function can factored by a factor graph form , in of the f f f f f 2 5 4 3 1 the nodes and are factors are depicted variables h the whic depicted by circular node n and factor node m by square nodes. An edge is put between variable Figure graph factor . The 26.1 function asso the with ciated f if the x variable on endence any dep ) has graph x ( factor . The function m n m P ( x ) (26.4). (26.4) is sho wn for the example function in gure 26.1. normalization problem The rst task to be solv ed is to compute the normalizing constan t Z . The mar The ginalization problems the task be solv ed second compute to marginal function of any The is to x variable , de ned by n X Z x ) = ( : ) x ( P (26.5) n n 0 f x 6 g n = ; n 0 n if f is a function of three variables then the marginal for For example, by n = 1 is de ned X : ;x ;x ( x (26.6) f ) ) = ( x Z 1 3 2 1 1 x ;x 2 3 0 type of summation, x This over `all the for except ' is so imp ortan t that it x n n be useful to have a special notation for it { the can or `summary'. `not-sum' The task to be solv ed is to compute the normalized marginal of any third x , de ned by variable n X ( x : ) ( ) x P P (26.7) n n 0 x f g 6 = n ; n 0 n the sux ` n ' in P in the [We include x practice ), departing from our normal ( n n rest of the book, where we would omit it.] ] 1 [ marginal w that the normalized . is related to the marginal Sho 26.1. Exercise ( x Z ) by n n ) x ( Z n n : (26.8) ) = x ( P n n Z t also be interested in marginals over a subset of the variables, We migh h as suc X Z ( x ;x ) : ;x ) ;x P (26.9) ( x 2 1 12 1 3 2 x 3 these factor are intractable in general. Even if every All is a function tasks solutions three the cost of computing exact of only for Z and for variables, the marginals is believ ed in general to gro w exp onen tially with the num ber of variables . N certain functions P e- , however, the marginals can be computed For P by exploiting the factorization of tly cien . The idea of how this eciency

348 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. | Exact Marginalization in Graphs 336 26 by the message-passing of Chapter 16. The is well illustrated arises examples that we now review of message- duct sum{pro algorithm is a generalization was the duct there, the sum{pro ). As algorithm passing rule-set case B (p.242 is tree-lik e. graph valid is only if the roduct algo rithm 26.2 The sum{p Notation tify of variables that the m th factor dep ends on, x the , by the set We iden set m sets m ). For our example function (26.4), the ( are N (1) = of their indices N g (since f is a function f of x 1 alone), N (2) = f 2 g , N (3) = f 3 g , N (4) = 1 1 ; 2 g , and N (5) = f 2 ; 3 g . Similarly we de ne the set of factors in whic h f 1 participates, a set M ( n ). We denote n N ( m ) with variable n variable by N m by ) n n . We introduce the shorthand x ( n n or x excluded to denote m n n m set of variables in x x with the i.e., excluded, n m 0 0 f x x n n n : ( m ) n n g : (26.10) 2N m n sum{pro duct algorithm will involve messages of two types passing The edges in the factor graph: messages q to along the from variable nodes m ! n nodes and messages factor A from factor r to variable nodes. nodes, n ! m either type, q or r ) that is sen t along an edge connecting factor message (of . to variable x f is alw ays a function of the variable x n n m the two rules for the updating of the Here of messages. two sets are to factor: variable From Y 0 ) x ( (26.11) r : ) = x q ( n ! m n n m ! n 0 2M m n ) n ( m x n to variable: From factor r ( x ) ) = x ( f m m ! n n n 0 1 X Y @ A 0 0 ) = x ( r ) ( f x q ( x : ) (26.12) n n m ! m m n m n ! f 0 m n x n n 2N n n ) m ( m Figure 26.2 . A factor node that is perp etually the sends node a leaf factor How in the rules these to leaves graph apply f x ) to message r ( ( x ) = n m n n ! m x its neigh one . bour n node it to another connecting edge one only has that a leaf A node is called node. only in the graph may be connected to nodes one vari- Some factor x n h case the set N ( m ) n n of variables app earing in the fac- able node, in whic (26.12) tor empt y set, and the pro duct of functions message update is an Q ) = 1 x q ( n ! n m 0 0 q ( x Suc ) is the h a fac- empt y pro duct, whose value is 1. 0 n n ! m ) m n ( 2N n n message the ays one neigh therefore broadcasts to alw node tor bour x its n f m f ( ( x x r ) = ). n m n n ! m that are connected to only one Similarly , there may be variable nodes node that Figure 26.3 . A variable is a leaf node perp etually sends ) n m in (26.11) is empt y. These nodes perp etually factor node, so the set M ( n message ) = 1. x ( q the ! n m n broadcast the x q ( message ) = 1. n ! n m Starting nishing, metho d 1 and The algorithm can be initialized in two ways. If the graph is tree-lik e then These it must that are leaves. have nodes leaf nodes can broadcast their

349 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. duct algorithm 337 The 26.2: sum{pro ectiv e neigh from the start. resp to their messages bours x n q nodes For all leaf ( : variable ) = 1 (26.13) n m ! n nodes m : r ) For all (26.14) ( x : ) = f leaf ( x factor n m n m ! n message-passing the used in Chapter 16's cedure rule- adopt pro We can then is created in accordance with the rules (26.11, set B (p.242 ): a message 26.12) the on whic h it dep ends are messages t. For example, in if all only presen x x x 1 2 3 g g g f gure be sen will message t only the from when 26.4, the x to message 1 1 @ @ , to f be , can f to from x q message the and ed; receiv been has x from 4 2 2 1 ! 2 2 @ @ sen when messages and r r t only the have both been receiv ed. 2 5 2 ! 4 ! f f f f f 1 2 3 4 5 in eac one tree, thus ow through will Messages the every along h direction Figure . Our mo del factor 26.4 every graph, of the diameter equal ber of steps a num after edge, to the and P function ) the for ( graph x message will have been created. (26.4). The answ can then be read out. we require marginal function of The ers is obtained x by multiplying all the incoming messages at that node. n Y Z ( x ) = (26.15) : ) x ( r n n n n m ! ) ( 2M n m constan normalizing can be obtained by summing any marginal t The Z P function, = Z ), and ( x Z from the normalized marginals obtained n n x n x ) ( Z n n (26.16) : ) = P ( x n n Z [ 2 ] 26.2. Apply the sum{pro duct algorithm to the function . in Exercise de ned (26.4) gure 26.1. Chec and the normalized marginals equation k that consisten t with what you kno w about the rep etition code R are . 3 3 ] [ Pro ve that the sum{pro duct algorithm correctly computes 26.3. Exercise graph marginal the ( x e. ) if the Z is tree-lik functions n n [ 3 ] 26.4. Describ e how to use the messages computed by the sum{ Exercise duct to obtain more complicated marginal functions in a algorithm pro e graph, for that Z tree-lik ( x ;x ), for two variables x and x example ; 1 1 2 2 2 1 are connected common factor node. to one and nishing, d 2 Starting metho mes- algorithm be initialized by setting ely, the the initial can Alternativ all from variables to 1: sages all n , m : q (26.17) x ; ( for ) = 1 n ! n m pro alternating with the factor message update rule (26.12), then with ceeding Compared variable update rule (26.11). message with metho d 1, this lazy the initialization metho d leads to a load of wasted computations, whose results are gradually out by the correct answ ers computed by metho d 1. ushed a num ber of iterations equal to the diameter of the factor graph, After the algorithm will con verge to a set of messages satisfying the sum{pro duct relationships 26.12). (26.11, [ 2 ] Exercise 26.5. Apply this second version of the sum{pro duct algorithm to the function de ned in equation (26.4) and gure 26.1.

350 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. | Exact 338 Marginalization 26 in Graphs introducing reason metho d is that (unlik e metho d 1) it can The this for lazy duct are e. When the sum{pro tree-lik algorithm that to graphs be applied not with cycles, the algorithm does not necessarily con is run on a graph verge, does not compute the correct marginal functions; but and certainly in general in the algorithm an imp ortance, esp ecially of great it is nev ertheless practical of sparse-graph codes. ding deco Sum{pr with on-the- y normalization oduct algorithm interested the normalize d marginals, then another version If we are in only duct may be useful. The factor-to-v ariable messages of the algorithm sum{pro r variable-to-factor are computed in just the same way (26.12), but the n ! m are normalized thus: messages Y 0 ( x q ) = (26.18) r ) x ( n n m nm ! n n m ! 0 ) n m 2M m ( n where chosen suc h that is a scalar nm X ) = 1 (26.19) : ( x q n m ! n x n [ 2 ] Apply this normalized Exercise of the sum{pro duct algorithm 26.6. version function in equation (26.4) and gure 26.1. de ned to the of the sum{pr oduct algorithm view A factorization way to view the sum{pro duct algorithm is that it reexpresses the original One Q M of M factors P function, ( x ) = pro duct the ), as another f x ( factored m m =1 m function whic h is the pro duct of M + N factors, factored N M Y Y x : ( ) (26.20) x ( ) P ( x ) = n n m m n =1 =1 m eac is asso ciated with a factor node h factor , and Eac h factor ) is ( x m n n m ciated with a variable node. Initially ) = 1. ( x asso ) = f x ( x ( ) and n m n m m m h time a factor-to-v ariable message r Eac factorization t, the ( x ) is sen n n ! m is updated thus: Y x r (26.21) ( ) ) = x ( n n m ! n n ) 2M m ( n x f ( ) m Q x ) = ( (26.22) : m m ) x ( r m n n ! ( 2N n m ) can be computed in terms of and using And eac h message 0 1 X Y @ A 0 0 ( x ) r ) = ( x x ( (26.23) ) m m n n ! m n n 0 n 2N ( m ) x n n m 0 h di ers from the assignmen t (26.12) in that the pro duct is over all n whic 2 N ( ). m 2 ] [ update Con rm Exercise the 26.7. rules (26.21{26.23) are equiv alen t that to the sum{pro duct rules (26.11{26.12). So becomes ( x ) eventually n n the marginal Z ( x ). n n e. factorization viewp oint applies whether or not the graph is tree-lik This

351 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. The min{sum 339 26.3: algorithm Computational tricks from a computational point of view normalization On-the- y is a good idea if of man y factors, its values are likely duct large because is a pro P to be very small. or very tric k involves passing the Another of the useful computational logarithms and instead of q and r r es; the computations of the q messages themselv in the algorithm (26.11, 26.12) are then pro by simpler additions. ducts replaced summations of course become more dicult: to carry them out The in (26.12) the we need to compute softmax functions like and logarithm, return l l l 3 2 1 (26.24) ) : + e + e = ln( e l computation can be done ecien But using look-up tables along with this tly observ ation that the the of the answ er l is typically just a little larger value than max l function . If we store in look-up tables values of the i i ln(1 ) (26.25) e + in a num e l can be computed exactly ) then ber of look-ups and negativ (for as the num ber of terms in the sum. additions and sorting scaling If look-ups are er than exp() then this cheap h costs less than the operations approac evaluation (26.24). The num ber of operations direct be further reduced can by omitting con tributions from the smallest of the f l negligible g . i A third tric k applicable to certain error-correcting codes is computational not the messages but the Fourier to pass of the messages. This transforms again es the computations of the factor-to-v mak messages quic ker. A ariable simple example of this Fourier transform tric k is given in Chapter 47 at equa- tion (47.9). 26.3 rithm The min{sum algo marginal algorithm es the duct of nding the solv func- The sum{pro problem P d- ( x ). This is analogous to solving the bitwise deco tion of a given pro duct deco of section just as there were other And ding problems 25.1. problem ing the codew ord deco ding problem), we can de ne other tasks (for example, involving ( x ) that can be solv ed by mo di cations of the sum{pro duct algo- P rithm. consider this task, analogous to the codew ord deco ding For example, problem: The problem . Find the setting of x that maximizes the maximization duct pro ( x ). P the problem This ed by replacing can two operations add and mul- be solv tiply everywhere they app ear in the sum{pro duct algorithm by another pair of operations that the distributiv e law, namely max and multiply . If satisfy P summation ) by maximization, we notice that the quan tity we replace (+, kno wn as the normalizing constan t, formerly X ; P (26.26) ( x ) = Z x max becomes ( x ). P x Thus the sum{pro duct algorithm can be turned into a max{pro duct algo- max- max that P rithm ( x ), and from whic h the solution of the computes x Z can be deduced. Eac h `marginal' problem imization ( x ) then lists the n n ) can value that P maxim ( x um attain for eac h value of x . n

352 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 26 | Exact in Graphs 340 Marginalization max{pro In practice, is most often carried out in the duct the algorithm od domain, min max and product become likeliho and sum . negativ where e log is also kno wn as the Viterbi algorithm. algorithm The min{sum junction tree algo rithm 26.4 The one What when the factor graph one is interested in is not a tree? should do several metho and they divide into exact are ds and appro x- There options, metho metho most widely used exact The d for handling marginaliza- imate ds. on graphs with cycles is called the tion tree algorithm. This algorithm junction works variables together until the agglomerated graph has by agglomerating cycles. You probably gure out the details for yourself; the complexit y can no with gro exp onen tially ws the num ber of agglomerated marginalization of the Read more about the junction tree algorithm in (Lauritzen, 1996; variables. 1998). Jordan, are man y appro ximate metho ds, and we'll visit some of them over There to next chapters the te Carlo metho ds and variational metho ds, few { Mon name a couple. However, the most amusing way of handling factor graphs to whic h the sum{pro duct algorithm may not be applied is, as we already men tioned, apply the sum{pro duct algorithm! We simply compute the to for eac in the graph, as if the graph were a tree, iterate, and messages h node has ngers. so-called `loopy' message passing This great imp ortance cross our deco ding of error-correcting codes, and we'll come bac k to it in section in the and VI. 33.8 Part reading Further about factor graphs and reading sum{pro duct algorithm, see For further the hisc hang et al. (2001), Ksc et al. (2000), Yedidia et al. (2001a), Yedidia Yedidia et al. (2002), Wainwrigh t et al. (2003), and Forney (2001). See also Pearl (1988). A good reference for the fundamen tal theory of graphical mo (1996). A readable introduction to Bayesian dels is Lauritzen is given (1996). works net by Jensen algorithms message-passing have di eren t capabilities from Interesting that sum{pro duct algorithm include exp ectation propagation (Mink a, 2001) the 33.8. surv (Braunstein et al. , 2003). See also section ey propagation and 26.5 Exercises [ 2 ] . Exercise 26.8. Express the join t probabilit y distribution from the burglar alarm earthquak e problem (example 21.1 (p.293) ) as a factor graph, and and nd the marginal probabilities of all the variables as eac h piece of the information to Fred's atten tion, using comes sum{pro duct algorithm with on-the- y normalization.

353 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 27 d Laplace's Metho an Laplace ximation is simple. idea that appro the The behind We assume P probabilit ( x ), whose normalizing constan t unnormalized y y densit Z P x ( ) d x (27.1) Z P P ( x ) logarithm of at a point has x a peak . We Taylor-expand the is of interest, 0 x this peak: P ( ) around c ( x ) P ln 2 ) (27.2) ; ln P ( x ) ' ln P ( x + ) x ( x 0 0 2 where 2 @ : (27.3) ln c = ) x P ( 2 @x = x x 0 Gaussian, ( x unnormalized We then appro ximate P ) by an ) P ln x ( h i ) x ( & ln Q c 2 ( x Q ) exp ( ) P x ; (27.4) x ) x ( 0 0 2 we appro ximate the normalizing constan t Z and by the normalizing constan t P of this Gaussian, r 2 ) ( x P Z P ( x = ) : (27.5) 0 Q c ( x & Q ) We can to appro ximate Z generalize for a densit y P this ( x ) over integral P x x . If the matrix of second deriv a es of ln P K ( -dimensional ) at space ativ maxim x the is A um by: , de ned 0 2 @ ln P ( x ) A = (27.6) ; ij @x @x i j = x x 0 the expansion (27.2) is generalized to so that 1 T ; ( x x (27.7) ) ) + A ( x x ' ln P ln ( P x ) ( x ) 0 0 0 2 normalizing constan t can be appro ximated by: then the r K ) (2 1 q ' ) Z : Z P (27.8) ( x = = P ( ) x 0 P Q 0 det A 1 det A 2 Predictions can be made using the appro ximation Q . Physicists also call this widely-used appro ximation the saddle-p oint appro ximation . 341

354 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 27 | Laplace's Metho 342 d that The constan t of a Gaussian is given by the fact normalizing r Z K 1 (2 ) K T x exp d = (27.9) Ax x det A 2 an orthogonal be pro into the basis u in whic h can ved by making transformation matrix. The integral then separates into a is transformed A into a diagonal of one-dimensional integrals, pro h of the form duct eac r Z 2 1 2 u : (27.10) = exp u d i i i 2 i duct of the eigen values The is the determinan pro A . t of i appro ximation is basis-dep enden Laplace x is transformed to a The t: if function u ( x ) and the densit y is transformed to P ( u ) = P ( x ) j d x= nonlinear u j d then the appro ximate normalizing constan ts Z in general will be di eren t. Q This ed as a defect { since the true value Z can is basis-indep enden t be view P of basis we can hunt for a choice y { because in whic h the { or an opp ortunit ximation is most accurate. Laplace appro 27.1 Exercises [ ] 2 (See also Exercise 22.8 exercise .) A photon coun ter is pointed 27.1. (p.307) star for one min ute, in order to infer the rate at a remote of photons arriving coun ter per min ute, . Assuming the num ber of photons at the r distribution with mean a Poisson , collected has r j ) = exp( ) P ( r ; (27.11) ! r , mak assuming improp er prior P ( and = the e Laplace appro xima- ) = 1 tions to the posterior distribution (a) over (b) over . [Note the improp er prior transforms to P (log ) = log t.] constan [ 2 ] . Use Laplace's metho d to appro ximate the integral 27.2. Exercise Z 1 u u 2 1 ;u ( Z ) = u (27.12) ; )) (1 f ( a ( ) a f d a 1 2 1 a a ) = 1 = (1 + e where f ) and u of ( accuracy are positiv e. Chec k the ;u 2 1 u appro the exact answ er (23.29, p.316 ) for ( against the ;u ximation ) = 2 1 1 1 / / Measure 2 ) in ) and ( u Z ;u log ) = (1 ; 1). the error (log Z 2 ; ( P 2 1 Q bits. ] 3 [ ) ( n ( n ) f ( x 27.3. r regression. Linea ;t N datap oints ) g are generated by Exercise . ) n ( deliv choosing x ter erimen exp , then the world h ering a noisy eac the of the linear function version ( x ) = w y + w (27.13) x; 1 0 n ) n ( ( ) 2 (27.14) ) : x Normal ) ; ( y ( t Assuming priors on w xima- and Gaussian appro , mak e the Laplace w 1 0 tion to the posterior distribution of w in fact) and w h is exact, (whic 1 0 +1) N ( , given predictiv for the next datap oint t the obtain and e distribution N +1) ( x . reading.) MacKa y (1992a) for further (See

355 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 28 del and Occam's Razor Comparison Mo to be . A picture 28.1 Figure a tree It con and interpreted. tains boxes. some 1? razo r 28.1 Occam's how man are ( gure 28.1)? In particular, picture y y boxes How man in the or 2? vicinit y of the tree? If we looked with x-ra y spectacles, boxes are in the or trunk behind the one ( gure 28.2)? (Or even would we see two boxes Figure 28.2 . How man y boxes are behind the tree? more?) principle is the that razor a preference for simple Occam's states theories. `Accept the simplest explanation that ts the data'. Thus according to Occam's razor, deduce that there is only one box behind the tree. we should a con an of thum b? Or is there hoc rule vincing reason for believing Is this ad is most likely one box? Perhaps your intuition likes the argumen there t `well, it would able coincidenc e for the two boxes to be just the be a remark same heigh t and colour as eac h other'. If we wish to mak e arti cial intelligences that interpret data correctly , we must translate this intuitiv e feeling into a concrete theory . for razor Motivations Occam's explanations are If several with a set of observ ations, Occam's compatible razor advises us to buy the simplest. This principle is often adv ocated for one of two reasons: rst is aesthetic (`A theory with mathematical beaut y is the more likely to be correct than an ugly one that ts some exp erimen tal data' 343

356 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. and Occam's Razor 344 28 | Mo del Comparison Figure 28.3 . Wh y Bayesian razor. inference embodies Occam's Evidence This gure gives the basic dels intuition for why complex mo P(D|H ) 1 out to be less probable. can turn horizon The axis represen ts the tal P(D|H ) 2 data sets D . space of possible rew mo dels in ards theorem Bayes' D prop ortion to how much they C 1 that occurred. predicte d the data is the razor. Dirac)); the second reason (Paul past empirical success of Occam's ti ed predictions are These quan y by a normalized probabilit razor, is a di eren t justi cation for Occam's namely: However there . This D on distribution probabilit y of the data given probabilit by Bayesian embodied (as t inference Coheren auto- y) ), is called mo H P ( the del D jH , i i quan razor, titativ ely. Occam's embodies matically H . for evidence i mo es only del H A simple mak a 1 the It is indeed mor e probable that there's one box behind tree, and we can limited of predictions, range how much more two. is than one probable compute by wn ( P sho D jH ); a more 1 H powerful mo del has, for , that 2 parameters example, free more and Occam's arison razor Model comp than , is able H to predict a 1 variet greater sets. This y of data t the We evaluate H e theories y of two alternativ plausibilit H and ligh in the 2 1 however, that H means, does not 2 as follo plausibilit the we relate theorem, y of mo using ws: del D of data Bayes' sets predict the in region C data 1 D j H ( P data, the given H del by the made predictions about ), to the mo 1 1 as strongly . Supp ose that H as 1 the ( P , the ). This gives the y of plausibilit prior H ), and data, H jH D ( P 1 1 1 prior probabilities have been equal theory : H theory and H follo between y ratio probabilit wing 2 1 Then, dels. two mo to the assigned data set falls in region C , if the 1 ) ( H j D P ) jH D ( P P ) H ( 1 1 1 the will H be del mo powerful less 1 (28.1) : = ( H ( ) H ) j D ) P P ( D jH P del. mor the mo e probable 2 2 2 )) on ratio ( H ( ) =P ( H The P the righ t-hand side measures how much our rst 2 1 favoured H over initial H beliefs . The second ratio expresses how well the 1 2 data were predicted by H ed , compared to H observ . 2 1 relate to Occam's razor, when H How does this is a simpler mo del than 1 H ? The rst ratio ( P ( H to ) =P ( H y, if we wish, )) gives us the opp ortunit 2 1 2 grounds, bias of H a prior insert aesthetic in favour or on the basis of on 1 erience. This would corresp ond to the aesthetic and empirical motiv ations exp Occam's tioned men for earlier. But suc h a prior bias is not necessary: razor second t factor, the data-dep enden the embodies Occam's razor auto- ratio, e precise matic mo dels tend to mak . Simple predictions. Complex mo dels, ally by their nature, are capable of making a greater variet y of predictions ( gure 28.3). So H if is a more complex mo del, it must spread its predictiv e proba- 2 data ( D jH bilit ) more thinly over the P space than H case . Thus, in the y 1 2 the data are compatible with both theories, the simpler H where will turn out 1 more than H probable , without our having to express any sub jectiv e dislik e 2 equal mo dels. Our for jectiv e prior just needs to assign complex prior prob- sub abilities to the possibilities of simplicit y and complexit y. Probabilit y theory then allo ws the observ ed data to express their opinion. Let us turn example. Here is a sequence of num bers: to a simple 1 : 3 ; 7 ; 11 ; and task The next two num bers, is to predict infer the underlying pro cess the that gave rise to this sequence. A popular answ er to this question is the `add prediction with the explanation 19', 4 to the previous num ber'. `15, What about the alternativ e answ er ` 19 : 9 ; 1043 : 8' with the underlying ber from rule `get the next num being: the previous num ber, x , by evaluating

357 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Occam's razor 28.1: 345 3 2 11 x + 23 = 11' ? I assume that this prediction seems rather less x = = 11 + 9 just rule the data ( 1, 3, 7, 11) ts as well as the second the plausible. But So why should we nd it less plausible? Let us give lab els to the rule `add 4'. two general theories: sequence is an arithmetic progression, `add the ', where n is an integer. H n { a 3 the sequence is generated by a cubic function of the form H ! x { + cx c 2 e , where c , d and e are fractions. dx + , less for second explanation, H One the plausible, migh t be reason nding c progressions are more frequen tly encoun tered than cubic func- that arithmetic H would This in the prior probabilit y ratio P ( H tions. ) =P ( put ) in a bias c a But let us give the equation equal prior probabilities, and (28.1). two theories trate on concen the data have to say. How well did eac h theory predict what the data? P ( jH To obtain ) we must specify the probabilit y distribution that eac h D a del parameters. First, H assigns dep ends to its the added integer n , mo on a rst num ber in the sequence. Let and say that these num bers could the us h have been between 50 and 50. anywhere since only the pair of eac Then f n = 4, rst num ber = 1 g give rise to the observ values D = ( 1, 3, ed data 7, 11), probabilit y of the data, given H the , is: a 1 1 ( ) = P D jH = 0 : 00010 : (28.2) a 101 101 P ( D jH say what ), we must similarly To evaluate values the fractions c;d c num e e on. [I choose to represen t these t tak bers as fractions rather and migh real than bers because if we used real num bers, the mo del would assign, num relativ e to H the , an in nitesimal probabilit y to D . Real parameters are a norm are assumed in the rest of this chapter.] A reasonable however, and could t state eac h fraction the numerator for be any num ber migh prior that 50 and 50, and the denominator is any num ber between 1 and 50. between the value for in the sequence, let us leave its probabilit y distribution As initial fraction as in the same are four ways of expressing the H c = 1 = 11 = . There a 2 = 22 = 3 = 33 = 4 = 44 under this prior, and similarly there are four and two solutions possible and e , resp ectiv ely. So the probabilit y of the observ ed for d H , is found to be: given data, c 1 1 4 1 2 4 1 jH D ) = P ( c 101 101 101 101 50 50 50 12 10 : 5 000000000002 0 5 = 2 : (28.3) : = P ( prob- jH Thus comparing ) with P ( D jH ) = 0 : 00010, even if our prior D a c abilities for H and H ), in favour are equal, the odds, P ( D jH jH ) : P ( D a c a c H y million over H fort , given the sequence D = ( 1, 3, 7, 11), are of about c a 2 to one. answ er dep ends on This sub jectiv e assumptions; in particular, the several probabilit y assigned to the free parameters n , c , d , e of the theories. Bayesians mak e no for this: there is no suc h thing as inference or prediction apologies assumptions. e details the quan titativ without of the prior proba- However, e Occam's bilities on the qualitativ e ect razor e ect; the complex have no theory H parameters, alw ays su ers an `Occam factor' because it has more c was only so can predict a greater variet y of data sets ( gure 28.3). This and a small example, and there were only four data points; as we move to larger

358 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Comparison 28 | Mo Razor and 346 del Occam's Create Bayesian . Where 28.4 Figure Gather alternativ e into the ts inference data DATA MODELS cess. delling mo pro an This illustrates gure @ @ abstraction of the part of the ti c pro cess h data scien in whic - Fit h MODEL eac delled. and In mo are collected to the DATA particular, to applies gure this learning, pattern classi cation, two The etc. interp olation, @ @ denote boxes double-framed the Gather new Create . infer enc two steps e whic h involve data more dels mo Assign preferences to the in those two steps It is only that e MODELS alternativ be used. can theorem Bayes' 6 6 you how to tell Bayes does not @ example. for dels, invent mo @ box, ` tting The rst del h mo eac @ R Cho ose what Decide whether is the data', of task to the ? data to to create new mo the del what inferring gather next dels mo migh t be given the parameters ose Cho future and the data. Bayesian mo del actions metho the ds may be used to nd probable parameter values, most and error bars on those sophisticated problems the magnitude of the Occam factors typi- and more result of The parameters. the increases, and in uenced degree to whic cally h our inferences by the are metho ds to this applying Bayesian quan titativ of our sub jectiv e assumptions becomes smaller. e details is often t problem little di eren given by ers answ the from dox statistics. ortho analysis data ds and metho Bayesian del The second inference task, mo us now relate analysis. above to real in data problems discussion the Let t of the comparison in the ligh metho Bayesian is where data, ds in science, and statistics technology problems tless h coun are There whic are in a class own. This of their require set, preferences be assigned to alternativ a limited given data e that, problem inference second requires complexities. of di ering dels mo For e hypotheses two alternativ example, e Occam's razor to a quan titativ for mo tric cen based accoun ting del planetary motion are Mr. Inquisition's geo dels. mo over-complex penalize solar mo del and of the with system Mr. Cop ernicus's simpler on `epicycles', Bayesian ds can metho assign mo motion planetary on data ts del at sun at the cen tre. The epicyclic the e preferences objectiv to the alternativ dels in a way that e mo ernican parameters. more least as well as the Cop does so using mo del, but Occam's automatically embodies tally for Mr. Coinciden two of the extra epicyclic parameters for Inquisition, razor. perio planet found to be iden tical to the are d and radius of the sun's every `cycle around the earth'. Intuitiv ely we nd Mr. Cop ernicus's theory more probable. The of the Bayesian razor: the evidenc e and the Occam factor mechanism o levels of inference often be distinguished in the pro cess of data mo d- Tw can rst level of inference, we assume that a particular mo del is true, elling. At the we t that mo del to the data, i.e., we infer and values its free param- what eters plausibly tak e, given the data. The results of this inference are should values, summarized most probable parameter by the and error bars on often those parameters. This analysis is rep eated for eac h mo del. The second level of inference is the of mo del comparison. Here we wish to compare the task some dels mo t of the data, and assign in the sort of preference or ranking ligh to the alternativ es. Note that both levels of infer enc e are distinct from decision theory . The goal and of inference a de ned hypothesis space given a particular data set, to is, assign probabilities to hypotheses. Decision theory typically chooses between of these alternativ actions on the basis e probabilities so as to minimize the

359 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Occam's razor 28.1: 347 of a `loss exp chapter concerns inference alone and no function'. ectation This this involved. mo del comparison, we discuss should are loss When functions mo del choic e . Ideal Bayesian predictions do not not be construed as implying dels; between predictions are made by summing over involve choice mo rather, e mo dels, weigh ted by their probabilities. the all alternativ ds are able consisten tly and quan titativ ely to solv e both Bayesian metho tasks. the is a popular myth that states that Bayesian meth- inference There ortho ods di er ds only by the inclusion of sub jectiv e dox statistical from metho don't h are and whic h usually to assign, mak e much dif- whic priors, dicult conclusions. It is true that, ference rst level of inference, a to the at the results often di er little from the outcome of an ortho will Bayesian's dox at- is not widely appreciated is how a Bayesian performs the second tack. What this chapter will therefore focus on Bayesian mo del level of inference; compar- ison. del comparison task because it is not possible simply to Mo is a dicult mo mo ts the data best: more complex that dels can alw ays the del choose data better, so the maxim um likeliho od mo del choice would t us the lead to implausible, mo dels, whic h generalize poorly . inevitably over-parameterized is needed. razor Occam's down Bayes' theorem for Let two levels of inference describ ed us write the explicitly how Bayesian mo del comparison works. above, so as to see h Eac mo H del is assumed to have a vector of parameters w . A mo del is de ned i w a `prior' distribution P ( y distributions: jH by a collection ), of probabilit i h states what values the mo del's parameters migh t be exp ected to tak e; whic a set one distributions, and for eac h value of w , de ning the of conditional the P D predictions w ; H . ) that ( mo del mak es about the data D j i 1. Mo del tting. At the rst level of inference, we assume that one mo del, the i say, is true, and we infer what the mo del's parameters w migh t th, the be, given . Using Bayes' theorem, the posterior probabilit y data D w of the is: parameters jH w ( P j H P ( D ) w ; ) i i H D; j w ) = ( P ; (28.4) i P ) D jH ( i that is, Prior Lik eliho od : = Posterior Evidence normalizing constan t P ( D jH The ) is commonly ignored since it is irrel- i of i.e., the inference level of inference, w ; but it becomes evant to the rst ortan t in the second level of inference, and we name it the evidence imp H t-based . It is common practice to use gradien for metho ds to nd the i the of the posterior, whic h de nes maxim most probable value for um the parameters, w ; it is then usual to summarize the posterior distribution MP by the value of w best- , and error bars or con dence interv als on these MP t parameters. can be obtained from the curv ature of the pos- Error bars , w at w terior; , evaluating = rr ln P ( Hessian j D; H the ) j A i MP w MP Taylor-expanding the log posterior probabilit y with w = w w and : MP T 1 / w (28.5) 2 A w ; H ( w j D; H ) ' P ( w P j D; ) exp i MP i we see that posterior can be locally appro ximated as a Gaussian the 1 A alen t to error bars) with covariance (equiv . [Whether this matrix appro ximation is good or not will dep end on the problem we are solv- and ing. the maxim um Indeed, mean of the posterior distribution have

360 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 28 | Mo del and Occam's Razor 348 Comparison 28.5 factor. Occam . The Figure ws gure This sho tities quan the that determine factor Occam the ( H j D; ) w P i a hypothesis having a for H i . The w parameter single prior w j D the distribution (solid line) for w jH ( P ) i . The has parameter width w w MP distribution (dashed posterior w peak line) w a single has at MP w . width with characteristic w j D is The Occam factor change tal both { they fundamen no status under inference in Bayesian j w D P jH ( ) = w : MP i D j w reparameterizations. nonlinear Maximization of a posterior probabil- w only if an appro ity is useful like equation (28.5) gives a good ximation summary distribution.] of the Mo 2. At the second level of inference, we wish to infer del comparison. y del given the data. The posterior probabilit plausible whic h mo is most del is: h mo of eac H P j ( ) / P ( D jH (28.6) ) P ( H : ) D i i i that the data-dep enden t term P ( D jH H ) is the evidence for Notice , i i h app eared normalizing constan t in (28.4). The second term, whic as the over our H ), is the sub jectiv e prior ( hypothesis space, whic h expresses P i we though t the alternativ e mo dels were how plausible the data before arriv Assuming that we choose to assign equal priors P ( H ed. ) to the i alternativ dels, models H The e mo d by evaluating the evidenc e. are ranke i P ) been ( D jH from P P ( H omitted ) has constan t ( D ) = normalizing P i i i (28.6) data-mo delling pro in the we may dev elop equation because cess dels after the data have arriv ed, when an inadequacy of the new mo rst dels for example. Inference is open ended: we con tinually mo is detected, more mo dels to accoun t for the data we gather. seek probable eat the key idea: to rank alternativ e mo dels H To rep , a Bayesian eval- i uates evidence P ( D jH the ). This concept is very general: the ev- i idence be evaluated for parametric and `non-parametric' mo dels can e; whatev er our data-mo delling task, a regression problem, a clas- alik problem, problem, y estimation si cation the evidence is a or a densit these ortable tity for comparing alternativ transp dels. In all quan e mo cases the evidence naturally embodies Occam's razor. Evaluating the evidenc e Let us the evidence more closely to gain insigh t into how the now study constan Occam's The evidence is the normalizing works. t for Bayesian razor (28.4): equation Z D jH (28.7) ) = P ( ( D j w ; H : ) P ( w jH w ) d P i i i For man y problems the posterior P ( w j D; H ) has ) / P ( D j w ; H jH ) P ( w i i i peak Then, the most probable parameters w at ( gure 28.5). a strong MP taking simplicit y the one-dimensional case, for evidence can be appro x- the imated, using Laplace's metho d, by the heigh t of the peak of the integrand w P D j w ; H ) P ( ( jH width, ) times its : i i w j D

361 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Occam's 28.1: 349 razor P ) jH w ( (28.8) : ; jH D ) ( ' P P ) w j ( D H MP i MP i i w j D {z } | | } {z od Evidence Occam factor t likeliho ' Best is found mo the best- t likeliho od that the evidence del Thus the by taking multiplying it by an `Occam factor', whic h is a term with achiev can e and than one that penalizes H for magnitude having the parameter w . less i etation factor Interpr of the Occam y The quan is the posterior uncertain ty in w . Supp ose for simplicit tity w D j large ( w jH ) is uniform on some the interv al that , represen ting the P prior w i H of were possible a priori , according to that range ( gure 28.5). w of values i ( w Then jH P ) = 1 = , and w i MP w j D factor Occam = ; (28.9) w i.e., factor is equal to the ratio of the posterior accessible volume the Occam factor 's parameter prior accessible volume, or the space to the by whic h of H i 's hypothesis space collapses when the data arriv H mo del H e. The can be i i of a certain ber of exclusiv e submo dels, of whic h only view num ed as consisting es when the one arriv e. The Occam factor is the inverse of that surviv data ber. The logarithm of the Occam factor is a measure of the amoun num t of information about the mo del's parameters when the data arriv e. we gain mo del man y parameters, eac h of whic h is free to vary A complex having range , will typically be penalized by a stronger Occam factor over a large w a simpler mo del. The Occam factor also penalizes mo than that have to dels be nely to t the data, favouring mo dels for whic h the required pre- tuned is coarse. cision factor Occam of the parameters The magnitude of the D j w is thus a measure of complexit y of the mo del; it relates to the complexit y of the predictions the mo del mak es in data space. This dep ends not only that the num in the mo del, but also on the prior probabilit y on ber of parameters achiev mo to them. Whic h mo del assigns es the greatest evidence the that del between minimizing this natural complexit y mea- is determined by a trade-o minimizing the data mis t. In con trast to alternativ e measures of sure and del complexit Occam factor for a mo del is straigh tforw ard to evalu- mo y, the it simply dep ends on the error bars on ate: parameters, whic h we already the evaluated tting the mo del to the data. when 28.6 ys an entire hypothesis space so as to illustrate the var- displa Figure probabilities in the analysis. There are three mo dels, H ; ; H , whic ious H h 3 1 2 prior wn Eac h mo del has one parameter w (eac h sho have equal probabilities. t prior on axis), but assigns a di eren tal range parame- to that a horizon W ter. H prior is the most ` exible' or `complex' mo del, assigning the broadest 3 range. data space is sho wn by the vertical axis. Eac h A one-dimensional w del t probabilit y distribution P ( D; a join jH mo ) to the data and assigns i parameters, illustrated by a cloud of dots. These dots represen t random the from the full probabilit y distribution. The total num ber of dots in samples we assigned h of the eac del subspaces is the same, because three equal prior mo probabilities to the mo dels. When a particular data set D is receiv ed (horizon tal line), we infer the pos- H terior w for a mo del ( of y along , say) by reading out the densit distribution 3 that horizon tal line, and normalizing. The posterior probabilit y P ( w j D; H ) 3 is sho by the dotted curv e at the bottom. wn sho wn is the prior distribu- Also tion P poorly w jH h is very ) (cf. gure 28.5). [In the case of mo del H whic ( 1 3

362 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 28 | Mo Comparison Occam's Razor del 350 and Figure 28.6 . A hypothesis space consisting e of three exclusiv eac h having one mo dels, ) P ( jH D 3 parameter a w , and set D . The one-dimensional data jH D ( P ) 2 measured is a single set' `data whic value from the h di ers D ) jH D ( P w by a small t parameter amoun 1 samples of additiv Typical e noise. t distribution join the from ) are by dots. P ( D;w; H sho wn j H P ( ) w D; 1 (N.B., these are not data points.) ed `data The is a single set' observ by wn sho D for value particular the dashed horizon tal line. The D; j P ( ) H w 2 jH ) P ( w 1 es below sho curv w the dashed w ) P ( H D; j 3 h for w y of probabilit posterior eac jH w ( P ) 2 set data given this del (cf. mo P w jH ) ( D 3 for evidence the The 28.3). gure di eren t mo is obtained by dels w w w axis marginalizing onto the D at j D w the left-hand side (cf. gure 28.5). w matc data, the shap e of the posterior to the will dep end on hed distribution details of the tails of the prior P ( w jH w ) and the likeliho od P ( D j the ; H ); 1 1 curv e sho wn is for the case where the prior falls o more strongly .] the gure t distributions by marginalizing the join We obtain P ( D; w jH 28.3 ) i data axis at the left-hand side. F