ADAfaEPoV

Transcript

1 Advanced Data Analysis from an Elementary Point of View Cosma Rohilla Shalizi

2

3 3 For my parents and in memory of my grandparents

4 Contents Introduction 11 11 Introduction To the Reader 11 14 Concepts You Should Know Part I Regression and Its Generalizations 15 1 Regression Basics 17 17 1.1 Statistics, Data Analysis, Regression 18 Guessing the Value of a Random Variable 1.2 1.3 The Regression Function 19 Estimating the Regression Function 1.4 23 1.5 28 Linear Smoothers Further Reading 1.6 39 Exercises 39 The Truth about Linear Regression 41 2 2.1 41 Optimal Linear Prediction: Multiple Variables 2.2 Shifting Distributions, Omitted Variables, and Transformations 46 2.3 Adding Probabilistic Assumptions 55 2.4 Linear Regression Is Not the Philosopher’s Stone 58 2.5 Further Reading 60 60 Exercises Model Evaluation 61 3 3.1 What Are Statistical Models For? 61 Errors, In and Out of Sample 62 3.2 Over-Fitting and Model Selection 66 3.3 3.4 Cross-Validation 70 3.5 Warnings 74 3.6 Further Reading 77 Exercises 78 4 Smoothing in Regression 84 4.1 How Much Should We Smooth? 84 4 st 11:17 Monday 1 April, 2019 c Copyright © Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/ cshalizi/ADAfaEPoV/ ~

5 Contents 5 Adapting to Unknown Roughness 85 4.2 92 4.3 Kernel Regression with Multiple Inputs Interpreting Smoothers: Plots 94 4.4 4.5 Average Predictive Comparisons 94 npreg 96 4.6 Computational Advice: 99 Further Reading 4.7 Exercises 100 5 Simulation 113 5.1 What Is a Simulation? 113 How Do We Simulate Stochastic Models? 114 5.2 Repeating Simulations 5.3 118 Why Simulate? 119 5.4 125 5.5 Further Reading Exercises 125 6 The Bootstrap 126 Stochastic Models, Uncertainty, Sampling Distributions 126 6.1 The Bootstrap Principle 6.2 128 Resampling 139 6.3 6.4 141 Bootstrapping Regression Models 6.5 Bootstrap with Dependent Data 146 6.6 Confidence Bands for Nonparametric Regression 147 6.7 Things Bootstrapping Does Poorly 147 Which Bootstrap When? 148 6.8 149 6.9 Further Reading Exercises 150 7 Splines 152 Smoothing by Penalizing Curve Flexibility 152 7.1 Computational Example: Splines for Stock Returns 7.2 154 Basis Functions and Degrees of Freedom 160 7.3 7.4 162 Splines in Multiple Dimensions 7.5 Smoothing Splines versus Kernel Regression 163 7.6 Some of the Math Behind Splines 163 7.7 Further Reading 165 166 Exercises Additive Models 168 8 8.1 168 Additive Models 8.2 Partial Residuals and Back-fitting 169 8.3 The Curse of Dimensionality 172 8.4 174 Example: California House Prices Revisited 8.5 Interaction Terms and Expansions 178 8.6 Closing Modeling Advice 180 8.7 Further Reading 181 Exercises 181

6 6 Contents Testing Regression Specifications 191 9 191 9.1 Testing Functional Forms Why Use Parametric Models At All? 201 9.2 9.3 Further Reading 205 10 Weighting and Variance 206 206 10.1 Weighted Least Squares 10.2 Heteroskedasticity 208 10.3 Estimating Conditional Variance Functions 217 225 10.4 Re-sampling Residuals with Heteroskedasticity 225 10.5 Local Linear Regression 230 10.6 Further Reading 231 Exercises 11 Logistic Regression 232 11.1 Modeling Conditional Probabilities 232 11.2 Logistic Regression 233 11.3 Numerical Optimization of the Likelihood 237 11.4 Generalized Linear and Additive Models 239 11.5 Model Checking 240 11.6 A Toy Example 242 11.7 Weather Forecasting in Snoqualmie Falls 245 11.8 Logistic Regression with More Than Two Classes 257 258 Exercises GLMs and GAMs 260 12 12.1 Generalized Linear Models and Iterative Least Squares 260 12.2 Generalized Additive Models 266 266 12.3 Further Reading 266 Exercises 13 Trees 267 13.1 Prediction Trees 267 13.2 Regression Trees 270 13.3 Classification Trees 279 13.4 Further Reading 285 Exercises 285 Part II Distributions and Latent Structure 291 14 Density Estimation 293 14.1 Histograms Revisited 293 14.2 “The Fundamental Theorem of Statistics” 294 14.3 Error for Density Estimates 295 14.4 Kernel Density Estimates 298 14.5 Conditional Density Estimation 304 14.6 More on the Expected Log-Likelihood Ratio 305

7 Contents 7 308 14.7 Simulating from Density Estimates 14.8 Further Reading 313 315 Exercises 15 317 Relative Distributions and Smooth Tests 15.1 Smooth Tests of Goodness of Fit 317 15.2 Relative Distributions 330 15.3 Further Reading 341 342 Exercises 16 Principal Components Analysis 343 16.1 Mathematics of Principal Components 343 16.2 Example 1: Cars 350 16.3 Example 2: The United States 1977 354 circa 357 16.4 Latent Semantic Analysis 360 16.5 PCA for Visualization 16.6 PCA Cautions 363 363 16.7 Random Projections 16.8 Further Reading 365 Exercises 366 17 Factor Models 369 17.1 From PCA to Factor Analysis 369 17.2 The Graphical Model 371 17.3 Roots of Factor Analysis in Causal Discovery 374 376 17.4 Estimation 381 17.5 Maximum Likelihood Estimation 17.6 The Rotation Problem 382 17.7 Factor Analysis as a Predictive Model 383 17.8 Factor Models versus PCA Once More 386 17.9 Examples in R 387 17.10 Reification, and Alternatives to Factor Models 391 17.11 Further Reading 398 Exercises 398 18 Nonlinear Dimensionality Reduction 400 18.1 Why We Need Nonlinear Dimensionality Reduction 400 18.2 Local Linearity and Manifolds 402 18.3 Locally Linear Embedding (LLE) 406 18.4 More Fun with Eigenvalues and Eigenvectors 410 18.5 Calculation 413 18.6 Example 421 18.7 Further Reading 421 Exercises 423 19 Mixture Models 424 19.1 Two Routes to Mixture Models 424 19.2 Estimating Parametric Mixture Models 428

8 8 Contents 433 19.3 Non-parametric Mixture Modeling 19.4 Worked Computating Example 433 452 19.5 Further Reading Exercises 452 20 Graphical Models 454 20.1 Conditional Independence and Factor Models 454 20.2 Directed Acyclic Graph (DAG) Models 455 20.3 Conditional Independence and d -Separation 457 20.4 Independence and Information 464 466 20.5 Examples of DAG Models and Their Uses 468 20.6 Non-DAG Graphical Models 472 20.7 Further Reading 473 Exercises Part III Causal Inference 475 21 Graphical Causal Models 477 21.1 Causation and Counterfactuals 477 21.2 Causal Graphical Models 478 21.3 Conditional Independence and d -Separation Revisited 481 21.4 Further Reading 482 Exercises 484 22 Identifying Causal Effects 485 485 22.1 Causal Effects, Interventions and Experiments 487 22.2 Identification and Confounding 22.3 Identification Strategies 489 22.4 Summary 504 505 Exercises Estimating Causal Effects 507 23 23.1 Estimators in the Back- and Front- Door Criteria 507 23.2 Instrumental-Variables Estimates 515 23.3 Uncertainty and Inference 516 23.4 Recommendations 516 23.5 Further Reading 517 Exercises 518 24 Discovering Causal Structure 519 24.1 Testing DAGs 520 24.2 Testing Conditional Independence 521 24.3 Faithfulness and Equivalence 522 24.4 Causal Discovery with Known Variables 523 24.5 Software and Examples 528 24.6 Limitations on Consistency of Causal Discovery 534 24.7 Pseudo-code for the SGS Algorithm 534

9 Contents 9 535 24.8 Further Reading Exercises 536 537 Part IV Dependent Data 25 539 Time Series 25.1 What Time Series Are 539 25.2 Stationarity 540 25.3 Markov Models 545 25.4 Autoregressive Models 549 25.5 Bootstrapping Time Series 554 25.6 Cross-Validation 556 556 25.7 Trends and De-Trending 560 25.8 Breaks in Time Series 25.9 Time Series with Latent Variables 562 570 25.10 Longitudinal Data 570 25.11 Multivariate Time Series 25.12 Further Reading 570 Exercises 572 26 Simulation-Based Inference 596 26.1 The Method of Simulated Moments 596 603 26.2 Indirect Inference 603 26.3 Further Reading Exercises 604 607 Appendices Data-Analysis Problem Sets 607 Appendix A Bibliography 684 References 685 Acknowledgments 706 Part V Online Appendices 709 Appendix B Linear Algebra Reminders 711 Appendix C Big O and Little o Notation 723 Appendix D Taylor Expansions 725 Appendix E Multivariate Distributions 728

10 10 Contents Algebra with Expectations and Variances 739 Appendix F Appendix G Propagation of Error 741 Appendix H Optimization 743 2 χ and Likelihood Ratios Appendix I 774 Appendix J 776 Rudimentary Graph Theory Appendix K Missing Data 779 Appendix L Programming 807 Appendix M Generating Random Variables 843

11 Introduction To the Reader This book began as the notes for 36-402, Advanced Data Analysis, at Carnegie Mellon University. This is the methodological capstone of the core statistics se- quence taken by our undergraduate majors (usually in their third year), and by undergraduate and graduate students from a range of other departments. The pre-requisite for that course is our class in modern linear regression, which in turn requires students to have taken classes in introductory statistics and data analysis, probability theory, mathematical statistics, linear algebra, and multi- variable calculus. This book does not presume that you once learned but have forgotten that material; it presumes that you know those subjects and are ready to go further (see p. 14, at the end of this introduction). The book also presumes that you can read and write simple functions in R. If you are lacking in any of these areas, this book is not really for you, at least not now. ADA is a class in : its aim is to get students to under- statistical methodology 1 methods of data analysis, and of the stand something of the range of modern considerations which go into choosing the right method for the job at hand (rather than distorting the problem to fit the methods you happen to know). Statistical theory is kept to a minimum, and largely introduced as needed. Since ADA is also a class in data analysis , there are a lot of assignments in which large, real data sets are analyzed with the new methods. There is no way to cover every important topic for data analysis in just a semester. Much of what’s not here — sampling theory and survey methods, ex- perimental design, advanced multivariate methods, hierarchical models, the in- tricacies of categorical data, graphics, data mining, spatial and spatio-temporal statistics — gets covered by our other undergraduate classes. Other important areas, like networks, inverse problems, advanced model selection or robust esti- 2 mation, have to wait for graduate school . The mathematical level of these notes is deliberately low; nothing should be beyond a competent third-year undergraduate. But every subject covered here can be profitably studied using vastly more sophisticated techniques; that’s why 1 Just as an undergraduate “modern physics” course aims to bring the student up to about 1930 (more specifically, to 1926), this class aims to bring the student up to about 1990–1995, maybe 2000. 2 Early drafts of this book, circulated online, included sketches of chapters covering spatial statistics, networks, and experiments. These were all sacrificed to length, and to actually finishing. 11 st 11:17 Monday 1 April, 2019 c Copyright © Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/ cshalizi/ADAfaEPoV/ ~

12 12 Introduction elementary this is advanced data analysis from an point of view. If reading these point of view, pages inspires anyone to study the same material from an advanced I will consider my troubles to have been amply repaid. A final word. At this stage in your statistical education, you have gained two kinds of knowledge — a few general statistical principles, and many more specific procedures, tests, recipes, etc. Typical students are much more comfortable with the specifics than the generalities. But the truth is that while none of your recipes 3 wrong , they are tied to assumptions which hardly ever hold . Learning more are flexible and powerful methods, which have a much better hope of being reliable, will demand a lot of hard thinking and hard work. Those of you who succeed, however, will have done something you can be proud of. Organization of the Book Part I is about regression and its generalizations. The focus is on nonparametric regression, especially smoothing methods. (Chapter 2 motivates this by dispelling some myths and misconceptions about linear regression.) The ideas of cross- validation, of simulation, and of the bootstrap all arise naturally in trying to come to grips with regression. This part also covers classification and specification- testing. Part II is about learning distributions, especially multivariate distributions, rather than doing regression. It is possible to learn essentially arbitrary distri- butions from data, including conditional distributions, but the number of ob- servations needed is often prohibitive when the data is high-dimensional. This motivates looking for models of special, simple structure lurking behind the high- dimensional chaos, including various forms of linear and non-linear dimension reduction, and mixture or cluster models. All this builds towards the general idea of using graphical models to represent dependencies between variables. Part III is about causal inference. This is done entirely within the graphical- model formalism, which makes it easy to understand the difference between causal prediction and the more ordinary “actuarial” prediction we are used to as statis- ticians. It also greatly simplifies figuring out when causal effects are, or are not, identifiable from our data. (Among other things, this gives us a sound way to decide what we ought to control for.) Actual estimation of causal effects is done as far as possible non-parametrically. This part ends by considering procedures for discovering causal structure from observational data. Part IV moves away from independent observations, more or less tacitly as- 3 “Econometric theory is like an exquisitely balanced French recipe, spelling out precisely with how many turns to mix the sauce, how many carats of spice to add, and for how many milliseconds to bake the mixture at exactly 474 degrees of temperature. But when the statistical cook turns to raw materials, he finds that hearts of cactus fruit are unavailable, so he substitutes chunks of cantaloupe; where the recipe calls for vermicelli he uses shredded wheat; and he substitutes green garment dye for curry, ping-pong balls for turtle’s eggs and, for Chalifougnac vintage 1883, a can of turpentine.” — Stefan Valavanis, quoted in Roger Koenker, “Dictionary of Received Ideas of Statistics” ( http://www.econ.uiuc.edu/ roger/dict.html ), s.v. “Econometrics”. ~

13 Introduction 13 sumed earlier, to dependent data. It specifically considers models of time se- ries, and time series data analysis, and simulation-based inference for complex or analytically-intractable models. Parts III and IV are mostly independent of each other, but both rely on Parts I and II. The appendices contain data-analysis problem sets; mathematical reminders; statistical-theory reminders; some notes on optimization, information theory, and missing data; and advice on writing R code for data analysis. R Examples The book is full of worked computational examples in R. In most cases, the code used to make figures, tables, etc., is given in full in the text. (The code is deliberately omitted for a few examples for pedagogical reasons.) To save space, comments are generally omitted from the text, but comments are vital to good programming ( § L.9.1), so fully-commented versions of the code for each chapter are available from the book’s website. Exercises and Problem Sets There are two kinds of assignments included here. Mathematical and computa- tional exercises go at the end of chapters, since they are mostly connected to those pieces of content. (Many of them are complements to, or fill in details of, material in the chapters.) There are also data-centric problem sets, in Appendix A; most of these draw on material from multiple chapters, and many of them are based on specific papers. Solutions will be available to teachers from the publisher; giving them out to those using the book for self-study is, sadly, not feasible. To Teachers The usual one-semester course for this class has contained Chapters 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 16, 17, 19, 20, 21, 22, 23, 24 and 25, and Appendices E and L (the latter quite early on). Other chapters have rotated in and out from year to year. One of the problem sets from Appendix A (or a similar one) was due every week, either as homework or as a take-home exam. Corrections and Updates The page for this book is http://www.stat.cmu.edu/ . cshalizi/ADAfaEPoV/ ~ The latest version will live there. The book will eventually be published by Cam- bridge University Press, at which point there will still be a free next-to-final draft at that URL, and errata. While the book is still in a draft, the PDF contains notes to myself for revisions, [[like so]]; you can ignore them. [[Also marginal notes-to- self]]

14 14 Introduction Concepts You Should Know If more than a few of these are unfamiliar, it’s unlikely you’re ready for this book. Linear algebra : Vectors; arithmetic with vectors; inner or dot product of vectors, orthogonality; linear independence; basis vectors. Linear subspaces. Ma- trices, matrix arithmetic, multiplying vectors and matrices; geometric meaning of matrix multiplication. Eigenvalues and eigenvectors of matrices. Projection. : Derivative, integral; fundamental theorem of calculus. Multivari- Calculus able extensions: gradient, Hessian matrix, multidimensional integrals. Finding minima and maxima with derivatives. Taylor approximations (App. D). : Random variable; distribution, population, sample. Cumula- Probability tive distribution function, probability mass function, probability density func- tion. Specific distributions: Bernoulli, binomial, Poisson, geometric, Gaussian, t exponential, , Gamma. Expectation value. Variance, standard deviation. Joint distribution functions. Conditional distributions; conditional expecta- tions and variances. Statistical independence and dependence. Covariance and correlation; why dependence is not the same thing as correlation. Rules for arith- metic with expectations, variances and covariances. Laws of total probability, total expectation, total variation. Sequences of random variables. Stochastic pro- cess. Law of large numbers. Central limit theorem. Sample mean, sample variance. Median, mode. Quartile, per- Statistics: centile, quantile. Inter-quartile range. Histograms. Contingency tables; odds ratio, log odds ratio. Parameters; estimator functions and point estimates. Sampling distribution. Bias of an estimator. Standard error of an estimate; standard error of the mean; how and why the standard error of the mean differs from the standard deviation. Consistency of estimators. Confidence intervals and interval estimates. Hypothesis tests. Tests for differences in means and in proportions; Z and t tests; degrees of freedom. Size, significance, power. Relation between hypothesis 2 test of independence for contingency tables; χ tests and confidence intervals. degrees of freedom. KS test for goodness-of-fit to distributions. Likelihood. Likelihood functions. Maximum likelihood estimates. Relation be- tween confidence intervals and the likelihood function. Likelihood ratio test. What a linear model is; distinction between the regressors and Regression: the regressand. Predictions/fitted values and residuals of a regression. Interpre- tation of regression coefficients. Least-squares estimate of coefficients. Relation between maximum likelihood, least squares, and Gaussian distributions. Matrix formula for estimating the coefficients; the hat matrix for finding fitted values. 2 2 ; why adding more predictor variables never reduces R R . The t -test for the sig- nificance of individual coefficients given other coefficients. The -test and partial F F -test for the significance of groups of coefficients. Degrees of freedom for resid- uals. Diagnostic examination of residuals. Confidence intervals for parameters. Confidence intervals for fitted values. Prediction intervals. (Most of this material is reviewed at http://www.stat.cmu.edu/ cshalizi/TALR/ .) ~

15 Part I Regression and Its Generalizations 15 st 11:17 Monday 1 April, 2019 c Copyright © Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/ cshalizi/ADAfaEPoV/ ~

16

17 1 Regression: Predicting and Relating Quantitative Features 1.1 Statistics, Data Analysis, Regression Statistics is the branch of mathematical engineering which designs and analyses methods for drawing reliable inferences from imperfect data. The subject of most sciences is some aspect of the world around us, or within us. Psychology studies minds; geology studies the Earth’s composition and form; economics studies production, distribution and exchange; mycology studies mush- rooms. Statistics does not study the world, but some of the ways we try to under- stand the world — some of the intellectual tools of the other sciences. Its utility comes indirectly, through helping those other sciences. This utility is very great, because all the sciences have to deal with imperfect data. Data may be imperfect because we can only observe and record a small fraction of what is relevant; or because we can only observe indirect signs of what is truly relevant; or because, no matter how carefully we try, our data always contain an element of noise. Over the last two centuries, statistics has come to handle all such imperfections by modeling them as random processes, and probability has become so central to statistics that we introduce random events 1 deliberately (as in sample surveys). Statistics, then, uses probability to model inference from data. We try to mathe- matically understand the properties of different procedures for drawing inferences: Under what conditions are they reliable? What sorts of errors do they make, and how often? What can they tell us when they work? What are signs that some- thing has gone wrong? Like other branches of engineering, statistics aims not just at understanding but also at improvement: we want to analyze data better : more reliably, with fewer and smaller errors, under broader conditions, faster, and with less mental effort. Sometimes some of these goals conflict — a fast, simple method might be very error-prone, or only reliable under a narrow range of circumstances. One of the things that people most often want to know about the world is how different variables are related to each other, and one of the central tools statistics 2 has for learning about relationships is regression. In your linear regression class, 1 Two excellent, but very different, histories of how statistics came to this understanding are Hacking (1990) and Porter (1986). 2 The origin of the name is instructive (Stigler, 1986). It comes from 19th century investigations into the relationship between the attributes of parents and their children. People who are taller (heavier, faster, . . . ) than average tend to have children who are also taller than average, but not quite as tall. 17 st April, 2019 11:17 Monday 1 c © Copyright Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/ cshalizi/ADAfaEPoV/ ~

18 18 Regression Basics you learned about how it could be used in data analysis, and learned about its properties. In this book, we will build on that foundation, extending beyond basic linear regression in many directions, to answer many questions about how variables are related to each other. This is intimately related to prediction. Being able to make predictions isn’t the reason we want to understand relations between variables — we also want to only tests our knowledge of relations. answer “what if?” questions — but prediction (If we misunderstand, we might still be able to predict, but it’s hard to see how we could understand and be able to predict.) So before we go beyond linear not regression, we will first look at prediction, and how to predict one variable from nothing at all. Then we will look at predictive relationships between variables, and see how linear regression is just one member of a big family of smoothing methods, all of which are available to us. 1.2 Guessing the Value of a Random Variable . We have a quantitative, numerical variable, which we’ll imaginatively call Y We’ll suppose that it’s a random variable, and try to predict it by guessing a single value for it. (Other kinds of predictions are possible — we might guess whether Y will fall within certain limits, or the probability that it does so, or even the whole probability distribution of Y . But some lessons we’ll learn here will apply to these other kinds of predictions as well.) What is the best value to optimal point forecast for Y ? guess? More formally, what is the To answer this question, we need to pick a function to be optimized, which should measure how good our guesses are — or equivalently how bad they are, i.e., how big an error we’re making. A reasonable, traditional starting point is the mean squared error : [ ] 2 E m ( Y − MSE( ) m ) (1.1) ≡ μ where MSE( m ) is smallest. Start by re-writing So we’d like to find the value the MSE as a (squared) bias plus a variance: [ ] 2 E MSE( ( Y − m ) m ) = (1.2) 2 V + [ Y − m ] (1.3) − m ]) = ( Y [ E 2 E [ Y − m ]) Y + V [ = ( ] (1.4) 2 Y [ Y ] − m ) = ( + V [ E ] (1.5) Notice that only the first, bias-squared term depends on our prediction m . We want to find the derivative of the MSE with respect to our prediction m , and Likewise, the children of unusually short parents also tend to be closer to the average, and similarly for other traits. This came to be called “regression towards the mean,” or even “regression towards mediocrity”; hence the line relating the average height (or whatever) of children to that of their parents was “the regression line,” and the word stuck.

19 1.3 The Regression Function 19 then set that to zero at the optimal prediction : TrueRegFunc MSE d − 2 ( E [ Y ] − m ) + 0 = (1.6) dm ∣ ∣ d MSE ∣ = 0 (1.7) ∣ dm = μ m E [ Y ] − μ ) = 0 (1.8) 2( = μ Y ] (1.9) E [ So, if we gauge the quality of our prediction by mean-squared error, the best prediction to make is the expected value. 1.2.1 Estimating the Expected Value Of course, to make the prediction E [ Y ] we would have to know the expected value Y . Typically, we do not. However, if we have sampled values, y of ,y , we ,...y n 2 1 can estimate the expectation from the sample mean: n ∑ 1 ̂ y (1.10) ≡ μ i n i =1 If the samples are independent and identically distributed (IID), then the law of large numbers tells us that ̂ E [ Y μ μ (1.11) → ] = and algebra with variances (Exercise 1.1) tells us something about how fast the V [ Y ] convergence is, namely that the squared error will typically be . /n Of course the assumption that the y come from IID samples is a strong one, i but we can assert pretty much the same thing if they’re just uncorrelated with a common expected value. Even if they are correlated, but the correlations decay § 25.2.2.1). So “sit, wait, fast enough, all that changes is the rate of convergence ( and average” is a pretty reliable way of estimating the expectation value. 1.3 The Regression Function Of course, it’s not very useful to predict just one number for a variable. Typically, we have lots of variables in our data, and we believe they are related somehow . For example, suppose that we have data on two variables, and Y , which might X 3 look like Figure 1.1. The feature Y is what we are trying to predict, a.k.a. the dependent variable or output or response or regressand , and X is . the or independent variable or covariate or input or regressor predictor Y might be something like the profitability of a customer and X their credit rating, or, if you want a less mercenary example, Y could be some measure of 3 Problem set A.30 features data that looks rather like these made-up values.

20 20 Regression Basics the dose taken of a drug. Typically we improvement in blood cholesterol and X X but rather many of them, but that gets won’t have just one input feature harder to draw and doesn’t change the points of principle. Figure 1.2 shows the same data as Figure 1.1, only with the sample mean added on. This clearly tells us something about the data, but also it seems like X , we should be able to do better — to reduce the average error — by using rather than by ignoring it. of X , namely f ( Let’s say that the we want our prediction to be a ). function X What should that function be, if we still use mean squared error? We can work [ U ] = E [ E [ U | V ]] this out by using the law of total expectation, i.e., the fact that E U and V . for any random variables ] [ 2 (1.12) )) − f ( X E f ( Y ) = MSE( [ [ ]] 2 ( Y − f ( X )) E | X = (1.13) E [ ] 2 V [ Y − f ( X ) | X ] + ( E [ Y − f ( X ) | X = E ]) (1.14) ] [ 2 = Y | X ] + ( E [ Y − f ( X ) | X ]) E V (1.15) [ When we want to minimize this, the first term inside the expectation doesn’t depend on our prediction, and the second term looks just like our previous opti- X , so for our optimal function mization only with all expectations conditional on ( ) we get μ x ( x ) = E [ Y μ X = x ] (1.16) | In other words, the (mean-squared) optimal conditional prediction is just the conditional expected value. The function μ ( x ) is called the regression function . This is what we would like to know when we want to predict Y . Some Disclaimers It’s important to be clear on what is and is not being assumed here. Talking X as the “independent variable” and Y as the “dependent” one suggests about a causal model, which we might write Y μ ( X ) +  (1.17) ← where the direction of the arrow, ← , indicates the flow from causes to effects, and   would have a is some noise variable. If the gods of inference are very kind, then fixed distribution, independent of X , and we could without loss of generality take it to have mean zero. (“Without loss of generality” because if it has a non-zero mean, we can incorporate that into μ ( X ) as an additive constant.) However, no such assumption is required to get Eq. 1.16. It works when predicting effects from causes, or the other way around when predicting (or “retrodicting”) causes from effects, or indeed when there is no causal relationship whatsoever between X and

21 1.3 The Regression Function 21 l l l l l l l l l l l 1.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.8 l l l l l 0.6 l l l l l l l y l l l l l l l l l l l l l 0.4 l l l l l l l l l l l l l l l l l 0.2 l l l l l l l l l l l l 0.0 l 0.4 0.2 0.8 0.0 1.0 0.6 x plot(all.x, all.y, xlab = "x", ylab = "y") rug(all.x, side = 1, col = "grey") rug(all.y, side = 2, col = "grey") Figure 1.1 Scatterplot of the (made up) running example data. rug() adds horizontal and vertical ticks to the axes to mark the location of the data; this isn’t necessary but is often helpful. The data are in the basics-examples.Rda file. 4 Y . It is always true that Y | X = μ ( X ) +  ( X ) (1.18) 4 We will cover causal inference in detail in Part III.

22 22 Regression Basics l l l l l l l l l l l 1.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.8 l l l l l 0.6 l l l l l l l y l l l l l l l l l l l l l 0.4 l l l l l l l l l l l l l l l l l 0.2 l l l l l l l l l l l l 0.0 l 0.6 0.8 0.2 0.0 0.4 1.0 x plot(all.x, all.y, xlab = "x", ylab = "y") rug(all.x, side = 1, col = "grey") rug(all.y, side = 2, col = "grey") abline(h = mean(all.y), lty = "dotted") y . Figure 1.2 Data from Figure 1.1, with a horizontal line at where  ( X ) is a random variable with expected value 0, E [  | X = x ] = 0, but as the notation indicates the distribution of this variable generally depends on . X It’s also important to be clear that if we find the regression function is a con- stant, μ ( x ) = are statistically Y for all x , that this does not mean that X and μ 0

23 1.4 Estimating the Regression Function 23 independent. If they are independent, then the regression function is a constant, 5 but turning this around is the logical fallacy of “affirming the consequent” . 1.4 Estimating the Regression Function X ) = E [ ( | x = x ], but what we have is a pile μ We want the regression function Y ( ,y of training examples, of pairs ( ) , ( x ). What should we do? ,y ,y ) ,... x x n 2 1 1 n 2 If X takes on only a finite set of values, then a simple strategy is to use the conditional sample means: ∑ 1 ̂ ) = x y (1.19) μ ( i = x } x : i { # i = x x i : i Reasoning with the law of large numbers as before, we can be confident that ̂ x ) → E [ Y | X = x ( μ ]. only works when X takes values in a finite set. If X is Unfortunately, this continuous, then in general the probability of our getting a sample at any par- ticular multiple samples at exactly value is zero, as is the probability of getting the same value of . This is a basic issue with estimating any kind of function x undersampled from data — the function will always be , and we need to fill in between the values we see. We also need to somehow take into account the y , and is a sample fact that each Y | X = x from the conditional distribution of i i generally not equal to E [ Y | X = x ]. So any kind of function estimation is going i to involve interpolation, extrapolation, and de-noising or smoothing. Different methods of estimating the regression function — different regression methods, for short — involve different choices about how we interpolate, extrapo- x ) with a limited μ late and smooth. These are choices about how to approximate ( class of functions which we know (or at least hope) we can estimate. There is no approximation in the case at hand, good guarantee that our choice leads to a though it is sometimes possible to say that the approximation error will shrink as we get more and more data. This is an extremely important topic and deserves an extended discussion, coming next. 1.4.1 The Bias-Variance Tradeoff ̂ x ), but we use the function Suppose that the true regression function is μ to ( μ X = x in a slightly make our predictions. Let’s look at the mean squared error at different way than before, which will make it clearer what happens when we can’t 2 ̂ x μ μ ( Y )) − , since the MSE to make predictions. We’ll begin by expanding ( use x is just the expectation of this. at 2 ̂ )) μ ( x − Y (1.20) ( 2 ̂ − μ ( x ) + μ ( x ) − = ( μ ( x )) Y 2 2 ̂ ̂ ( x )) = ( + 2( Y − μ ( x ))( μ ( x ) − Y μ ( x )) + ( μ ( x ) − − μ ( x )) μ (1.21) 5 As in combining the fact that all human beings are featherless bipeds, and the observation that a cooked turkey is a featherless biped, to conclude that cooked turkeys are human beings.

24 24 Regression Basics Y Eq. 1.18 tells us that X ) =  , a random variable which has expectation − μ ( ). Taking the expectation of Eq. 1.21, nothing zero (and is uncorrelated with X happens to the last term (since it doesn’t involve any random quantities); the Y − μ ( X )] = E [  ] = 0), and the first term middle term goes to zero (because E [ 2 σ ( x ):  becomes the variance of , call it 2 2 ̂ ̂ σ MSE( ( x ) + ( μ ( x ) − μ μ ( x )) ( (1.22) x )) = 2 σ ) term doesn’t depend on our prediction function, just on how hard it is, The ( x at = x . The second term, though, is the extra error Y intrinsically, to predict X . (Unsurprisingly, ignorance of μ cannot improve our we get from not knowing μ : the total MSE bias-variance decomposition predictions.) This is our first ̂ at μ ( x ) − x μ ( x ), the amount by which is decomposed into a (squared) bias 2 off, and a variance σ our predictions are ( x ), the unpredictable, systematically “statistical” fluctuation around even the best prediction. ̂ ̂ All this presumes that is a single fixed function. Really, of course, μ is some- μ thing we estimate from earlier data. But if those data are random, the regression ̂ , where the M function we get is random too; let’s call this random function n subscript reminds us of the finite amount of data we used to estimate it. What ̂ ̂ ̂ = ( | we have analyzed is really MSE( M condi- ) M μ ), the mean squared error x n n a particular estimated regression function. What can we say about the tional on method , averaging over all the possible training data sets? prediction error of the ] [ 2 ̂ ̂ X )) = x ( Y − (1.23) M MSE( ( E )) M | X = x ( n n ] ] [ [ 2 ̂ ̂ ̂ E ( X )) Y | X = x, M M − = ( μ E | X = x = (1.24) n n ] [ 2 2 ̂ σ μ ( x ) − ( M = ( x )) ) + ( | X = x E (1.25) x n [ ] 2 2 ̂ E = ( σ ( x ) − μ M ( ( x )) x | X = x ) + (1.26) n [ ] ] [ [ ] 2 2 ̂ ̂ ̂ x ) − E σ ( M (1.27) ( x ) ( + E x ) + M E = x ) ( − μ M ( ( x )) n n n ]) [ ] ( [ 2 2 ̂ ̂ ) − E ( x M (1.28) ( x ) ) + = σ V μ ( M x ( x ) + n n This is our second bias-variance decomposition — I pulled the same trick as before, adding and subtracting a mean inside the square. The first term is just the variance of the process; we’ve seen that before and it isn’t, for the moment, ̂ of any concern. The second term is the bias in using M to estimate μ — the n approximation bias or approximation error . The third term, though, is the variance in our of the regression function. Even if we have an unbiased estimate [ ] ̂ x ) = E M method ( μ ( x ) ( ), if there is a lot of variance in our estimates, we can n expect to make large errors. The approximation bias depends on the true regression function. For exam- ] [ ̂ = 42 + 37 M ( x ) if x x , the error of approximation will be zero at all ple, if E n μ ( x ) = 42+37 x , but it will be larger and x -dependent if μ ( x ) = 0. However, there are flexible methods of estimation which will have small approximation biases for

25 1.4 Estimating the Regression Function 25 μ all in a broad range of regression functions. The catch is that, at least past a certain point, decreasing the approximation bias can only come through in- bias-variance trade-off . However, creasing the estimation variance. This is the nothing says that the trade-off has to be one-for-one. Sometimes we can lower some bias, since it gets rid of more variance than introducing the total error by it adds approximation error. The next section gives an example. In general, both the approximation bias and the estimation variance depend 6 consistent → ∞ when both of these go to zero as n on n . A method is — 7 that is, if we recover the true regression function as we get more and more data. Again, consistency depends not just on the method, but also on how well the method matches the data-generating process, and, again, there is a bias-variance trade-off. There can be multiple consistent methods for the same problem, and their biases and variances don’t have to go to zero at the same rates . 1.4.2 The Bias-Variance Trade-Off in Action ( x ) by a con- μ Let’s take an extreme example: we could decide to approximate μ . The implicit smoothing here is very strong, but sometimes appropriate. stant 0 For instance, it’s appropriate when ( x ) really is a constant! Then trying to es- μ timate any additional structure in the regression function is just wasted effort. Alternately, if ( x ) is nearly constant, we may still be better off approximating μ μ ( x ) = μ sin ( + a it as one. For instance, suppose the true νx ), where a  1 and 0 ν 1 (Figure 1.3 shows an example). With limited data, we can actually get  better predictions by estimating a constant regression function than one with the correct functional form. 1.4.3 Ordinary Least Squares Linear Regression as Smoothing Let’s revisit ordinary least-squares linear regression from this point of view. We’ll assume that the predictor variable X is one-dimensional, just to simplify the book-keeping. x We ( x ) by b to approximate + b of μ , and ask for the best values β ,β choose 0 1 1 0 6 consistent for μ , or consistent for conditional expectations . More generally, an To be precise, estimator of any property of the data, or of the whole distribution, is consistent if it converges on the truth. 7 You might worry about this claim, especially if you’ve taken more probability theory — aren’t we ̂ M just saying something about average performance of the , rather than any particular estimated n regression function? But notice that if the estimation variance goes to zero, then by Chebyshev’s [ ] 2 ̂ ̂ [ X ] |≥ a ) ≤ V [ X ] /a inequality, Pr ( , each | M with ( x ) comes arbitrarily close to E X − M E ( x ) n n arbitrarily high probability. If the approximation bias goes to zero, therefore, the estimated regression functions converge in probability on the true regression function, not just in mean .

26 26 Regression Basics l 100 1 + ) x 0.1 ( sin 2.5 y ^ ^ 100 + ) a x sin ( b 2.0 l l l l 1.5 y l l l l l l 1.0 l l l l l l l l 0.5 l 0.4 0.0 0.8 0.2 0.6 x ugly.func <- function(x) { 1 + 0.01 * sin(100 * x) } x <- runif(20) y <- ugly.func(x) + rnorm(length(x), 0, 0.5) plot(x, y, xlab = "x", ylab = "y") curve(ugly.func, add = TRUE) abline(h = mean(y), col = "red", lty = "dashed") sine.fit = lm(y ~ 1 + sin(100 * x)) curve(sine.fit$coefficients[1] + sine.fit$coefficients[2] * sin(100 * x), col = "blue", add = TRUE, lty = "dotted") legend("topright", legend = c(expression(1 + 0.1 * sin(100 * x)), expression(bar(y)), expression(hat(a) + hat(b) * sin(100 * x))), lty = c("solid", "dashed", "dotted"), col = c("black", "red", "blue")) Figure 1.3 When we try to estimate a rapidly-varying but small-amplitude μ = 1 + 0 . 01 sin 100 x +  , with regression function (solid black line, mean-zero Gaussian noise of standard deviation 0 . 5), we can do better to use a constant function (red dashed line at the sample mean) than to estimate a ˆ sin 100 a + more complicated model of the functional form ˆ correct x (dotted b blue line). With just 20 observations, the mean predicts slightly better on new data (square-root MSE, RMSE, of 0 . 52) than does the estimate sine function (RMSE of 0 . 55). The bias of using the wrong functional form is less than the extra variance of estimation, so using the true model form hurts us. those constants. These will be the ones which minimize the mean-squared error. [ ] 2 ( Y − b ) = a,b b (1.29) X ) E MSE( − 1 0 ]] [ [ 2 ) Y − b = − b (1.30) X ( E | X E 1 0 [ ] 2 V [ Y = X ] + ( E [ Y − b E − b | X | X ]) (1.31) 1 0 [ ] 2 V [ Y | X ]] + E (1.32) ( = [ Y − b E − b [ X | X ]) E 1 0

27 1.4 Estimating the Regression Function 27 b b , so we can drop it for purposes of The first term doesn’t depend on or 1 0 optimization. Taking derivatives, and then bringing them inside the expectations, MSE ∂ E [2( Y − b b − = (1.33) X )( − 1)] 1 0 ∂b 0 [ Y − β 0 = − β E X ] (1.34) 1 0 = E [ Y ] − β β E (1.35) X ] [ 0 1 So we need to get β : 1 ∂ MSE )] = E [2( Y − b X − (1.36) − X )( b 1 0 ∂b 1 [ ] 2 ] − β 0 = E (1.37) X E [ + ( E [ Y ] − β XY E [ X ]) E [ X ] 1 1 [ ] 2 2 − E [ X ] E [ Y ] − = (1.38) ( E E X [ XY − E [ X ] ] ) β 1 ] X,Y Cov [ = β (1.39) 1 [ ] V X . That is, the mean-squared optimal using our equation for linear prediction is β 0 X,Y ] Cov [ E [ Y ] + − μ ( x (1.40) ]) X ( x ) = E [ ] [ V X Now, if we try to estimate this from data, there are (at least) two approaches. One is to replace the true, population values of the covariance and the variance with their sample values, respectively ∑ 1 (1.41) ( y ) − y )( x x − i i n i and ∑ 1 2 ̂ ( x (1.42) − x ) ] ≡ . V [ X i n i in-sample or empirical The other is to minimize the mean squared error, ∑ 1 2 ( y b − (1.43) ) − b x 1 i 0 i n i You may or may not find it surprising that both approaches lead to the same answer: ∑ 1 ( − y − ) x y )( x i i i n ̂ β = (1.44) 1 ̂ V [ X ] ̂ ̂ y − β β x (1.45) = 0 1 (1.46) Provided that V [ X ] > 0, these will converge with IID samples, so we have a consistent estimator. We are now in a position to see how the least-squares linear regression model

28 28 Regression Basics is really a weighted averaging of the data. Let’s write the estimated regression function explicitly in terms of the training data points. ̂ ̂ ̂ μ + ( β ) = x (1.47) β x 0 1 ̂ = ( x − x ) (1.48) β + y 1 ( ) ∑ n 1 ∑ ) x y − ( y x − )( 1 i i i n ∑ − x y ( + = x (1.49) ) i 1 2 n x ( − x ) i i n i =1 n n ∑ ∑ x − x ) ( 1 (1.50) ) y y y − x + )( ( x − = i i i 2 n n ˆ σ X i =1 i =1 n n ∑ ∑ ( 1 ) x x x − x ) − ( n x n x + ( y ( x − − x ) y − y (1.51) = ) i i i 2 2 n n n σ σ ˆ ˆ X X =1 i i =1 ) ( n ∑ ( x − − 1 ) x x )( x i y 1 + (1.52) = i 2 n ˆ σ X i =1 In words, our prediction is a weighted average of the observed values y of the i x regressand, where the weights are proportional to how far and x both are from i the center of the data (relative to the variance of ). If x X is on the same side of i the center as x , it gets a positive weight, and if it’s on the opposite side it gets a negative weight. Figure 1.4 adds the least-squares regression line to Figure 1.1. As you can see, this is only barely slightly different from the constant regression function (the slope is . 014). Visually, the problem is that there should be a positive slope X is 0 in the left-hand half of the data, and a negative slope in the right, but the slopes 8 single slope is near zero. and the densities are balanced so that the best Mathematically, the problem arises from the peculiar way in which least- squares linear regression smoothes the data. As I said, the weight of a data point depends on how far it is from the center of the data, not how far it is from the . This works when μ ( x ) really is a straight point at which we are trying to predict line, but otherwise — e.g., here — it’s a recipe for trouble. However, it does sug- gest that if we could somehow just tweak the way we smooth the data, we could do better than linear regression. 1.5 Linear Smoothers The sample mean and the least-squares line are both special cases of linear , which estimates the regression function with a weighted average: smoothers ∑ ̂ ̂ x ) = μ (1.53) y ) ( w ( x ,x i i i These are called linear smoothers because the predictions are linear in the re- sponses y they can be and generally are nonlinear. ; as functions of x i 8 The standard test of whether this coefficient is zero is about as far from rejecting the null hypothesis as you will ever see, p = 0 . 89. Remember this the next time you look at linear regression output.

29 1.5 Linear Smoothers 29 l l l l l l l l l l l 1.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.8 l l l l l 0.6 l l l l l l l y l l l l l l l l l l l l l 0.4 l l l l l l l l l l l l l l l l l 0.2 l l l l l l l l l l l l 0.0 l 0.4 1.0 0.6 0.2 0.8 0.0 x plot(all.x, all.y, xlab = "x", ylab = "y") rug(all.x, side = 1, col = "grey") rug(all.y, side = 2, col = "grey") abline(h = mean(all.y), lty = "dotted") fit.all = lm(all.y ~ all.x) abline(fit.all) Figure 1.4 Data from Figure 1.1, with a horizontal line at the mean (dotted) and the ordinary least squares regression line (solid). As I just said, the sample mean is a special case; see Exercise 1.7. Ordinary ̂ ,x w ( x ) is given by Eq. 1.52. Both linear regression is another special case, where i of these, as remarked earlier, ignore how far x is from x . Let us look at some i linear smoothers which are not so silly.

30 30 Regression Basics k -Nearest-Neighbor Regression 1.5.1 x , we could do x and At the other extreme from ignoring the distance between i nearest-neighbor regression : { x x 1 nearest neighbor of i ̂ ) = x ( w ,x (1.54) i 0 otherwise and x x . If μ This is very sensitive to the distance between x ) does not change ( i too rapidly, and X is pretty thoroughly sampled, then the nearest neighbor of among the x is probably close to x x , so that μ ( x ). ) is probably close to μ ( x i i y ( x ) + noise, so nearest-neighbor regression will include the noise μ = However, i i into its prediction. We might instead do k -nearest neighbor regression, { x 1 /k x nearest neighbors of one of the k i ̂ ,x ) = ( x w (1.55) i 0 otherwise Again, with enough samples all the k nearest neighbors of x are probably close to x , so their regression functions there are going to be close to the regression x . But because we average their values of function at , the noise terms should y i k , we get smoother functions — tend to cancel each other out. As we increase k = n and we just get back the constant. Figure 1.5 illustrates this in the limit 9 for our running example data. To use -nearest-neighbors regression, we need to k pick somehow. This means we need to decide how much smoothing to do, and k this is not trivial. We will return to this point in Chapter 3. Because -nearest-neighbors averages over only a fixed number of neighbors, k each of which is a noisy sample, it always has some noise in its prediction, and is generally not consistent. This may not matter very much with moderately-large data (especially once we have a good way of picking k ). If we want consistency, we need to let k grow with n , but not too fast; it’s enough that as n →∞ , k →∞ and k/n 0 (Gy ̈orfi et al. , 2002, Thm. 6.1, p. 88). → 1.5.2 Kernel Smoothers Changing in a k -nearest-neighbors regression lets us change how much smooth- k ing we’re doing on our data, but it’s a bit awkward to express this in terms of a number of data points. It feels like it would be more natural to talk about a range in the independent variable over which we smooth or average. Another problem with k -NN regression is that each testing point is predicted using information from only a few of the training data points, unlike linear regression or the sample mean, which always uses all the training data. It’d be nice if we could somehow use all the training data, but in a location-sensitive way. There are several ways to do this, as we’ll see, but a particularly useful one is 9 The code uses the k -nearest neighbor function provided by the package FNN (Beygelzimer et al. , 2013). This requires one to give both a set of training points (used to learn the model) and a set of test points (at which the model is to make predictions), and returns a list where the actual predictions are in the pred element — see help(knn.reg) for more, including examples.

31 1.5 Linear Smoothers 31 l l l l l l l l l l l 1.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.8 l l l mean l l 0.6 l k = 1 l l l l l l y = 3 k l l l k 5 = l l l = k 20 l l l l l l l 0.4 l l l l l l l l l l l l l l l l l 0.2 l l l l l l l l l l l l 0.0 l 0.4 0.0 0.6 1.0 0.8 0.2 x library(FNN) plot.seq <- matrix(seq(from = 0, to = 1, length.out = 100), byrow = TRUE) lines(plot.seq, knn.reg(train = all.x, test = plot.seq, y = all.y, k = 1)$pred, col = "red") lines(plot.seq, knn.reg(train = all.x, test = plot.seq, y = all.y, k = 3)$pred, col = "green") lines(plot.seq, knn.reg(train = all.x, test = plot.seq, y = all.y, k = 5)$pred, col = "blue") lines(plot.seq, knn.reg(train = all.x, test = plot.seq, y = all.y, k = 20)$pred, col = "purple") legend("center", legend = c("mean", expression(k == 1), expression(k == 3), expression(k == 5), expression(k == 20)), lty = c("dashed", rep("solid", 4)), col = c("black", "red", "green", "blue", "purple")) Figure 1.5 Points from Figure 1.1 with horizontal dashed line at the mean and the k -nearest-neighbor regression curves for various k . Increasing k smooths out the regression curve, pulling it towards the mean. — The code is repetitive; can you write a function to simplify it?

32 32 Regression Basics , a.k.a. kernel regression Nadaraya-Watson regres- kernel smoothing or 10 kernel function ( x . To begin with, we need to pick a ,x ) which satisfies sion K i the following properties: ( ,x ) x 0; 1. K ≥ i x 2. ,x ) depends only on the distance K ( − x , not the individual arguments; x i i ∫ ,x 3. dx = 0; and xK (0 ) ∫ 2 x K (0 ,x ) dx < ∞ . 4. 0 < ( ,x ) → 0 as These conditions together (especially the last one) imply that K x i x − | |→∞ . Two examples of such functions are the density of the Unif( − h/ x ,h/ 2) 2 i √ h ) distribution. (0 , N distribution, and the density of the standard Gaussian can be any positive number, and is called the bandwidth . Because h Here ( K as a one-argument function, ,x ) = K (0 ,x K − x ), we will often write x i i ( x x ). Because we often want to consider similar kernels which differ only by K − i − x x i ( ), or bandwidth, we’ll either write K K ( x ). − x i h h The Nadaraya-Watson estimate of the regression function is ∑ x ( K ,x ) i ∑ ̂ (1.56) y μ x ( ) = i ( x K ,x ) j j i i.e., in terms of Eq. 1.53, ( ) ,x K x i ∑ ̂ ( ) = ,x x w (1.57) i ( ,x ) K x j j k -NN regression, the sum of the weights is always 1. (Notice that here, as in 11 Why?) What does this achieve? Well, K ( x , so this will ,x ) is large if x x is close to i i place a lot of weight on the training data points close to the point where we are trying to predict. More distant training points will have smaller weights, falling off towards zero. If we try to predict at a point x which is very far from any of x ( ) will be small for all x , but it will K the training data points, the value of ,x i i for all the x which are not the nearest neighbor typically be much, much smaller i 12 ̂ ≈ ( x x ,x ) , so 1 for the nearest neighbor and ≈ 0 for all the others. w That is, of i far from the training data, our predictions will tend towards nearest neighbors, ±∞ , as linear regression’s predictions do. Whether this rather than going off to 10 There are many other mathematical objects which are also called “kernels”. Some of these meanings are related, but not all of them. (Cf. “normal”.) 11 ) is zero for some x What do we do if ,x K x ( ? Nothing; they just get zero weight in the average. i i all the K ( x What do we do if ,x ) are zero? Different people adopt different conventions; popular i ones are to return the global, unweighted mean of the , to do some sort of interpolation from y i regions where the weights are defined, and to throw up our hands and refuse to make any NA ). predictions (computationally, return 2 2 h 2 12 / ) x − x − ( i Take a Gaussian kernel in one dimension, for instance, so ,x e x ( K ) . Say x is the ∝ i i 2 2 − L / 2 h ∝ h . So K ( x ,x ) = | e L x , with L  | , a small number. But now x nearest neighbor, and − i i 2 2 2 2 2 2 2 x x ) L/ 2 h ( − − h − ( x / − x ) 2 − 2 h L / 2 / h − L i j i j x ,x e ) ∝ e x , K ( for any other e e  . — This assumes i j that we’re using a kernel like the Gaussian, which never quite goes to zero, unlike the box kernel.

33 1.5 Linear Smoothers 33 ( x ) — and how often we have to μ is good or bad of course depends on the true predict what will happen very far from the training data. Figure 1.6 shows our running example data, together with kernel regression estimates formed by combining the uniform-density, or box , and Gaussian kernels h around with different bandwidths. The box kernel simply takes a region of width x and averages the training data points it finds there. The Gaussian the point kernel gives reasonably large weights to points within h of x , smaller ones to points 2 2 / h ) x − ( x − i e h , tiny ones to points within 3 h within 2 . , and so on, shrinking like h h As promised, the bandwidth →∞ , we controls the degree of smoothing. As revert to taking the global mean. As → 0, we tend to get spikier functions — h with the Gaussian kernel at least it tends towards the nearest-neighbor regression. If we want to use kernel regression, we need to choose both which kernel to use, and the bandwidth to use with it. Experience, like Figure 1.6, suggests that the bandwidth usually matters a lot more than the kernel. This puts us back k -NN regression, needing to control the degree to roughly where we were with of smoothing, without knowing how smooth μ ( x ) really is. Similarly again, with a fixed bandwidth h , kernel regression is generally not consistent. However, if h → 0 as n →∞ , but doesn’t shrink too fast, then we can get consistency.

34 34 Regression Basics l l l l l l l l l l l 1.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.8 l l l l l 0.6 l l l l l l l y l l l l l l l l l l l l l 0.4 l l l l l l l l l l l l l l l l l 0.2 l l l Box Gaussian l l l l l l = h 2 l l 1 = h l 0.0 l 0.1 h = 0.8 0.2 0.6 0.0 0.4 1.0 x lines(ksmooth(all.x, all.y, "box", bandwidth = 2), col = "red") lines(ksmooth(all.x, all.y, "box", bandwidth = 1), col = "green") lines(ksmooth(all.x, all.y, "box", bandwidth = 0.1), col = "blue") lines(ksmooth(all.x, all.y, "normal", bandwidth = 2), col = "red", lty = "dashed") lines(ksmooth(all.x, all.y, "normal", bandwidth = 1), col = "green", lty = "dashed") lines(ksmooth(all.x, all.y, "normal", bandwidth = 0.1), col = "blue", lty = "dashed") legend("bottom", ncol = 3, legend = c("", expression(h == 2), expression(h == 1), expression(h == 0.1), "Box", "", "", "", "Gaussian", "", "", ""), lty = c("blank", "blank", "blank", "blank", "blank", "solid", "solid", "solid", "blank", "dashed", "dashed", "dashed"), col = c("black", "black", "black", "black", "black", "red", "green", "blue", "black", "red", "green", "blue"), pch = NA) Figure 1.6 Data from Figure 1.1 together with kernel regression lines, for various combinations of kernel (box/uniform or Gaussian) and bandwidth. Note the abrupt jump around x = 0 . 75 in the h = 0 . 1 box-kernel (solid blue) line — with a small bandwidth the box kernel is unable to interpolate smoothly across the break in the training data, while the Gaussian kernel (dashed blue) can.

35 1.5 Linear Smoothers 35 1.5.3 Some General Theory for Linear Smoothers Some key parts of the theory you are familiar with for linear regression models carries over more generally to linear smoothers. They are not quite so important any more, but they do have their uses, and they can serve as security objects during the transition to non-parametric regression. Throughout this sub-section, we will temporarily assume that ( X ) +  , Y = μ 2 σ , no correlation with the noise having constant variance  with the noise terms , influence or hat at other observations. Also, we will define the smoothing w = ˆ w ( x ,x ). This records how much influence observation y matrix ˆw by ˆ j j ij i ̂ ̂ x for ), which (remember) is μ had on the smoother’s fitted value for μ x μ ) or ( ( i i i 13 . w , hence the name “hat matrix” for ˆ short 1.5.3.1 Standard error of predicted mean values ̂ ( x It is easy to get the standard error of any predicted mean value ), by first μ working out its variance: [ ] n ∑ ̂ )] = [ x μ ( (1.58) V w ( x Y ,x ) V j j =1 j n ∑ (1.59) ] Y V [ w ( x ) ,x = j j =1 j n ∑ 2 w ,x ( x (1.60) = ) V [ Y ] j j =1 j n ∑ 2 2 ) ,x (1.61) w ( x = σ j j =1 The second line uses the assumption that the noise is uncorrelated, and the last the assumption that the noise variance is constant. In particular, for a point x i ∑ 2 2 ̂ ( x V )] = σ [ . which appeared in the training data, w μ i ij j ̂ Notice that this is the variance in the predicted mean μ ( x ). It is not an value, ], though we will see how conditional variances can be V Y estimate of X = x [ | estimated using nonparametric regression in Chapter 10. Notice also that we have not had to assume that the noise is Gaussian. If we did add that assumption, this formula would also give us a confidence interval for the fitted value (though we would still have to worry about estimating σ ). 1.5.3.2 (Effective) Degrees of Freedom For linear regression models, you will recall that the number of “degrees of free- dom” was just the number of coefficients (including the intercept). While degrees of freedom are less important for other sorts of regression than for linear models, they’re still worth knowing about, so I’ll explain here how they are defined and calculated. 13 This is often written as ˆ y , not , but that’s not very logical notation; the quantity is a function of y i i an estimate of it; it’s an estimate of μ ( x ). i

36 36 Regression Basics The first thing to realize is that we can’t use the number of parameters to define have degrees of freedom in general, since most linear smoothers don’t parameters. why Instead, we have to go back to the reasons the number of parameters matters 14 . We’ll start with an n p data matrix of predictor in ordinary linear models × n × 1 variables x (possibly including an all-1 column for an intercept), and an . The ordinary least squares estimate of the y column matrix of response values p β is -dimensional coefficient vector ) ( − 1 T T ˆ x x β y (1.62) x = and : x y This lets us write the fitted values in terms of ˆ ̂ μ β (1.63) = x ) ( ( ) − 1 T T x x x = y (1.64) x wy = (1.65) is the saying how much of each observed × n matrix, with w w where y n ij j ̂ contributes to each fitted μ . This is what, a little while ago, I called the influence i or hat matrix, in the special case of ordinary least squares. w depends only on the predictor variables in x ; the observed re- Notice that ̂ y y , the fitted values μ will don’t matter. If we change around sponse values in w . There are n independent also change, but only within the limits allowed by y can change, so we say the data have n degrees of free- coordinates along which ̂ x (and thus w ) are fixed, however, dom. Once μ has to lie in a p -dimensional linear subspace in this n -dimensional space, and the residuals have to lie in the n ( )-dimensional space orthogonal to it. − p ̂ μ wy is confined is the = Geometrically, the dimension of the space in which . Since w rank of the matrix w is an idempotent matrix (Exercise 1.5), its rank p : equals its trace. And that trace is, exactly, ) ( ) ( − 1 T T x x x = tr w tr (1.66) x ( ) ( ) − 1 T T = tr x x x (1.67) x = tr I (1.68) = p p T 15 b , tr ( ab ) = tr ( ba ), and x p x is a a × p matrix , . since for any matrices For more general linear smoothers, we can still write Eq. 1.53 in matrix form, ̂ μ = wy (1.69) 16 define the degrees of freedom We now to be the trace of w : ̂ tr μ ) ≡ df w (1.70) ( This may not be an integer. 14 What follows uses some concepts and results from linear algebra; see Appendix B for reminders. 15 T x has an inverse. Can you work out what happens when it does not? x This all assumes that 16 Some authors prefer to say “ effective degrees of freedom”, to emphasize that we’re not just counting parameters.

37 1.5 Linear Smoothers 37 Covariance of Observations and Fits Eq. 1.70 defines the number of degrees of freedom for linear smoothers. A yet more Y ( x ) +  μ , and general definition includes nonlinear methods, assuming that = i i i 2 17 variance σ consist of uncorrelated noise of constant . This is the  i n ∑ 1 ̂ ̂ ( (1.71) ) ≡ )] x Cov [ Y ( , df μ μ i i 2 σ =1 i Y and In words, this is the normalized covariance between each observed response i ̂ the corresponding predicted value, ( x ). This is a very natural way of measuring μ i how flexible or stable the regression model is, by seeing how much it shifts with the data. If we do have a linear smoother, Eq. 1.71 reduces to Eq. 1.70. ] [ n ∑ ̂ Y x Cov [ )] = Cov μ Y (1.72) , , ( Y w j i i i ij =1 j n ∑ w (1.73) = ] Cov [ Y ,Y i j ij =1 j 2 w (1.74) V [ Y w ] = σ = i ii ii Here the first line uses the fact that we’re dealing with a linear smoother, and the last line the assumption that  is uncorrelated and has constant variance. i Therefore n ∑ 1 2 ̂ (1.75) w w = tr σ ) = μ ( df ii 2 σ i =1 as promised. 1.5.3.3 Prediction Errors Bias Because linear smoothers are linear in the response variable, it’s easy to work out (theoretically) the expected value of their fits: n ∑ ̂ μ (1.76) ] = E [ ] Y w [ E ij j i =1 j In matrix form, ̂ E E ] = w [ [ Y ] (1.77) μ This means the smoother is unbiased if, and only if, w E [ Y ] = E [ Y ], that is, if E Y ] is an eigenvector of w . Turned around, the condition for the smoother to [ be unbiased is ( I (1.78) − w ) E [ Y ] = 0 n 17 But see Exercise 1.10.

38 38 Regression Basics In general, ( − w ) E [ Y ] 6 = 0 , so linear smoothers are more or less biased. Different I n smoothers are, however, unbiased for different families of regression functions. Ordinary linear regression, for example, is unbiased if and only if the regression function really is linear. In-sample mean squared error When you studied linear regression, you learned that the expected mean-squared 2 /n − σ ) ( . This formula generalizes n error on the data used to fit the model is p to other linear smoothers. Let’s first write the residuals in matrix form. ̂ μ y y − wy (1.79) − = I − wy (1.80) = y n = ( ) y (1.81) I w − n 2 1 − ̂ − n μ ‖ The in-sample mean squared error is , so y ‖ 1 1 2 2 ̂ ‖ ‖ = y − ‖ ( I (1.82) − w ) y ‖ μ n n n 1 T T = I (1.83) y w − )( I ( − w ) y n n n 18 Taking expectations , ] [ 2 ) ( 1 σ 1 2 2 T ̂ y − ] tr μ ( I − w ‖ )( I − w ) (1.84) + ‖ ‖ ‖ ( I = − w ) E [ y E n n n n n n 2 ( ) σ 1 2 T − 2 tr w + tr ( w (1.85) w ) tr + I = ‖ ( I ‖ − w ) E [ y ] n n n n 2 ( ) σ 1 2 T w + tr ( w (1.86) w ) = n − 2 tr ‖ ( I ‖ − w ) E [ y ] + n n n 2 1 − y n ( − w ) E [ I ] ‖ The last term, , comes from the bias: it indicates the dis- ‖ n tortion that the smoother would impose on the regression function, even without 2 noise. The first term, proportional to , reflects the variance. Notice that it in- σ w , but also a second- volves not only what we’ve called the degrees of freedom, tr T order term, tr w w . For ordinary linear regression, you can show (Exercise 1.9) T T w w ) = p , so 2 tr that tr ( − tr ( w w w ) would also equal p . For this reason, some T T tr ( ) or 2 tr w − people prefer either tr ( w w w ) as the definition of degrees of w freedom for linear smoothers, so be careful. 1.5.3.4 Inferential Statistics Many of the formulas underlying things like the test for whether a regression F predicts significantly better than the global mean carry over from linear regression to linear smoothers, if one uses the right definitions of degrees of freedom, and one believes that the noise is always IID and Gaussian. However, we will see ways of doing inference on regression models which don’t rely on Gaussian assumptions at all (Ch. 6), so I won’t go over these results. 18 See App. F.2 for how to find the expected value of quadratic forms like this.

39 1.6 Further Reading 39 1.6 Further Reading In Chapter 2, we’ll look more at the limits of linear regression and some ex- tensions; Chapter 3 will cover some key aspects of evaluating statistical models, including regression models; and then Chapter 4 will come back to kernel regres- ksmooth sion, and more powerful tools than . Chapters 10–8 and 13 all introduce further regression methods, while Chapters 11–12 pursue extensions. limited not Good treatments of regression, emphasizing linear smoothers but to linear regression, can be found in Wasserman (2003, 2006), Simonoff (1996), Faraway (2006) and Gy ̈orfi et al. (2002). The last of these in particular provides a very thorough theoretical treatment of non-parametric regression methods. On generalizations of degrees of freedom to non-linear models, see Buja et al. § 2.7.3), and Ye (1998). (1989, Historical notes All the forms of nonparametric regression covered in this chapter are actually quite old. Kernel regression was introduced independently by Nadaraya (1964) and Watson (1964). The origin of nearest neighbor methods is less clear, and indeed they may have been independently invented multiple times — Cover and Hart (1967) collects some of the relevant early citations, as well as providing a pi- oneering theoretical analysis, extended to regression problems in Cover (1968a,b). Exercises Y ,Y Suppose and standard deviation 1.1 μ are random variables with the same mean ,...Y n 2 1 19 , and that they are all uncorrelated with each other, but not necessarily independent σ or identically distributed. Show the following: [ ] ∑ n 2 V 1. Y nσ . = i =1 i ] [ ∑ n − 1 2 /n Y 2. = σ V n . i i =1 ∑ √ n − 1 . n n 3. The standard deviation of σ/ is Y i =1 i ∑ √ n − 1 Y 4. The standard deviation of . n n is σ/ μ − i =1 i Y but each has its own share mean Can you state the analogous results when the μ i σ ? (Assume in both cases that ? When each Y standard deviation has a distinct mean μ i i i Y the remain uncorrelated.) i 1.2 Suppose we use the mean absolute error instead of the mean squared error: MAE( m ) = E [ | Y − m | ] (1.87) minimizes the MAE? Is this also minimized by taking E [ Y ]? If not, what value ̃ μ = m Should we use MSE or MAE to measure error? 1.3 Derive Eqs. 1.45 and 1.44 by minimizing Eq. 1.43. 1.4 What does it mean to say that Gaussian kernel regression approaches nearest-neighbor regression as h → 0? Why does it do so? Is this true for all kinds of kernel regression? 2 w w from Eq. 1.65 is idempotent, i.e., that w 1.5 = Prove that . 1.6 Show that for ordinary linear regression, Eq. 1.61 gives the same variance for fitted values as the usual formula. 19 See Appendix E.4 for a refresher on the difference between “uncorrelated” and “independent”.

40 40 Regression Basics w Consider the global mean as a linear smoother. Work out the influence matrix , and 1.7 show that it has one degree of freedom, using the definition in Eq. 1.70. 1.8 Consider k -nearest-neighbors regression as a linear smoother. Work out the influence ma- trix w , and find an expression for the number of degrees of freedom (in the sense of Eq. k and n . Hint: Your answers should reduce to those of the previous 1.70) in terms of k = n . problem when  Suppose that 1.9 = μ ( x are uncorrelated have mean 0, with constant ) + Y  , where the i i i i ∑ n 2 T 2 − 1 /n n ). Show V [ˆ μ variance ] = ( σ σ ww ) tr ( . Prove that, for a linear smoother, i i =1 2 that this reduces to σ p/n for ordinary linear regression. 1.10 Suppose that Y are uncorrelated and have mean 0, but = μ ( x  ) +  , where the i i i i 2 . Consider modifying the definition of degrees of freedom each has its own variance σ i ∑ n 2 2 2 Cov [ Y = , ˆ ). Show that this μ ] /σ σ σ (which reduces to Eq. 1.71 if all the to i i i i i =1 still equals tr w for a linear smoother with influence matrix w .

41 2 The Truth about Linear Regression We need to say some more about how linear regression, and especially about how it really works and how it can fail. Linear regression is important because 1. it’s a fairly straightforward technique which sometimes works tolerably for prediction; 2. it’s a simple foundation for some more sophisticated techniques; 3. it’s a standard method so people use it to communicate; and 4. it’s a standard method so people have come to confuse it with prediction and even with causal inference as such. We need to go over (1)–(3), and provide prophylaxis against (4). [[TODO: Discuss the geometry: smoothing on to a linear surface; only projec- tion along β matters; fitted values constrained to a linear subspace]] 2.1 Optimal Linear Prediction: Multiple Variables We have a numerical variable Y and a p -dimensional vector of predictor variables ~ ~ . We would like to predict using X or features X . Chapter 1 taught us that the Y mean-squared optimal predictor is is the conditional expectation, [ ] ~ = μ Y ~x ) = X ( ~x E (2.1) | Instead of using the optimal predictor μ ( ~x ), let’s try to predict as well as 1 + function of ~x , say β . This is not possible while using only a linear β · ~x 0 an assumption about the world, but rather a decision on our part; a choice, β + ~x · β could be a tolerable not a hypothesis. This decision can be good — 0 μ ( ~x ) — even if the linear hypothesis is strictly wrong. Even if approximation to μ is much good mathematically, but we might still no linear approximation to want one for practical reasons, e.g., speed of computation. (Perhaps the best reason to hope the choice to use a linear model isn’t crazy 2 is a smooth function. If it is, then we can Taylor expand is that we may hope μ 1 Pedants might quibble that this function is actually affine rather than linear. But the distinction is ′ ~x , which is always 1, getting the vector ~x specious: we can always add an extra element to , and ′ ′ . ~x then we have the linear function · β 2 See Appendix D on Taylor approximations. 41 st 11:17 Monday 1 April, 2019 c Copyright © Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/ cshalizi/ADAfaEPoV/ ~

42 42 The Truth about Linear Regression ~u : it about our favorite point, say ∣ ) ( p ∑ ∣ ∂μ 2 ∣ ‖ ~u (2.2) ) ( − ~x ‖ O ) + ( x u − ~u μ ) + ) = μ ~x ( ( i i ∣ ∂x i ~u =1 i or, in the more compact vector-calculus notation, 2 ~x ) = μ ( ~u ) + ( ~x − ~u ) ·∇ μ ( ~u ) + O ( ‖ ~x − ~u μ ( ) (2.3) ‖ If we only look at points which are close to ~u , then the remainder terms ~x 2 3 ( ‖ O ) are small, and a linear approximation is a good one ~u . Here, “close − ‖ ~x ” really means “so close that all the non-linear terms in the Taylor series are ~u to comparatively negligible”.) Whatever the reason for wanting to use a linear function, there are many linear functions, and we need to pick just one of them. We may as well do that by minimizing mean-squared error again: ] [ ( ) 2 ~ E ( ) = β MSE − X · β β Y − (2.4) 0 Going through the optimization is parallel to the one-dimensional case we worked § β is 1.4.3, with the conclusion that the optimal through in ] [ − 1 ~ v Cov = X,Y β (2.5) ] [ ~ ~ X = Cov [ X , i.e., ,X where ], and Cov v is the covariance matrix of X,Y v j i ij [ ] ~ X,Y , i.e. Cov is the vector of covariances between the regressors and = Y i X ,Y ]. We also get Cov [ i [ ] ~ [ Y ] − β · E β = X E (2.6) 0 just as in the one-dimensional case (Exercise 2.1). These conclusions hold without μ ; about the distri- assuming anything at all about the true regression function X , of Y , of Y | X , or of Y − μ ( X ) (in particular, nothing needs to be bution of Gaussian); or whether data points are independent or not. Multiple regression would be a lot simpler if we could just do a simple regression for each regressor, and add them up; but really, this is what multiple regression does , just in a disguised form. If the input variables are uncorrelated, v is diagonal − 1 . Then doing multiple regression breaks up into = 0 unless i = j ), and so is v ( v ij a sum of separate simple regressions across each input variable. When the input v is not diagonal, we can think of the multiplication variables are correlated and − 1 ~ de-correlating by X — applying a linear transformation to come up as v 4 with a new set of inputs which are uncorrelated with each other. 2 3 ~u notation like ( ‖ ~x − O ‖ O ), now would be a good time to read If you are not familiar with the big- Appendix C. 4 ~ ~ Z is a random vector with covariance matrix , then w If Z is a random vector with covariance I T ~ w matrix w . Conversely, if we start with a random vector X with covariance matrix v , the latter 2 / 1 / 2 1 − 1 / 2 2 1 / ~ (i.e., v ), and v v = v has a “square root” v X will be a random vector with covariance

43 2.1 Optimal Linear Prediction: Multiple Variables 43 ~ depends on the marginal distribution of (through the covariance Notice: β X matrix β the real will shift, ). If that shifts, the optimal coefficients unless v regression function is linear. 2.1.1 Collinearity ] [ 1 − ~ Cov X,Y v makes no sense if v has no inverse. This will β The formula = happen if, and only if, the predictor variables are linearly dependent on each other — if one of the predictors is really a linear combination of the others. Then (as we learned in linear algebra) the covariance matrix is of less than “full rank” has at least (i.e., “rank deficient”) and it doesn’t have an inverse. Equivalently, v one eigenvalue which is exactly zero. So much for the algebra; what does that mean statistically? Let’s take an easy case where one of the predictors is just a multiple of the others — say X ) and mass in kilograms ( X ), so you’ve included people’s weight in pounds ( 1 2 = 2 X 2 X , we’d have . Then if we try to predict Y . 2 1 ~ ̂ μ X ) = β ( X (2.7) + β X X β + β + X ... + 3 2 p 2 p 1 1 3 p ∑ + (2 . 2 β = 0 + β X ) X + (2.8) β X 2 1 1 2 i i =3 i p ∑ (2.9) X β = ( / β . 2) X 2 + 0 X + + β i i 2 1 2 1 i =3 p ∑ 2200 + (1000 + β + X = ) X − + β β X (2.10) 2 1 2 1 i i =3 i In other words, because there’s a linear relationship between X X , we and 1 2 X make the coefficient for whatever we like, provided we make a corresponding 1 adjustment to the coefficient for X , and it has no effect at all on our prediction. 2 So rather than having one optimal linear predictor, we have infinitely many of 5 them. There are three ways of dealing with collinearity. One is to get a different data set where the regressors are no longer collinear. A second is to identify one of the collinear variables (it usually doesn’t matter which) and drop it from the data set. This can get complicated; principal components analysis (Chapter 16) can help here. Thirdly, since the issue is that there are infinitely many different coefficient vectors which all minimize the MSE, we could appeal to some extra principle, [ ] − 1 ~ ~ . When we write our predictions as Cov , we should think of this as X X,Y v matrix I ]) [ )( ( 1 / / 2 − 1 − 2 / 2 − 1 ~ ~ . We use one power of v v X to transform the input features into X,Y Cov v uncorrelated variables before taking their correlations with the response, and the other power to ~ decorrelate X . — For more on using covariance matrices to come up with new, decorrelated variables, see Chapter 16. 5 Algebraically, there is a linear combination of two (or more) of the regressors which is constant. The coefficients of this linear combination are given by one of the zero eigenvectors of v .

44 44 The Truth about Linear Regression beyond prediction accuracy, to select just one of them. We might, for instance, prefer smaller coefficient vectors (all else being equal), or ones where more of the coefficients were exactly zero. Using some quality other than the squared error to pick out a unique solution is called “regularizing” the optimization problem, and a lot of attention has been given to regularized regression, especially in the “high dimensional” setting where the number of coefficients is comparable to, or even greater than, the number of data points. See Appendix H.3.5, and exercise 7.2 in Chapter 7. 2.1.2 The Prediction and Its Error Once we have coefficients β , we can use them to make predictions for the expected ~ values of Y X , whether we’ve an observation there before or value of arbitrary at not. How good are these? If we have the optimal coefficients, then the prediction error will be uncorrelated with the regressors: [ [ ] ] [ ] [ ] − 1 ~ ~ ~ ~ ~ ~ − Cov · β, X · ( v X Y Cov = Cov − X,Y Y, ) , Cov X X (2.11) X ] [ [ ] − 1 ~ ~ Y, X vv Cov X Y, (2.12) − = Cov (2.13) = 0 ~ X , will be zero (Exer- Moreover, the expected prediction error, averaged over all cise 2.2). But the conditional expectation of the error is generally not zero, [ ] ~ ~ − X · β | Y X = ~x E 6 = 0 (2.14) and the conditional variance is generally not constant, ] [ [ ] ~ ~ ~ ~ Y X = ~x (2.15) − 6 = V V X − · X · β | β X = ~x | Y 2 1 The optimal linear predictor can be arbitrarily bad, and it can make arbitrarily 6 big systematic mistakes. It is generally very biased . 2.1.3 Estimating the Optimal Linear Predictor To actually estimate β from data, we need to make some probabilistic assumptions about where the data comes from. A fairly weak but often sufficient assumption ~ X is that observations ( ,Y , with un- ) are independent for different values of i i i changing covariances. Then if we look at the sample covariances, they will, by the law of large numbers, converge on the true covariances: [ ] 1 T ~ → Cov Y X X,Y (2.16) n 1 T X X (2.17) → v n 6 You were taught in your linear models course that linear regression makes unbiased predictions. This presumed that the linear model was true.

45 2.1 Optimal Linear Prediction: Multiple Variables 45 is the data-frame matrix with one row for each data point and X where as before . Y one column for each variable, and similarly for So, by continuity, − 1 T T ̂ = ( X X Y → β (2.18) X ) β and we have a consistent estimator. On the other hand, we could start with the empirical or in-sample mean squared error n ∑ 1 2 (2.19) ( y ) − ~x β · ) MSE( β ≡ i i n i =1 ̂ and minimize it. The minimizer is the same we got by plugging in the sample β covariances. No probabilistic assumption is needed to minimize the in-sample ̂ of . For that, β MSE, but it doesn’t let us say anything about the convergence ~ we do need some assumptions about Y coming from distributions with X and unchanging covariances. (One can also show that the least-squares estimate is the linear predictor with the minimax prediction risk. That is, its worst-case performance, when everything goes wrong and the data are horrible, will be better than any other linear method. This is some comfort, especially if you have a gloomy and pessimistic view of data, but other methods of estimation may work better in less-than-worst-case scenarios.) 2.1.3.1 Unbiasedness and Variance of Ordinary Least Squares Estimates The very weak assumptions we have made still let us say a little bit more about ̂ the properties of the ordinary least squares estimate β . To do so, we need to think ̂ β fluctuates. For the moment, let’s fix X at a particular value x , but about why Y to vary randomly (what’s called “fixed design” regression). allow ̂ is linear in the observed responses Y . We can use this β The key fact is that by writing, as you’re used to from your linear regression class, ~ + X · β Y  (2.20) =  is the noise around the optimal linear predictor; we have to remember that Here [ ] [ ] ~ ~ , E X [ = 0, it is not generally true that E   | ] = 0 and Cov X = ~x while = 0 ] [ ~ or that V X = ~x  is constant. Even with these limitations, we can still say that | − 1 T T ̂ β x Y x x ) (2.21) = ( − 1 T T x = ( x ) ( x β +  ) (2.22) x 1 − T T x + ( β x =  (2.23) ) x ̂ β This directly tells us that β : is an unbiased estimate of [ ] − 1 T T ̂ x E = β + ( x β x ) | X x = E [  ] (2.24) = β + 0 = β (2.25)

46 46 The Truth about Linear Regression ̂ β : We can also get the variance matrix of ] [ ] [ − 1 T T ̂ β + ( x V x | = x x β  | x = (2.26) V X ) [ ] − 1 T T V x )  | X = x x (2.27) = x ( − − 1 1 T T T ( V  | X = x ] x [ x ) x ) x x (2.28) = ( x V [  | X = x ] as a single matrix Σ ( x ). If the linear-prediction errors are Let’s write Σ uncorrelated with each other, then will be diagonal. If they’re also of equal 2 σ I variance, then , and we have Σ = ( ) 1 − ] [ 2 σ 1 2 T − 1 T ̂ | X = ( x = x = x ) σ β V x x (2.29) n n Said in words, this means that the variance of our estimates of the linear-regression n grows, (ii) go up as the linear coefficient will (i) go down as the sample size 2 regression gets worse ( grows), and (iii) go down as the regressors, the compo- σ ~ X nents of , have more sample variance themselves, and are less correlated with each other. X to vary, then by the law of total variance, If we allow ] [ ( ) 1 − [ ]] [ [ ]] ] [ [ 2 1 σ T ̂ ̂ ̂ β | X = = β | (2.30) E X V + V X β X E V E n n − 1 T . Since matrix inversion is n X n X → v , the sample variance matrix As →∞ ] [ − 1 2 − 1 ̂ continuous, σ β v V → n , and points (i)–(iii) still hold. 2.2 Shifting Distributions, Omitted Variables, and Transformations 2.2.1 Changing Slopes β I said earlier that the best in linear regression will depend on the distribution of the regressors, unless the conditional mean is exactly linear. Here is an illustra- tion. For simplicity, let’s say that p = 1, so there’s only one regressor. I generated √ 2 ) (i.e. the standard deviation of the X + 05 , with  ∼ N (0 , 0 .  = Y data from noise was 0.05). Figure 2.1 shows the lines inferred from samples with three dif- ferent distributions of X : X ∼ Unif(0 , 1), X ∼N (0 . 5 , 0 . 01), and X ∼ Unif(2 , 3). Some distributions of X lead to similar (and similarly wrong) regression lines; doing one estimate from all three data sets gives yet another answer. 2 2.2.1.1 : Distraction or Nuisance? R 2 This little set-up, by the way, illustrates that is not a stable property of the R 2 2 . = 0 . 92; for the blue, R distribution either. For the black points, = 0 R 70; and 2 values for the red, = 0 . 77; and for the complete data, 0 . 96. Other sets of x R i 2 would give other values for R . Note that while the global linear fit isn’t even a 2 good approximation anywhere in particular, it has the highest R . This kind of perversity can happen even in a completely linear set-up. Suppose

47 2.2 Shifting Distributions, Omitted Variables, and Transformations 47 Unif[0,1] 3.0 l N(0.5, 0.01) Unif[2,3] Union of above True regression line 2.5 2.0 Y 1.5 l 1.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.5 l 0.0 2.5 2.0 1.5 1.0 0.5 0.0 3.0 X √ 2 Y | X ∼N ( Figure 2.1 X, 0 . 05 Behavior of the conditional distribution ) with different distributions of X . The dots (in different colors and shapes) (with sample values indicated by show three different distributions of X colored “rug” ticks on the axes), plus the corresponding regression lines. The solid line is the regression using all three sets of points, and the grey curve is the true regression function. (See Code Example 1 for the code use to make this figure.) Notice how different distributions of X give rise to different slopes, each of which may make sense as a local approximation to the truth. Y = aX +  , and we happen to know a exactly. The variance of Y will now that 2 ]. The amount of variance our regression “explains” — really, [ X ] + V [  V a be 2 a V [ X ] 2 2 . This goes = V the variance of our predictions — will be [ X ]. So R a 2 [ X ]+ V [  ] a V to zero as V [ X ] → 0 and it goes to 1 as V [ X ] →∞ . It thus has little to do with the quality of the fit, and a lot to do with how spread out the regressor is.

48 48 The Truth about Linear Regression x1 <- runif(100) x2 <- rnorm(100, 0.5, 0.1) x3 <- runif(100, 2, 3) y1 <- sqrt(x1) + rnorm(length(x1), 0, 0.05) y2 <- sqrt(x2) + rnorm(length(x2), 0, 0.05) y3 <- sqrt(x3) + rnorm(length(x3), 0, 0.05) plot(x1, y1, xlim = c(0, 3), ylim = c(0, 3), xlab = "X", ylab = "Y", col = "darkgreen", pch = 15) rug(x1, side = 1, col = "darkgreen") rug(y1, side = 2, col = "darkgreen") points(x2, y2, pch = 16, col = "blue") rug(x2, side = 1, col = "blue") rug(y2, side = 2, col = "blue") points(x3, y3, pch = 17, col = "red") rug(x3, side = 1, col = "red") rug(y3, side = 2, col = "red") lm1 <- lm(y1 ~ x1) lm2 <- lm(y2 ~ x2) lm3 <- lm(y3 ~ x3) abline(lm1, col = "darkgreen", lty = "dotted") abline(lm2, col = "blue", lty = "dashed") abline(lm3, col = "red", lty = "dotdash") x.all <- c(x1, x2, x3) y.all <- c(y1, y2, y3) lm.all <- lm(y.all ~ x.all) abline(lm.all, lty = "solid") curve(sqrt(x), col = "grey", add = TRUE) legend("topleft", legend = c("Unif[0,1]", "N(0.5, 0.01)", "Unif[2,3]", "Union of above", "True regression line"), col = c("black", "blue", "red", "black", "grey"), pch = c(15, 16, 17, NA, NA), lty = c("dotted", "dashed", "dotdash", "solid", "solid")) Code Example 1: Code used to make Figure 2.1. 2 Notice also how easy it is to get a very high R even when the true model is not linear! 2.2.2 Omitted Variables and Shifting Distributions That the optimal regression coefficients can change with the distribution of the predictor features is annoying, but one could after all notice that the distribution has shifted, and so be cautious about relying on the old regression. More subtle is that the regression coefficients can depend on variables which you do not measure, and those can shift without your noticing anything. Mathematically, the issue is that ] ] ] [ [ [ ~ ~ ~ Y E | Y | Z, E X X | = X E (2.31) ~ , then the extra conditioning in the inner is independent of Z given Now, if X Y expectation does nothing and changing Z doesn’t alter our predictions. But in general there will be plenty of variables Z which we don’t measure (so they’re

49 2.2 Shifting Distributions, Omitted Variables, and Transformations 49 2 0 Y −2 2 1 2 1 0 0 −1 −1 Z −2 X −2 −3 library(lattice) library(MASS) x.z = mvrnorm(100, c(0, 0), matrix(c(1, 0.1, 0.1, 1), nrow = 2)) y = x.z[, 1] + x.z[, 2] + rnorm(100, 0, 0.1) cloud(y ~ x.z[, 1] * x.z[, 2], xlab = "X", ylab = "Z", zlab = "Y", scales = list(arrows = FALSE), col.point = "black") Y Figure 2.2 Scatter-plot of response variable (vertical axis) and two X , which is included in the variables which influence it (horizontal axes): Z , which is omitted. X and Z regression, and . 1. have a correlation of +0 ~ X ) but which have some non-redundant information about the not included in ~ Y Z even conditional on X ). If the distribution of response (so that depends on ~ ~ Z changes, then the optimal regression of Y X given X should change too. on Here’s an example. X and Z are both N (0 , 1), but with a positive correlation of 0 . 1. In reality, Y ∼N ( X + Z, 0 . 01). Figure 2.2 shows a scatterplot of all three variables together ( n = 100). X Z to − 0 . 1. This leaves both Now I change the correlation between and marginal distributions alone, and is barely detectable by eye (Figure 2.3). X and Y values from the two data sets, in black for Figure 2.4 shows just the X the points with a positive correlation between Z , and in blue when the and correlation is negative. Looking by eye at the points and at the axis tick-marks, one sees that, as promised, there is very little change in the marginal distribution of either variable. Furthermore, the correlation between X and Y doesn’t change much, going only from 0 . 75 to 0 . 55. On the other hand, the regression lines are noticeably different. When Cov [ X,Z . 1, the slope of the regression line is 1 . 2 ] = 0 — high values for tend to indicate high values for Z , which also increases Y . X When Cov [ X,Z ] = − 0 . 1, the slope of the regression line is 0 . 74, since extreme values of X are now signs that Z is at the opposite extreme, bringing Y closer

50 50 The Truth about Linear Regression 3 2 1 Y 0 −1 −2 2 1 2 1 0 0 −1 Z −1 X −2 −2 new.x.z = mvrnorm(100, c(0, 0), matrix(c(1, -0.1, -0.1, 1), nrow = 2)) new.y = new.x.z[, 1] + new.x.z[, 2] + rnorm(100, 0, 0.1) cloud(new.y ~ new.x.z[, 1] * new.x.z[, 2], xlab = "X", ylab = "Z", zlab = "Y", scales = list(arrows = FALSE)) Figure 2.3 As in Figure 2.2, but shifting so that the correlation between X Z is now − 0 . 1, though the marginal distributions, and the distribution and Y X and Z , are unchanged. given of back to its mean. But, to repeat, the difference is due to changing the correlation X and Z , not how X and Z themselves relate to Y . If I regress Y on X between ̂ ̂ , I get and β = 0 . 99 , 1 in the first case and Z β = 0 . 98 , 1 in the second. We’ll return to omitted variables when we look at causal inference in Part III. 2.2.3 Errors in Variables ~ X , are distorted versions Often, the predictor variables we can actually measure, ~ of some other variables U we wish we could measure, but can’t: ~ ~ = U + ~η (2.32) X ~ ~η being some sort of noise. Regressing Y on with X then gives us what’s called an problem. errors-in-variables In one sense, the errors-in-variables problem is huge. We are often much more interested in the connections between actual variables in the real world, than with our imperfect, noisy measurements of them. Endless ink has been spilled, for instance, on what determines students’ test scores. One thing commonly thrown ~ into the regression — a feature included in X — is the income of children’s

51 2.2 Shifting Distributions, Omitted Variables, and Transformations 51 l l l l l l l l l l l l l l l l l l l l 2 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0 l l l l l l l l y l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −2 l l l l l l l l l l −4 −2 −1 −3 2 3 0 1 x Figure 2.4 Joint distribution of X and Y from Figure 2.2 (black, with a positive correlation between and ) and from Figure 2.3 (blue, with a X Z and ). Tick-marks on the axes show the X negative correlation between Z marginal distributions, which are manifestly little-changed. (See accompanying R file for commands.) 7 families. But this is rarely measured precisely , so what we are really interested in — the relationship between actual income and school performance — is not what our regression estimates. Typically, adding noise to the input features makes ̂ β them less predictive of the response — in linear regression, it tends to push ~ Y on U . closer to zero than it would be if we could regress On account of the error-in-variables problem, some people get very upset when they see imprecisely-measured features as inputs to a regression. Some of them, exactly , with no noise in fact, demand that the input variables be measured whatsoever. This position, however, is crazy, and indeed there’s a sense in which errors-in-variables isn’t a problem at all. Our earlier reasoning about how to ~ Y from find the optimal linear predictor of X remains valid whether something like Eq. 2.32 is true or not. Similarly, the reasoning in Ch. 1 about the actual regression function being the over-all optimal predictor, etc., is unaffected. If we ~ ~ will continue to have X rather than U available to us for prediction, then Eq. 2.32 ~ for prediction . Without better data, the relationship of Y to U is irrelevant is just one of the unanswerable questions the world is full of, as much as “what song the sirens sang, or what name Achilles took when he hid among the women”. Now, if you are willing to assume that ~η is a very well-behaved Gaussian with known variance, then there are solutions to the error-in-variables problem for linear regression, i.e., ways of estimating the coefficients you’d get from regressing 7 One common proxy is to ask the child what they think their family income is. (I didn’t believe that either when I first read about it.)

52 52 The Truth about Linear Regression ~ on . I’m not going to go over them, partly because they’re in standard Y U 8 textbooks, but mostly because the assumptions are hopelessly demanding. 2.2.4 Transformation X ∼ N (log Y 1). The problem | Let’s look at a simple non-linear example, X, with smoothing data like this on to a straight line is that the true regression [ Y | X curve isn’t straight, x ] = log x . (Figure 2.5.) This suggests replacing E = is linear, and then undoing the variables we have with ones where the relationship the transformation to get back to what we actually measure and care about. We have two choices: we can transform the response Y , or the predictor X . Here Y on transforming the response would mean regressing exp , and transforming X the predictor would mean regressing on log X . Both kinds of transformations Y can be worth trying. The best reasons to use one kind rather than another are those that come from subject-matter knowledge: if we have good reason to think ( Y ) = βX f  , then it can make a lot of sense to transform Y . If that that + genuine subject-matter considerations are not available, however, my experience is that transforming the predictors, rather than the response, is a better bet, for several reasons. E [ f ( Y )] 1. Mathematically, = f ( E [ Y ]). A mean-squared optimal prediction of 6 f ( Y ) is not necessarily close to the transformation of an optimal prediction of Y . And is, presumably, what we really want to predict. Y √ X + log Z . There’s not going to be any particularly nice Y = 2. Imagine that that makes everything linear, though there will be trans- Y transformation of formations of the features. This generalizes to more complicated models with features built from multiple covariates. , with 3. Suppose that we are in luck and μ ( X ) +  =  independent of X , Y and Gaussian, so all the usual default calculations about statistical inference apply. Then it will generally not be the case that f ( Y ) = s ( X ) + η , with η a Gaussian random variable independent of . In other words, transforming X Y completely messes up the noise model. (Consider the simple case where we take the logarithm of Y . Gaussian noise after the transformation implies log-normal noise before the transformation. Conversely, Gaussian noise before the transformation implies a very weird, nameless noise distribution after the transformation.) Figure 2.6 shows the effect of these transformations. Here transforming the predictor does, indeed, work out more nicely; but of course I chose the example so that it does so. To expand on that last point, imagine a model like so: q ∑ ~x ) = μ ( (2.33) ) c ~x f ( j j =1 j 8 Non-parametric error-in-variable methods are an active topic of research (Carroll et al. , 2009).

53 2.2 Shifting Distributions, Omitted Variables, and Transformations 53 l 2 l l l l l l l l l l l l l l l l l l l 0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −2 l l y l l l l l l l −4 l l l −6 l 0.6 0.4 0.2 0.0 0.8 1.0 x x <- runif(100) y <- rnorm(100, mean = log(x), sd = 1) plot(y ~ x) curve(log(x), add = TRUE, col = "grey") abline(lm(y ~ x)) Figure 2.5 Sample of data for Y | X ∼N (log X, 1). (Here X ∼ Unif(0 , 1), and all logs are natural logs.) The true, logarithmic regression curve is shown in grey (because it’s not really observable), and the linear regression fit is shown in black. If we know the functions f , we can estimate the optimal values of the coefficients j c by least squares — this is a regression of the response on new features, which j happen to be defined in terms of the old ones. Because the parameters are out- side the functions, that part of the estimation works just like linear regression.

54 54 The Truth about Linear Regression l l l 2 2 l l l l l l l l 8 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0 0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 6 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −2 −2 l l l l y y l l l l l l 4 l l exp(y) l l l l l l l l l l −4 −4 l l 2 l l l l l l l l l l l l l l −6 −6 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0 0.6 0.0 0.2 0.8 0.6 0.8 1.0 1.0 0 −2 −4 −6 0.0 0.2 0.4 0.4 x x log(x) l l 2 2 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0 0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −2 −2 l l l l y y l l l l l l l l l l l l l l l l −4 −4 l l l l −6 −6 l l 0.8 0.0 0.4 0.6 0.8 1.0 0.2 0.4 0.0 0.6 0.2 1.0 x x Figure 2.6 Transforming the predictor (left column) and the response (right) in the data from Figure 2.5, shown in both the transformed coordinates (top) and the original coordinates (middle). The bottom figure super-imposes the two estimated curves (transformed X in black, transformed Y in blue). The true regression curve is always in grey. (R code deliberately omitted; reproducing this is Exercise 2.4.) Models embraced under the heading of Eq. 2.33 include linear regressions with between the regressors (set f = interactions x , for various combinations of x j i k i and k ), and polynomial regression . There is however nothing magical about using products and powers of the regressors; we could regress on sin x , sin 2 x , Y sin 3 x , etc. To apply models like Eq. 2.33, we can either (a) fix the functions f in advance, j based on guesses about what should be good features for this problem; (b) fix the

55 2.3 Adding Probabilistic Assumptions 55 functions in advance by always using some “library” of mathematically convenient functions, like polynomials or trigonometric functions; or (c) try to find good functions from the data. Option (c) takes us beyond the realm of linear regression splines additive models (Chapter 8). as such, into things like (Chapter 7) and sides of a regression model; It is also possible to search for transformations of both et al. see Breiman and Friedman (1985) and, for an R implementation, Spector (2013). 2.3 Adding Probabilistic Assumptions The usual treatment of linear regression adds many more probabilistic assump- tions, namely that 2 ~ ~ ∼N ( ) X · β,σ | Y (2.34) X ~ Y X values. So now we and that values are independent conditional on their assuming that the regression function is exactly linear; we are assuming that are ~ the scatter of at each Y around the regression function is Gaussian; we are X that the variance of this scatter is constant; and we are assuming assuming that there is no dependence between this scatter and anything else. None of these assumptions was needed in deriving the optimal linear predictor. None of them is so mild that it should go without comment or without at least some attempt at testing. Leaving that aside just for the moment, why make those assumptions? As you know from your earlier classes, they let us write down the likelihood of the observed responses y ), and then ,y ,...~x ,...y ~x (conditional on the covariates n 1 n 1 2 2 β σ by maximizing this likelihood. As you also know, the maximum estimate and β is exactly the same as the β obtained by minimizing the likelihood estimate of residual sum of squares. This coincidence would not hold in other models, with non-Gaussian noise. ̂ We saw earlier that β is consistent under comparatively weak assumptions — that it converges to the optimal coefficients. But then there might, possibly, still be other estimators are also consistent, but which converge faster. If we ̂ is also the maximum likelihood β make the extra statistical assumptions, so that estimate, we can lay that worry to rest. The MLE is generically (and certainly here!) , meaning that it converges as fast as any other asymptotically efficient consistent estimator, at least in the long run. So we are not, so to speak, wasting any of our data by using the MLE. A further advantage of the MLE is that, as →∞ , its sampling distribution is n itself a Gaussian, centered around the true parameter values. This lets us calculate standard errors and confidence intervals quite easily. Here, with the Gaussian assumptions, much more exact statements can be made about the distribution of ̂ β β . You can find the formulas in any textbook on regression, so I won’t around get into that. We can also use a general property of MLEs for model testing. Suppose we have ω ω . Ω is the general case, with p parameters, and two classes of models, Ω and

56 56 The Truth about Linear Regression q < p is a special case, where some of those parameters are constrained, but of them are left free to be estimated from the data. The constrained model class ω is then within Ω. Say that the MLEs with and without the constraints nested ̂ ̂ ̂ ̂ L Θ and are, respectively, Θ) and L ( θ θ ). , so the maximum log-likelihoods are ( ̂ ̂ ( Θ) ≥ L L Because it’s a maximum over a larger parameter space, θ ). On the ( other hand, if the true model really is in ω , we’d expect the constrained and unconstrained estimates to be converging. It turns out that the difference in log- likelihoods has an asymptotic distribution which doesn’t depend on any of the model details, namely [ ] 2 ̂ ̂ − L L 2 θ ) ( ; χ Θ) (2.35) ( − q p 2 χ distribution with one degree of freedom for each extra parameter That is, a 9 in Ω (that’s why they’re called “degrees of freedom”). This approach can be used to test particular restrictions on the model, and so it is sometimes used to assess whether certain variables influence the response. This, however, gets us into the concerns of the next section. 2.3.1 Examine the Residuals By construction, the errors of the optimal linear predictor have expectation 0 and are uncorrelated with the regressors. Also by construction, the residuals of a linear regression have sample mean 0, and are uncorrelated, in the sample, fitted with the regressors. If the usual probabilistic assumptions hold, however, the errors of the optimal linear predictor have many other properties as well. ~x . 1. The errors have a Gaussian distribution at each same 2. The errors have the ~x , i.e., they are in- Gaussian distribution at each dependent of the regressors. In particular, they must have the same variance (i.e., they must be homoskedastic). 3. The errors are independent of each other uncor- . In particular, they must be related with each other. When these properties — Gaussianity, homoskedasticity, lack of correlation — hold, we say that the errors are white noise . They imply strongly related prop- erties for the residuals: the residuals should be Gaussian, with variances and T − 1 T x covariances given by the hat matrix, or more specifically by I ) − x x ( x ( § 1.5.3.2). This means that the residuals will not be exactly white noise, but they should be close to white noise. You should check this! If you find residuals which are a long way from being white noise, you should be extremely suspicious of your model. These tests are much more important than checking whether the coefficients are significantly different from zero. 9 If you assume the noise is Gaussian, the left-hand side of Eq. 2.35 can be written in terms of various residual sums of squares. However, the equation itself remains valid under other noise distributions, which just change the form of the likelihood function. See Appendix I.

57 2.3 Adding Probabilistic Assumptions 57 Every time someone uses linear regression with the standard assumptions for not test whether the residuals are white noise, an angel loses inference and does its wings. 2.3.2 On Significant Coefficients t -tests can be used to decide If all the usual distributional assumptions hold, then whether particular coefficients are statistically-significantly different from zero. Pretty much any piece of statistical software, R very much included, reports the results of these tests automatically. It is far too common to seriously over-interpret those results, for a variety of reasons. Begin with exactly what hypothesis is being tested when R (or whatever) runs p predictor variables, -tests. Say, without loss of generality, that there are those t ~ X = ( ,...X . Then the null ), and that we are testing the coefficient on X X p 1 p β β = 0 in a linear, Gaussian-noise model = 0”, but “ hypothesis is not just “ p p X ,...X , and nothing else”. The alternative hypothesis which also includes − 1 1 p β is not just “ = 0”, but “ β 6 6 = 0 in a linear, Gaussian-noise model which also p p X ,...X , but nothing else”. The optimal linear coefficient on X will includes 1 p p − 1 X and the response Y , but also on depend not just on the relationship between p which other variables are included in the model. The test checks whether adding X really improves predictions more than would be expected, under all these p assumptions, if one is already using all the other variables, and only those other variables. It does not, cannot, test whether X is important in any absolute sense. p Even if you are willing to say “Yes, all I really want to know about this variable is whether adding it to the model really helps me predict in a linear approxima- -test answers is whether adding that t tion”, remember that the question which a at all . Of course, as you know from your regression class, and variable will help as we’ll see in more detail in Chapter 3, expanding the model never hurts its performance on the t -test is to gauge whether training data. The point of the the improvement in prediction is small enough to be due to chance, or so large, compared to what noise could produce , that one could confidently say the variable some predictive ability. This has several implications which are insufficiently adds appreciated among users. In the first place, tests on individual coefficients can seem to contradict tests on groups of coefficients. Adding multiple variables to the model could significantly improve the fit (as checked by, say, a partial F test), even if none of the coefficients is significant on its own. In fact, every single coefficient in the model could be insignificant, while the model as a whole is highly significant (i.e., better than a flat line). In the second place, it’s worth thinking about which variables will show up as ̂ ̂ -statistic is se( β ), the ratio of / statistically significant. Remember that the t β i i ] [ ̂ β | X = x = the estimated coefficient to its standard error. We saw above that V 2 1 − σ 2 − 1 1 T − 1 − n ) n x x σ → v ( . This means that the standard errors will shrink as n the sample size grows, so more and more variables will become significant as we

58 58 The Truth about Linear Regression get more data — but how much data we collect is irrelevant to how the process we’re studying actually works. Moreover, at a fixed sample size, the coefficients with smaller standard errors will tend to be the ones whose variables have more variance, and whose variables are less correlated with the other predictors. High estimate the coefficient precisely, but, input variance and low correlation help us again, they have nothing to do with whether the input variable actually influences the response a lot. the case that statistical significance is the same as To sum up, it is never not those with scientific, real-world significance. The most important variables are t the largest-magnitude p -values. Statistical significance is statistics or smallest 10 . always about what “signals” can be picked out clearly from background noise In the case of linear regression coefficients, statistical significance runs together the size of the coefficients, how bad the linear regression model is, the sample size, the variance in the input variable, and the correlation of that variable with all the others. Of course, even the limited “does it help linear predictions enough to bother with?” utility of the usual t -test (and F -test) calculations goes away if the stan- dard distributional assumptions do not hold, so that the calculated p -values are just wrong. One can sometimes get away with using bootstrapping (Chapter 6) to get accurate -values for standard tests under non-standard conditions. p 2.4 Linear Regression Is Not the Philosopher’s Stone The philosopher’s stone, remember, was supposed to be able to transmute base metals (e.g., lead) into the perfect metal, gold (Eliade, 1971). Many people treat linear regression as though it had a similar ability to transmute a correlation matrix into a scientific theory. In particular, people often argue that: 1. because a variable has a significant regression coefficient, it must influence the response; 2. because a variable has an insignificant regression coefficient, it must not influ- ence the response; 3. if the input variables change, we can predict how much the response will change by plugging in to the regression. All of this is wrong, or at best right only under very particular circumstances. We have already seen examples where influential variables have regression coef- ficients of zero. We have also seen examples of situations where a variable with no influence has a non-zero coefficient (e.g., because it is correlated with an omitted If there are no nonlinearities and if there are variable which does have influence). no omitted influential variables and if the noise terms are always independent of the predictor variables, are we good? 10 In retrospect, it might have been clearer to say “statistically detectable ” rather than “statistically significant ”.

59 2.4 Linear Regression Is Not the Philosopher’s Stone 59 No. Remember from Equation 2.5 that the optimal regression coefficients de- pend on both the marginal distribution of the predictors and the joint distribution (covariances) of the response and the predictors. There is no reason whatsoever to the system, this will leave the conditional distribution suppose that if we change of the response alone. A simple example may drive the point home. Suppose we surveyed all the cars in Pittsburgh, recording the maximum speed they reach over a week, and how often they are waxed and polished. I don’t think anyone doubts that there will be a positive correlation here, and in fact that there will be a positive regression coefficient, even if we add in many other variables as predictors. Let us even postulate that the relationship is linear (perhaps after a suitable transformation). Would anyone believe that polishing cars will make them go faster? Manifestly not. But this is exactly how people interpret regressions in all kinds of applied fields — instead of saying polishing makes cars go faster, it might be saying that receiving targeted ads makes customers buy more, or that consuming dairy foods makes diabetes progress faster, or . . . . Those claims might be , but the true regressions could easily come out the same way were the claims false. Hence, the regression results provide little or no for the claims. evidence Similar remarks apply to the idea of using regression to “control for” extra variables. If we are interested in the relationship between one predictor, or a few predictors, and the response, it is common to add a bunch of other variables to the regression, to check both whether the apparent relationship might be due to correlations with something else, and to “control for” those other variables. The regression coefficient is interpreted as how much the response would change, on average, if the predictor variable were increased by one unit, “holding everything else constant”. There is a very particular sense in which this is true: it’s a predic- tion about the difference in expected responses (conditional on the given values for the other predictors), assuming that the form of the regression model is right, that observations are randomly drawn from the same population we used to and fit the regression. In a word, what regression does is probabilistic prediction. It says what will select happen if we keep drawing from the same population, but a sub-set of the observations, namely those with given values of the regressors. A causal or counter-factual prediction would say what would happen if we (or Someone) made those variables take those values. Sometimes there’s no difference between selection and intervention, in which case regression works as a tool for causal 11 ; but in general there is. Probabilistic prediction is a worthwhile en- inference deavor, but it’s important to be clear that this is what regression does. There are techniques for doing causal prediction, which we will explore in Part III. Every time someone thoughtlessly uses regression for causal inference, an angel not only loses its wings, but is cast out of Heaven and falls in extremest agony into the everlasting fire. 11 In particular, if our model was estimated from data where Someone assigned values of the predictor variables in a way which breaks possible dependencies with omitted variables and noise — either by randomization or by experimental control — then regression can, in fact, work for causal inference.

60 60 The Truth about Linear Regression 2.5 Further Reading If you would like to read a lot more — about 400 pages more — about linear , at http: regression from this perspective, see The Truth About Linear Regression cshalizi/TALR/ //www.stat.cmu.edu/ . That manuscript began as class notes ~ this one, and has some overlap. for the class before There are many excellent textbooks on linear regression. Among them, I would mention Weisberg (1985) for general statistical good sense, along with Faraway (2004) for R practicalities, and Hastie et al. (2009) for emphasizing connections to more advanced methods. Berk (2004) omits the details those books cover, but is superb on the big picture, and especially on what must be assumed in order to do certain things with linear regression and what cannot be done under any assumption. For some of the story of how the usual probabilistic assumptions came to have that status, see, e.g., Lehmann (2008). On the severe issues which arise for the et al. (2014). usual inferential formulas when the model is incorrect, see Buja Linear regression is a special case of both additive models (Chapter 8), and of § 10.5). In most practical situations, additive models are a locally linear models ( better idea than linear ones. Historical notes Because linear regression is such a big part of statistical practice, its history has been extensively treated in general histories of statistics, such as Stigler (1986) and Porter (1986). I would particularly recommend Klein (1997) for a careful account of how regression, on its face a method for doing comparisons at one time across a population, came to be used to study causality and dynamics. The paper by Lehmann (2008) mentioned earlier is also informative. Exercises ~ 2.1 b and intercept b 1. Write the expected squared error of a linear predictor with slopes 0 as a function of those coefficients. 2. Find the derivatives of the expected squared error with respect to all the coefficients. 3. Show that when we set all the derivatives to zero, the solutions are Eq. 2.5 and 2.6. ] [ ~ − X · β Y , is zero. 2.2 E Show that the expected error of the optimal linear predictor, 2.3 Convince yourself that if the real regression function is linear, β does not depend on the marginal distribution of X . You may want to start with the case of one predictor variable. 2.4 Run the code from Figure 2.5. Then replicate the plots in Figure 2.6. √ 1)? X, | Y ∼N ( Which kind of transformation is superior for the model where 2.5 X

61 3 Evaluating Statistical Models: Error and Inference 3.1 What Are Statistical Models For? Summaries, Forecasts, Simulators There are (at least) three ways we can use statistical models in data analysis: as summaries of the data, as predictors, and as simulators. The least demanding use of a model is to summarize the data — to use it for , or compression . Just as the sample mean or sample quan- data reduction tiles can be descriptive statistics, recording some features of the data and saying nothing about a population or a generative process, we could use estimates of a model’s parameters as descriptive summaries. Rather than remembering all the points on a scatter-plot, say, we’d just remember what the OLS regression surface was. It’s hard to be wrong about a summary, unless we just make a mistake. (It helpful may not be for us later, but that’s different.) When we say “the slope which minimized the sum of squares was 4 02”, we make no claims about any- . but thing the training data. That statement relies on no assumptions, beyond our calculating correctly. But it also asserts nothing about the rest of the world. As soon as we try to connect our training data to anything else, we start relying on assumptions, and we run the risk of being wrong. Probably the most common connection to want to make is to say what other data will look like — to make predictions. In a statistical model, with random variables, we do not anticipate that our predictions will ever be exactly right, but we also anticipate that our mistakes will show stable probabilistic patterns. We can evaluate predictions based on those patterns of error — how big is our typical mistake? are we biased in a particular direction? do we make a lot of little errors or a few huge ones? Statistical inference about model parameters — estimation and hypothesis test- ing — can be seen as a kind of prediction, extrapolating from what we saw in a small piece of data to what we would see in the whole population, or whole pro- ˆ the regression coefficient estimate cess. When we = 4 . 02, that involves predicting b new values of the dependent variable, but also predicting that if we repeated the ˆ , we’d get a value close to 4 b experiment and re-estimated . 02. Using a model to summarize old data, or to predict new data, doesn’t commit us to assuming that the model describes the process which generates the data. But we often want to do that, because we want to interpret parts of the model 61 st 11:17 Monday 1 April, 2019 c Copyright © Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/ cshalizi/ADAfaEPoV/ ~

62 62 Model Evaluation as aspects of the real world. We think that in neighborhoods where people have more money, they spend more on houses — perhaps each extra $1000 in income translates into an extra $4020 in house prices. Used this way, statistical models become stories about how the data were generated. If they are accurate, we simulate should be able to use them to that process, to step through it and produce something that looks, probabilistically, just like the actual data. This is scientific often what people have in mind when they talk about models, rather than just statistical ones. An example: if you want to predict where in the night sky the planets will be, you can actually do very well with a model where the Earth is at the center of the universe, and the Sun and everything else revolve around it. You can even estimate, from data, how fast Mars (for example) goes around the Earth, or where, in this model, it should be tonight. But, since the Earth is not at the center of the solar system, those parameters don’t actually refer to anything in reality. They are just mathematical fictions. On the other hand, we can also predict where the planets will appear in the sky using models where all the planets orbit the Sun, 1 refer to reality. and the parameters of the orbit of Mars in that model do This chapter focuses on evaluating predictions, for three reasons. First, often we just want prediction. Second, if a model can’t even predict well, it’s hard to see how it could be right scientifically. Third, often the best way of checking a scientific model is to turn some of its implications into statistical predictions. 3.2 Errors, In and Out of Sample With any predictive model, we can gauge how well it works by looking at its errors. We want these to be small; if they can’t be small all the time we’d like them to be small on average. We may also want them to be patternless or unsystematic (because if there was a pattern to them, why not adjust for that, and make smaller mistakes). We’ll come back to patterns in errors later, when we look at specification testing (Chapter 9). For now, we’ll concentrate on the size of the errors. To be a little more mathematical, we have a data set with points z = z . ,z ,...z n n 1 2 (For regression problems, think of each data point as the pair of input and output values, so z possibly a vector.) We also have various possible = ( x x ,y ), with i i i i θ models, each with different parameter settings, conventionally written . For re- θ tells us which regression function to use, so m x ( gression, ) or m ( x ; θ ) is the θ prediction we make at point x with parameters set to θ . Finally, we have a loss function θ on a which tells us how big the error is when we use a certain L L ( z,θ ). For mean-squared error, this would just be certain data point, 2 ( z,θ ) = ( y − m (3.1) ( x )) L θ 1 We can be pretty sure of this, because we use our parameter estimates to send our robots to Mars, and they get there.

63 3.2 Errors, In and Out of Sample 63 But we could also use the mean absolute error z,θ | | y − m ( ( x ) L (3.2) ) = θ or many other loss functions. Sometimes we will actually be able to measure how costly our mistakes are, in dollars or harm to patients. If we had a model which ( gave us a distribution for the data, then ) would a probability density at z , p z θ − log m ( and a typical loss function would be the negative log-likelihood, z ). No θ matter what the loss function is, I’ll abbreviate the sample average of the loss ( ,θ ). L z over the whole data set by n What we would like, ideally, is a predictive model which has zero error on future data. We basically never achieve this: The world just really is a noisy and stochastic place, and this means even the • 2 2 σ This corresponds to the first, true, ideal model has non-zero error. , term x in the bias-variance decomposition, Eq. 1.28 from Chapter 1. • Our models are usually more or less mis-specified , or, in plain words, wrong. We hardly ever get the functional form of the regression, the distribution of the noise, the form of the causal dependence between two factors, etc., exactly 3 right. This is the origin of the bias term in the bias-variance decomposition. Of course we can get any of the details in the model specification more or less wrong, and we’d prefer to be less wrong. • Our models are never perfectly estimated. Even if our data come from a perfect IID source, we only ever have a finite sample, and so our parameter estimates are (almost!) never quite the true, infinite-limit values. This is the origin of the variance term in the bias-variance decomposition. But as we get more and more data, the sample should become more and more representative of the whole process, and estimates should converge too. So, because our models are flawed, we have limited data and the world is stochas- tic, we cannot expect even the best model to have zero error. Instead, we would like to minimize the , or risk , or generalization error , on new expected error data. What we would like to do is to minimize the risk or expected loss ∫ [ L ( Z,θ )] = z,θ L ( E ) p ( z ) dz (3.3) To do this, however, we’d have to be able to calculate that expectation. Doing — the joint distribution of X and Z that would mean knowing the distribution of , for the regression problem. Since we don’t know the true joint distribution, Y we need to approximate it somehow. A natural approximation is to use our training data z . For each possible model n 2 This is so even if you believe in some kind of ultimate determinism, because the variables we plug in to our predictive models are not complete descriptions of the physical state of the universe, but rather immensely coarser, and this coarseness shows up as randomness. 3 Except maybe in fundamental physics, and even there our predictions are about our fundamental theories in the context of experimental set-ups , which we never model in complete detail.

64 64 Model Evaluation , we can could calculate the sample mean of the error on the data, L z θ ,θ ), called ( n in-sample loss . The simplest strategy for estimation the or the empirical risk , which minimizes the in-sample loss. θ is then to pick the model, the value of empirical risk minimization . Formally, This strategy is imaginatively called ̂ ≡ argmin θ (3.4) ) L ( z ,θ n n ∈ θ Θ This means picking the regression which minimizes the sum of squared errors, 4 or the density with the highest likelihood . This is what you’ve usually done in statistics courses so far, and it’s very natural, but it does have some issues, notably optimism and over-fitting. The problem of optimism comes from the fact that our training data isn’t perfectly representative. The in-sample loss is a sample average. By the law of large numbers, then, we anticipate that, for each θ , ( z (3.5) ,θ ) → E [ L ( Z,θ )] L n n → ∞ as . This means that, with enough data, the in-sample error is a good . (Big samples are approximation to the generalization error of any given model θ not mean representative of the underlying population or process.) But this does ˆ θ that the in-sample performance of tells us how well it will generalize, because we purposely picked it to match the training data z . To see this, notice that the n in-sample loss equals the risk plus sampling noise: ( z (3.6) ,θ L E [ L ( Z ,θ )] + η ) ( θ ) = n n η ( Here ) is a random term which has mean zero, and represents the effects θ n of having only a finite quantity of data, of size n , rather than the complete probability distribution. (I write it η ) as a reminder that different values of ( θ n θ are going to be affected differently by the same sampling fluctuations.) The problem, then, is that the model which minimizes the in-sample loss could be one with good generalization performance ( L ( Z ,θ )] is small), or it could be one E [ ( θ ) was large and negative): which got very lucky ( η n ̂ (3.7) = argmin )) θ ( ( E [ θ ( Z,θ )] + η L n n θ Θ ∈ E [ L ( Z,θ We only want to minimize η ), so ( θ )], but we can’t separate it from n ̂ we’re almost surely going to end up picking a θ which was more or less lucky n ( η < 0) as well as good ( E [ L ( Z,θ )] small). This is the reason why picking the n model which best fits the data tends to exaggerate how well it will do in the future (Figure 3.1). → η ( θ Again, by the law of large numbers , but now we need 0 for each θ ) n to worry about how fast it’s going to zero, and whether that rate depends on θ . Suppose we knew that min 0. Then it would η | → ( θ ) → 0, or max ) | η θ ( n n θ θ 4 Remember, maximizing the likelihood is the same as maximizing the log-likelihood, because log is an increasing function. Therefore maximizing the likelihood is the same as minimizing the negative log-likelihood.

65 3.2 Errors, In and Out of Sample 65 6 5 4 MSE risk 3 2 1 6 2 8 0 10 4 regression slope n <- 20 theta <- 5 x <- runif(n) y <- x * theta + rnorm(n) empirical.risk <- function(b) { mean((y - b * x)^2) } true.risk <- function(b) { 1 + (theta - b)^2 * (0.5^2 + 1/12) } curve(Vectorize(empirical.risk)(x), from = 0, to = 2 * theta, xlab = "regression slope", ylab = "MSE risk") curve(true.risk, add = TRUE, col = "grey") Figure 3.1 Empirical and generalization risk for regression through the origin, Y = θX +  ,  ∼N (0 , 1), with true θ = 5, and X ∼ Unif(0 , 1). Black: MSE on a particular sample ( = 20) as a function of slope, minimized at n ˆ θ = 4 . 37. Grey: true or generalization risk (Exercise 3.2). The gap between the curves is the text’s η ( θ ). n

66 66 Model Evaluation ̂ η θ ) → 0, and the over-optimism in using the in-sample error to follow that ( n n approximate the generalization error would at least be shrinking. If we knew ) η | ( θ how fast max | was going to zero, we could even say something about how n θ much bigger the true risk was likely to be. A lot of more advanced statistics and machine learning theory is thus about uniform laws of large numbers (showing | max η 0) and rates of convergence. ( θ ) |→ n θ Learning theory is a beautiful, deep, and practically important subject, but also a subtle and involved one. (See 3.6 for references.) To stick closer to analyzing § real data, and to not turn this into an advanced probability class, I will only talk about some more-or-less heuristic methods, which are good enough for many purposes. 3.3 Over-Fitting and Model Selection The big problem with using the in-sample error is related to over-optimism, but over-fitting . at once trickier to grasp and more important. This is the problem of To illustrate it, let’s start with Figure 3.2. This has the twenty X values from a 2 Y = 7 X  − 0 . 5 X + Gaussian distribution, and ,  ∼ N (0 , 1). That is, the true regression curve is a parabola, with additive and independent Gaussian noise. Let’s try fitting this — but pretend that we didn’t know that the curve was — degree 0 x a parabola. We’ll try fitting polynomials of different degrees in (a flat line), degree 1 (a linear regression), degree 2 (quadratic regression), up through degree 9. Figure 3.3 shows the data with the polynomial curves, and Figure 3.4 shows the in-sample mean squared error as a function of the degree of the polynomial. Notice that the in-sample error goes down as the degree of the polynomial can also be written as a poly- p increases; it has to. Every polynomial of degree p +1 x nomial of degree p ), so going to a higher-degree +1 (with a zero coefficient for model can only reduce the in-sample error. Quite generally, in fact, as one uses more and more complex and flexible models, the in-sample error will get smaller 5 and smaller. Things are quite different if we turn to the generalization error. In principle, I could calculate that for any of the models, since I know the true distribution, but 18 E [ X it would involve calculating things like ], which won’t be very illuminating. Instead, I will just draw a lot more data from the same source, twenty thousand data points in fact, and use the error of the old models on the new data as their 6 . The results are in Figure 3.5. generalization error What is happening here is that the higher-degree polynomials — beyond degree little optimistic about how well they fit, they are wildly 2 — are not just a 5 In fact, since there are only 20 data points, they could all be fit exactly if the degree of the polynomials went up to 19. (Remember that any two points define a line, any three points a parabola, etc. — p + 1 points define a polynomial of degree p which passes through them.) 6 This works, yet again, because of the law of large numbers. In Chapters 5 and especially 6, we will see much more about replacing complicated probabilistic calculations with simple simulations, an idea sometimes called the “Monte Carlo method”.

67 3.3 Over-Fitting and Model Selection 67 l 50 40 l l 30 y l 20 l l l l l l 10 l l l l l l l l 0 l l 0 1 2 −2 −1 x x = rnorm(20) y = 7 * x^2 - 0.5 * x + rnorm(20) plot(x, y) curve(7 * x^2 - 0.5 * x, col = "grey", add = TRUE) Figure 3.2 Scatter-plot showing sample data and the true, quadratic regression curve (grey parabola). over-optimistic. The models which seemed to do notably better than a quadratic actually do much, much worse. If we picked a polynomial regression model based on in-sample fit, we’d chose the highest-degree polynomial available, and suffer for it. In this example, the more complicated models — the higher-degree polynomi- als, with more terms and parameters — were not actually fitting the generalizable

68 68 Model Evaluation l 50 40 l l 30 y l 20 l l l l l l 10 l l l l l l l l 0 l l −1 0 1 −2 2 x plot(x, y) poly.formulae <- c("y~1", paste("y ~ poly(x,", 1:9, ")", sep = "")) poly.formulae <- sapply(poly.formulae, as.formula) df.plot <- data.frame(x = seq(min(x), max(x), length.out = 200)) fitted.models <- list(length = length(poly.formulae)) for (model_index in 1:length(poly.formulae)) { fm <- lm(formula = poly.formulae[[model_index]]) lines(df.plot$x, predict(fm, newdata = df.plot), lty = model_index) fitted.models[[model_index]] <- fm } Figure 3.3 Twenty training data points (dots), and ten different fitted regression lines (polynomials of degree 0 to 9, indicated by different line types). R notes: The poly command constructs orthogonal (uncorrelated) polynomials of the specified degree from its first argument; regressing on them is degree 2 ,x,x ,...x conceptually equivalent to regressing on 1 , but more numerically stable. (See ?poly .) This builds a vector of model formulae and then fits each one in turn, storing the fitted models in a new list.

69 3.3 Over-Fitting and Model Selection 69 200 l l 100 50 20 10 mean squared error 5 2 l l l 1 l l l l l 8 6 4 2 0 polynomial degree mse.q <- sapply(fitted.models, function(mdl) { mean(residuals(mdl)^2) }) plot(0:9, mse.q, type = "b", xlab = "polynomial degree", ylab = "mean squared error", log = "y") Figure 3.4 Empirical MSE vs. degree of polynomial for the data from the previous figure. Note the logarithmic scale for the vertical axis. features of the data. Instead, they were fitting the sampling noise, the accidents which don’t repeat. That is, the more complicated models over-fit the data. In terms of our earlier notation, η is bigger for the more flexible models. The model which does best here is the quadratic, because the true regression func- tion happens to be of that form. The more powerful, more flexible, higher-degree

70 70 Model Evaluation polynomials were able to get closer to the training data, but that just meant matching the noise better. In terms of the bias-variance decomposition, the bias shrinks with the model degree, but the variance of estimation grows. Notice that the models of degrees 0 and 1 also do worse than the quadratic -fitting; they would do better model — their problem is not over-fitting but under if they were more flexible. Plots of generalization error like this usually have a model selection — minimum. If we have a choice of models — if we need to do of models, we would like to find the minimum. Even if we do not have a choice we might like to know how big the gap between our in-sample error and our generalization error is likely to be. There is nothing special about polynomials here. All of the same lessons apply -nearest neighbors (where we need to variable selection in linear regression, to k to choose ), to kernel regression (where we need to choose the bandwidth), and k to other methods we’ll see later. In every case, there is going to be a minimum for the generalization error curve, which we’d like to find. (A minimum with respect to what, though? In Figure 3.5, the horizontal axis is the model degree, which here is the number of parameters [minus one for the intercept]. More generally, however, what we care about is some measure of how complex the model space is, which is not necessarily the same thing as the number of parameters. What’s more relevant is how flexible the class of models is, how many different functions it can approximate. Linear polynomials can approximate a smaller set of functions than quadratics can, so the latter are more complex, . More advanced learning theory has a number of ways or have higher capacity of quantifying this, but the details get pretty arcane, and we will just use the concept of complexity or capacity informally.) 3.4 Cross-Validation The most straightforward way to find the generalization error would be to do what I did above, and to use fresh, independent data from the same source — ′ , as opposed to our training data testing or validation data-set. Call this z a m ̂ . The loss of this on the validation data is . We fit our model to z θ , and get z n n n [ ] ′ ̂ ̂ E θ (3.8) ) L + η ( ) ( Z, θ n n m ′ ̂ validation set, η where now the sampling noise on the . So , is independent of θ n m m is large, this gives us an unbiased estimate of the generalization error, and, if a precise one. If we need to select one model from among many, we can pick the one which does best on the validation data, with confidence that we are not just over-fitting. The problem with this approach is that we absolutely, positively, cannot use any of the validation data in estimating the model. Since collecting data is expensive — it takes time, effort, and usually money, organization, effort and skill — this means getting a validation data set is expensive, and we often won’t have that luxury.

71 3.4 Cross-Validation 71 CAPA <- na.omit(read.csv("http://www.stat.cmu.edu/~cshalizi/uADA/13/hw/01/calif_penn_2011.csv")) half_A <- sample(1:nrow(CAPA), size = nrow(CAPA)/2, replace = FALSE) half_B <- setdiff(1:nrow(CAPA), half_A) small_formula = "Median_house_value ~ Median_household_income" large_formula = "Median_house_value ~ Median_household_income + Median_rooms" small_formula <- as.formula(small_formula) large_formula <- as.formula(large_formula) msmall <- lm(small_formula, data = CAPA, subset = half_A) mlarge <- lm(large_formula, data = CAPA, subset = half_A) in.sample.mse <- function(model) { mean(residuals(model)^2) } new.sample.mse <- function(model, half) { test <- CAPA[half, ] predictions <- predict(model, newdata = test) return(mean((test$Median_house_value - predictions)^2)) } Code Example 2: Code used to generate the numbers in Figure 3.7. 3.4.1 Data Splitting The next logical step, however, is to realize that we don’t strictly need a separate validation set. We can just take our data and it ourselves into training and split testing sets. If we divide the data into two parts at random, we ensure that they have (as much as possible) the same distribution, and that they are independent of each other. Then we can act just as though we had a real validation set. Fitting to one part of the data, and evaluating on the other, gives us an unbiased estimate of generalization error. Of course it doesn’t matter which half of the data is used to train and which half is used to test. Figure 3.7 illustrates the idea with a bit of the data and linear models from § A.13, and Code Example 2 shows the code used to make Figure 3.7. 3.4.2 k -Fold Cross-Validation (CV) The problem with data splitting is that, while it’s an unbiased estimate of the risk, it is often a very noisy one. If we split the data evenly, then the test set has n/ 2 data points — we’ve cut in half the number of sample points we’re averaging over. It would be nice if we could reduce that noise somewhat, especially if we are going to use this for model selection. One solution to this, which is pretty much the industry standard, is what’s called -fold cross-validation . Pick a small integer k , usually 5 or 10, and k divide the data at random into k equally-sized subsets. (The subsets are often called “folds”.) Take the first subset and make it the test set; fit the models to the rest of the data, and evaluate their predictions on the test set. Now make the second subset the test set and the rest of the training sets. Repeat until each subset has been the test set. At the end, average the performance across test sets. This is the cross-validated estimate of generalization error for each model. Model

72 72 Model Evaluation cv.lm <- function(data, formulae, nfolds = 5) { data <- na.omit(data) formulae <- sapply(formulae, as.formula) n <- nrow(data) fold.labels <- sample(rep(1:nfolds, length.out = n)) mses <- matrix(NA, nrow = nfolds, ncol = length(formulae)) colnames <- as.character(formulae) for (fold in 1:nfolds) { test.rows <- which(fold.labels == fold) train <- data[-test.rows, ] test <- data[test.rows, ] for (form in 1:length(formulae)) { current.model <- lm(formula = formulae[[form]], data = train) predictions <- predict(current.model, newdata = test) test.responses <- eval(formulae[[form]][[2]], envir = test) test.errors <- test.responses - predictions mses[fold, form] <- mean(test.errors^2) } } return(colMeans(mses)) } Function to do -fold cross-validation on linear models, given as a vector (or k Code Example 3: list) of model formulae. Note that this only returns the CV MSE, not the parameter estimates on each fold. 7 selection then picks the model with the smallest estimated risk. Code Example 3 performs k -fold cross-validation for linear models specified by formulae. The reason cross-validation works is that it uses the existing data to simulate the process of generalizing to new data. If the full sample is large, then even the smaller portion of it in the testing data is, with high probability, fairly represen- tative of the data-generating process. Randomly dividing the data into training and test sets makes it very unlikely that the division is rigged to favor any one model class, over and above what it would do on real new data. Of course the perfectly original data set is never representative of the full data, and a smaller testing set is even less representative, so this isn’t ideal, but the approximation is k -fold CV is fairly good at getting the relative often quite good. order of different 8 models right, that is, at controlling over-fitting. Figure 3.8 demonstrates these points for the polynomial fits we considered earlier (in Figures 3.3–3.5). Cross-validation is probably the most widely-used method for model selection, and for picking control settings, in modern statistics. There are circumstances where it can fail — especially if you give it too many models to pick among — 7 k -fold CV”, is to pick 1 /k of the data points at A closely related procedure, sometimes also called “ random to be the test set (using the rest as a training set), and then pick an independent 1 /k of the data points as the test set, etc., repeating k times and averaging. The differences are subtle, but what’s described in the main text makes sure that each point is used in the test set just once. 8 The cross-validation score for the selected model still tends to be somewhat over-optimistic, because it’s still picking the luckiest model — though the influence of luck is much attenuated. Tibshirani and Tibshirani (2009) provides a simple correction.

73 3.4 Cross-Validation 73 but it’s the first thought of seasoned practitioners, and it should be your first thought, too. The assignments to come will make you very familiar with it. 3.4.3 Leave-one-out Cross-Validation k k = n . Our testing sets would Suppose we did -fold cross-validation, but with then consist of single points, and each point would be used in testing once. This . It actually came before k -fold cross- is called leave-one-out cross-validation validation, and has three advantages. First, because it estimates the performance n − 1 data points, it’s less biased as an estimator of the of a model trained with performance of a model trained with n data points than is k -fold cross-validation, k − 1 which uses data points. Second, leave-one-out doesn’t require any random n k number generation, or keeping track of which data point is in which subset. Third, and more importantly, because we are only testing on one data point, it’s often possible to find what the prediction on the left-out point would be by doing whole data. (See p. 3.4.3 below.) This means calculations on a model fit to the k times, which can be a that we only have to fit each model once, rather than big savings of computing time. The drawback to leave-one-out CV is subtle but often decisive. Since each n training set has 1 points, any two training sets must share n − 2 points. The − models fit to those training sets tend to be strongly correlated with each other. Even though we are averaging n out-of-sample forecasts, those are correlated forecasts, so we are not really averaging away all that much noise. With k -fold CV, on the other hand, the fraction of data shared between any two training 2 − n 2 − k , not , so even though the number of terms being averaged is sets is just 1 n − 1 k − smaller, they are less correlated. There are situations where this issue doesn’t really matter, or where it’s over- whelmed by leave-one-out’s advantages in speed and simplicity, so there is cer- 9 -fold CV. tainly still a place for it, but one subordinate to k A Short-cut for Linear Smoothers Suppose the model is a linear smoother ( § 1.5). For each of the data points m , then, the predicted value is a linear combination of the observed values of y , i ∑ § (Eq. 1.53). As in y ) x ,x ˆ w ( 1.5.3, define the “influence”, “smooth- ) = x ( m i j j i j ing” or “hat” matrix ˆw by ˆ w ). What happens when we hold back = ˆ w ( x ,x i ij j data point i , and then make a prediction at x ? Well, the observed response at i i can’t contribute to the prediction, but otherwise the linear smoother should work 9 At this point, it may be appropriate to say a few words about the Akaike information criterion, or AIC. AIC also tries to estimate how well a model will generalize to new data. One can show that, under standard assumptions, as the sample size gets large, leave-one-out CV actually gives the same estimate as AIC (Claeskens and Hjort, 2008, § 2.9). However, there do not seem to be any situations where AIC works where leave-one-out CV does not work at least as well. So AIC should really be understood as a very fast, but often very crude, approximation to the more accurate cross-validation.

74 74 Model Evaluation as before, so ) y ( ˆwy − ˆ w i ii i − ) ( i ) = x m ( (3.9) i w 1 − ˆ ii The numerator just removes the contribution to x m ) that came from y ( , and the i i denominator just re-normalizes the weights in the smoother. Now a little algebra says that ) x ( m − y i i − i ) ( (3.10) ) = m − y ( x i i w ˆ 1 − ii The quantity on the left of that equation is what we want to square and average to get the leave-one-out CV score, but everything on the right can be calculated from the fit we did to the whole data. The leave-one-out CV score is therefore ) ( n 2 ∑ y − m ( x ) 1 i i (3.11) ˆ w n − 1 ii =1 i Thus, if we restrict ourselves to leave-one-out and to linear smoothers, we can calculate the CV score with just one estimation on the whole data, rather than re-estimates. n An even faster approximation that this is what’s called “generalized” cross- − 1 2 n validation, which is just the in-sample MSE divided by (1 ˆw ) . That is, − tr rather than dividing each term in Eq. 3.11 by a unique factor that depends on its own diagonal entry in the hat matrix, we use the average of all the diagonal − 1 entries, tr ˆw n § 1.5.3.2 that tr ˆw is the number of effective degrees . (Recall from of freedom for a linear smoother.) In addition to speed, this tends to reduce the influence of points with high values of ˆ w , which may or may not be desirable. ii 3.5 Warnings Some caveats are in order. 1. All of the model-selection methods I have described, and almost all others in the literature, aim at getting models which will generalize well to new data, if it follows the same distribution as old data. Generalizing well even when distributions change is a much harder and much less well-understood problem (Qui ̃nonero-Candela et al. , 2009). It is particularly troublesome for a lot of applications involving large numbers of human beings, because society keeps relationships changing all the time — variables vary by definition, but the between variables also change. (That’s history.) 2. All of the standard theory of statistical inference you have learned so far presumes that you have a model which was fixed in advance of seeing the data. If you use the data to select the model, that theory becomes invalid, and it will no longer give you correct p -values for hypothesis tests, confidence sets for parameters, etc., etc. Typically, using the same data both to select a model and to do inference leads to too much confidence that the model is correct, significant, and estimated precisely.

75 3.5 Warnings 75 3. All the model selection methods we have discussed aim at getting models which predict well . This is not necessarily the same as getting the true theory of the . Presumably the true theory will also predict well, but the converse world does not necessarily follow. We have seen (Fig. 1.3), and will see again ( 9.2), § examples of false but low-capacity models out-predicting correctly specified , because the former have such low variance of estimation. models at small n The last two items — combining selection with inference, and parameter inter- pretation — deserve elaboration. 3.5.1 Inference after Selection You have, by this point, learned a lot of inferential statistics — how to test various hypotheses, calculate -values, find confidence regions, etc. Most likely, you have p been taught procedures or calculations which all presume that the model you are working with is fixed in advance of seeing the data. But, of course, if you do model selection, the model you do inference within is not fixed in advance, but is actually a function of the data. What happens then? This depends on whether you do inference with the same data used to select the model, or with another, independent data set. If it’s the same data, then all of the inferential statistics become invalid — none of the calculations of probabilities on which they rest are right any more. Typically, if you select a model so that it 10 fits the data well, what happens is that confidence regions become too small , p as do -values for testing hypotheses about parameters. Nothing can be trusted as it stands. The essential difficulty is this: Your data are random variables. Since you’re doing model selection, making your model a function of the data, that means your model is random too. That means there is some extra randomness in your estimated parameters (and everything else), which isn’t accounted for by formulas which assume a fixed model (Exercise 3.4). This is not just a problem with formal model-selection devices like cross-validation. If you do an initial, exploratory data analysis before deciding which model to use — and that’s generally a good idea — you are, yourself, acting as a noisy, complicated model-selection device. There are three main approaches to this issue of post-selection inference . 1. Ignore it. This can actually make sense if you don’t really care about doing in- ference within your selected model, you just care about what model is selected. Otherwise, I can’t recommend it. 2. Beat it with more statistical theory. There is, as I write, a lot of interest among statisticians in working out exactly what happens to sampling distributions under various combinations of models, model-selection methods, and assump- tions about the true, data-generating process. Since this is an active area of research in statistical theory, I will pass it by, with some references in § 3.6. 10 Or, if you prefer, the same confidence region really has a lower confidence level, a lower probability of containing or covering the truth, than you think it does.

76 76 Model Evaluation Evade it with an independent data set. A and B 3. Remember that if the events A | B ) = Pr ( A ). Now set A = “the are probabilistically independent, then Pr ( B confidence set we calculated from this new data covers the truth” and = “the model selected from this old data was such-and-such”. So long as the old and the new data are independent, it doesn’t matter that the model was selected using data, rather than being fixed in advance. § 3.4.1). We divide The last approach is of course our old friend data splitting ( the data into two parts, and we use one of them to select the model. We then re-estimate the selected model on the other part of the data, and only use that second part in calculating our inferential statistics. Experimentally, using part of the data to do selection, and then all of the data to do inference, does not work as well as a strict split (Faraway, 2016). Using equal amounts of data for selection and for inference is somewhat arbitrary, but, again it’s not clear that there’s a much better division. Of course, if you only use a portion of your data to calculate confidence regions, they will typically be larger than if you used all of the data. (Or, if you’re running hypothesis tests, fewer coefficients will be significantly different from zero, etc.) This drawback is more apparent than real, since using all of your data to select a model and do inference gives you apparently-precise confidence regions which aren’t actually valid. The simple data-splitting approach to combining model selection and inference only works if the individual data points were independent to begin with. When we deal with dependent data, in Part IV, other approaches will be necessary. 3.5.2 Parameter Interpretation In many situations, it is very natural to want to attach some substantive, real- world meaning to the parameters of our statistical model, or at least to some of them. I have mentioned examples above like astronomy, and it is easy to come up with many others from the natural sciences. This is also extremely common in the social sciences. It is fair to say that this is much less carefully attended to than it should be. To take just one example, consider the paper “Luther and Suleyman” by Prof. Murat Iyigun (Iyigun, 2008). The major idea of the paper is to try to help explain why the Protestant Reformation was not wiped out during the European wars of religion (or alternately, why the Protestants did not crush all the Catholic powers), leading western Europe to have a mixture of religions, with profound consequences. Iyigun’s contention is that the European Christians were so busy fighting the Ottoman Turks, or perhaps so afraid of what might happen if they did not, that conflicts among the Europeans were suppressed. To quote his abstract: at the turn of the sixteenth century, Ottoman conquests lowered the number of all newly initiated conflicts among the Europeans roughly by 25 percent, while they dampened all longer-running feuds by more than 15 percent. The Ottomans’ military activities influenced the length of intra- European feuds too, with each Ottoman-European military engagement shortening the duration of intra-European conflicts by more than 50 percent.

77 3.6 Further Reading 77 To back this up, and provide those quantitative figures, Prof. Iyigun estimates 11 linear regression models, of the form Z = β X + β Y + β + β U +  (3.12) 3 t t 0 1 t 2 t t Y where is “the number of violent conflicts initiated among or within continental t 12 ” European countries at time , X t is “the number of conflicts in which the t Ottoman Empire confronted European powers at time ”, Z t is “the count at t time of the newly initiated number of Ottoman conflicts with others and its t U own domestic civil discords”, is control variables reflecting things like the t availability of harvests to feed armies, and  is Gaussian noise. t The qualitative idea here, about the influence of the Ottoman Empire on the 13 European wars of religion, has been suggested by quite a few historians before . The point of this paper is to support this rigorously, and make it precise. That support and precision requires Eq. 3.12 to be an accurate depiction of at least part of the process which led European powers to fight wars of religion. Prof. Iyigun, after all, wants to be able to interpret a negative estimate of as saying β 1 Christians from fighting each other. If Eq. that fighting off the Ottomans kept becomes the β 3.12 is inaccurate, if the model is badly mis-specified, however, 1 best approximation to the truth within a systematically wrong model, and the support for claims like “Ottoman conquests lowered the number of all newly initiated conflicts among the Europeans roughly by 25 percent” drains away. To back up the use of Eq. 3.12, Prof. Iyigun looks at a range of slightly different linear-model specifications (e.g., regress the number of intra-Christian conflicts t on the number of Ottoman attacks in year t in year 1), and slightly differ- − ent methods of estimating the parameters. What he does not do is look at the other implications of the model: that residuals should be (at least approximately) Gaussian, that they should be unpredictable from the regressor variables. He does not look at whether the relationships he thinks are linear really are linear (see Chapters 4, 8, and 9). He does not try to simulate his model and look at whether the patterns of European wars it produces resemble actual history (see Chapter 5). He does not try to check whether he has a model which really supports causal inference, though he has a causal question (see Part III). I do not say any of this to denigrate Prof. Iyigun. His paper is actually much better than most quantitative work in the social sciences. This is reflected by the fact that it was published in the Quarterly Journal of Economics , one of the most prestigious, and rigorously-reviewed, journals in the field. The point is that by the end of this course, you will have the tools to do better . 3.6 Further Reading Data splitting and cross-validation go back in statistical practice for many decades, though often as a very informal tool. One of the first important papers on the 11 His Eq. 1 on pp. 1473; I have modified the notation to match mine. 12 In one part of the paper; he uses other dependent variables elsewhere. 13 See § 1–2 of Iyigun (2008), and MacCulloch (2004, passim) .

78 78 Model Evaluation subject was Stone (1974), which goes over the earlier history. Arlot and Celisse (2010) is a good recent review of cross-validation. Faraway (1992, 2016) reviews computational evidence that data splitting reduces the over-confidence that re- et al. sults from model selection even if one only wants to do prediction. Gy ̈orfi (2002, chs. 7–8) has important results on data splitting and cross-validation, though the proofs are much more advanced than this book. Some comparatively easy starting points on statistical learning theory are et al. Kearns and Vazirani (1994), Cristianini and Shawe-Taylor (2000) and Mohri (2012). At a more advanced level, look at the tutorial papers by Bousquet et al. (2004); von Luxburg and Sch ̈olkopf (2008), or the textbooks by Vidyasagar (2003) and by Anthony and Bartlett (1999) (the latter is much more general than its title suggests), or read the book by Vapnik (2000) (one of the founders). Hastie et al. (2009), while invaluable, is much more oriented towards models and practical theory methods than towards learning . On model selection in general, the best recent summary is the book by Claeskens and Hjort (2008); it is more theoretically demanding than this book, but includes many real-data examples. The literature on doing statistical inference after model selection by accounting for selection effects, rather than simple data splitting, is already large and rapidly growing. Taylor and Tibshirani (2015) is a comparatively readable introduction to the “selective inference” approach associated with those authors and their et al. (2015) draws connections between this approach collaborators. Tibshirani et al. and the bootstrap (ch. 6). Berk (2013) provides yet another approach to post-selection inference; nor is this an exhaustive list. White (1994) is a thorough treatment of parameter estimation in models which may be mis-specified, and some general tests for mis-specification. It also briefly discusses the interpretation of parameters in mis-specified models. That topic deserves a more in-depth treatment, but I don’t know of a really good one. Exercises 3.1 Suppose that one of our model classes contains the true and correct model, but we also consider more complicated and flexible model classes. Does the bias-variance trade-off mean that we will over-shoot the true model, and always go for something more flexible, too much data when we have enough data? (This would mean there was such a thing as to be reliable.) 3.2 Derive the formula for the generalization risk in the situation depicted in Figure 3.1, as true.risk function in the code for that figure. In particular, explain to given by the 2 yourself where the constants 0 5 and 1 / 12 come from. . 3.3 “Optimism” and degrees of freedom Suppose we get data of the form Y , = μ ( x  ) + i i i 2 where the noise terms  have mean zero, are uncorrelated, and all have variance σ . We i § 1.5) to estimate ̂ μ from n such data points. The “optimism” of use a linear smoother ( the estimate is ] [ ] [ n n ∑ ∑ 1 1 ′ 2 2 ̂ μ ( x μ )) x ( ̂ − Y − Y ( )) ( (3.13) E − E i i i i n n =1 i i =1

79 Exercises 79 ′ Y . That is, the optimism is the difference between is an independent copy of where Y i i the in-sample MSE, and how well the model would predict on new data taken at exactly values. the same x i 2 σ 1. Find a formula for the optimism in terms of , and the number of effective degrees n , 1.5.3). § of freedom (in the sense of ] [ ∑ n 2 ′ 1 ( Y μ )) − ̂ x ( E differ from the risk? 2. When (and why) does i i i =1 n 14 The perils of post-selection inference, and data splitting to the rescue Generate a 1000 × 3.4 101 array, where all the entries are IID standard Gaussian variables. We’ll call the first . By design, , and the others the predictors X column the response variable ,...X Y 100 1 there is no true relationship between the response and the predictors (but all the usual linear-Gaussian-modeling assumptions hold). Y β β test + β F X -value for the + 1. Estimate the model = X p +  . Extract the 50 1 1 0 50 of the whole model. Repeat the simulation, estimation and testing 100 times, and plot the histogram of the p -values. What does it look like? What should it look like? 2. Use the function to select a linear model by forward stepwise selection. Extract the step p -value for the F -test of the selected model. Repeat 100 times and plot the histogram of p -values. Explain what’s going on. 3. Again use step to select a model based on one random 1000 × 101 array. Now re-estimate the selected model on a new 1000 × 101 array, and extract the new p -value. Repeat 100 times, with new selection and inference sets each time, and plot the histogram of p -values. 14 Inspired by Freedman (1983).

80 80 Model Evaluation 10000 1000 l l 100 mean squared error 10 l l l 1 l l l l l 8 4 2 0 6 polynomial degree x.new = rnorm(20000) y.new = 7 * x.new^2 - 0.5 * x.new + rnorm(20000) gmse <- function(mdl) { mean((y.new - predict(mdl, data.frame(x = x.new)))^2) } gmse.q <- sapply(fitted.models, gmse) plot(0:9, mse.q, type = "b", xlab = "polynomial degree", ylab = "mean squared error", log = "y", ylim = c(min(mse.q), max(gmse.q))) lines(0:9, gmse.q, lty = 2, col = "blue") points(0:9, gmse.q, pch = 24, col = "blue") Figure 3.5 In-sample error (black dots) compared to generalization error (blue triangles). Note the logarithmic scale for the vertical axis.

81 Exercises 81 l l l l l l l l l l l l l l l l 1.0 0.8 0.6 2 R 0.4 0.2 l l 2 R 2 R l l d a j 0.0 8 6 4 0 2 polynomial degree extract.rsqd <- function(mdl) { c(summary(mdl)$r.squared, summary(mdl)$adj.r.squared) } rsqd.q <- sapply(fitted.models, extract.rsqd) plot(0:9, rsqd.q[1, ], type = "b", xlab = "polynomial degree", ylab = expression(R^2), ylim = c(0, 1)) lines(0:9, rsqd.q[2, ], type = "b", lty = "dashed") legend("bottomright", legend = c(expression(R^2), expression(R[adj]^2)), lty = c("solid", "dashed")) 2 2 R and adjusted R Figure 3.6 for the polynomial fits, to reinforce § 2.2.1.1’s point that neither statistic is a useful measure of how well a model fits, or a good criteria for picking among models.

82 82 Model Evaluation house value household income Median rooms Median Median 111667 909600 2 6.0 3 748700 66094 4.6 87306 5.0 4 773600 579200 62386 5 4.5 56667 6.0 209500 11274 253400 71638 6.6 11275 Median value Median household income Median rooms house 3 66094 4.6 748700 773600 5.0 4 87306 11275 253400 71638 6.6 house value Median household income Median rooms Median 2 909600 111667 6.0 5 579200 62386 4.5 11274 209500 56667 6.0 B → ) RMSE( A → A ) RMSE( A 5 5 6215652 × 10 Income only 1 . 6078767 × 10 1 . 5 5 . 2831218 × 10 Income + Rooms 1 . 2576588 × 10 1 Figure 3.7 Example of data splitting. The top table shows three columns § A.13. I then randomly and seven rows of the housing-price data used in split this into two equally-sized parts (next two tables). I estimate a linear model which predicts house value from income alone, and another model which predicts from income and the median number of rooms, on the first half. The third table fourth row shows the performance of each estimated model both on the first half of the data (left column) and on the second (right column). The latter is a valid estimate of generalization error. The larger model always has a lower in-sample error, whether or not it is really better, so the in-sample MSEs provide little evidence that we should use the larger model. Having a lower score under data splitting, however, is evidence that the larger model generalizes better. (For R commands used to get these numbers, see Code Example 2.)

83 Exercises 83 l In−sample Generalization 10000 CV 1000 l l 100 mean squared error 10 l l l 1 l l l l l 8 4 2 0 6 polynomial degree little.df <- data.frame(x = x, y = y) cv.q <- cv.lm(little.df, poly.formulae) plot(0:9, mse.q, type = "b", xlab = "polynomial degree", ylab = "mean squared error", log = "y", ylim = c(min(mse.q), max(gmse.q))) lines(0:9, gmse.q, lty = 2, col = "blue", type = "b", pch = 2) lines(0:9, cv.q, lty = 3, col = "red", type = "b", pch = 3) legend("topleft", legend = c("In-sample", "Generalization", "CV"), col = c("black", "blue", "red"), lty = 1:3, pch = 1:3) Figure 3.8 In-sample, generalization, and cross-validated MSE for the polynomial fits of Figures 3.3, 3.4 and 3.5. Note that the cross-validation is done entirely within the initial set of only 20 data points.

84 4 Using Nonparametric Smoothing in Regression Having spent long enough running down linear regression, and thought through evaluating predictive models, it is time to turn to constructive alternatives, which are (also) based on smoothing. Recall the basic kind of smoothing we are interested in: we have a response variable Y , some input variables which we bind up into a vector X , and a col- x ). By “smoothing”, I mean that ,y ,y ) , ( x x ,y lection of data values, ( ) ,... ( 2 n 1 n 1 2 predictions are going to be weighted averages of the observed responses in the training data: n ∑ ̂ w ( μ (4.1) ) x ,h ) = ( x,x y i i =1 i Most smoothing methods have a control setting, here written h , that says how much to smooth. With nearest neighbors, for instance, the weights are 1 /k if k = 0 otherwise, so large k -nearest points to x , and w x k means that is one of the i each prediction is an average over many training points. Similarly with kernel regression, where the degree of smoothing is controlled by the bandwidth. Why do we want to do this? How do we pick how much smoothing to do? 4.1 How Much Should We Smooth? When we smooth very little ( h → 0), then we can match very small, fine-grained or sharp aspects of the true regression function, if there are such. That is, less smoothing leads to less bias. At the same time, less smoothing means that each of our predictions is going to be an average over (in effect) fewer observations, mak- ing the prediction noisier. Smoothing less increases the variance of our estimate. Since 2 (4.2) (total error) = (noise) + (bias) + (variance) h (Eq. 1.28), if we plot the different components of error as a function of , we typically get something that looks like Figure 4.1. Because changing the amount of smoothing has opposite effects on the bias and the variance, there is an optimal amount of smoothing, where we can’t reduce one source of error without increas- ing the other. We therefore want to find that optimal amount of smoothing, which is where cross-validation comes in. You should note, at this point, that the optimal amount of smoothing depends 84 st 11:17 Monday 1 April, 2019 c Copyright © Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/ cshalizi/ADAfaEPoV/ ~

85 4.2 Adapting to Unknown Roughness 85 2.0 1.5 1.0 Generalization error 0.5 0.0 1.0 0.2 0.6 0.0 0.8 0.4 Smoothing curve(2 * x^4, from = 0, to = 1, lty = 2, xlab = "Smoothing", ylab = "Generalization error") curve(0.12 + x - x, lty = 3, add = TRUE) curve(1/(10 * x), lty = 4, add = TRUE) curve(0.12 + 2 * x^4 + 1/(10 * x), add = TRUE) Decomposition of the generalization error of smoothing: the Figure 4.1 total error (solid) equals process noise (dotted) plus approximation error from smoothing (=squared bias, dashed) and estimation variance (dot-and-dash). The numerical values here are arbitrary, but the functional − 4 1 − 1 h ∝ forms (squared bias h , variance ∝ ) are representative of kernel n regression (Eq. 4.12). on the real regression curve, on our smoothing method, and on how much data we have. This is because the variance contribution generally shrinks as we get more 1 If we get more data, we go from Figure 4.1 to Figure 4.2. The minimum data. of the over-all error curve has shifted to the left, and we should smooth less. Strictly speaking, are properties of the data-generating process parameters alone, so the optimal amount of smoothing is not really a parameter. If you do think of it as a parameter, you have the problem of why the “true” value changes control variable in as you get more data. It’s better thought of as a setting or the smoothing method, to be adjusted as convenient. 4.2 Adapting to Unknown Roughness Figure 4.3, which graphs two functions, r and s . Both are “smooth” functions in 2 the mathematical sense . We could Taylor-expand both functions to approximate 3 their values anywhere, just from knowing enough derivatives at one point x . If 0 1 Sometimes bias changes as well. Noise does not (why?). 2 ∞ They are “ C ”: continuous, with continuous derivatives to all orders. 3 See App. D for a refresher on Taylor expansions.

86 86 Smoothing in Regression 2.0 1.5 1.0 Generalization error 0.5 0.0 0.2 0.0 0.8 0.4 1.0 0.6 Smoothing curve(2 * x^4, from = 0, to = 1, lty = 2, xlab = "Smoothing", ylab = "Generalization error") curve(0.12 + x - x, lty = 3, add = TRUE) curve(1/(10 * x), lty = 4, add = TRUE, col = "grey") curve(0.12 + 2 * x^2 + 1/(10 * x), add = TRUE, col = "grey") curve(1/(30 * x), lty = 4, add = TRUE) curve(0.12 + 2 * x^4 + 1/(30 * x), add = TRUE) Figure 4.2 Consequences of adding more data to the components of error: noise (dotted) and bias (dashed) don’t change, but the new variance curve (dotted and dashed, black) is to the left of the old (greyed), so the new over-all error curve (solid black) is lower, and has its minimum at a smaller amount of smoothing than the old (solid grey). x we have the values of the functions at a instead of knowing the derivatives at 0 sequence of points x ,x , we could use interpolation to fill out the rest of ,...x n 1 2 the curve. Quantitatively, however, r is less smooth than s — it changes much more rapidly, with many reversals of direction. For the same degree of accuracy r needs more, and more closely spaced, training points in the interpolation x i than does s . Now suppose that we don’t get to actually get to see r and s , but rather just r ( x )+  and s ( x )+ η , for various x , where  and η are noise. (To keep things simple I’ll assume they’re constant-variance, IID Gaussian noises, say with σ . 15.) = 0 The data now look something like Figure 4.4. Can we recover the curves? As remarked in Chapter 1, if we had many measurements at the same x , then we could find the expectation value by averaging: the regression function μ ( x ) = E [ Y | X = x ], so with multiple observations x , the mean of the corresponding = x i ). Generally, however, we y would (by the law of large numbers) converge on μ ( x i have at most one measurement per value of x , so simple averaging won’t work. Even if we just confine ourselves to the x where we have observations, the mean- i

87 4.2 Adapting to Unknown Roughness 87 2 , the noise variance. However, our estimate squared error would always be σ would be unbiased. x which are Smoothing methods try to use multiple measurements at points i near the point of interest x . If the regression function is smooth, as we’re assuming x ). Remember that the mean-squared error is the ) will be close to μ ( x it is, μ ( i = x is going to 6 sum of bias (squared) and variance. Averaging values at x i introduce bias, but averaging independent terms together also reduces variance. If smoothing gets rid of more variance than it adds bias, we come out ahead. Here’s a little math to see it. Let’s assume that we can do a first-order Taylor expansion (Figure D.1), so ′ ≈ ) μ μ ( ( ) + ( x (4.3) − x ) μ x ( x ) x i i and ′ ) ( x ) + ( x y − x ≈ μ μ ( x ) +  (4.4) i i i w ( x ) ,x,h Now we average: to keep the notation simple, abbreviate the weight i by just w . i n ∑ ̂ (4.5) y w x μ ) = ( i i =1 i n ∑ ′ (4.6) ( μ ( x ) + ( x = − x ) μ w ( x ) +  ) i i i =1 i n n ∑ ∑ ′ = μ (4.7) w )  x + μ ( ( x ) ) + x − x w ( i i i i =1 i =1 i n n ∑ ∑ ′ ̂ − ( w ) x x (4.8) + ( x )  w μ − ) ( ( μ x μ ) = x i i i i i =1 =1 i   ) ( 2 n n ∑ ∑ ] [ 2 2 ′ 2   ̂ w E + ) x − x = σ x ( w μ ) ( − E x ( μ )) (4.9) ) x ( μ ( i i i i =1 =1 i ∑ w is uncorrelated with everything; and = 1; E [  (Remember that: ] = 0;  i i 2 [  V σ .) ] = i The first term on the final right-hand side is an estimation variance, which will tend to shrink as n grows. (If we just did a simple global mean, w for all = 1 /n i 2 σ i /n , just like in baby stats.) The second term, an expectation, , so we’d get x is bias, which grows as gets further from x , and as the magnitudes of the i derivatives grow, i.e., this term’s growth varies with how smooth or wiggly the regression function is. For smoothing to work, w and had better shrink as x x − i i ′ 4 x ) grow. μ Finally, all else being equal, w ( should also shrink with n , so that i the over-all size of the sum shrinks as we get more data. 4 The higher derivatives of μ also matter, since we should really keep more than just the first term in the Taylor expansion. The details get messy, but Eq. 4.12 below gives the upshot for kernel smoothing.

88 88 Smoothing in Regression (1 . 6) and s (1 . 6) from the noisy observations. To illustrate, let’s try to estimate r x ( ) +  r and s ( We’ll try a simple approach, just averaging all values of η ) + x i i i i 5 < x 7 with equal weights. For < 1 . for 1 r , this gives 0.71, while r (1 . 6) = 0 . 83. . i , this gives 1, with s (1 . 6) = 0 . 96. (See figure 4.5.) The same window size For g than with creates a much larger bias with the rougher, more rapidly changing r . Varying the size of the averaging window the smoother, more slowly changing s will change the amount of error, and it will change it in different ways for the two functions. If one does a more careful second-order Taylor expansion like that leading to Eq. 4.9, specifically for kernel regression, one can show that the bias at x is ] [ ′ ′ ( x ( μ f ) x ) 1 ′′ 2 2 2 ̂ h ( ) + ) x ( o + μ σ | X E = x x ( ,...X μ = x ) ] = h − μ ( x [ ) n n 1 1 K 2 ) x ( f (4.10) ∫ 2 2 , and σ where f = is the density of u x K ( u ) du , the variance of the probability K 5 ′′ density corresponding to the kernel term just comes from the second- . The μ ′ ′ order part of the Taylor expansion. To see where the μ f term comes from, ′ is a mode of the distribution, so f h ( x ) = 0. As imagine first that shrinks, only x ̂ x is very close to x will have any weight in X μ ( training points where ), and their i distribution will be roughly symmetric around x (at least once h is sufficiently ̂ E [ w ( X 0. Away from a mode, there ,x,h )( X ≈ − x ) small). So, at mode, μ ( x )] i i x will tend to be more training points on one side or the other of , depending ′ ( x ), and this induces a bias. The tricky part of the analysis is f on the sign of 6 concluding that the bias has exactly the form given above. One can also work out the variance of the kernel regression estimate, 2 ( σ ) K ( R ) x 1 − ̂ (( + (4.11) ) ) nh o ] = x = V ,...X [ x = μ ( x ) | X 1 1 n n x ( nhf ) ∫ 2 K ) ≡ K where R ( u ) du . Roughly speaking, the width of the region where the ( h kernel puts non-trivial weight is about nhf ( x ) training , so there will be about ̂ x ( points available to estimate ). Each of these has a y ) plus μ μ ( x value, equal to i 2 σ ( noise of variance x ). The final factor of R ( K ) accounts for the average weight. Putting the bias together with the variance, we get an expression for the mean : squared error of the kernel regression at x ] [ 2 2 ′ ′ ) σ ) K ( R ) x ( x f ) μ ( ( x 1 ′′ 1 − 2 2 4 4 2 + μ ) ( σ ( ( ) ) + x o + ) nh )+ o (( h ( MSE x ) = σ ( x )+ h K ( ) nhf ) x ( f 2 x (4.12) Eq. 4.12 tells us that, in principle, there is a single optimal choice of bandwidth h , an optimal degree of smoothing. We could find it by taking Eq. 4.12, differen- tiating with respect to the bandwidth, and setting everything to zero (neglecting 5 If you are not familiar with the “order” symbols O and o , see Appendix C. 6 Exercise 4.1 sketches the demonstration for the special case of the uniform (“boxcar”) kernel.

89 4.2 Adapting to Unknown Roughness 89 terms): the o ] [ 2 ′ ′ 2 1 f μ K ( R ) ( ) ( x ) ( x σ ) x 3 2 2 ′′ h 0 = 4 ) ( − (4.13) μ σ ) + x ( K 2 2 nh ) f f x ) x ( (   ] [ − 1 / 5 2 ′ ′ f ) x x ( ( μ ) 1 ′′ 2 2 σ ( μ ) f 4 x )( x ( ) + K ( x ) f 2   = n h (4.14)   2 ( ) R σ ( K ) x ′ h involves the unknown derivatives μ Of course, this expression for the optimal ( x ) ′ ′′ and x ), plus the unknown density f ( x ) and its unknown derivative f μ ( x ). But ( if we knew the derivative of the regression function, we would basically know the function itself (just integrate), so we seem to be in a vicious circle, where we need 7 to know the function before we can learn it. One way of expressing this is to talk about how well a smoothing procedure would work, if an Oracle were to tell us the derivatives, or (to cut to the chase) the optimal bandwidth h . Since most of us do not have access to such oracles, opt ̂ estimate h . Once we have this estimate, h we need to , then we get our weights opt and our predictions, and so a certain mean-squared error. Basically, our MSE will ̂ h is to h be the Oracle’s MSE, plus an extra term which depends on how far , opt and how sensitive the smoother is to the choice of bandwidth. What would be really nice would be an adaptive procedure, one where our ̂ . h , approaches the Oracle’s MSE, which it gets from h actual MSE, using opt This would mean that, in effect, we are how rough the underlying figuring out regression function is, and so how much smoothing to do, rather than having to 8 guess or be told. An adaptive procedure, if we can find one, is a partial substitute for prior knowledge. 4.2.1 Bandwidth Selection by Cross-Validation The most straight-forward way to pick a bandwidth, and one which generally k -fold CV is usually somewhat manages to be adaptive, is in fact cross-validation; better than leave-one-out, but the latter often works acceptably too. The usual procedure is to come up with an initial grid of candidate bandwidths, and then use cross-validation to estimate how well each one of them would generalize. The one with the lowest error under cross-validation is then used to fit the regression 9 curve to the whole data . 7 You may be wondering why I keep talking about the optimal bandwidth, when Eq. 4.14 makes it seem that the bandwidth should vary with x . One can go through pretty much the same sort of analysis in terms of the expected values of the derivatives, and the qualitative conclusions will be the same, but the notational overhead is even worse. Alternatively, there are techniques for variable-bandwidth smoothing. 8 Only partial, because we’d do better if the Oracle would just tell us h . always opt 9 5 / 1 − n ∝ Since the optimal bandwidth is , and the training sets in cross-validation are smaller than the whole data set, one might adjust the bandwidth proportionally. However, if n is small enough that this makes a big difference, the sheer noise in bandwidth estimation usually overwhelms this.

90 90 Smoothing in Regression cv_bws_npreg <- function(x, y, bandwidths = (1:50)/50, nfolds = 10) { require(np) n <- length(x) stopifnot(n > 1, length(y) == n) stopifnot(length(bandwidths) > 1) stopifnot(nfolds > 0, nfolds == trunc(nfolds)) fold_MSEs <- matrix(0, nrow = nfolds, ncol = length(bandwidths)) colnames(fold_MSEs) = bandwidths case.folds <- sample(rep(1:nfolds, length.out = n)) for (fold in 1:nfolds) { train.rows = which(case.folds != fold) x.train = x[train.rows] y.train = y[train.rows] x.test = x[-train.rows] y.test = y[-train.rows] for (bw in bandwidths) { fit <- npreg(txdat = x.train, tydat = y.train, exdat = x.test, eydat = y.test, bws = bw) fold_MSEs[fold, paste(bw)] <- fit$MSE } } CV_MSEs = colMeans(fold_MSEs) best.bw = bandwidths[which.min(CV_MSEs)] return(list(best.bw = best.bw, CV_MSEs = CV_MSEs, fold_MSEs = fold_MSEs)) } Cross-validation for univariate kernel regression. The colnames trick: com- Code Example 4: ponent names have to be character strings; other data types will be coerced into characters when we assign them to be names. Later, when we want to refer to a bandwidth column by its name, we wrap the name in another coercing function, such as paste . — The is just demo of how cross-validation for bandwidth selection works in principle; don’t use it blindly on data, or in assignments. (That goes double for the vector of default bandwidths.) Code Example 4 shows how it would work in R, with a one predictor variable, 10 function from the np npreg borrowing the library (Hayfield and Racine, 2008). The return value has three parts. The first is the actual best bandwidth. The second is a vector which gives the cross-validated mean-squared errors of all the different bandwidths in the vector bandwidths . The third component is an array which gives the MSE for each bandwidth on each fold. It can be useful to know things like whether the difference between the CV score of the best bandwidth and the runner-up is bigger than their fold-to-fold variability. Figure 4.7 plots the CV estimate of the (root) mean-squared error versus band- width for our two curves. Figure 4.8 shows the data, the actual regression func- tions and the estimated curves with the CV-selected bandwidths. This illustrates why picking the bandwidth by cross-validation works: the curve of CV error against bandwidth is actually a pretty good approximation to the true curve of generalization error (which would look like Figure 4.1), so optimizing the CV error is close to optimizing the generalization error. Notice, by the way, in Figure 4.7, that the rougher curve is more sensitive 10 The package has methods for automatically selecting bandwidth by cross-validation — see § 4.6 below.

91 4.2 Adapting to Unknown Roughness 91 to the choice of bandwidth, and that the smoother curve always has a lower mean-squared error. Also notice that, at the minimum, one of the cross-validation estimates of generalization error is smaller than the true system noise level; this 11 . shows that cross-validation doesn’t completely correct for optimism We still need to come up with an initial set of candidate bandwidths. For reasons which will drop out of the math in Chapter 14, it’s often reasonable 1 / 5 /n to start around 1 . 06 , where s s is the sample standard deviation of X . X X However, it is hard to be very precise about this, and good results often require some honest trial and error. 4.2.2 Convergence of Kernel Smoothing and Bandwidth Scaling Go back to Eq. 4.12 for the mean squared error of kernel regression. As we said, it involves some unknown constants, but we can bury them inside big- O order remainder terms: symbols, which also absorb the little- o 4 2 − 1 nh ( x ) + O ( h MSE ) + O (( ( h ) = σ ) (4.15) ) 2 The ( x ) term is going to be there no matter what, so let’s look at the excess σ risk over and above the intrinsic noise: − 2 4 1 σ MSE ( x ) = O ( h h ) + ) (( nh ) ( − ) (4.16) O That is, the (squared) bias from the kernel’s only approximately getting the curve is proportional to the fourth power of the bandwidth, but the variance is inversely proportional to the product of sample size and bandwidth. If we kept constant h n →∞ and just let , we’d get rid of the variance, but we’d be left with the bias. To get the MSE to go to zero, we need to let the bandwidth h change with n — call it h . Then, by Eq. . Specifically, suppose h → ∞ → 0 as n → ∞ , but nh n n n 4.16, the risk (generalization error) of kernel smoothing is approaching that of the ideal predictor. What is the best bandwidth? We saw in Eq. 4.14 that it is (up to constants) − 1 / 5 n h (4.17) = O ) ( opt If we put this bandwidth into Eq. 4.16, we get ) ) ( ( ( ( ) ( ) ) ( ) ( ) − 1 4 / 5 / 4 − 4 / 5 − 5 − 4 5 / 1 1 − − 1 2 5 / − O n + O n n = = O n n O + h ) − ( x ) = n σ ( MSE O (4.18) That is, the excess prediction error of kernel smoothing over and above the system 0 . 8 /n . Notice, by the way, that the contributions of bias noise goes to zero as 1 8 . − 0 n . and variance to the generalization error are both of the same order, Is this fast or slow? We can compare it to what would happen with a parametric model, say with parameter θ . (For linear regression, θ would be the vector of 11 Tibshirani and Tibshirani (2009) gives a fairly straightforward way to adjust the estimate of the generalization error for the selected model or bandwidth, but that doesn’t influence the choice of the best bandwidth.

92 92 Smoothing in Regression θ slopes and the intercept.) The optimal value of the parameter, , minimizes the 0 mean-squared error. At , the parametric model has MSE θ 0 2 ) = σ MSE ( x ( θ ( x,θ ) + ) (4.19) b 0 0 where b is the bias of the parametric model; this is zero when the parametric 12 model is true θ is unknown and must be estimated, one typically has . Since 0 √ ̂ O (1 / θ θ ). Because the error is minimized at θ − , the first derivatives = n 0 0 at θ are 0. Doing a second-order Taylor expansion of the parametric MSE of 0 2 ̂ θ model contributes an error O θ ), so altogether ) (( − 0 2 ̂ (1 − σ ( ( x ) = b ( x,θ ) ) + O θ /n ) (4.20) MSE 0 1 − goes to zero faster This means parametric models converge more quickly ( n − 0 . 8 2 ), but they typically converge to the wrong answer ( b than > 0). Kernel n 13 . smoothing converges more slowly, but always converges to the right answer ̂ for the band- This doesn’t change much if we use cross-validation. Writing h CV width picked by cross-validation, it turns out (Simonoff, 1996, ch. 5) that ̂ − h h CV opt 10 / 1 − O ( − n 1 = ) (4.21) h opt ̂ Given this, one concludes (Exercise 4.2) that the MSE of using h is also CV 5 − 4 / ( n O ). 4.2.3 Summary on Kernel Smoothing in 1D X Suppose that Y are both one-dimensional, and the true regression func- and 14 ( x ) = E [ Y | X = x ] is continuous and has first and second derivatives tion . μ Suppose that the noise around the true regression function is uncorrelated be- tween different observations. Then the bias of kernel smoothing, when the kernel 2 − 1 ( h O ), and the variance, after n samples, is has bandwidth ((1 /nh ) h , is ). O − 1 / 5 The optimal bandwidth is O ( n ), and the excess mean squared error of using − 4 / 5 this bandwidth is O n ). If the bandwidth is selected by cross-validation, the ( − 4 / 5 O n excess risk is still ( ). 4.3 Kernel Regression with Multiple Inputs For the most part, when I’ve been writing out kernel regression I have been treating the input variable x as a scalar. There’s no reason to insist on this, 12 When the model is wrong, the optimal parameter value θ . is often called the pseudo-truth 0 5 13 / 4 − n It is natural to wonder if one couldn’t do better than kernel smoothing’s ) while still having ( O no asymptotic bias. Resolving this is very difficult, but the answer turns out to be “no” in the following sense (Wasserman, 2006). Any curve-fitting method which can learn arbitrary smooth − 4 / 5 regression functions will have some curves where it cannot converge any faster than O ( n ). (In the jargon, that is the minimax rate .) Methods which converge faster than this for some kinds of curves have to converge more slowly for others. So this is the best rate we can hope for on truly unknown curves. 14 Or can be approximated arbitrarily closely by such functions.

93 4.3 Kernel Regression with Multiple Inputs 93 however; it could equally well be a vector. If we want to enforce that in the 2 d 1 notation, say by writing ,...x ,x ), then the kernel regression of y on ~x ~x x = ( would just be n ∑ K ~x ) ( ~x − i ∑ ̂ y ~x μ ( ) = (4.22) i n ) ~x K ( ~x − j =1 j i =1 above. y ~y In fact, if we want to predict a vector, we’d just substitute for i i To make this work, we need kernel functions for vectors. For scalars, I said that any probability density function would work so long as it had mean zero, ∞ ) variance. The same conditions carry and a finite, strictly positive (not 0 or over: any distribution over vectors can be used as a multivariate kernel, provided 15 . In it has mean zero, and the variance matrix is finite and “positive definite” product practice, the overwhelmingly most common and practical choice is to use 16 . kernels A product kernel simply uses a different kernel for each component, and then multiplies them together: d 1 d 2 2 1 ) = K ) ( x x − K ( ~x ) K x ( x − − x ~x − ) ...K (4.23) ( x d 1 i 2 i i i Now we just need to pick a bandwidth for each kernel, which in general should ~ not be equal — say h = ( h ,h ,...h ). Instead of having a one-dimensional error d 2 1 d -dimensional error surface, but we curve, as in Figure 4.1 or 4.2, we will have a can still use cross-validation to find the vector of bandwidths that generalizes best. We generally can’t, unfortunately, break the problem up into somehow picking the best bandwidth for each variable without considering the others. This makes it slower to select good bandwidths in multivariate problems, but still often feasible. (We can actually turn the need to select bandwidths together to our advantage. If one or more of the variables are irrelevant to our prediction given the others, cross-validation will tend to give them the maximum possible bandwidth, and smooth away their influence. In Chapter 14, we’ll look at formal tests based on this idea.) Kernel regression will recover almost any regression function. This is true even when the true regression function involves lots of interactions among the input variables, perhaps in complicated forms that would be very hard to express in linear regression. For instance, Figure 4.9 shows a contour plot of a reasonably complicated regression surface, at least if one were to write it as polynomials in 1 2 x , which would be the usual approach. Figure 4.11 shows the estimate x and we get with a product of Gaussian kernels and only 1000 noisy data points. It’s not perfect, of course (in particular the estimated contours aren’t as perfectly smooth and round as the true ones), but the important thing is that we got this without having to know, and describe in Cartesian coordinates, the type of shape we were looking for. Kernel smoothing discovered the right general form. 15 Remember that for a matrix v to be “positive definite”, it must be the case that for any vector ~ · = ~a 0, ~a 6 v ~a > 0. Covariance matrices are automatically non-negative, so we’re just ruling out the case of some weird direction along which the distribution has zero variance. 16 People do sometimes use multivariate Gaussians; we’ll glance at this in Chapter E.

94 94 Smoothing in Regression There are limits to these abilities of kernel smoothers; the biggest one is that they require more and more data as the number of predictor variables increases. We will see later (Chapter 8) exactly how much data is required, generalizing the § 4.2.2, and some of the compromises this can force us into. kind of analysis done 4.4 Interpreting Smoothers: Plots In a linear regression without interactions, it is fairly easy to interpret the coeffi- th for a one-unit change in the input i cients. The expected response changes by β i variable. The coefficients are also the derivatives of the expected response with respect to the inputs. And it is easy to draw pictures of how the output changes as the inputs are varied, though the pictures are somewhat boring (straight lines or planes). As soon as we introduce interactions, all this becomes harder, even for para- metric regression. If there is an interaction between two components of the input, 1 2 say x x , then we can’t talk about the change in the expected response for and 2 1 2 over x a one-unit change in average without saying what x x is. We might values, and in § 4.5 below we’ll see next time a reasonable way of doing this, but 1 x the flat statement “increasing β by one unit increases the response by ” is just 1 false, no matter what number we fill in for . Likewise for derivatives; we’ll come β 1 back to them next time as well. What about pictures? With only two input variables, we can make wireframe plots like Figure 4.11, or contour or level plots, which will show the predictions for different combinations of the two variables. But what if we want to look at one variable at a time, or there are more than two input variables? a curve for each input variable is to set all the A reasonable way to produce others to some “typical” value, like their means or medians, and to then plot the predicted response as a function of the one remaining variable of interest (Figure 4.12). Of course, when there are interactions, changing the values of the other inputs will change the response to the input of interest, so it’s a good idea to produce a couple of curves, possibly super-imposed (Figure 4.12 again). If there are three or more input variables, we can look at the interactions of any two of them, taken together, by fixing the others and making three-dimensional or contour plots, along the same principles. The fact that smoothers don’t give us a simple story about how each input is associated with the response may seem like a disadvantage compared to using linear regression. Whether it really is a disadvantage depends on whether there really is a simple story to be told, and/or how much big a lie you are prepared to tell in order to keep your story simple. 4.5 Average Predictive Comparisons Suppose we have a linear regression model Y = (4.24)  X + + β X β 2 2 1 1

95 4.5 Average Predictive Comparisons 95 Y changes, on average, for a one-unit increase and we want to know how much β : X . The answer, as you know very well, is just in 1 1 β ( X (4.25) + 1) + β [ X β ] − [ β ] = X X + β 2 1 2 1 2 2 1 1 1 This is an interpretation of the regression coefficients which you are very used to giving. But it fails as soon as we have interactions: Y X X (4.26) + β  β + + β = X X 1 3 2 2 1 2 1 by 1 is X Now the effect of increasing 1 β (4.27) ( [ X + 1) + β β X + + β β ( X ] = + 1) X X ] − [ β X X β + β + X X 2 1 3 1 1 2 2 1 3 1 2 2 3 1 2 1 2 X is increased The right answer to “how much does the response change when 1 by one unit?” depends on the value of ; it’s certainly not just “ β ”. X 2 1 We also can’t give just a single answer if there are nonlinearities. Suppose that the true regression function is this: βX e +  (4.28) Y = βX 1 + e which looks like Figure 4.13, setting β = 7 (for luck). Moving x from − 4 to − 3 − 10 increases the response by 7 57 10 , but the increase in the response from x = . × x = 0 is 0 . 499. Functions like this are very common in psychology, medicine − 1 to (dose-response curves for drugs), biology, etc., and yet we cannot sensibly talk about response to a one-unit increase in x . (We will come back to curves the which look like this in Chapter 11.) ~ on a vector X , and want to assess Y More generally, let’s say we are regressing the impact of one component of the input on Y . To keep the use of subscripts and ~ ~ = ( U, superscripts to a minimum, we’ll write V ), where U is the coordinate X we’re really interested in. (It doesn’t have to come first, of course.) We would like to know how much the prediction changes as we change u , ] [ [ ] (1) (2) ~ ~ ,~v ) = ( − E | Y | ,~v X = ( u Y X ) E (4.29) u u and the change in the response per unit change in , ] [ [ ] (2) (1) ~ ~ X ,~v ) Y − E | Y | E X = ( u = ( ,~v ) u (4.30) (2) (1) u u − predictive comparison Both of these, but especially the latter, are called the . (1) Note that both of them, as written, depend on u (the starting value for the (2) u ~v (the ending value), and on variable of interest), on (the other variables, held fixed during this comparison). We have just seen that in a linear model (1) (2) and , u u without interactions, ~v all go away and leave us with the regression coefficient on u . In nonlinear or interacting models, we can’t simplify so much. Once we have estimated a regression model, we can choose our starting point, ending point and context, and just plug in to Eq. 4.29 or Eq. 4.30. (Or problem

96 96 Smoothing in Regression 9 in problem set A.14.) But suppose we do want to boil this down into a single number for each input variable — how might we go about this? One good answer, which comes from Gelman and Pardoe (2007), is just to av- 17 average predictive erage 4.30 over the data . More specifically, we have as our for u comparison ∑ ∑ n n ̂ ̂ ) ( u ,~v ) − ( μ ( u u ,~v ))sign( u − μ i i j j i i =1 j =1 i ∑ ∑ (4.31) n n u ) − u u ( u − )sign( j i i j =1 i j =1 ̂ and j run over data points, where μ is our estimated regression function, and i x 0, = 0 if x > the sign function is defined by sign( x = 0, and = − 1 if ) = +1 if 0. We use the sign function this way to make sure we are always looking at x < the consequences of increasing u . The average predictive comparison is a reasonable summary of how rapidly we should expect the response to vary as u changes slightly. But we need to remember that once the model is nonlinear or has interactions, it’s just not possible to boil down the whole predictive relationship between y into one number. In u and (and u particular, the value of Eq. 4.31 is going to depend on the distribution of v ), even when the regression function is unchanged. (See Exercise 4.3.) possibly of 4.6 Computational Advice: npreg pack- The homework will call for you to do nonparametric regression with the np age — which we’ve already looked at a little. It’s a powerful bit of software, but it can take a bit of getting used to. This section is not a substitute for reading Hayfield and Racine (2008), but should get you started. We’ll look at a synthetic-data example with four variables: a quantitative re- Y , two quantitative predictors X and Z , and a categorical predictor W , sponse which can be either “A” or “B”. The true model is { Z if W = A 2 = + 20 X Y +  (4.32) Z Z / (1 + e e ) if W = B 10 with  (0 , 0 . 05). Code Example 5 generates some data from this model for ∼ N us. The basic function for fitting a kernel regression in np is npreg — conceptually, it’s the equivalent of lm . Like lm , it takes a formula argument, which specifies the model, and a argument, which is a data frame containing the variables data included in the formula. The basic idea is to do something like this: demo.np1 <- npreg(y ~ x + z, data = demo.df) The variables on the right-hand side of the formula are the predictors; we use + to separate them. Kernel regression will automatically include interactions be- tween all variables, so there is no special notation for interactions. Similarly, there is no point in either including or excluding intercepts. If we wanted to transform 17 Actually, they propose something a bit more complicated, which takes into account the uncertainty in our estimate of the regression function, via bootstrapping (Chapter 6).

97 4.6 Computational Advice: npreg 97 make.demo.df <- function(n) { demo.func <- function(x, z, w) { 20 * x^2 + ifelse(w == "A", z, 10 * exp(z)/(1 + exp(z))) } x <- runif(n, -1, 1) z <- rnorm(n, 0, 10) w <- sample(c("A", "B"), size = n, replace = TRUE) y <- demo.func(x, z, w) + rnorm(n, 0, 0.05) return(data.frame(x = x, y = y, z = z, w = w)) } demo.df <- make.demo.df(100) Code Example 5: Generating data from Eq. 4.32. lm , we can do so. Run like this, either a predictor variable or the response, as in will try to determine the best bandwidths for the predictor variables, based npreg on a sophisticated combination of cross-validation and optimization. Let’s look at the output of : npreg summary(demo.np1) ## ## Regression Data: 100 training points, in 2 variable(s) ## x z ## Bandwidth(s): 0.08108232 2.428622 ## ## Kernel Regression Estimator: Local-Constant ## Bandwidth Type: Fixed ## Residual standard error: 2.228451 ## R-squared: 0.9488648 ## ## Continuous Kernel Type: Second-Order Gaussian ## No. Continuous Explanatory Vars.: 2 The main things here are the bandwidths. We also see the root mean squared error on the training data. Note that this is the in-sample root MSE; if we wanted the in-sample MSE, we could do demo.np1$MSE ## [1] 4.965993 (You can check that this is the square of the residual standard error above.) If we want the cross-validated MSE used to pick the bandwidths, that’s demo.np1$bws$fval ## [1] 16.93204 The fitted and residuals functions work on these objects just like they do in objects, while the coefficients and confint functions do not. (Why?) lm The predict function also works like it does for lm , expecting a data frame containing columns whose names match those in the formula used to fit the model: predict(demo.np1, newdata = data.frame(x = -1, z = 5)) ## [1] 22.60836

98 98 Smoothing in Regression With two predictor variables, there is a nice three-dimensional default plot (Figure 4.14). Kernel functions can also be defined for categorical and ordered variables. factor() These can be included in the formula by wrapping the variable in or ordered() , respectively: demo.np3 <- npreg(y ~ x + z + factor(w), data = demo.df) Again, there’s no point, or need, to indicate interactions. Including the extra variable, not surprisingly, improves the cross-validated MSE: demo.np3$bws$fval ## [1] 3.852239 With three or more predictor variables, we’d need a four-dimensional plot, which is hard. Instead, the default is to plot what happens as we sweep one vari- able with the others held fixed (by default, at their medians; see help(npplot) for changing that), as in Figure 4.15. We get something parabola-ish as we sweep (which is right), and something near a step function as we sweep Z (which is X W B ), so we’re not doing badly for estimating a fairly complicated = right when function of three variables with only 100 samples. We could also try fixing W at one value or another and making a perspective plot — Figure 4.16. extremely aggressive. It keeps adjust- The default optimization of bandwidths is ing the bandwidths until the changes in the cross-validated MSE are very small, or the changes in the bandwidths themselves are very small. The “tolerances” for what count as “very small” are controlled by arguments to npreg called tol − 8 (for the MSE), which default to about 10 (for the bandwidths) and ftol and − 7 , respectively. With a lot of data, or a lot of variables, this gets extremely 10 npreg slow. One can often make run much faster, with no real loss of accuracy, tol and ftol by adjusting these options. A decent rule of thumb is to start with . 01. One can use the bandwidth found by this initial coarse search to both at 0 start a more refined one, as follows: bigdemo.df <- make.demo.df(1000) system.time(demo.np4 <- npreg(y ~ x + z + factor(w), data = bigdemo.df, tol = 0.01, ftol = 0.01)) ## user system elapsed ## 30.251 0.122 30.547 This tells us how much time it took R to run npreg , dividing that between time spent exclusively on our job and on background system tasks. The result of the run is stored in demo.np4 : demo.np4$bws ## ## Regression Data (1000 observations, 3 variable(s)): ## ## x z factor(w) ## Bandwidth(s): 0.05532488 1.964943 1.535065e-07 ## ## Regression Type: Local-Constant

99 4.7 Further Reading 99 ## Bandwidth Selection Method: Least Squares Cross-Validation ## Formula: y ~ x + z + factor(w) ## Bandwidth Type: Fixed ## Objective Function Value: 0.9546005 (achieved on multistart 2) ## ## Continuous Kernel Type: Second-Order Gaussian ## No. Continuous Explanatory Vars.: 2 ## ## Unordered Categorical Kernel Type: Aitchison and Aitken ## No. Unordered Categorical Explanatory Vars.: 1 The bandwidths have all shrunk (as they should), and the cross-validated MSE is also much smaller (0 95 versus 3 . 9 before). Figure 4.16 shows the estimated . regression surfaces for both values of the categorical variable. The package also contains a function, npregbw , which takes a formula and a data frame, and just optimizes the bandwidth. This is called automatically by npreg , and many of the relevant options are documented in its help page. One can also use the output of npregbw as an argument to npreg , in place of a formula. As a final piece of computational advice, you will notice when you run these commands yourself that the bandwidth-selection functions by default print out lots of progress-report messages. This can be annoying, especially if you are em- bedding the computation in a document, and so can be suppressed by setting a global option at the start of your code: options(np.messages = FALSE) 4.7 Further Reading Simonoff (1996) is a good practical introduction to kernel smoothing and related methods. Wasserman (2006) provides more theory. Li and Racine (2007) is a detailed treatment of nonparametric methods for econometric problems, over- whelmingly focused on kernel regression and kernel density estimation (which we’ll get to in Chapter 14); Racine (2008) summarizes. While kernels are a nice, natural method of non-parametric smoothing, they are not the only one. We saw nearest-neighbors in 1.5.1, and will encounter splines § (continuous piecewise-polynomial models) in Chapter 7 and trees (piecewise- constant functions, with cleverly chosen pieces) in Chapter 13; local linear models ( § 10.5) combine kernels and linear models. There are many, many more options. Historical Notes Kernel regression was introduced, independently, by Nadaraya (1964) and Watson (1964); both were inspired by kernel density estimation.

100 100 Smoothing in Regression Exercises − h/ 2 ,h/ 2). Show Suppose we use a uniform (“boxcar”) kernel extending over the region ( 4.1 that ∣ ( )] [ ∣ h h ∣ X ∈ − , (4.33) μ X (0)] = μ ( [ ) ̂ E E ∣ 2 2 ∣ ( )] [ ∣ h h ′ ∣ X ∈ − (4.34) μ E (0) + = , X (0) μ ∣ 2 2 ∣ ( [ )] ′′ ∣ (0) μ h h 2 2 ∣ ∈ X − ( X + E + o , h ) ∣ 2 2 2 ∣ ∣ [ )] )] [ ( ( 2 2 2 ′ h h h h ∣ ∣ , , (0) h = ), and that E ). X ( Show that − X ∈ ∈ − X h O X ( E = O f 2 2 2 2 2 O Conclude that the over-all bias is h ). ( 4.2 Use Eqs. 4.21, 4.17 and 4.16 to show that the excess risk of the kernel smoothing, when 5 / 4 − n ( the bandwidth is selected by cross-validation, is also ). O = Generate 1000 data points where is uniformly distributed between 4.3 4 and 4, and Y X − 7 x 7 x (1 + e ) + e /  , with  Gaussian and with variance 0 . 01. Use non-parametric regression to estimate μ ( x ), and then use Eq. 4.31 to find the average predictive comparison. Now ̂ re-run the simulation with X uniform on the interval [0 , 0 . 5] and re-calculate the average predictive comparison. What happened?

101 Exercises 101 1.0 0.5 ) x ( 0.0 r −1.0 2.5 0.0 0.5 1.5 3.0 1.0 2.0 x 1.2 ) 0.8 x ( s 0.4 0.0 2.5 2.0 3.0 1.5 1.0 0.5 0.0 x par(mfcol = c(2, 1)) true.r <- function(x) { sin(x) * cos(20 * x) } true.s <- function(x) { log(x + 1) } curve(true.r(x), from = 0, to = 3, xlab = "x", ylab = expression(r(x))) curve(true.s(x), from = 0, to = 3, xlab = "x", ylab = expression(s(x))) par(mfcol = c(1, 1)) Two curves for the running example. Above, Figure 4.3 r ( x ) = sin x cos 20 x ; below, s ( x ) = log 1 + x (we will not use this information about the exact functional forms).

102 102 Smoothing in Regression l l l l l l l l l l l l l l l l l l l l 1.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ε l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l + l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ) l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 l x l l l l l l l l l l l l l l l l l l l l l l ( l l l l l l l l l l l l l l r l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −1.0 l 1.0 3.0 2.5 2.0 1.5 0.0 0.5 x l l l l l l l l l l l l l l l l 1.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l η l l l l l l l l l l l l l l l l l l l l l l l l l l l 1.0 l l l l l l l l l l l l l l l l l l l l l l l + l l l l l l l l l l l l l l l l l l l ) l l l l l l l l l l l l l l l l l x l l l l l l l l l l l ( l l l l l l l l l l l l l l l l l l s 0.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 l l l 2.5 1.0 3.0 0.5 0.0 1.5 2.0 x x = runif(300, 0, 3) yr = true.r(x) + rnorm(length(x), 0, 0.15) ys = true.s(x) + rnorm(length(x), 0, 0.15) par(mfcol = c(2, 1)) plot(x, yr, xlab = "x", ylab = expression(r(x) + epsilon)) curve(true.r(x), col = "grey", add = TRUE) plot(x, ys, xlab = "x", ylab = expression(s(x) + eta)) curve(true.s(x), col = "grey", add = TRUE) Figure 4.4 The curves of Fig. 4.3 (in grey), plus IID Gaussian noise with mean 0 and standard deviation 0 . 15. The two curves are sampled at the same x values, but with different noise realizations.

103 Exercises 103 l l l l l l l l l l l l l l l l l l l l 1.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ε l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l + l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ) l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 l x l l l l l l l l l l l l l l l l l l l l l l ( l l l l l l l l l l l l l l r l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −1.0 l 2.5 1.5 1.0 0.5 3.0 2.0 0.0 x l l l l l l l l l l l l l l l l 1.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l η l l l l l l l l l l l l l l l l l l l l l l l l l l l 1.0 l l l l l l l l l l l l l l l l l l l l l l l + l l l l l l l l l l l l l l l l l l l ) l l l l l l l l l l l l l l l l l x l l l l l l l l l l l ( l l l l l l l l l l l l l l l l l l s 0.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 l l l 2.5 1.0 0.5 0.0 3.0 2.0 1.5 x par(mfcol = c(2, 1)) x.focus <- 1.6 x.lo <- x.focus - 0.1 x.hi <- x.focus + 0.1 colors = ifelse((x < x.hi) & (x > x.lo), "black", "grey") plot(x, yr, xlab = "x", ylab = expression(r(x) + epsilon), col = colors) curve(true.r(x), col = "grey", add = TRUE) points(x.focus, mean(yr[(x < x.hi) & (x > x.lo)]), pch = 18, cex = 2) plot(x, ys, xlab = "x", ylab = expression(s(x) + eta), col = colors) curve(true.s(x), col = "grey", add = TRUE) points(x.focus, mean(ys[(x < x.hi) & (x > x.lo)]), pch = 18, cex = 2) par(mfcol = c(1, 1)) Figure 4.5 Relationship between smoothing and function roughness. In x = 1 . 6 by both panels we estimate the value of the regression function at 7 (black points, others are 5 < x averaging observations where 1 < 1 . . i “ghosted” in grey). The location of the average in shown by the large black diamond. This works poorly for the rough function r in the upper panel (the bias is large), but much better for the smoother function in the lower panel (the bias is small).

104 104 Smoothing in Regression 1.0 0.8 0.6 0.4 Absolute value of error 0.2 0.0 0.10 0.50 1.00 0.02 0.01 0.20 0.05 Radius of averaging window Error of estimating r (1 . 6) (solid line) and Figure 4.6 (1 . 6) (dashed) from s averaging observed values at 1 6 − h < x < 1 . 6 + . , for different radii h . The h grey is σ , the standard deviation of the noise — how can the estimation error be smaller than that? 0.6 0.5 0.4 0.3 Root CV MSE 0.2 0.1 0.0 0.100 0.200 0.020 0.500 0.005 0.050 0.010 Bandwidth rbws <- cv_bws_npreg(x, yr, bandwidths = (1:100)/200) sbws <- cv_bws_npreg(x, ys, bandwidths = (1:100)/200) plot(1:100/200, sqrt(rbws$CV_MSEs), xlab = "Bandwidth", ylab = "Root CV MSE", type = "l", ylim = c(0, 0.6), log = "x") lines(1:100/200, sqrt(sbws$CV_MSEs), lty = "dashed") abline(h = 0.15, col = "grey") Figure 4.7 Cross-validated estimate of the (root) mean-squard error as a function of the bandwidth (solid curve, r data; dashed, s data; grey line, true noise σ ). Notice that the rougher curve is more sensitive to the choice of bandwidth, and that the smoother curve is more predictable at every choice of bandwidth. CV selects bandwidths of 0 . 02 for r and 0 . 095 for s .

105 Exercises 105 l l l l l l l l l l l l l l l l l l l l 1.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ε l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l + l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ) l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 l x l l l l l l l l l l l l l l l l l l l l l l ( l l l l l l l l l l l l l l r l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −1.0 l 1.5 2.0 2.5 3.0 0.0 0.5 1.0 x l l l l l l l l l l l l l l l l 1.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l η l l l l l l l l l l l l l l l l l l l l l l l l l l l 1.0 l l l l l l l l l l l l l l l l l l l l l l l + l l l l l l l l l l l l l l l l l l l ) l l l l l l l l l l l l l l l l l x l l l l l l l l l l l ( l l l l l l l l l l l l l l l l l l s 0.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 l l l 3.0 2.0 1.5 1.0 0.5 0.0 2.5 x x.ord = order(x) par(mfcol = c(2, 1)) plot(x, yr, xlab = "x", ylab = expression(r(x) + epsilon)) rhat <- npreg(bws = rbws$best.bw, txdat = x, tydat = yr) lines(x[x.ord], fitted(rhat)[x.ord], lwd = 4) curve(true.r(x), col = "grey", add = TRUE, lwd = 2) plot(x, ys, xlab = "x", ylab = expression(s(x) + eta)) shat <- npreg(bws = sbws$best.bw, txdat = x, tydat = ys) lines(x[x.ord], fitted(shat)[x.ord], lwd = 4) curve(true.s(x), col = "grey", add = TRUE, lwd = 2) par(mfcol = c(1, 1)) Figure 4.8 Data from the running examples (circles), true regression functions (grey) and kernel estimates of regression functions with CV-selected bandwidths (black). R notes: The x values aren’t sorted, so we need to put them in order before drawing lines connecting the fitted values; then we need to put the fitted values in the same order. Alternately, we could have used predict on the sorted values, as in § 4.3.

106 106 Smoothing in Regression 0.8 0.6 y 0.4 0.2 3 2 3 2 1 1 0 0 −1 2 −1 x 1 −2 x −2 −3 −3 x1.points <- seq(-3, 3, length.out = 100) x2.points <- x1.points x12grid <- expand.grid(x1 = x1.points, x2 = x2.points) y <- matrix(0, nrow = 100, ncol = 100) y <- outer(x1.points, x2.points, f) library(lattice) wireframe(y ~ x12grid$x1 * x12grid$x2, scales = list(arrows = FALSE), xlab = expression(x^1), ylab = expression(x^2), zlab = "y") An example of a regression surface that would be very hard to Figure 4.9 learn by piling together interaction terms in a linear regression framework. (Can you guess what the mystery function f is?) — wireframe is from the graphics library lattice .

107 Exercises 107 1.0 0.8 0.6 y 0.4 0.2 0.0 2 2 1 1 0 0 −1 2 −1 x 1 −2 x −2 x1.noise <- runif(1000, min = -3, max = 3) x2.noise <- runif(1000, min = -3, max = 3) y.noise <- f(x1.noise, x2.noise) + rnorm(1000, 0, 0.05) noise <- data.frame(y = y.noise, x1 = x1.noise, x2 = x2.noise) cloud(y ~ x1 * x2, data = noise, col = "black", scales = list(arrows = FALSE), xlab = expression(x^1), ylab = expression(x^2), zlab = "y") Figure 4.10 1000 points sampled from the surface in Figure 4.9, plus independent Gaussian noise (s.d. = 0 . 05).

108 108 Smoothing in Regression 0.8 0.6 y 0.4 0.2 0.0 3 2 3 2 1 1 0 0 −1 2 −1 x 1 −2 x −2 −3 −3 noise.np <- npreg(y ~ x1 + x2, data = noise) y.out <- matrix(0, 100, 100) y.out <- predict(noise.np, newdata = x12grid) wireframe(y.out ~ x12grid$x1 * x12grid$x2, scales = list(arrows = FALSE), xlab = expression(x^1), ylab = expression(x^2), zlab = "y") Figure 4.11 Gaussian kernel regression of the points in Figure 4.10. Notice that the estimated function will make predictions at arbitrary points, not just the places where there was training data.

109 Exercises 109 1.0 0.8 0.6 y 0.4 0.2 0.0 3 −3 −1 2 1 −2 0 1 x new.frame <- data.frame(x1 = seq(-3, 3, length.out = 300), x2 = median(x2.noise)) plot(new.frame$x1, predict(noise.np, newdata = new.frame), type = "l", xlab = expression(x^1), ylab = "y", ylim = c(0, 1)) new.frame$x2 <- quantile(x2.noise, 0.25) lines(new.frame$x1, predict(noise.np, newdata = new.frame), lty = 2) new.frame$x2 <- quantile(x2.noise, 0.75) lines(new.frame$x1, predict(noise.np, newdata = new.frame), lty = 3) Figure 4.12 Predicted mean response as function of the first input 2 1 x x coordinate for the example data, evaluated with the second coordinate th th set to the median (solid), its 25 percentile (dashed) and its 75 percentile (dotted). Note that the changing shape of the partial response curve indicates an interaction between the two inputs. Also, note that the model can make predictions at arbitrary coordinates, whether or not there were any training points there.

110 110 Smoothing in Regression 1.0 0.8 0.6 y 0.4 0.2 0.0 −2 2 4 −4 0 x curve(exp(7 * x)/(1 + exp(7 * x)), from = -5, to = 5, ylab = "y") The function of Eq. 4.28, with = 7. Figure 4.13 β [theta= 40, phi= 10] 20 10 y 0 −10 20 10 −0.5 0 0.0 −10 z x −20 0.5 −30 plot(demo.np1, theta = 40, view = "fixed") Figure 4.14 Plot of the kernel regression with just two predictor variables. (See help(npplot) for plotting options.

111 Exercises 111 20 15 y 10 5 0 0.5 −1.0 0.0 1.0 −0.5 x 20 15 y 10 5 0 20 10 0 −10 −20 30 −30 z 20 15 y 10 5 l 0 l B A factor(w) plot(demo.np3) Figure 4.15 Predictions of demo.np3 as each variable is swept over its range, with the others held at their medians.

112 Smoothing in Regression 112 W=A W=B 40 20 20 y 10 y 0 0 −20 −10 30 30 20 20 −1.0 −1.0 10 10 −0.5 −0.5 0 0 0.0 0.0 z z −10 −10 x x 0.5 0.5 −20 −20 −30 −30 1.0 1.0 x.seq <- seq(from = -1, to = 1, length.out = 50) z.seq <- seq(from = -30, to = 30, length.out = 50) grid.A <- expand.grid(x = x.seq, z = z.seq, w = "A") grid.B <- expand.grid(x = x.seq, z = z.seq, w = "B") yhat.A <- predict(demo.np4, newdata = grid.A) yhat.B <- predict(demo.np4, newdata = grid.B) par(mfrow = c(1, 2)) persp(x = x.seq, y = z.seq, z = matrix(yhat.A, nrow = 50), theta = 40, main = "W=A", xlab = "x", ylab = "z", zlab = "y", ticktype = "detailed") persp(x = x.seq, y = z.seq, z = matrix(yhat.B, nrow = 50), theta = 40, main = "W=B", xlab = "x", ylab = "z", zlab = "y", ticktype = "detailed") Figure 4.16 The regression surfaces learned for the demo function at the two different values of the categorical variable. Note that holding z fixed, we always see a parabolic shape as we move along x (as we should), while whether we see a line or something close to a step function at constant x depends on w , as it should.

113 5 Simulation You will recall from your previous statistics courses that quantifying uncertainty sampling distributions in statistical inference requires us to get at the of things like estimators. When the very strong simplifying assumptions of basic statistics 1 courses do not apply , there is little hope of being able to write down sampling distributions in closed form. There is equally little help when the estimates are themselves complex objects, like kernel regression curves or even histograms, rather than short, fixed-length parameter vectors. We get around this by using simulation to approximate the sampling distributions we can’t calculate. 5.1 What Is a Simulation? A mathematical model is a mathematical story about how the data could have been made, or generated . Simulating the model means following that story, implementing it, step by step, in order to produce something which should look synthetic data , or surrogate data , like the data — what’s sometimes called realization of the model. In a stochastic model, some of the steps we need or a to follow involve a random component, and so multiple simulations starting from exactly the same inputs or initial conditions will not give exactly the same outputs or realizations. Rather, the model specifies a distribution over the realizations, and doing many simulations gives us a good approximation to this distribution. X For a trivial example, consider a model with three random variables, ∼ 1 2 2 + X = . Simulating from X , and X ⊥⊥ X ), with X ), ,σ ( ∼N μ X N ,σ μ ( 2 1 3 2 1 2 2 1 2 1 this model means drawing a random value from the first normal distribution for X , drawing a second random value for X , and adding them together to get X . 3 1 2 The marginal distribution of X , and the joint distribution of ( X ), are ,X ,X 3 3 1 2 implicit in this specification of the model, and we can find them by running the simulation. In this particular case, we could also find the distribution of X , and the joint 3 distribution, by probability calculations of the kind you learned how to do in 2 2 X N ( μ ). These + your basic probability courses. For instance, μ ,σ is σ + 3 1 2 1 2 1 As discussed in Chapter 2, in your linear models class, you learned about the sampling ad nauseam distribution of regression coefficients when the linear model is true, and the noise is Gaussian, independent of the predictor variables, and has constant variance. As an exercise, try to get parallel results when the noise has a t distribution with 10 degrees of freedom. 113 st 11:17 Monday 1 April, 2019 c Copyright © Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/ cshalizi/ADAfaEPoV/ ~

114 114 Simulation analytical probability calculations can usually be thought of as just short-cuts for exhaustive simulations. 5.2 How Do We Simulate Stochastic Models? 5.2.1 Chaining Together Random Variables Stochastic models are usually specified by sets of conditional distributions for one random variable, given some other variable or variables. For instance, a simple linear regression model might have the specification ( x X ,x ∼U ) (5.1) max min 2 X ∼N Y β (5.2) + β ) X,σ | ( 1 0 If we knew how to generate a random variable from the distributions given on the right-hand sides, we could simulate the whole model by chaining together draws from those conditional distributions. This is in fact the general strategy for 2 simulating any sort of stochastic model, by chaining together random variables. You might ask why we don’t start by generating a random Y , and then gen- erate X by drawing from the X | Y distribution. The basic answer is that you could, but it would generally be messier. (Just try to work out the conditional distribution | Y .) More broadly, in Chapter 20, we’ll see how to arrange the X variables in complicated probability models in a natural order, so that we start with independent, “exogenous” variables, then first-generation variables which only need to be conditioned on the exogenous variables, then second-generation variables which are conditioned on first-generation ones, and so forth. This is also the natural order for simulation. The upshot is that we can reduce the problem of simulating to that of gener- ating random variables. 5.2.2 Random Variable Generation 5.2.2.1 Built-in Random Number Generators R provides random number generators for most of the most common distributions. By convention, the names of these functions all begin with the letter “r”, followed by the abbreviation of the functions, and the first argument is always the number of draws to make, followed by the parameters of the distribution. Some examples: rnorm(n, mean = 0, sd = 1) runif(n, min = 0, max = 1) rexp(n, rate = 1) rpois(n, lambda) rbinom(n, size, prob) 2 In this case, we could in principle first generate Y , and then draw from Y | X , but have fun finding those distributions. Especially have fun if, say, X has a t distribution with 10 degrees of freedom. (I keep coming back to that idea, because it’s really a very small change from being Gaussian.)

115 5.2 How Do We Simulate Stochastic Models? 115 . Rather than A further convention is that these parameters can be vectorized giving a single mean and standard deviation (say) for multiple draws from the Gaussian distribution, each draw can have its own: rnorm(10, mean = 1:10, sd = 1/sqrt(1:10)) That instance is rather trivial, but the exact same principle would be at work here: rnorm(nrow(x), mean = predict(regression.model, newdata = x), sd = predict(volatility.model, newdata = x)) volatility.model are previously-defined parts regression.model where and of the model which tell us about conditional expectations and conditional vari- ances. Of course, none of this explains how R actually draws from any of these distri- butions; it’s all at the level of a black box, which is to say black magic. Because when we need to go beyond the ignorance is evil, and, even worse, unhelpful standard distributions, it’s worth opening the black box just a bit. We’ll look at using transformations between distributions, and, in particular, transforming uniform distributions into others ( 5.2.2.3). Appendix M explains some more § advanced methods, and looks at the issue of how to get uniformly-distributed random numbers in the first place. 5.2.2.2 Transformations Z with some distribution, and V = g ( Z ), If we can generate a random variable V then we can generate . So one thing which gets a lot of attention is writing random variables as transformations of one another — ideally as transformations of easy-to-generate variables. Example: from standard to customized Gaussians Suppose we can generate random numbers from the standard Gaussian distri- 2 μ ∼ N , 1). Then we can generate from N ( μ,σ Z ) as σZ + (0 . We can bution 2 2 χ random variables with 1 degree of freedom as Z . We can generate generate 2 χ random variables with d degrees of freedom by summing d independent copies 2 of . Z In particular, if we can generate random numbers uniformly distributed be- tween 0 and 1, we can use this to generate anything which is a transformation of a uniform distribution. How far does that extend? 5.2.2.3 Quantile Method quantile function Q for the random variable Z we Suppose that we know the Z Q want, so that (0 . 5) is the median of X , Q 9) is the 90th percentile, and in (0 . Z Z general Q comes as a pair ( p ) is bigger than or equal to Z with probability p . Q Z Z with the cumulative distribution function F , since Z Q p ( F )) = ( (5.3) )) = a, F p ( Q ( a Z Z Z Z

116 116 Simulation (or inverse distribution transform method quantile method In the ), we Q . Now and feed it as the argument to generate a uniform random number U Z U Q F ) has the distribution function : ( Z Z ( ( U ) ≤ a ) = Pr ( F (5.4) Pr ( Q )) ( U )) ≤ F a ( Q Z Z Z Z ≤ )) = Pr ( ( a U (5.5) F Z = ( a ) (5.6) F Z U is uniform on [0 , 1], and the first line where the last line uses the fact that F uses the fact that b ≤ a is true if and only if is a non-decreasing function, so Z ). b ) ≤ F ( a F ( Z Z − λz . The e The CDF of the exponential distribution with rate − is 1 Example. λ log (1 − p ) − quantile function Q ( p ) is thus . (Notice that this is positive, because λ 0, and that it has units of 1 p < 1 and so log (1 − p ) < − /λ , which are the units 1 ) U − log (1 ). This is λ Exp( ∼ − 1), then U , as it should.) Therefore, if z of , Unif(0 λ . rexp() the method used by Example: Power laws or power law is a two-parameter family, f ( z ; α,z The ) = Pareto distribution 0 ( ) − α z − 1 α , with density 0 otherwise. Integration shows that the cumu- z ≥ if z 0 z z 0 0 ) ( − +1 α z . The quantile function F α,z lative distribution function is ) = 1 ; z ( − 0 z 0 1 − α − 1 ) = z therefore is (1 − p ) Q ( p ; α,z . (Notice that this has the same units as 0 0 z , as it should.) Example: Gaussians The standard Gaussian , 1) does not have a closed form for its quantile func- N (0 tion, but there are fast and accurate ways of calculating it numerically (they’re qnorm ), so the quantile method can be used. In practice, there what stand behind are other transformation methods which are even faster, but rely on special tricks. Q ( Since U ) has the same distribution function as Z , we can use the quantile Z method, as long as we can calculate Q . Since Q always exists, in principle Z Z this solves the problem. In practice, we need to calculate Q before we can use Z it, and this may not have a closed form, and numerical approximations may be 3 In such situations, we turn to more advanced methods, like those intractable. described in Appendix M. 5.2.3 Sampling A complement to drawing from given distributions is to from a given sample collection of objects. This is a common task, so R has a function to do it: 3 F — ( z ) = p for z over and over for different p In essence, we have to solve the nonlinear equation Z and that assumes we can easily calculate F . Z

117 5.2 How Do We Simulate Stochastic Models? 117 sample(x, size, replace = FALSE, prob = NULL) Here x is a vector which contains the objects we’re going to sample from. x replace says whether is the number of samples we want to draw from . size the samples are drawn with or without replacement. (If replace=TRUE , then x size replace=FALSE , having a can be arbitrarily larger than the length of . If size doesn’t make sense.) Finally, the optional argument prob allows for larger sampling; ideally, weighted is a vector of probabilities as long as x , giving prob 4 the probability of drawing each element of . x sample As a convenience for a common situation, running with one argument produces a random permutation of the input, i.e., sample(x) is equivalent to sample(x, size = length(x), replace = FALSE) k For example, the code for -fold cross-validation, Code Example 3, had the lines fold.labels <- sample(rep(1:nfolds, length.out = nrow(data))) Here, rep repeats the numbers from 1 to nfolds until we have one number for each row of the data frame, say 1 , 2 , 3 , 4 , 5 , 1 , 2 , 3 , 4 , 5 , 1 , 2 if there were twelve rows. Then sample shuffles the order of those numbers randomly. This then would df give an assignment of each row of to one (and only one) of five folds. 5.2.3.1 Sampling Rows from Data Frames When we have multivariate data (which is the usual situation), we typically arrange it into a data-frame, where each row records one unit of observation, with multiple interdependent columns. The natural notion of sampling is then to draw a random sample of the data points, which in that representation amounts to a random sample of the rows. We can implement this simply by sampling row numbers . For instance, this command, df[sample(1:nrow(df), size = b), ] will create a new data frame from b , by selecting b rows from df without replacement. It is an easy exercise to figure out how to sample from a data frame replacement, and with unequal probabilities per row. with 4 If the elements of prob do not add up to 1, but are positive, they will be normalized by their sum, 9 9 1 prob=c(9,9,1) e.g., setting will assign probabilities ( . x , , ) to the three elements of 19 19 19

118 118 Simulation 5.2.3.2 Multinomials and Multinoullis If we want to draw one value from a multinomial distribution with probabilities p p ,p ), then we can use sample: = ( ,...p k 2 1 sample(1:k, size = 1, prob = p) 5 , i.e., a sequence of independent If we want to simulate a “multinoulli” process and identically distributed multinomial random variables, then we can easily do so: rmultinoulli <- function(n, prob) { k <- length(prob) return(sample(1:k, size = n, replace = TRUE, prob = prob)) } Of course, the labels needn’t be the integers 1 : k (exercise 5.1). 5.2.3.3 Probabilities of Observation Often, our models of how the data are generated will break up into two parts. One part is a model of how actual variables are related to each other out in the world. (E.g., we might model how education and racial categories are related to occupation, and occupation is related to income.) The other part is a model of how variables come to be recorded in our data, and the distortions they might undergo in the course of doing so. (E.g., we might model the probability that someone appears in a survey as a function of race and income.) Plausible sampling mechanisms often make the probability of appearing in the data a function of some of the variables. This can then have important consequences when we try to draw inferences about the whole population or process from the sample we happen to have seen (see, e.g., App. K). income <- rnorm(n, mean = predict(income.model, x), sd = sigma) capture.probabilities <- predict(observation.model, x) observed.income <- sample(income, size = b, prob = capture.probabilities) 5.3 Repeating Simulations Because simulations are often most useful when they are repeated many times, R has a command to repeat a whole block of code: replicate(n, expr) Here expr is some executable “expression” in R, basically something you could type in the terminal, and n is the number of times to repeat it. For instance, 5 A handy term I learned from Gustavo Lacerda.

119 5.4 Why Simulate? 119 output <- replicate(1000, rnorm(length(x), beta0 + beta1 * x, sigma)) will replicate, 1000 times, sampling from the predictive distribution of a Gaus- sian linear regression model. Conceptually, this is equivalent to doing something like output <- matrix(0, nrow = 1000, ncol = length(x)) for (i in 1:1000) { output[i, ] <- rnorm(length(x), beta0 + beta1 * x, sigma) } but the version has two great advantages. First, it is faster, because replicate R processes it with specially-optimized code. (Loops are especially slow in R.) more importantly, it is clearer Second, and far : it makes it obvious what is being done, in one line, and leaves the computer to figure out the boring and mundane details of how best to implement it. 5.4 Why Simulate? There are three major uses for simulation: to understand a model, to check it, and to fit it. We will deal with the first two here, and return to fitting in Chapter 26, after we’ve looked at dealing with dependence and hidden variables. 5.4.1 Understanding the Model; Monte Carlo We understand a model by seeing what it predicts about the variables we care about, and the relationships between them. Sometimes those predictions are easy to extract from a mathematical representation of the model, but often they aren’t. With a model we can simulate, however, we can just run the model and see what happens. Z , which Our stochastic model gives a distribution for some random variable in general is a complicated, multivariate object with lots of interdependent com- ponents. We may also be interested in some complicated function Z , such of g as, say, the ratio of two components of Z , or even some nonparametric curve fit ? through the data points. How do we know what the model says about g , we can find the Z Assuming we can make draws from the distribution of distribution of any function of it we like, to as much precision as we want. Suppose ̃ ̃ ̃ different Z Z ,... , Z are the outputs of b independent runs of the model — b that b 1 2 replicates of the model. (The tilde is a reminder that these are just simulations.) ̃ ̃ ̃ ( g Z ). If averaging ) ,g We can calculate on each of them, getting Z Z ) ,...g ( g ( 2 b 1 makes sense for these values, then b ∑ 1 ̃ (5.7) )] g ( ) Z Z ( −−−→ g [ E i →∞ b b =1 i by the law of large numbers. So simulation and averaging lets us get expectation

120 120 Simulation 6 Monte Carlo method If our values. This basic observation is the seed of the . simulations are independent, we can even use the central limit theorem to say ∑ b 1 ̃ )] Z g ( /b Z ( ) has approximately the distribution N ( E [ g ( Z ). )] V [ g , that i =1 i b Of course, if you can get expectation values, you can also get variances. (This is handy if trying to apply the central limit theorem!) You can also get any higher moments — if, for whatever reason, you need the kurtosis, you just have to simulate enough. s and get the probability that You can also pick any set ( Z ) falls into that g set: b ∑ 1 ̃ ) s 1 ∈ ( g ( −−−→ Z ) )) (5.8) Z ( Pr ( g s i →∞ b b i =1 g ( Z ) ∈ s ) = E [ The reason this works is of course that Pr ( ))], and we can ( g ( Z 1 s use the law of large numbers again. So we can get the whole distribution of any complicated function of the model that we want, as soon as we can simulate the model. It is really only a little harder to get the complete sampling distribution than it is to get the expectation value, and the exact same ideas apply. 5.4.2 Checking the Model An important but under-appreciated use for simulation is to check models after they have been fit. If the model is right, after all, it represents the mechanism which generates the data. This means that when we simulate, we run that mecha- nism, and the surrogate data which comes out of the machine should look like the real data. More exactly, the real data should look like a typical realization of the model. If it does not, then the model’s account of the data-generating mechanism is systematically wrong in some way. By carefully choosing the simulations we perform, we can learn a lot about how the model breaks down and how it might 7 need to be improved. 5.4.2.1 “Exploratory” Analysis of Simulations Often the comparison between simulations and data can be done qualitatively and visually. For example, a classic data set concerns the time between eruptions of the Old Faithful geyser in Yellowstone, and how they relate to the duration of the latest eruption. A common exercise is to fit a regression line to the data by ordinary least squares: library(MASS) data(geyser) fit.ols <- lm(waiting ~ duration, data = geyser) 6 The name was coined by the physicists who used the method to do calculations relating to designing the hydrogen bomb; see Metropolis et al. (1953). Folklore among physicists says that the method goes back at least to Enrico Fermi in the 1930s, without the cutesy name. 7 “Might”, because sometimes (e.g., § 1.4.2) we’re better off with a model that makes systematic mistakes, if they’re small and getting it right would be a hassle.

121 5.4 Why Simulate? 121 110 l 100 l l l l l l l l l l l l l l l l l l l 90 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 80 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l waiting l l l l l l l l l l l l 70 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 60 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 50 l l l l l l l l l l l l l l l l 5 2 1 3 4 duration plot(geyser$duration, geyser$waiting, xlab = "duration", ylab = "waiting") abline(fit.ols) Data for the geyser Figure 5.1 data set, plus the OLS regression line. Figure 5.1 shows the data, together with the OLS line. It doesn’t look that great, but if someone insisted it was a triumph of quantitative vulcanology, how could you show they were wrong? We’ll consider general tests of regression specifications in Chapter 9. For now, let’s focus on the way OLS is usually presented as part of a stochastic model for the response conditional on the input, with Gaussian and homoskedastic noise. + waiting = β ∼ + β  duration In this case, the stochastic model is  , with 1 0 2 N (0 ,σ ). If we simulate from this probability model, we’ll get something we can

122 122 Simulation rgeyser <- function() { n <- nrow(geyser) sigma <- summary(fit.ols)$sigma new.waiting <- rnorm(n, mean = fitted(fit.ols), sd = sigma) new.geyser <- data.frame(duration = geyser$duration, waiting = new.waiting) return(new.geyser) } Code Example 6: Function for generating surrogate data sets from the linear model fit to geyser . compare to the actual data, to help us assess whether the scatter around that regression line is really bothersome. Since OLS doesn’t require us to assume a distribution for the input variable (here, ), the simulation function in duration Code Example 6 leaves those values alone, but regenerates values of the response waiting ) according to the model assumptions. ( A useful principle for model checking is that if we do some exploratory data analyses of the real data, doing the same analyses to realizations of the model should give roughly the same results (Gelman, 2003; Hunter et al. , 2008; Gelman and Shalizi, 2013). This is a test the model fails. Figure 5.2 shows the actual histogram of waiting , plus the histogram produced by simulating — reality is clearly bimodal, but the model is unimodal. Similarly, Figure 5.3 shows the real data, the OLS line, and a simulation from the OLS model. It’s visually clear that the deviations of the real data from the regression line are both bigger and more patterned than those we get from simulating the model, so something is wrong with the latter. By itself, just seeing that data doesn’t look like a realization of the model isn’t how the model’s broken, and super informative, since we’d really like to know so how to fix it. Further simulations, comparing more detailed analyses of the data to analyses of the simulation output, are often very helpful here. Looking at Figure 5.3, we might suspect that one problem is heteroskedasticity — the variance isn’t constant. This suspicion is entirely correct, and will be explored in § 10.3.2. 5.4.3 Sensitivity Analysis Often, the statistical inference we do on the data is predicated on certain assump- tions about how the data is generated. We’ve talked a lot about the Gaussian- noise assumptions that usually accompany linear regression, but there are many others. For instance, if we have missing values for some variables and just ignore incomplete rows, we are implicitly assuming that data are “missing at random”, rather than in some systematic way that would carry information about what the missing values were (see App. K). Often, these assumptions make our analysis much neater than it otherwise would be, so it would be convenient if they were true. As a wise man said long ago, “The method of ‘postulating’ what we want has

123 5.4 Why Simulate? 123 0.04 0.03 0.02 Density 0.01 0.00 110 100 60 40 70 80 90 50 waiting hist(geyser$waiting, freq = FALSE, xlab = "waiting", main = "", sub = "", col = "grey") lines(hist(rgeyser()$waiting, plot = FALSE), freq = FALSE, lty = "dashed") Actual density of the waiting time between eruptions (grey bars, Figure 5.2 solid lines) and that produced by simulating the OLS model (dashed lines). many advantages; they are the same as the advantages of theft over honest toil” (Russell, 1920, ch. VII, p. 71). In statistics, honest toil often takes the form of sensitivity analysis , of seeing how much our conclusions would change if the assumptions were violated, i.e., of checking how sensitive our inferences are to the assumptions. In principle, this means setting up models where the assumptions are more or less violated, or violated in different ways, analyzing them as though the assumptions held, and seeing how badly wrong we go. Of course, if that

124 Simulation 124 l l l l 110 l l l l l l l 100 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 90 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 80 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l waiting l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 70 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 60 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 50 l l l l l l l l l l l l l l l l l l l l l l l l l l l 2 4 5 1 3 l duration plot(geyser$duration, geyser$waiting, xlab = "duration", ylab = "waiting") abline(fit.ols) points(rgeyser(), pch = 20, cex = 0.5) As in Figure 5.1, plus one realization of simulating the OLS Figure 5.3 model (small black dots). was easy to do in closed form, we often wouldn’t have needed to make those assumptions in the first place. On the other hand, it’s usually pretty easy to simulate a model where the assumption is violated, run our original, assumption-laden analysis on the sim- ulation output, and see what happens. Because it’s a simulation, we know the complete truth about the data-generating process, and can assess how far off our inferences are. In favorable circumstances, our inferences don’t mess up too much

125 5.5 Further Reading 125 even when the assumptions we used to motivate the analysis are badly wrong. Sometimes, however, we discover that even tiny violations of our initial assump- tions lead to large errors in our inferences. Then we either need to make some compelling case for those assumptions, or be very cautious in our inferences. 5.5 Further Reading Simulation will be used in nearly every subsequent chapter. It is the key to the “bootstrap” technique for quantifying uncertainty (Ch. 6), and the foundation for a whole set of methods for dealing with complex models of dependent data (Ch. 26). et al. Many texts on scientific programming discuss simulation, including Press (1992) and, using R, Jones et al. (2009). There are also many more specialized texts on simulation in various applied areas. It must be said that many references on simulation present it as almost completely disconnected from statistics and data analysis, giving the impression that probability models just fall from the sky. Guttorp (1995) is an excellent exception. For further reading on methods of drawing random variables from a given distribution, on Monte Carlo, and on generating uniform random numbers, see Appendix M. For doing statistical inference by comparing simulations to data, see Chapter 26. When all (!) you need to do is draw numbers from a probability distribution which isn’t one of the ones built in to R, it’s worth checking CRAN’s “task view” on probability distributions, https://cran.r-project.org/web/views/ . Distributions.html For sensitivity analyses, Miller (1998) describes how to use modern optimiza- tion methods to actively search for settings in simulation models which break desired behaviors or conclusions. I have not seen this idea applied to sensitivity analyses for statistical models, but it really ought to be. Exercises 5.1 Modify rmultinoulli from § 5.2.3.2 so that the values in the output are not the integers from 1 to k , but come from a vector of arbitrary labels.

126 6 The Bootstrap We are now several chapters into a statistics class and have said basically nothing about uncertainty. This should seem odd, and may even be disturbing if you are -values and saying variables have “significant effects”. very attached to your p It is time to remedy this, and talk about how we can quantify uncertainty for bootstrapping , or complex models. The key technique here is what’s called the bootstrap . 6.1 Stochastic Models, Uncertainty, Sampling Distributions Statistics is the branch of mathematical engineering which studies ways of draw- ing inferences from limited and imperfect data. We want to know how a neuron in a rat’s brain responds when one of its whiskers gets tweaked, or how many rats th Street bridge live in Pittsburgh, or how high the water will get under the 16 during May, or the typical course of daily temperatures in the city over the year, or the relationship between the number of birds of prey in Schenley Park in the spring and the number of rats the previous fall. We have some data on all of these things. But we know that our data is incomplete, and experience tells us that repeating our experiments or observations, even taking great care to replicate the conditions, gives more or less different answers every time. It is foolish to treat any inference from the data in hand as certain. If all data sources were totally capricious, there’d be nothing to do beyond piously qualifying every conclusion with “but we could be wrong about this”. A mathematical discipline of statistics is possible because while repeating an ex- periment gives different results, some kinds of results are more common than others; their relative frequencies are reasonably stable. We thus model the data- generating mechanism through probability distributions and stochastic processes. When and why we can use stochastic models are very deep questions, but ones for another time. If we use them in our problem, quantities like the ones can I mentioned above are represented as functions of the stochastic model, i.e., of the underlying probability distribution. Since a function of a function is a “func- tional”, and these quantities are functions of the true probability distribution 1 functionals or statistical functionals . Functionals function, we’ll call these could be single numbers (like the total rat population), or vectors, or even whole 1 Most writers in theoretical statistics just call them “parameters” in a generalized sense, but I will 126 st 11:17 Monday 1 April, 2019 c Copyright © Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/ cshalizi/ADAfaEPoV/ ~

127 6.1 Stochastic Models, Uncertainty, Sampling Distributions 127 curves (like the expected time-course of temperature over the year, or the regres- sion of hawks now on rats earlier). Statistical inference becomes estimating those functionals, or testing hypotheses about them. These estimates and other inferences are functions of the data values, which means that they inherit variability from the underlying stochastic process. If we “re-ran the tape” (as the late, great Stephen Jay Gould used to say), we would get different data, with a certain characteristic distribution, and applying a fixed pro- cedure would yield different inferences, again with a certain distribution. Statis- ticians want to use this distribution to quantify the uncertainty of the inferences. For instance, the standard error is an answer to the question “By how much would our estimate of this functional vary, typically, from one replication of the experiment to another?” (It presumes a particular meaning for “typically vary”, as the root-mean-square deviation around the mean.) A confidence region on a parameter, likewise, is the answer to “What are all the values of the parameter could which have produced this data with at least some specified probability?”, i.e., all the parameter values under which our data are not low-probability out- liers. The confidence region is a promise that either the true parameter point lies or something very unlikely under any circumstances happened — in that region, or that our stochastic model is wrong. To get things like standard errors or confidence intervals, we need to know the distribution of our estimates around the true values of our functionals. These sampling distributions follow, remember, from the distribution of the data, since our estimates are functions of the data. Mathematically the problem is well- defined, but actually anything is another story. Estimates are typically computing complicated functions of the data, and mathematically-convenient distributions may all be poor approximations to the data source. Saying anything in closed form about the distribution of estimates can be simply hopeless. The two classical responses of statisticians were to focus on tractable special cases, and to appeal to asymptotics. Your introductory statistics courses mostly drilled you in the special cases. From one side, limit the kind of estimator we use to those with a simple math- ematical form — say, means and other linear functions of the data. From the other, assume that the probability distributions featured in the stochastic model take one of a few forms for which exact calculation possible, analytically or is via tabulated special functions. Most such distributions have origin myths: the Gaussian arises from averaging many independent variables of equal size (say, the many genes which contribute to height in humans); the Poisson distribu- tion comes from counting how many of a large number of independent and individually-improbable events have occurred (say, radioactive nuclei decaying in a given second), etc. Squeezed from both ends, the sampling distribution of estimators and other functions of the data becomes exactly calculable in terms of the aforementioned special functions. try to restrict that word to actual parameters specifying statistical models, to minimize confusion. I may slip up.

128 128 The Bootstrap That these origin myths invoke various limits is no accident. The great results of probability theory — the laws of large numbers, the ergodic theorem, the stochastic processes central limit theorem, etc. — describe limits in which all in broad classes of models display the same asymptotic behavior. The central limit theorem, for instance, says that if we average more and more independent random quantities with a common distribution, and that common distribution 2 isn’t too pathological, then the average becomes closer and closer to a Gaussian . Typically, as in the CLT, the limits involve taking more and more data from the source, so statisticians use the theorems to find the asymptotic, large-sample distributions of their estimates. We have been especially devoted to re-writing our estimates as averages of independent quantities, so that we can use the CLT to get Gaussian asymptotics. Up through about the 1960s, statistics was split between developing general ideas about how to draw and evaluate inferences with stochastic models, and working out the properties of inferential procedures in tractable special cases (especially the linear-and-Gaussian case), or under asymptotic approximations. This yoked a very broad and abstract theory of inference to very narrow and con- crete practical formulas, an uneasy combination often preserved in basic statistics classes. The arrival of (comparatively) cheap and fast computers made it feasible for scientists and statisticians to record lots of data and to fit models to it, so they did. Sometimes the models were conventional ones, including the special-case as- sumptions, which often enough turned out to be detectably, and consequentially, wrong. At other times, scientists wanted more complicated or flexible models, some of which had been proposed long before, but now moved from being the- 3 oretical curiosities to stuff that could run overnight . In principle, asymptotics might handle either kind of problem, but convergence to the limit could be un- acceptably slow, especially for more complex models. By the 1970s, then, statistics faced the problem of quantifying the uncertainty of inferences without using either implausibly-helpful assumptions or asymp- even more computation. Here totics; all of the solutions turned out to demand we will examine what may be the most successful solution, Bradley Efron’s pro- posal to combine estimation with simulation, which he gave the less-than-clear but persistent name of “the bootstrap” (Efron, 1979). 6.2 The Bootstrap Principle Remember (from baby stats.) that the key to dealing with uncertainty in param- eters and functionals is the sampling distribution of estimators. Knowing what distribution we’d get for our estimates on repeating the experiment would give simulate repli- us things like standard errors. Efron’s insight was that we can 2 The reason is that the non-Gaussian parts of the distribution wash away under averaging, but the average of two Gaussians is another Gaussian. 3 Kernel regression ( § 1.5.2), kernel density estimation (Ch. 14), and nearest-neighbors prediction ( § 1.5.1) were all proposed in the 1950s or 1960s, but didn’t begin to be widely used until about 1980.

129 6.2 The Bootstrap Principle 129 simulated data data .00183 .00168 -0.00378 -0.00249 0.00754 0.0183 -0.00587 -0.00587 -0.00673 0.0139 simulation estimator estimator fitted model re-estimate parameter calculation q = -0.0323 0.01 = -0.0326 q 0.01 Figure 6.1 Schematic for model-based bootstrapping: simulated values are generated from the fitted model, then treated like the original data, yielding q . a new estimate of the functional of interest, here called 01 0 . cation. After all, we have already fitted a model to the data, which is a guess at the mechanism which generated the data. Running that mechanism generates simulated data which, by hypothesis, has the same distribution as the real data. Feeding the simulated data through our estimator gives us one draw from the sampling distribution; repeating this many times yields the sampling distribu- tion. Since we are using the model to give us its own uncertainty, Efron called this “bootstrapping”; unlike the Baron Munchhausen’s plan for getting himself out of a swamp by pulling on his own bootstraps, it works. Figure 6.1 sketches the over-all process: fit a model to data, use the model to calculate the functional, then get the sampling distribution by generating new, synthetic data from the model and repeating the estimation on the simulation output. To fix notation, we’ll say that the original data is x . (In general this is a whole ˆ θ . Sur- data frame, not a single number.) Our parameter estimate from the data is ̃ ̃ ̃ . The cor- , rogate data sets simulated from the fitted model will be X X ,... X 2 1 B ̃ ̃ ̃ , θ θ ,... responding re-estimates of the parameters on the surrogate data are θ . 1 2 B

130 130 The Bootstrap 4 T The functional of interest is estimated by the statistic , with sample value ̃ ̃ ̃ ̃ ˆ ̃ ̃ X = T ( x X ). ), ), and values of the surrogates of t X = T t = T t ), ... ( t ( = T ( B 2 B 1 1 2 may be a direct function of the estimated parameters, and only (The statistic T indirectly a function of x .) Everything which follows applies without modifica- tion when the functional of interest the parameter, or some component of the is parameter. some In this section, we will assume that the model is correct for , value of θ model- θ which we will call . This means that we are employing a parametric 0 based bootstrap . The true (population or ensemble) values of the functional is . likewise t 0 6.2.1 Variances and Standard Errors The simplest thing to do is to get the variance or standard error: [ [ ] ] ̂ ˆ ̃ Var t t V (6.1) = ̃ ˆ ̂ ) se( t ) = sd( (6.2) t That is, we approximate the variance of our estimate of t under the true but 0 ̃ θ unknown distribution t on surrogate data from by the variance of re-estimates 0 ̂ the fitted model θ . Similarly we approximate the true standard error by the ̃ X standard deviation of the re-estimates. The logic here is that the simulated that our data, x has about the same distribution as the real X , was drawn from, so applying the same estimation procedure to the surrogate data gives us the sampling distribution. This assumes, of course, that our model is right, and that ˆ is not too far from θ . θ 0 A code sketch is provided in Code Example 7. Note that this may not work as given in some circumstances, depending on the syntax details of, say, exactly ˆ . t just what kind of data structure is needed to store 6.2.2 Bias Correction We can use bootstrapping to correct for a biased estimator. Since the sampling ̂ ̂ ̃ t t , and distribution of is close to that of itself is close to t , t 0 ] [ [ ] ̂ ̂ ̃ ≈ E t − t t − t E (6.3) 0 The left hand side is the bias that we want to know, and the right-hand side the was what we can calculate with the bootstrap. ̂ t − In fact, Eq. 6.3 remains valid so long as the sampling distribution of t 0 ̂ ̂ ̃ is close to that of − t . This is a weaker requirement than asking for t t and ̂ ̃ themselves to have similar distributions, or asking for t to be close to t . In t 0 statistical theory, a random variable whose distribution does not depend on the parameters is called a pivot . (The metaphor is that it stays in one place while 4 T is a common symbol in the literature on the bootstrap for a generic function of the data. It may or may not have anything to do with Student’s t test for difference in means.

131 6.2 The Bootstrap Principle 131 rboot <- function(statistic, simulator, B) { tboots <- replicate(B, statistic(simulator())) if (is.null(dim(tboots))) { tboots <- array(tboots, dim = c(1, B)) } return(tboots) } bootstrap <- function(tboots, summarizer, ...) { summaries <- apply(tboots, 1, summarizer, ...) return(t(summaries)) } bootstrap.se <- function(statistic, simulator, B) { bootstrap(rboot(statistic, simulator, B), summarizer = sd) } Code Example 7: generates Code for calculating bootstrap standard errors. The function rboot bootstrap samples (using the simulator function) and calculates the statistic on them (using B statistic ). simulator needs to be a function which returns a surrogate data set in a form statistic suitable for simulator and/or . (How would you modify the code to pass arguments to statistic ?) Because every use of bootstrapping is going to need to do this, it makes sense to break it out as a separate function, rather than writing the same code many times (with many chances of getting it wrong). The function takes the output of rboot and applies bootstrap bootstrap.se just calls rboot and makes the summarizing function a summarizing function. , which takes a standard deviation. Important Note: This is just a code sketch , because sd depending on the data structure which the statistic returns, it may not (e.g.) be feasible to just run sd on it, and so it might need some modification. See detailed examples below. bootstrap.bias <- function(simulator, statistic, B, t.hat) { expect <- bootstrap(rboot(statistic, simulator, B), summarizer = mean) return(expect - t.hat) } Code Example 8: Sketch of code for bootstrap bias correction. Arguments are as in Code is the estimate on the original data. Important Note: Example 7, except that t.hat As with Code Example 7, this is just a code sketch , because it won’t work with all data types that might statistic be returned by , and so might require modification. the parameters turn around it.) A sufficient (but not necessary) condition for Eq. ̂ t − 6.3 to hold is that be a pivot, or approximately pivotal. t 0 6.2.3 Confidence Intervals A confidence interval is a random interval which contains the truth with high probability (the confidence level). If the confidence interval for g is C , and the confidence level is 1 α , then we want − Pr ( t (6.4) ∈ C ) = 1 − α 0 t . When we calculate a confidence interval, our no matter what the true value of 0 inability to deal with distributions exactly means that the true confidence level, ; the coverage of the interval, is not quite the desired confidence level 1 − α or

132 132 The Bootstrap closer it is, the better the approximation, and the more accurate the confidence 5 interval. ̃ t , but what we really care about is the When we simulate, we get samples of ˆ . When we have enough data to start with, those two distributions distribution of t will be approximately the same. But at any given amount of data, the distribution ̃ ˆ ˆ ̃ − t t t − than the distribution of of t is to that of t will usually be closer to that of 0 ˆ . That is, the distribution of fluctuations around the true value usually converges t quickly. (Think of the central limit theorem.) We can use this to turn information ̃ into accurate confidence intervals for t about the distribution of , essentially by t 0 ˆ ̃ re-centering t . t around ̃ − q q be the 2 and 1 and α/ 2 quantiles of α/ t . Then Specifically, let 1 2 α/ α/ 2 − ( ) ̃ α q − (6.5) ≤ = Pr T ≤ q 1 − α/ 1 2 2 α/ ( ) ˆ ̃ ˆ ˆ T ≤ = Pr T − (6.6) T ≤ q T − − q α/ 2 1 2 α/ − ) ( ˆ ˆ ˆ ≤ q T ≤ (6.7) T ≈ t T Pr q − − − α/ 2 − 1 0 2 α/ ) ( ˆ ˆ 2 − ≤− t ≤ q T q T = Pr (6.8) 2 − 2 0 1 − α/ 2 α/ ) ( ˆ ˆ − q 2 (6.9) 2 q ≤ t − ≤ = Pr T T 0 α/ 2 α/ − 1 2 ˆ ˆ ˆ T − q The interval is a random , C = [2 T − q T 2 ] is random, because 2 α/ − 1 2 α/ quantity, so it makes sense to talk about the probability that it contains the true ̃ t have, as it were, . Also, notice that the upper and lower quantiles of value T 0 swapped roles in determining the upper and lower confidence limits. Finally, notice that we do not actually know those quantiles exactly, but they’re what we approximate by bootstrapping. This is the , or the pivotal CI. It is basic bootstrap confidence interval simple and reasonably accurate, and makes a very good default choice for finding confidence intervals. 6.2.3.1 Other Bootstrap Confidence Intervals ̃ ˆ − t being approximately t The basic bootstrap CI relies on the distribution of ˆ t − t the same as that of . Even when this is false, however, it can be that the 0 distribution of ˆ t t − 0 (6.10) τ = ˆ ̂ se( ) t is close to that of ˆ ̃ t t − = τ (6.11) ̃ ̃ se( t ) 5 You might wonder why we’d be unhappy if the coverage level was greater than 1 − α . This is certainly better than if it’s than the nominal confidence level, but it usually means we could less have used a smaller set, and so been more precise about t , without any more real risk. Confidence 0 intervals whose coverage is greater than the nominal level are called conservative ; those with less than nominal coverage are anti-conservative (and not, say, “liberal”).

133 6.2 The Bootstrap Principle 133 equitails <- function(x, alpha) { lower <- quantile(x, alpha/2) upper <- quantile(x, 1 - alpha/2) return(c(lower, upper)) } bootstrap.ci <- function(statistic = NULL, simulator = NULL, tboots = NULL, B = if (!is.null(tboots)) { ncol(tboots) }, t.hat, level) { if (is.null(tboots)) { stopifnot(!is.null(statistic)) stopifnot(!is.null(simulator)) stopifnot(!is.null(B)) tboots <- rboot(statistic, simulator, B) } alpha <- 1 - level intervals <- bootstrap(tboots, summarizer = equitails, alpha = alpha) upper <- t.hat + (t.hat - intervals[, 1]) lower <- t.hat + (t.hat - intervals[, 2]) CIs <- cbind(lower = lower, upper = upper) return(CIs) } Code Example 9: Sketch of code for calculating the basic bootstrap confidence interval. See and rboot , and cautions about blindly applying this to arbitrary Code Example 7 for bootstrap data-types. t t -test was invented by This is like what we calculate in a -test, and since the quantities. If “Student”, these are called and ̃ τ have the same studentized τ distribution, then we can reason as above and get a confidence interval ( ) ˆ ˆ ˆ ˆ ̂ ̂ Q (6.12) t (1 − α/ 2) , se( t − − se( t ) ) Q 2) α/ ( t τ τ ̃ ̃ ̃ ˆ ̂ se( This is the same as the basic interval when t ) = se( t ), but different otherwise. ̃ t To find se( ), we need to actually do a second level of bootstrapping, as follows. ˆ ˆ 1. Fit the model with . θ , find t ∈ B i 2. For 1 : 1 ˆ ̃ from X θ 1. Generate i ̃ ̃ θ , 2. Estimate t i i ∈ 1 : B j 3. For 2 † ̃ θ from 1. Generate X i ij † t 2. Calculate ij † = standard deviation of the t 4. Set ̃ σ i ij † ̃ − t t i ij 5. Set ̃ τ j for all = ij ̃ σ i ˆ ̃ ̂ se( t ) = standard deviation of the t 3. Set i 4. Find the α/ 2 and 1 − α/ 2 quantiles of the distribution of the ̃ τ 5. Plug into Eq. 6.12.

134 134 The Bootstrap boot.pvalue <- function(test, simulator, B, testhat) { testboot <- rboot(B = B, statistic = test, simulator = simulator) p <- (sum(testboot >= testhat) + 1)/(B + 1) return(p) } Bootstrap p testhat should be the value of the test statis- Code Example 10: -value calculation. is a function which takes in a data set and calculates the test statis- tic on the actual data. test +1 in the tic, presuming that large values indicate departure from the null hypothesis. Note the numerator and denominator of the p -value — it would be more straightforward to leave them B is comparatively small. (Also, it keeps us from ever off, but this is a little more stable when p -value of exactly 0.) reporting a The advantage of the studentized intervals is that they are more accurate than the basic ones; the disadvantage is that they are more work! At the other extreme, the percentile method simply sets the confidence interval to − (1 2)) α/ (6.13) α/ 2) ,Q ( Q ( ̃ ̃ t t This is definitely easier to calculate, but not as accurate as the basic, pivotal CI. All of these methods have many variations, described in the monographs re- ferred to at the end of this chapter ( 6.9). § 6.2.4 Hypothesis Testing For hypothesis tests, we may want to calculate two sets of sampling distributions: the distribution of the test statistic under the null tells us about the size of the test and significance levels, and the distribution under the alternative tells us about power and realized power. We can find either with bootstrapping, by simulating from either the null or the alternative. In such cases, the statistic of interest, which I’ve been calling T , is the test statistic. Code Example 10 illustrates how to find a p -value by simulating under the null hypothesis. The same procedure would work to calculate power, only we’d need to simulate from the alternative hypothesis, and testhat would be set to the critical value of T separating acceptance from rejection, not the observed value. 6.2.4.1 Double bootstrap hypothesis testing When the hypothesis we are testing involves estimated parameters, we may need to correct for this. Suppose, for instance, that we are doing a goodness-of-fit test. If we estimate our parameters on the data set, we adjust our distribution so that it matches the data. It is thus not surprising if it seems to fit the data well! (Essentially, it’s the problem of evaluating performance by looking at in-sample fit, which gave us so much trouble in Chapter 3.) Some test statistics have distributions which are not affected by estimating parameters, at least not asymptotically. In other cases, one can analytically come up with correction terms. When these routes are blocked, one uses a double bootstrap , where a second level of bootstrapping checks how much estimation

135 6.2 The Bootstrap Principle 135 doubleboot.pvalue <- function(test, simulator, B1, B2, estimator, thetahat, testhat, ...) { for (i in 1:B1) { xboot <- simulator(theta = thetahat, ...) thetaboot <- estimator(xboot) testboot[i] <- test(xboot) pboot[i] <- boot.pvalue(test, simulator, B2, testhat = testboot[i], theta = thetaboot) } p <- (sum(testboot >= testhat) + 1)/(B1 + 1) p.adj <- (sum(pboot <= p) + 1)/(B1 + 1) return(p.adj) } Code sketch for “double bootstrap” significance testing. The inner or second Code Example 11: bootstrap is used to calculate the distribution of nominal bootstrap p-values. For this to work, we ˆ ̃ θ θ , need to draw our second-level bootstrap samples from , the bootstrap re-estimate, not from function takes a theta argument allowing the data estimate. The code presumes the simulator for loop with replicate . this. Exercise: replace the improves the apparent fit of the model. This is perhaps most easily explained in pseudo-code (Code Example 11). 6.2.5 Model-Based Bootstrapping Example: Pareto’s Law of Wealth Inequality 6 The Pareto or distribution , is a popular model for data with “heavy power-law f ( x ) goes to zero only very slowly as tails”, i.e. where the probability density →∞ . The probability density is x ) ( θ − θ − 1 x x ( f (6.14) ) = x x 0 0 x θ is the minimum scale of the distribution, and where is the scaling exponent 0 (Exercise 6.1). The Pareto is highly right-skewed, with the mean being much larger than the median. x , one can show that the maximum likelihood estimator of the If we know 0 exponent θ is n ˆ ∑ (6.15) θ = 1 + n x i log =1 i x 0 and that this is consistent (Exercise 6.3), and efficient. Picking x is a harder 0 problem (see Clauset 2009) — for the present purposes, pretend that the et al. Oracle tells us. The file pareto.R , on the book website, contains a number of functions related to the Pareto distribution, including a function pareto.fit for estimating it. (There’s an example of its use below.) Pareto came up with this density when he attempted to model the distribution 6 Named after Vilfredo Pareto (1848–1923), the highly influential economist, political scientist, and proto-Fascist.

136 136 The Bootstrap sim.wealth <- function() { rpareto(n = n.tail, threshold = wealth.pareto$xmin, exponent = wealth.pareto$exponent) } est.pareto <- function(data) { pareto.fit(data, threshold = x0)$exponent } Code Example 12: Simulator and estimator for model-based bootstrapping of the Pareto dis- tribution. of personal wealth. Approximately, but quite robustly across countries and time- periods, the upper tail of the distribution of income and wealth follows a power law, with the exponent varying as money is more or less concentrated among the 7 very richest individuals and households . Figure 6.2 shows the distribution of net 8 worth for the 400 richest Americans in 2003 . source("http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/code/pareto.R") wealth <- scan("http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/data/wealth.dat") x0 <- 9e+08 n.tail <- sum(wealth >= x0) wealth.pareto <- pareto.fit(wealth, threshold = x0) 8 Taking = 9 × 10 x (again, see Clauset et al. 2009), the number of individuals 0 ˆ in the tail is 302, and the estimated exponent is θ = 2 . 34. How much uncertainty is there in this estimate of the exponent? Naturally, we’ll bootstrap. We need a function to generate Pareto-distributed random variables; pareto.R on the course this, along with some related functions, is part of the file website. With that tool, model-based bootstrapping proceeds as in Code Example 12. Using these functions, we can now calculate the bootstrap standard error, bias 4 ˆ and 95% confidence interval for , setting B = 10 θ : pareto.se <- bootstrap.se(statistic = est.pareto, simulator = sim.wealth, B = 10000) pareto.bias <- bootstrap.bias(statistic = est.pareto, simulator = sim.wealth, t.hat = wealth.pareto$exponent, B = 10000) pareto.ci <- bootstrap.ci(statistic = est.pareto, simulator = sim.wealth, B = 10000, t.hat = wealth.pareto$exponent, level = 0.95) This gives a standard error of ± 0 . 078, matching the asymptotic approximation 9 reasonably well , but not needing asymptotic assumptions. 7 Most of the distribution, for ordinary people, roughly conforms to a log-normal. 8 For the data source and a fuller analysis, see Clauset et al. (2009). 2 ˆ ( − 1) θ 9 076. The intuition is , in this case 0 . “In Asymptopia”, the variance of the MLE should be n that this variance depends on how sharp the maximum of the likelihood function is — if it’s sharply peaked, we can find the maximum very precisely, but a broad maximum is hard to pin down. Variance is thus inversely proportional to the second derivative of the negative log-likelihood. (The minus sign is because the second derivative has to be negative at a maximum, while variance has to be positive.) For one sample, the expected second derivative of the negative log-likelihood is − 2 − of the model.) Log-likelihood adds across ( θ . (This is called the Fisher information 1)

137 6.2 The Bootstrap Principle 137 0.500 0.200 0.050 0.020 Fraction of top 400 above that worth 0.005 0.002 1e+10 5e+10 1e+09 5e+09 2e+10 2e+09 Net worth (dollars) plot.survival.loglog(wealth, xlab = "Net worth (dollars)", ylab = "Fraction of top 400 above that worth") rug(wealth, side = 1, col = "grey") curve((n.tail/400) * ppareto(x, threshold = x0, exponent = wealth.pareto$exponent, lower.tail = FALSE), add = TRUE, lty = "dashed", from = x0, to = 2 * max(wealth)) Figure 6.2 Upper cumulative distribution function (or “survival function”) of net worth for the 400 richest individuals in the US (2000 data). The solid line shows the fraction of the 400 individuals whose net worth W equaled or exceeded a given value W ≥ w ). (Note the logarithmic scale for both , Pr ( w axes.) The dashed line is a maximum-likelihood estimate of the Pareto 8 . (This threshold was picked using the = $9 × 10 x distribution, taking 0 method of Clauset et al. 2009.) Since there are 302 individuals at or above the threshold, the cumulative distribution function of the Pareto has to be reduced by a factor of (302/400).

138 138 The Bootstrap ks.stat.pareto <- function(x, exponent, x0) { x <- x[x >= x0] ks <- ks.test(x, ppareto, exponent = exponent, threshold = x0) return(ks$statistic) } ks.pvalue.pareto <- function(B, x, exponent, x0) { testhat <- ks.stat.pareto(x, exponent, x0) testboot <- vector(length = B) for (i in 1:B) { xboot <- rpareto(length(x), exponent = exponent, threshold = x0) exp.boot <- pareto.fit(xboot, threshold = x0)$exponent testboot[i] <- ks.stat.pareto(xboot, exp.boot, x0) } p <- (sum(testboot >= testhat) + 1)/(B + 1) return(p) } Code Example 13: Calculating a p -value for the Pareto distribution, using the Kolmogorov- Smirnov test and adjusting for the way estimating the scaling exponent moves the fitted distri- bution closer to the data. Asymptotically, the bias is known to go to zero; at this size, bootstrapping . gives a bias of 0 0051, which is effectively negligible. 4 replications, the 95% We can also get the confidence interval; with the same 10 . , 2 . 48. In theory, the confidence interval could be calculated exactly, but CI is 2 17 it involves the inverse gamma distribution (Arnold, 1983), and it is quite literally faster to write and do the bootstrap than go to look it up. A more challenging problem is goodness-of-fit; we’ll use the Kolmogorov-Smirnov 10 Code Example 13 calculates the p -value. With ten thousand bootstrap statistic. replications, signif(ks.pvalue.pareto(10000, wealth, wealth.pareto$exponent, x0), 4) ## [1] 0.0101 Ten thousand replicates is enough that we should be able to accurately es- timate probabilities of around 0 . 01 (since the binomial standard error will be √ (0 01)(0 . 99) . − 4 9 . 9 × 10 . B ); if it weren’t, we might want to increase ≈ 4 10 Simply plugging in to the standard formulas, and thereby ignoring the effects of estimating the scaling exponent, gives a p -value of 0.171, which is not outstanding but not awful either. Properly accounting for the flexibility of the model, however, the discrepancy between what it predicts and what the data shows is so large that it would take a big (one-in-a-hundred) coincidence to produce it. We have, independent samples, giving us an over-all factor of n . In the large-sample limit, the actual log-likelihood will converge on the expected log-likelihood, so this gives us the asymptotic variance. (See also H.5.5.) § 10 The file contains a function, pareto.tail.ks.test , which does a goodness-of-fit test for pareto.R fitting a power-law to the tail of the distribution. That differs somewhat from what follows, because it takes into account the extra uncertainty which comes from having to estimate x . Here, I am 0 8 pretending that an Oracle told us x = 9 × 10 . 0

139 6.3 Resampling 139 therefore, detected that the Pareto distribution makes systematic errors for this data, but we don’t know much about what they are. In Chapter 15, we’ll look at techniques which can begin to tell us something about it fails. how 6.3 Bootstrapping by Resampling The bootstrap approximates the sampling distribution, with three sources of ap- : using finitely many replications to proximation error. First, simulation error stand for the full sampling distribution. Clever simulation design can shrink this, but brute force — just using enough replicates — can also make it arbitrarily small. Second, statistical error : the sampling distribution of the bootstrap re- estimates under our estimated model is not exactly the same as the sampling distribution of estimates under the true data-generating process. The sampling distribution changes with the parameters, and our initial estimate is not com- around the pletely accurate. But it often turns out that distribution of estimates truth is more nearly invariant than the distribution of estimates themselves, so subtracting the initial estimate from the bootstrapped values helps reduce the statistical error; there are many subtler tricks to the same end. Third, specifica- tion error : the data source doesn’t exactly follow our model at all. Simulating the model then never quite matches the actual sampling distribution. Efron had a second brilliant idea, which is to address specification error by replacing simulation from the model with re-sampling from the data. After all, our initial collection of data gives us a lot of information about the relative probabilities of different values. In a sense the empirical distribution is the least prejudiced estimate possible of the underlying distribution — anything else im- poses biases or pre-conceptions, possibly accurate but also potentially mislead- 11 ing . Lots of quantities can be estimated directly from the empirical distribution, without the mediation of a model. Efron’s resampling bootstrap (a.k.a. the ) treats the original data set as a complete popula- non-parametric bootstrap tion and draws a new, simulated sample from it, picking each observation with equal probability (allowing repeated values) and then re-running the estimation (Figure 6.3, Code Example 14). In fact, this is usually what people mean when they talk about “the bootstrap” without any modifier. Everything we did with model-based bootstrapping can also be done with re- sampling bootstrapping — the only thing that’s changing is the distribution the surrogate data is coming from. The resampling bootstrap should remind you of k -fold cross-validation. The analog of leave-one-out CV is a procedure called the jack-knife , where we repeat the estimate times on n − 1 of the data points, holding each one out in turn. It’s n historically important (it dates back to the 1940s), but generally doesn’t work as well as resampling. An important variant is the smoothed bootstrap , where we re-sample the 11 See § 14.6 in Chapter 14.

140 140 The Bootstrap simulated data data 0.00183 0.00168 0.00183 -0.00249 -0.00249 0.0183 -0.00249 -0.00587 re-sampling -0.00587 0.0139 estimator estimator empirical distribution re-estimate parameter calculation q = -0.0354 0.01 = -0.0392 q 0.01 Schematic for the resampling bootstrapping. New data is Figure 6.3 simulated by re-sampling from the original data (with replacement), and functionals are calculated either directly from the empirical distribution, or by estimating a model on this surrogate data. resample <- function(x) { sample(x, size = length(x), replace = TRUE) } resample.data.frame <- function(data) { sample.rows <- resample(1:nrow(data)) return(data[sample.rows, ]) } Code Example 14: A utility function to resample from a vector, and another which resamples from a data frame. Can you write a single function which determines whether its argument is a vector or a data frame, and does the right thing in each case/ data points and then perturb each by a small amount of noise, generally Gaus- 12 sian . 12 We will see in Chapter 14 that this corresponds to sampling from a kernel density estimate.

141 6.4 Bootstrapping Regression Models 141 Back to the Pareto example Let’s see how to use re-sampling to get a 95% confidence interval for the Pareto 13 exponent . wealth.resample <- function() { resample(wealth[wealth >= x0]) } pareto.CI.resamp <- bootstrap.ci(statistic = est.pareto, simulator = wealth.resample, t.hat = wealth.pareto$exponent, level = 0.95, B = 10000) The interval is 2.17, 2.48; this is very close to the interval we got from the model- based bootstrap, which should actually reassure us about the latter’s validity. 6.3.1 Model-Based vs. Resampling Bootstraps When we have a properly specified model, simulating from the model gives more accurate results (at the same n ) than does re-sampling the empirical distribution — parametric estimates of the distribution converge faster than the empirical distribution does. If on the other hand the model is mis-specified, then it is rapidly distribution. This is of course just another bias-variance converging to the wrong trade-off, like those we’ve seen in regression. Since I am suspicious of most parametric modeling assumptions, I prefer re- sampling, when I can figure out how to do it, or at least until I have convinced myself that a parametric model is a good approximation to reality. 6.4 Bootstrapping Regression Models Let’s recap what we’re doing estimating regression models. We want to learn μ ( x ) = E [ Y | X = x ]. We estimate the model on a set of the regression function ,... x ,y predictor-response pairs, ( ) , ( x ), resulting in an estimated ,y ,y ) x ( 1 n 2 n 2 1 ̂ ̂ ̂ ̂ x ), fitted values curve (or surface) μ . For = μ μ ( x μ ), and residuals,  ( = y − i i i i i any such model, we have a choice of several ways of bootstrapping, in decreasing order of reliance on the model. • Simulate new values from the model’s distribution of X , and then draw Y X Y | X . from the specified conditional distribution Hold the x fixed, but draw Y | X from the specified distribution. • ̂ Hold the x fixed, but make Y equal to • μ ( x ) plus a randomly re-sampled  . j Re-sample ( x,y ) pairs. • 13 Even if the Pareto model is wrong, the estimator of the exponent will converge on the value which gives, in a certain sense, the best approximation to the true distribution from among all power laws. Econometricians call such parameter values the pseudo-truth ; we are getting a confidence interval for the pseudo-truth. In this case, the pseudo-true scaling exponent can still be a useful way of summarizing how heavy tailed the income distribution is, despite the fact that the power law makes systematic errors.

142 142 The Bootstrap The first case is pure model-based bootstrapping. (So is the second, sometimes, when the regression model is agnostic about .) The last case is just re-sampling X from the joint distribution of ( X,Y ). The next-to-last case is called re-sampling re-sampling the errors . When we do that, we rely on the the residuals or regression model to get the conditional expectation function right, but we don’t count on it getting the distribution of the noise around the expectations. The specific procedure of re-sampling the residuals is to re-sample the  , with i ̂ . This , ̃   ,... ̃ replacement, to get ̃ ) + ̃ , and then set ̃ x x = x ( ̃ , ̃ y μ =   i i i n 2 i 1 i surrogate data set is then re-analyzed like new data. 6.4.1 Re-sampling Points: Parametric Model Example A classic data set contains the time between 299 eruptions of the Old Faithful geyser in Yellowstone, and the length of the subsequent eruptions; these variables are called waiting and duration . (We saw this data set already in § 5.4.2.1, and will see it again in § waiting on 10.3.2.) We’ll look at the linear regression of . We’ll re-sample ( , waiting ) pairs, and would like confidence duration duration intervals for the regression coefficients. This is a confidence interval for the coef- the best linear predictor , a functional of the distribution, which, as we ficients of saw in Chapters 1 and 2, exists no matter how nonlinear the process really is. It’s only a confidence interval for the true regression parameters if the real regression function is linear. Before anything else, look at the model: library(MASS) data(geyser) geyser.lm <- lm(waiting ~ duration, data = geyser) Estimate Std. Error t value Pr(¿—t—) (Intercept) 99.3 50.7 0 1.960 -7.8 duration -14.5 0 0.537 The first step in bootstrapping this is to build our simulator, which just means sampling rows from the data frame: resample.geyser <- function() { resample.data.frame(geyser) } summary(geyser.resample()) , and seeing that We can check this by running 14 summary(geyser) it gives about the same quartiles and mean for both variables as , but that the former gives different numbers each time it’s run. Next, we define the estimator: 14 The minimum and maximum won’t match up well — why not?

143 6.4 Bootstrapping Regression Models 143 est.geyser.lm <- function(data) { fit <- lm(waiting ~ duration, data = data) return(coefficients(fit)) } coefficients(geyser.lm) We can check that this function works by seeing that est.geyser.lm(geyser) , but that est.geyser.lm(resample.geyser() matches is different every time we run it. Put the pieces together: geyser.lm.ci <- bootstrap.ci(statistic=est.geyser.lm, simulator=resample.geyser, level=0.95, t.hat=coefficients(geyser.lm), B=1e4) lower upper (Intercept) 102.00 96.60 duration -6.95 -8.72 Notice that we do not have to assume homoskedastic Gaussian noise — fortu- 15 nately, because that’s a very bad assumption here . 6.4.2 Re-sampling Points: Non-parametric Model Example Nothing in the logic of re-sampling data points for regression requires us to use a parametric model. Here we’ll provide 95% confidence bounds for the kernel smoothing of the geyser data. Since the functional is a whole curve, the confidence confidence band . set is often called a We use the same simulator, but start with a different regression curve, and need a different estimator. evaluation.points <- data.frame(duration = seq(from = 0.8, to = 5.5, length.out = 200)) library(np) npr.geyser <- function(data, tol = 0.1, ftol = 0.1, plot.df = evaluation.points) { bw <- npregbw(waiting ~ duration, data = data, tol = tol, ftol = ftol) mdl <- npreg(bw) return(predict(mdl, newdata = plot.df)) } Now we construct pointwise 95% confidence bands for the regression curve. 15 β We have calculated 95% confidence intervals for the intercept and the slope β separately. These 1 0 intervals cover their coefficients all but 5% of the time. Taken together, they give us a rectangle in ( β rectangle could be anywhere from 95% all the ,β this ) space, but the coverage probability of 1 0 way down to 90%. To get a confidence region which simultaneously covers both coefficients 95% of the time, we have two big options. One is to stick to a box-shaped region and just increase the confidence level on each coordinate (to 97 . 5%). The other is to define some suitable metric of how far apart coefficient vectors are (e.g., ordinary Euclidean distance), find the 95% percentile of the ˆ ˆ distribution of this metric, and trace the appropriate contour around β , β . 0 1

144 144 The Bootstrap main.curve <- npr.geyser(geyser) # We already defined this in a previous example, but it doesn't hurt resample.geyser <- function() { resample.data.frame(geyser) } geyser.resampled.curves <- rboot(statistic=npr.geyser, simulator=resample.geyser, B=800) Code Example 15: Generating multiple kernel-regression curves for the geyser data, by resampling that data frame and re-estimating the model on each simulation. stores the predictions of those 800 models, evaluated at a common geyser.resampled.curves main.curve , which we’ll use presently to get set of values for the predictor variable. The vector confidence intervals, stores predictions of the model fit to the whole data, evaluated at that same set of points. For this end, we don’t really need to keep around the whole kernel regression object — we’ll just use its predicted values on a uniform grid of points, extending slightly beyond the range of the data (Code Example 15). Observe that this will go through bandwidth selection again for each bootstrap sample. This is slow, but it is the most secure way of getting good confidence bands. Applying the bandwidth we found on the data to each re-sample would be faster, but would introduce an extra level of approximation, since we wouldn’t be treating each simulation run the same as the original data. Figure 6.4 shows the curve fit to the data, the 95% confidence limits, and (faintly) all of the bootstrapped curves. Doing the 800 bootstrap replicates took 16 4 minutes on my laptop . 6.4.3 Re-sampling Residuals: Example As an example of re-sampling the residuals, rather than data points, let’s take a linear regression, based on the data-analysis assignment in § A.14. We will regress gdp.growth on log(gdp) , pop.growth , invest and trade : penn <- read.csv("http://www.stat.cmu.edu/~cshalizi/uADA/13/hw/02/penn-select.csv") penn.formula <- "gdp.growth ~ log(gdp) + pop.growth + invest + trade" penn.lm <- lm(penn.formula, data = penn) (Why make the formula a separate object here?) The estimated parameters are 16 Specifically, I ran system.time(geyser.resampled.curves <- rboot(statistic=npr.geyser, simulator=resample.geyser, B=800)) , which not only did the calculations and stored them in geyser.resampled.curves , but told me how much time it took R to do all that.

145 6.4 Bootstrapping Regression Models 145 plot(0, type = "n", xlim = c(0.8, 5.5), ylim = c(0, 100), xlab = "Duration (min)", ylab = "Waiting (min)") for (i in 1:ncol(geyser.resampled.curves)) { lines(evaluation.points$duration, geyser.resampled.curves[, i], lwd = 0.1, col = "grey") } geyser.npr.cis <- bootstrap.ci(tboots = geyser.resampled.curves, t.hat = main.curve, level = 0.95) lines(evaluation.points$duration, geyser.npr.cis[, "lower"]) lines(evaluation.points$duration, geyser.npr.cis[, "upper"]) lines(evaluation.points$duration, main.curve) rug(geyser$duration, side = 1) points(geyser$duration, geyser$waiting) Figure 6.4 Kernel regression curve for Old Faithful (central black line), with 95% confidence bands (other black lines), the 800 bootstrapped curves (thin, grey lines), and the data points. Notice that the confidence bands get wider where there is less data. Caution: doing the bootstrap took 4 minutes to run on my computer.

146 146 The Bootstrap resample.residuals.penn <- function() { new.frame <- penn new.growths <- fitted(penn.lm) + resample(residuals(penn.lm)) new.frame$gdp.growth <- new.growths return(new.frame) } penn.estimator <- function(data) { mdl <- lm(penn.formula, data = data) return(coefficients(mdl)) } penn.lm.cis <- bootstrap.ci(statistic = penn.estimator, simulator = resample.residuals.penn, B = 10000, t.hat = coefficients(penn.lm), level = 0.95) Code Example 16: Re-sampling the residuals to get confidence intervals in a linear model. x 5.71e-04 (Intercept) 5.07e-04 log(gdp) -1.87e-01 pop.growth invest 7.15e-04 trade 3.11e-05 17 ) Code Example 16 shows the new simulator for this set-up ( , resample.residuals.penn 18 penn.estimator ) the new estimation function ( , and the confidence interval cal- culation ( penn.lm.cis ): upper lower -1.62e-02 1.71e-02 (Intercept) log(gdp) 2.49e-03 -1.46e-03 pop.growth -3.58e-01 -1.75e-02 invest 4.94e-04 9.37e-04 trade 8.21e-05 -1.94e-05 Doing ten thousand linear regressions took 45 seconds on my computer, as opposed to 4 minutes for eight hundred kernel regressions. 6.5 Bootstrap with Dependent Data If the data points we are looking at are vectors (or more complicated structures) with dependence between components, but each data point is independently gen- erated from the same distribution, then dependence isn’t really an issue. We 17 How would you check that this worked? 18 How would you check that this worked?

147 6.6 Confidence Bands for Nonparametric Regression 147 re-sample vectors, or generate vectors from our model, and proceed as usual. In fact, that’s what we’ve done so far in several cases. data points, things are more tricky. If our model If there is dependence across incorporates this dependence, then we can just simulate whole data sets from it. An appropriate re-sampling method is trickier — just re-sampling individual data points destroys the dependence, so it won’t do. We will revisit this question when we look at time series in Chapter 25. 6.6 Confidence Bands for Nonparametric Regression Many of the examples in this chapter use bootstrapping to get confidence bands for nonparametric regression. It is worth mentioning that there is a subtle issue with doing so, but one which I do not think really matters, usually, for practice. The issue is that when we do nonparametric regression, we accept some bias in our estimate of the regression function. In fact, we saw in Chapter 4 that min- imizing the total MSE means accepting amounts of bias and variance. matching So our nonparametric estimate of is biased. If we simulate from it, we’re sim- μ ulating from something biased; if we simulate from the residuals, those residuals contain bias; and even if we do a pure resampling bootstrap, we’re comparing the bootstrap replicates to a biased estimate. This means that we are really looking at sampling intervals around the biased estimate, rather than confidence intervals around μ . The two questions this raises are (1) how much this matters, and (2) whether there is any alternative. As for the size of the bias, we know from Chapter 4 that 5 / − 4 / 5 2 − n , so the bias itself goes like the squared bias, in 1D, goes like . This n does go to zero, but slowly. [[Living with it vs. Hall and Horowitz (2013) paper, which gives 1 α coverage − − η fraction of points. Essentially, construct naive bands, and then work out at 1 by how much they need to be expanded to achieve desired coverage]] 6.7 Things Bootstrapping Does Poorly The principle behind bootstrapping is that sampling distributions under the true process should be close to sampling distributions under good estimates of the truth. If small perturbations to the data-generating process produce huge swings in the sampling distribution, bootstrapping will not work well, and may fail spec- tacularly. For model-based bootstrapping, this means that small changes to the underlying parameters must produce small changes to the functionals of interest. Similarly, for resampling, it means that adding or removing a few data points 19 must change the functionals only a little . Re-sampling in particular has trouble with extreme values. Here is a simple 19 More generally, moving from one distribution function f to another (1 −  ) f + g mustn’t change the functional very much when  is small, no matter in what “direction” g we perturb it. Making this idea precise calls for some fairly deep mathematics, about differential calculus on spaces of functions (see, e.g., van der Vaart 1998, ch. 20).

148 148 The Bootstrap (0 X X ), and we want to ∼ Unif are IID, with ,θ example: Our data points 0 i i ˆ θ . The maximum likelihood estimate estimate θ is just the sample maximum of 0 x . We’ll use resampling to get a confidence interval for this, as above — but the i I will fix the true θ = 1, and see how often the 95% confidence interval covers 0 the truth. max.boot.ci <- function(x, B) { max.boot <- replicate(B, max(resample(x))) return(2 * max(x) - quantile(max.boot, c(0.975, 0.025))) } boot.cis <- replicate(1000, max.boot.ci(x = runif(100), B = 1000)) (true.coverage <- mean((1 >= boot.cis[1, ]) & (1 <= boot.cis[2, ]))) ## [1] 0.87 That is, the actual coverage probability is not 95% but about 87%. If you suspect that your use of the bootstrap may be setting yourself up for a similar epic fail, your two options are (1) learn some of the theory of the bootstrap from the references in the “Further Reading” section below, or (2) set up a simulation experiment like this one. 6.8 Which Bootstrap When? This chapter has introduced a bunch of different bootstraps, and before it closes it’s worth reviewing the general principles, and some of the considerations which go into choosing among them in a particular problem. When we bootstrap, we try to approximate the sampling distribution of some statistic (mean, median, correlation coefficient, regression coefficients, smoothing curve, difference in MSEs. . . ) by running simulations, and calculating the statistic on the simulation. We’ve seen three major ways of doing this: • The model-based bootstrap: we estimate the model, and then simulate from x the estimated model; • Resampling residuals: we estimate the model, and then simulate by resampling residuals to that estimate and adding them back to the fitted values; • Resampling cases or whole data points: we ignore the estimated model com- pletely in our simulation, and just re-sample whole rows from the data frame. Which kind of bootstrap is appropriate depends on how much trust we have in our model. The model-based bootstrap trusts the model to be completely correct for some parameter value. In, e.g., regression, it trusts that we have the right shape for the regression function and that we have the right distribution for the noise. When we trust our model this much, we could in principle work out sampling distributions analytically; the model-based bootstrap replaces hard math with simulation. Resampling residuals doesn’t trust the model as much. In regression problems, it assumes that the model gets the shape of the regression function right, and that the noise around the regression function is independent of the predictor

149 6.9 Further Reading 149 variables, but doesn’t make any further assumption about how the fluctuations 20 are distributed. It is therefore more secure than model-based bootstrap. Finally, resampling cases assumes nothing at all about either the shape of the regression function or the distribution of the noise, it just assumes that each data point (row in the data frame) is an independent observation. Because it assumes so little, and doesn’t depend on any particular model being correct, it is very safe. The reason we do not always use the safest bootstrap, which is resampling cases, is that there is, as usual, a bias-variance trade-off. Generally speaking, if we compare three sets of bootstrap confidence intervals on the same data for the same statistic, the model-based bootstrap will give the narrowest intervals, fol- lowed by resampling residuals, and resampling cases will give the loosest bounds. If the model really correct about the shape of the curve, we can get more is precise results, without any loss of accuracy, by resampling residuals rather than resampling cases. If the model is also correct about the distribution of noise, we can do even better with a model-based bootstrap. To sum up: resampling cases is safer than resampling residuals, but gives wider, weaker bounds. If you have good reason to trust a model’s guess at the shape of the regression function, then resampling residuals is preferable. If you don’t, or it’s not a regression problem so there are no residuals, then you prefer to resample cases. The model-based bootstrap works best when the over-all model is correct, and we’re just uncertain about the exact parameter values we need. 6.9 Further Reading Davison and Hinkley (1997) is both a good textbook, and the reference I consult most often. Efron and Tibshirani (1993), while also very good, is more theoretical. Canty et al. (2006) has useful advice for serious applications. All the bootstraps discussed in this chapter presume IID observations. For bootstraps for time series, see 25.5. § Software For professional purposes, I strongly recommend using the R package (Canty boot and Ripley, 2013), based on Davison and Hinkley (1997). I deliberately do not use it in this chapter, or later in the book, for pedagogical reasons; I have found that forcing students to write their own bootstrapping code helps build character, or at least understanding. The bootstrap vs. robust standard errors For linear regression coefficients, econometricians have developed a variety of “robust” standard errors which are valid under weaker conditions than the usual 20 You could also imagine simulations where we presume that the noise takes a very particular form (e.g., a t -distribution with 10 degrees of freedom), but are agnostic about the shape of the regression function, and learn that non-parametrically. It’s harder to think of situations where this is really plausible, however, except maybe Gaussian noise arising from central-limit-theorem considerations.

150 150 The Bootstrap et al. (2014) shows their equivalence to resampling cases. (See assumptions. Buja also King and Roberts 2015.) Historical notes The original paper on the bootstrap, Efron (1979), is extremely clear, and for the most part presented in the simplest possible terms; it’s worth reading. His later small book (Efron, 1982), while often cited, is not in my opinion so useful 21 nowadays . As the title of that last reference suggests, the bootstrap is in some ways a successor to an older method, apparently dating back to the 1940s if not before, called the “jackknife”, in which each data point is successively held back and the estimate is re-calculated; the variance of these re-estimates, appropriately 22 scaled, is then taken as the variance of estimation, and similarly for the bias . The jackknife is appealing in its simplicity, but is only valid under much stronger conditions than the bootstrap. Exercises Show that x is the mode of the Pareto distribution. 6.1 0 6.2 Derive the maximum likelihood estimator for the Pareto distribution (Eq. 6.15) from the density (Eq. 6.14). 6.3 Show that the MLE of the Pareto distribution is consistent. ˆ 1. Using the law of large numbers, show that θ (Eq. 6.15) converges to a limit which E [log X/x depends on ]. 0 2. Find an expression for E [log X/x ] in terms of θ and from the density (Eq. 6.14). 0 Hint: Write E [log X/x to ] as an integral, change the variable of integration from x 0 z x/x = log ( ), and remember that the mean of an exponential random variable with 0 rate is 1 /λ . λ Find confidence bands for the linear regression model of 6.4 6.4.1 using § 1. The usual Gaussian assumptions ( hint : try the intervals="confidence" option to predict ); 2. Resampling of residuals; and 3. Resampling of cases. 6.5 (Computational) Writing new functions to simulate every particular linear model is some- what tedious. 1. Write a function which takes, as inputs, an lm model and a data frame, and returns a new data frame where the response variable is replaced by the model’s predictions plus Gaussian noise, but all other columns are left alone. 2. Write a function which takes, as inputs, an lm model and a data frame, and returns a new data frame where the response variable is replaced by the model’s predictions plus resampled residuals. 21 It seems to have done a good job of explaining things to people who were already professional statisticians in 1982. 22 A “jackknife” is a knife with a blade which folds into the handle; think of the held-back data point as the folded-away blade.

151 Exercises 151 3. Will your functions work with npreg models, as well as lm models? If not, what do you have to modify? Hint: See Code Example 3 in Chapter 3 for some R tricks to extract the name of the response variable from the estimated model.

152 7 Splines 7.1 Smoothing by Penalizing Curve Flexibility Let’s go back to the problem of smoothing one-dimensional data. We have data ̂ x ) , ( x ( ,y μ ) ,... points ( ,y x ,y ), and we want to find a good approximation n n 2 2 1 1 to the true conditional expectation or regression function μ . Previously, we con- ̂ indirectly, through the bandwidth of our kernels. μ trolled how smooth we made But why not be more direct, and control smoothness itself? spline objective function A natural way to do this is to minimize the ∫ n ∑ 1 2 2 ′′ x − m (7.1) x )) )) dx + λ ( ( m y ( ( ≡ L ( m,λ ) i i n =1 i m ( x ) to The first term here is just the mean squared error of using the curve y . We know and like this; it is an old friend. predict ′′ second term, however, is something new for us. m The is the second derivative of m x — it would be zero if m were linear, so this measures the with respect to ′′ ( x . The sign of m of at x ) says whether the curvature at x is concave curvature m or convex, but we don’t care about that so we square it. We then integrate this x to say how curved m is, on average. Finally, we multiply by λ and add over all penalty that to the MSE. This is adding a to the MSE criterion — given two functions with the same MSE, we prefer the one with less average curvature. We will accept changes in m that increase the MSE by 1 unit if they also reduce the average curvature by at least λ . The curve or function which solves this minimization problem, ̂ μ = argmin ) L ( m,λ (7.2) λ m is called a smoothing spline , or spline curve . The name “spline” comes from a simple tool used by craftsmen to draw smooth curves, which was a thin strip of a flexible material like a soft wood; you pin it in place at particular points, called knots , and let it bend between them. (When the gas company dug up my front yard and my neighbor’s driveway, the contractors who put everything back used a plywood board to give a smooth, curved edge to the new driveway. That board was a spline, and the knots were pairs of metal stakes on either side of the board. Figure 7.1 shows the spline after concrete was poured on one side of it.) Bending the spline takes energy — the stiffer the material, the more energy has to go into bending it through the same shape, and so the material makes a straighter curve 152 st April, 2019 11:17 Monday 1 c Copyright © Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/ cshalizi/ADAfaEPoV/ ~

153 7.1 Smoothing by Penalizing Curve Flexibility 153 A wooden spline used to create a smooth, curved border for a Figure 7.1 paved area (Shadyside, Pittsburgh, October 2014). between given points. For smoothing splines, using a stiffer material corresponds to increasing λ . It is possible to show ( § 7.6 below) that all solutions to Eq. 7.1, no matter what the data might be, are piecewise cubic polynomials which are continuous and have ̂ μ continuous, so are continuous first and second derivatives — i.e., not only is ′ ′′ ̂ ̂ μ μ . The boundaries between the pieces sit at the original data points. By and knots of analogy with the craftman’s spline, the boundary points are called the the smoothing spline. The function is continuous beyond the largest and smallest 1 data points, but it is always linear in those regions. I will also assert, without proof, that, with enough pieces, such piecewise cu- bic polynomials can approximate any well-behaved function arbitrarily closely. Finally, smoothing splines are linear smoothers, in the sense of Chapter 1: pre- dicted values are linear combinations of the training-set response values y — see i Eq. 7.21 below. 7.1.1 The Meaning of the Splines Look back to the optimization problem. As →∞ , any curvature at all becomes λ infinitely costly, and only linear functions are allowed. But we know how to min- imize mean squared error with linear functions, that’s OLS. So we understand that limit. On the other hand, as λ → 0, we decide that we don’t care about curvature. In that case, we can always come up with a function which just interpolates between interpolation spline passing exactly through each point. the data points, an More specifically, of the infinitely many functions which interpolate between those points, we pick the one with the minimum average curvature. ̂ λ , At intermediate values of μ becomes a function which compromises between λ 1 Can you explain why it is linear outside the data range, in terms of the optimization problem?

154 154 Splines having low curvature, and bending to approach all the data points closely (on average). The larger we make λ , the more curvature is penalized. There is a bias- grows, the spline becomes less sensitive to the data, variance trade-off here. As λ λ shrinks, so does bias, with lower variance to its predictions but more bias. As but variance grows. For consistency, we want to let λ → n → ∞ , just as, 0 as → 0 while with kernel smoothing, we let the bandwidth →∞ . h n We can also think of the smoothing spline as the function which minimizes the mean squared error, subject to a constraint on the average curvature. This turns on a general corresponds between penalized optimization and optimization under constraints, which is explored in Appendix H.3. The short version is that each level of λ corresponds to imposing a cap on how much curvature the function is allowed to have, on average, and the spline we fit with that λ is the MSE- 2 minimizing curve subject to that constraint. As we get more data, we have more information about the true regression function and can relax the constraint (let λ shrink) without losing reliable estimation. by cross-validation. Ordinary It will not surprise you to learn that we select λ -fold CV is entirely possible, but leave-one-out CV works quite well for splines. k In fact, the default in most spline software is either leave-one-out CV, or the even faster approximation called “generalized cross-validation” or GCV (see § 3.4.3). 7.2 Computational Example: Splines for Stock Returns The default R function for fitting a smoothing spline is smooth.spline : smooth.spline(x, y, cv = FALSE) x where y is a vector of values should be a vector of values for input variable, cv λ for the response (in the same order), and the switch controls whether to pick by generalized cross-validation (the default) or by leave-one-out cross-validation. The object which returns has an $x component, re-arranged in smooth.spline increasing order , a $y component of fitted values, a $yin component of original values, etc. See help(smooth.spline) for more. 3 As a concrete illustration, Figure 7.2 looks at the daily logarithmic returns of the S&P 500 stock index, on 5542 consecutive trading days, from 9 February 4 . 1993 to 9 February 2015 2 The slightly longer version: Consider minimizing the MSE (not the penalized MSE), but only over ∫ ′′ 2 ( m functions ( m where x dx is at most some maximum level C . λ would then be the Lagrange )) multiplier enforcing the constraint. The constrained but unpenalized optimization is equivalent to the penalized but unconstrained one. In economics, λ would be called the “shadow price” of average curvature in units of MSE, the rate at which we’d be willing to pay to have the constraint level C marginally increased. 3 For a financial asset whose price on day t is p , the and which pays a dividend on that day of d t t ) t p . Financiers and other professional gamblers care more about + d log-returns on are log ( /p t t t − 1 the log returns than about the price change, p , because the log returns give the rate of − p t t 1 − profit (or loss) on investment. We are using a price series which is adjusted to incorporate dividend (and related) payments. 4 This uses the handy pdfetch library, which downloads data from such public domain sources as the Federal Reserve, Yahoo Finance, etc.

155 7.2 Computational Example: Splines for Stock Returns 155 require(pdfetch) ## Loading required package: pdfetch sp <- pdfetch_YAHOO("SPY", fields = "adjclose", from = as.Date("1993-02-09"), to = as.Date("2015-02-09")) sp <- diff(log(sp)) sp <- sp[-1] We want to use the log-returns on one day to predict what they will be on the next. The horizontal axis in the figure shows the log-returns for each of 2527 days t , and the vertical axis shows the corresponding log-return for the succeeding day − 0 . 0642 (grey line in t + 1. A linear model fitted to this data displays a slope of λ = 0 . 0127, the figure). Fitting a smoothing spline with cross-validation selects and the black curve: sp.today <- head(sp, -1) sp.tomorrow <- tail(sp, -1) coefficients(lm(sp.tomorrow ~ sp.today)) ## (Intercept) sp.today ## 0.0003716837 -0.0640901257 sp.spline <- smooth.spline(x = sp.today, y = sp.tomorrow, cv = TRUE) sp.spline ## Call: ## smooth.spline(x = sp.today, y = sp.tomorrow, cv = TRUE) ## ## Smoothing Parameter spar= 1.346847 lambda= 0.01299752 (11 iterations) ## Equivalent Degrees of Freedom (Df): 5.855613 ## Penalized Criterion (RSS): 0.7825304 ## PRESS(l.o.o. CV): 0.0001428132 sp.spline$lambda ## [1] 0.01299752 (PRESS is the “prediction sum of squares”, i.e., the sum of the squared leave- one-out prediction errors.) This is the curve shown in black in the figure. The blue curves are for large values of λ , and clearly approach the linear regression; the red curves are for smaller values of λ . The spline can also be used for prediction. For instance, if we want to know what the return to expect following a day when the log return was +0 01, we do . predict(sp.spline, x = 0.01) ## $x ## [1] 0.01 ## ## $y ## [1] 0.0001948564 R Syntax Note: The syntax for predict with smooth.spline spline differs slightly from the syntax for predict with lm or np . The latter two want a newdata argument, which should be a data-frame with column names matching those in the formula used to fit predict function for smooth.spline , though, just wants a vector the model. The called x predict predict for lm or np returns a vector of predictions, . Also, while

156 156 Splines l l 0.10 l l l l l l l l l l l l l l l l l l l l l 0.05 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.00 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Tomorrow's log−return l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −0.05 l l l l l l l l l −0.10 0.05 0.00 −0.05 0.10 −0.10 Today's log−return plot(as.vector(sp.today), as.vector(sp.tomorrow), xlab = "Today's log-return", ylab = "Tomorrow's log-return", pch = 16, cex = 0.5, col = "grey") abline(lm(sp.tomorrow ~ sp.today), col = "darkgrey") sp.spline <- smooth.spline(x = sp.today, y = sp.tomorrow, cv = TRUE) lines(sp.spline) lines(smooth.spline(sp.today, sp.tomorrow, spar = 1.5), col = "blue") lines(smooth.spline(sp.today, sp.tomorrow, spar = 2), col = "blue", lty = 2) lines(smooth.spline(sp.today, sp.tomorrow, spar = 1.1), col = "red") lines(smooth.spline(sp.today, sp.tomorrow, spar = 0.5), col = "red", lty = 2) Figure 7.2 The S& P 500 log-returns data (grey dots), with the OLS linear regression (dark grey line), the spline selected by cross-validation (solid . black, . 0127), some more smoothed splines (blue, λ = 0 = 0 178 and 727) λ 4 − − 8 88 × 10 06 and some less smooth splines (red, and 1 . λ × 10 = 2 . ). Incoveniently, does not let us control λ directly, but rather a smooth.spline somewhat complicated but basically exponential transformation of it called spar . (See help(smooth.spline) for the gory details.) The equivalent λ can be extracted from the return value, e.g., smooth.spline(sp.today,sp.tomorrow,spar=2)$lambda .

157 7.2 Computational Example: Splines for Stock Returns 157 smooth.spline returns a list with an component (in increasing order) and a for x points component, which is the sort of thing that can be put directly into y or lines for plotting. 7.2.1 Confidence Bands for Splines Continuing the example, the smoothing spline selected by cross-validation has a negative slope everywhere, like the regression line, but it’s asymmetric — the slope is more negative to the left, and then levels off towards the regression line. (See Figure 7.2 again.) Is this real, or might the asymmetry be a sampling artifact? We’ll investigate by finding confidence bands for the spline, much as we did for kernel regression in Chapter 6 and Problem Set A.27, problem 5. Again, we need to bootstrap, and we can do it either by resampling the residuals or resampling whole data points. Let’s take the latter approach, which assumes less about the data. We’ll need a simulator: sp.frame <- data.frame(today = sp.today, tomorrow = sp.tomorrow) sp.resampler <- function() { n <- nrow(sp.frame) resample.rows <- sample(1:n, size = n, replace = TRUE) return(sp.frame[resample.rows, ]) } This treats the points in the scatterplot as a complete population, and then 5 . We’ll draws a sample from them, with replacement, just as large as the original also need an estimator. What we want to do is get a whole bunch of spline curves, one on each simulated data set. But since the values of the input variable will change from one simulation to another, to make everything comparable we’ll evaluate each spline function on a fixed grid of points, that runs along the range of the data. grid.300 <- seq(from = min(sp.today), to = max(sp.today), length.out = 300) sp.spline.estimator <- function(data, eval.grid = grid.300) { fit <- smooth.spline(x = data[, 1], y = data[, 2], cv = TRUE) return(predict(fit, x = eval.grid)$y) } This sets the number of evaluation points to 300, which is large enough to give visually smooth curves, but not so large as to be computationally unwieldly. Now put these together to get confidence bands: sp.spline.cis <- function(B, alpha, eval.grid = grid.300) { spline.main <- sp.spline.estimator(sp.frame, eval.grid = eval.grid) spline.boots <- replicate(B, sp.spline.estimator(sp.resampler(), eval.grid = eval.grid)) cis.lower <- 2 * spline.main - apply(spline.boots, 1, quantile, probs = 1 - alpha/2) cis.upper <- 2 * spline.main - apply(spline.boots, 1, quantile, probs = alpha/2) return(list(main.curve = spline.main, lower.ci = cis.lower, upper.ci = cis.upper, x = eval.grid)) } 5 § 25.5 covers more refined ideas about bootstrapping time series.

158 158 Splines The return value here is a list which includes the original fitted curve, the lower and upper confidence limits, and the points at which all the functions were evaluated. Figure 7.3 shows the resulting 95% confidence limits, based on B=1000 boot- strap replications. (Doing all the bootstrapping took 45 seconds on my laptop.) These are pretty clearly asymmetric in the same way as the curve fit to the whole data, but notice how wide they are, and how they get wider the further we go from the center of the distribution in either direction.

159 7.2 Computational Example: Splines for Stock Returns 159 l l 0.10 l l l l l l l l l l l l l l l l l l l l l 0.05 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.00 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Tomorrow's log−return l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −0.05 l l l l l l l l l −0.10 −0.10 0.10 0.00 −0.05 0.05 Today's log−return sp.cis <- sp.spline.cis(B = 1000, alpha = 0.05) plot(as.vector(sp.today), as.vector(sp.tomorrow), xlab = "Today's log-return", ylab = "Tomorrow's log-return", pch = 16, cex = 0.5, col = "grey") abline(lm(sp.tomorrow ~ sp.today), col = "darkgrey") lines(x = sp.cis$x, y = sp.cis$main.curve, lwd = 2) lines(x = sp.cis$x, y = sp.cis$lower.ci) lines(x = sp.cis$x, y = sp.cis$upper.ci) Figure 7.3 Bootstrapped pointwise confidence band for the smoothing spline of the S & P 500 data, as in Figure 7.2. The 95% confidence limits around the main spline estimate are based on 1000 bootstrap re-samplings of the data points in the scatterplot.

160 160 Splines 7.3 Basis Functions and Degrees of Freedom 7.3.1 Basis Functions Splines, I said, are piecewise cubic polynomials. To see how to fit them, let’s think about how to fit a global cubic polynomial. We would define four basis 6 , functions B (7.3) x ( ) = 1 1 x ) = B (7.4) x ( 2 2 ( x ) = x (7.5) B 3 3 x ) = x B (7.6) ( 4 and chose to only consider regression functions that are linear combinations of the basis functions, 4 ∑ ( ) = μ (7.7) β x B ) ( x j j =1 j transformed variables B ), ( x ) ,...B Such regression functions would be linear in the ( x 1 4 x even though it is nonlinear in . To estimate the coefficients of the cubic polynomial, we would apply each basis and gather the results in an n function to each data point 4 matrix B , x × i = B B ( x (7.8) ) i j ij B matrix in place of the usual data matrix x : Then we would do OLS using the T T − 1 ˆ ) B β B B y (7.9) = ( Since splines are piecewise cubics, things proceed similarly, but we need to be a n little more careful in defining the basis functions. Recall that we have values of x , x . For the rest of this section, I will assume that ,x the input variable ,...x n 1 2 these are in increasing order, because it simplifies the notation. These n “knots” define n + 1 pieces or segments: n − 1 of them between the knots, one from −∞ to , and one from x to + ∞ . A third-order polynomial on each segment would x n 1 seem to need a constant, linear, quadratic and cubic term per segment. So the x segment running from to x would need the basis functions +1 i i 3 2 x ( 1 ( ) x ( x ) , ( x − x − ) 1 x ( , ) x 1 1 ( x ) , ( x − x ) ) i i ,x ) ,x x ( i ) x ,x ( x ,x ( ) x ( ) +1 i i i +1 i i +1 i i +1 i (7.10) ) is 1 if ,x x where as usual the indicator function ( ∈ ( x 1 x ) and 0 ) ,x x i ( i +1 i +1 i otherwise. This makes it seem like we need 4( + 1) = 4 n + 4 basis functions. n However, we know from linear algebra that the number of basis vectors we need is equal to the number of dimensions of the vector space. The number of adjustable coefficients for an arbitrary piecewise cubic with n + 1 segments is indeed 4 n + 4, but splines are constrained to be smooth. The spline must be continuous, which means that at each x , the value of the cubic from the left, i 6 See App. B.11 for brief reminders about basis functions.

161 7.3 Basis Functions and Degrees of Freedom 161 x ), must match the value of the cubic from the right, defined defined on ( ,x i 1 i − ,x ). This gives us one constraint per data point, reducing the number of x on ( i i +1 adjustable coefficients to at most 3 n + 4. Since the first and second derivatives are n also continuous, we are down to just + 4 coefficients. Finally, we know that the ,x ) and on spline function is linear outside the range of the data, i.e., on ( −∞ 1 , ), lowering the number of coefficients to n . There are no more constraints, x ( ∞ n basis functions. And in fact, from linear algebra, any so we end up needing only n 7 n can be used as set of piecewise cubic functions which are linearly independent a basis. One common choice is ) = 1 (7.11) x B ( 1 x ) = B (7.12) x ( 2 3 3 3 3 ( ( x − − x ) x − x ) x − x x ) ) ( x − − ( n n − 1 i n + + + + (7.13) − ) = x ( B +2 i x x − x x − − n n 1 n i a = a if a > 0, and = 0 otherwise. This rather unintuitive-looking basis where ( ) + B are zero has the nice property that the second and third derivatives of each j outside the interval ( x ,x ). 1 n Now that we have our basis functions, we can once again write the spline as a weighted sum of them, m ∑ x ) = m ( (7.14) β ) B x ( j j j =1 ( where and put together the matrix = B ). We can write the spline B x B i j ij objective function in terms of the basis functions, T T y β B β ) = ( ( y − B L ) + nλβ n Ω β (7.15) − where the matrix Ω encodes information about the curvature of the basis func- tions: ∫ ′′ ′′ B = (7.16) ( x ) Ω B dx ( x ) jk k j Notice that only the quadratic and cubic basis functions will make non-zero contributions to Ω. With the choice of basis above, the second derivatives are non-zero on, at most, the interval ( x ), so each of the integrals in Ω is going ,x n 1 , no once to be finite. This is something we (or, realistically, R) can calculate λ is. Now we can find the smoothing spline by differentiating with matter what respect to β : T T ˆ ˆ 0 = + 2 B − B 2 β + 2 nλ Ω B β (7.17) y ) ( T T ˆ (7.18) β nλ Ω B = + B y B ( ) − 1 T T ˆ B y (7.19) nλ B β B + = Ω 7 Recall that vectors ~v ,~v ,...~v are linearly independent when there is no way to write any one of 1 2 d the vectors as a weighted sum of the others. The same definition applies to functions.

162 162 Splines Notice, incidentally, that we can now show splines are linear smoothers: ˆ ̂ ( ) = β (7.20) B x μ ( ) − 1 T T Ω B = B B + y (7.21) B nλ Once again, if this were ordinary linear regression, the OLS estimate of the co- 1 T T − ) x x efficients would be ( y . In comparison to that, we’ve made two changes. x B First, we’ve substituted the basis function matrix for the original matrix of independent variables, x — a change we’d have made already for a polynomial T T T , or even regression. Second, the “denominator” is not x B , but B x B + nλ Ω. B T Since x x is n times the covariance matrix of the independent variables, we are taking the covariance matrix of the spline basis functions and adding some extra covariance — how much depends on the shapes of the functions (through Ω) and how much smoothing we want to do (through ). The larger we make λ , the less λ the actual data matters to the fit. In addition to explaining how splines can be fit quickly (do some matrix arith- metic), this illustrates two important tricks. One, which we won’t explore further here, is to turn a nonlinear regression problem into one which is linear in an- other set of basis functions. This is like using not just one transformation of the input variables, but a whole library of them, and letting the data decide which transformations are important. There remains the issue of selecting the 8 , most basis functions, which can be quite tricky. In addition to the spline basis choices are various sorts of waves — sine and cosine waves of different frequen- cies, various wave-forms of limited spatial extent (“wavelets”), etc. The ideal is to chose a function basis where only a few non-zero coefficients would need to be estimated, but this requires some understanding of the data. . . The other trick is that of stabilizing an unstable estimation problem by adding a penalty term. This reduces variance at the cost of introducing some bias. Exercise 7.2 explores this idea. Effective degrees of freedom § 1.5.3.2, we defined the number of effective degrees of freedom for a linear In w as just tr w . Thus, Eq. 7.21 lets us calculate smoother with smoothing matrix ) ( − 1 T T + nλ Ω) B B B B the effective degrees of freedom of a spline, as tr . You ( should be able to convince yourself from this that increasing λ will, all else being equal, reduce the effective degrees of freedom of the fit. 7.4 Splines in Multiple Dimensions Suppose we have two input variables, x and z , and a single response y . How could we do a spline fit? 8 Or, really, bases; there are multiple sets of basis functions for the splines, just like there are multiple sets of basis vectors for the plane. Phrases like “B splines” and “P splines” refer to particular choices of spline basis functions.

163 7.5 Smoothing Splines versus Kernel Regression 163 One approach is to generalize the spline optimization problem so that we pe- nalize the curvature of the spline surface (no longer a curve). The appropriate penalized least-squares objective function to minimize is ] [ ( ) ( ( ) ) ∫ 2 2 2 n 2 2 2 ∑ m ∂ ∂ m ∂ m 2 + + dxdz ( m,λ x + 2 ( λ m L ,z − ) = )) y ( i i i 2 2 ∂x∂z ∂z ∂x i =1 (7.22) thin-plate spline . This is appropriate when the two The solution is called a 9 x z should be treated more or less symmetrically . input variables and An alternative is use the spline basis functions from section 7.3. We write M M 2 1 ∑ ∑ z (7.23) ) β B ( ( x ) B x ) = ( m k j jk =1 j =1 k Doing all possible multiplications of one set of numbers or functions with another is said to give their or tensor product , so this is known as a outer product or . We have to chose the number of terms tensor spline tensor product spline 2 n M ), since using M for each would give n to include for each variable ( and 1 2 2 n coefficients to n data points is asking for trouble. basis functions, and fitting 7.5 Smoothing Splines versus Kernel Regression For one input variable and one output variable, smoothing splines can basically do 10 everything which kernel regression can do . The advantages of splines are their computational speed and (once we’ve calculated the basis functions) simplicity, as well as the clarity of controlling curvature directly. Kernels however are easier 11 to program (if slower to run), easier to analyze mathematically , and extend more straightforwardly to multiple variables, and to combinations of discrete and continuous variables. 7.6 Some of the Math Behind Splines Above, I claimed that a solution to the optimization problem Eq. 7.1 exists, and is a continuous, piecewise-cubic polynomial, with continuous first and second x derivatives, with pieces at the , and linear outside the range of the x . I do not i i know of any truly elementary way of showing this, but I will sketch here how it’s established, if you’re interested. Eq. 7.1 asks us to find the function which minimize the sum of the MSE and 9 Generalizations to more than two input variables are conceptually straightforward — just keep adding up more partial derivatives — but the book-keeping gets annoying. 10 In fact, there is a technical sense in which, for large n , splines act like a kernel regression with a specific non-Gaussian kernel, and a bandwidth which varies over the data, being smaller in high-density regions. See Simonoff (1996, § 5.6.2), or, for more details, Silverman (1984). 11 Most of the bias-variance analysis for kernel regression can be done with basic calculus, as we did in Chapter 4. The corresponding analysis for splines requires working in infinite-dimensional function spaces called “Hilbert spaces”. It’s a pretty theory, if you like that sort of thing.

164 164 Splines a certain integral. Even the MSE can be brought inside the integral, using Dirac delta functions: [ ] ∫ n ∑ 1 2 ′′ 2 λ + ( m L ( (7.24) dx ( y x − m ( x ) )) )) δ ( x − = x i i i n i =1 x are ordered, so In what follows, without loss of generality, assume that the i ≤ x . With some loss of generality but a great gain ≤ ...x x ≤ x ...x ≤ +1 i n 2 1 i x are equal, so we can make those inequalities in simplicity, assume none of the i strict. The subject which deals with maximizing or minimizing integrals of functions 12 , and one of its basic tricks is to write the integrand is the calculus of variations x as a function of , the function, and its derivatives: ∫ ′ ′′ dx ,m L ) = (7.25) ( x,m,m L where, in our case, n ∑ 1 ′′ 2 2 )) x + ( m ( λ L = ( (7.26) y x − m ( x − )) ) δ ( x i i i n =1 i This sets us up to use a general theorem of the calculus of variations, to the effect Euler-Lagrange which minimizes L that any function ˆ L ’s m must also solve equation : ∣ 2 ∣ d d ∂L ∂L ∂L ∣ (7.27) = 0 − + ∣ 2 ′′ ′ ∂m ∂m dx ∂m dx m m = ˆ In our case, the Euler-Lagrange equation reads n 2 ∑ 2 d ′′ ( y λ − ˆ m ( x ) + 2 )) − ( x − x δ ˆ (7.28) ) = 0 x ( m i i i 2 n dx i =1 2 2 ′′ ( x d m ˆ m/dx Remembering that ˆ , ) = n 4 ∑ d 1 ˆ m ( x ) = x − (7.29) ) x ( y ( − ˆ m ( x δ )) i i i 4 nλ dx i =1 x other than one of the x The right-hand side is zero at any point , so the fourth i derivative has to be zero in between the x . This in turn means that the function i must be piecewise cubic. Now fix an x , and pick any two points which bracket i ; call them it, but are both greater than . Integrate u and and less than x l x +1 i 1 − i 12 In addition to its uses in statistics, the calculus of variations also shows up in physics (“what is the path of least action?”), control theory (“what is the cheapest route to the objective?”) and stochastic processes (“what is the most probable trajectory?”). Gershenfeld (1999, ch. 4) is a good starting point.

165 7.7 Further Reading 165 to l : our Euler-Lagrange equation from u ∫ ∫ n u u 4 ∑ d 1 δ )) x ( ) x m − ˆ ( (7.30) − y ( x ( ˆ = dx m ) x i i i 4 nλ dx l l i =1 ˆ m y x − ( ) i i ′′′ ′′′ ) = ( u ) − ˆ ˆ m ( l m (7.31) nλ x , though (since That is, the third derivative makes a jump when we move across i the fourth derivative is zero), it doesn’t matter which pair of points above and below x we compare third derivatives at. Integrating the equation again, i ) x ( m ˆ − y i i ′′ ′′ (7.32) ˆ − l ) − m ) u ( ( m l ˆ ) = ( u nλ ′′ Letting l approach x u from either side, so u − l → 0, we see that ˆ m and makes i no jump at . Repeating this trick twice more, we conclude the same about x i ′ m and ˆ m itself. In other words, ˆ ˆ must be continuous, with continuous first m x ,x ) and second derivatives, and a third derivative that is constant on each ( i i +1 interval. Since the fourth derivative is zero on those intervals (and undefined at the x ), the function must be a piecewise cubic, with the piece boundaries at the i x , and continuity (up to the second derivative) across pieces. i To see that the optimal function must be linear below x and above x , suppose 1 n that it wasn’t. Clearly, though, we could reduce the curvature as much as we want in those regions, without altering the value of the function at the boundary, or even its first derivative there. This would yield a better function, i.e., one with a lower value of L , since the MSE would be unchanged and the average curvature would be smaller. Taking this to the limit, then, the function must be linear outside the observed data range. 13 , must have all that the optimal function ˆ m , if it exists We have now shown the properties I claimed for it. We have not shown either that there is a solution, or that a solution is unique if it does exist. However, we can use the fact that if solutions, there are any, are piecewise cubics obeying continuity conditions to set up a system of equations to find their coefficients. In fact, we did so already § in n independent linear equations in n 7.3.1, where we saw it’s a system of unknowns. Such a thing does indeed have a unique solution, here Eq. 7.19. 7.7 Further Reading There are good discussions of splines in Simonoff (1996, ch. 5), Hastie (2009, et al. ch. 5) and Wasserman (2006, § 5.5). Wood (2006, ch. 4) includes a thorough prac- tical treatment of splines as a preparation for additive models (see Chapter 8 below) and generalized additive models (see Chapters 11–12). The classic ref- erence, by one of the inventors of splines as a useful statistical tool, is Wahba (1990); it’s great if you already know what a Hilbert space is and how to navigate one. 13 For a very weak value of “shown”, admittedly.

166 166 Splines Historical notes The first introduction of spline smoothing in the statistical literature seems to be Whittaker (1922). (His “graduation” is more or less our “smoothing”.) He begins with an “inverse probability” (we would now say “Bayesian”) argument for minimizing Eq. 7.1 to find the most probable curve, based on the a priori hypothesis of smooth Gaussian curves observed through Gaussian error, and gives tricks for fitting splines more easily with the mathematical technology available in 1922. The general optimization problem, and the use of the word “spline”, seems to have its roots in numerical analysis in the early 1960s; those spline functions were intended as ways of smoothly interpolating between given points. The connec- tion to statistical smoothing was made by Schoenberg (1964) (who knew about Whittaker’s earlier work) and by Reinsch (1967) (who gave code). Splines were then developed as a practical tool in statistics and in applied mathematics in the 1960s and 1970s. Silverman (1985) is a still-readable and insightful summary of this work. In econometrics, spline smoothing a time series is called the “Hodrick-Prescott filter”, after two economists who re-discovered the technique in 1981, along with should always take a particular value (1600, as it a fallacious argument that λ happens), regardless of the data. See Paige and Trindade (2010) for a (polite) discussion, and demonstration of the advantages of cross-validation. Exercises smooth.spline function lets you set the effective degrees of freedom explicitly. Write The 7.1 a function which chooses the number of degrees of freedom by five-fold cross-validation. When we can’t measure our predictor variables perfectly, it seems like a good idea to try 7.2 to include multiple measurements for each one of them. For instance, if we were trying to predict grades in college from grades in high school, we might include the student’s grade from each year separately, rather than simply averaging them. Multiple measurements of the same variable will however tend to be strongly correlated, so this means that a linear regression will be multi-collinear. This in turn means that it will tend to nearly have multiple, mutually-canceling large coefficients. This makes it hard to interpret the regression and hard to treat the predictions seriously. (See 2.1.1.) § One strategy for coping with this situation is to carefully select the variables one uses in the regression. Another, however, is to add a penalty for large coefficient values. For historical , or Tikhonov regularization . reasons, this second strategy is called ridge regression Specifically, while the OLS estimate is n ∑ 1 2 ̂ , x · β ) ( y − (7.33) = argmin β i i OLS n β =1 i the regularized or penalized estimate is ] [ p n ∑ ∑ 1 2 2 ̂ ( β = argmin − x ) · β y + λ (7.34) β i i RR j n β =1 i j =1

167 Exercises 167 1. Show that the matrix form of the ridge-regression objective function is T 1 T − x β ) n ( y ( x β ) + λβ y β (7.35) − − 2. Show that the optimum is 1 − T T ̂ x + nλ I ) = ( β x x y (7.36) RR T (This is where the name “ridge regression” comes from: we take x and add a “ridge” x along the diagonal of the matrix.) → ∞ 3. What happens as 0? As λ λ ? (For the latter, it may help to think about the → case of a one-dimensional X first.) 4. Let Y = Z +  , with Z ∼U ( − 1 , 1) and  ∼N (0 , 0 . 05). Generate 2000 draws from Z and η Y 1 : 50. Generate corresponding = 0 . 9 Z + η , with X ∼N (0 , 0 . 05), for i ∈ . Now let i X X values. Using the first 1000 rows of the data only, do ridge regression of Y on the i i (not on Z ), plotting the 50 coefficients as functions of λ . Explain why ridge regression is called a shrinkage estimator . 5. Use cross-validation with the first 1000 rows to pick the optimal value of λ . Compare the out-of-sample performance you get with this penalty to the out-of-sample performance of OLS. For more on ridge regression, see Appendix H.3.5.

168 8 Additive Models [[TODO: Re- 8.1 Additive Models organize: for regression is that the conditional expectation function additive model The bring curse partial response is a sum of functions, one for each predictor variable. Formally, dimen- of ~ when the vector dimensions, , the model p X x of predictor variables has ,...x p 1 sionality says that up, then p ] [ ∑ additive ~ f (8.1) ) x ( α Y E X = ~x + = | j j models j =1 com- as promise, , but it’s β x ( This includes the linear model as a special case, where f ) = x j j j j so same clearly more general, because the f s can be arbitrary nonlinear functions. The j as order idea is still that each input feature makes a separate contribution to the response, lectures?]] and these just add up (hence “partial response function”), but these contribu- tions don’t have to be strictly proportional to the inputs. We do need to add a restriction to make it identifiable; without loss of generality, say that E [ Y ] = α 1 and f [ ( X E )] = 0. j j Additive models keep a lot of the nice properties of linear models, but are more flexible. One of the nice things about linear models is that they are fairly straightforward to interpret: if you want to know how the prediction changes as you change , you just need to know β x . The partial response function f j j j plays the same role in an additive model: of course the change in prediction from x will generally depend on the level changing x had before perturbation, but j j since that’s also true of reality that’s really a feature rather than a bug. It’s true that a set of plots for f s takes more room than a table of β s, but it’s also nicer j j to look at, conveys more information, and imposes fewer systematic distortions on the data. Of course, none of this would be of any use if we couldn’t actually estimate these models, but we can, through a clever computational trick which is worth knowing for its own sake. The use of the trick is also something they share with linear models, so we’ll start there. 1 p = 2. If we add constants c to f To see why we need to do this, imagine the simple case where 1 1 c has changed about the model. f , but subtract c + c and from α , then nothing observable to 2 2 2 1 This degeneracy or lack of identifiability is a little like the way collinearity keeps us from defining true slopes in linear regression. But it’s less harmful than collinearity because we can fix it with this convention. 168 st 11:17 Monday 1 April, 2019 c Copyright © Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/ cshalizi/ADAfaEPoV/ ~

169 8.2 Partial Residuals and Back-fitting 169 8.2 Partial Residuals and Back-fitting 8.2.1 Back-fitting for Linear Models The general form of a linear regression model is p [ ] ∑ ~ ~ (8.2) β x + E β · ~x X = ~x Y = β | = j j 0 =0 j where x is always the constant 1. (Adding this fictitious constant variable lets 0 us handle the intercept just like any other regression coefficient.) ~ but just one component of it, say X Suppose we don’t condition on all of . X k Y What is the conditional expectation of ? Y | X = x ] = E [ E [ E | X ,X [ ,...X ,...X ] | X = x ] (8.3) Y k k k 1 k p k 2 ] [ p ∑ | β x X = (8.4) X = E k j j k =0 j ] [ ∑ X + E = x (8.5) β x = β | X j k k k j k j 6 k = 2 , and the second line uses where the first line uses the law of total expectation Eq. 8.2. Turned around, ] [ ∑ = = E [ Y | X x β x = ] − E (8.6) x X | X β j j k k k k k k k 6 j = ( [ ] ) ∑ E − Y = X x | X (8.7) = β k j k j k 6 j = th The expression in the expectation is the partial residual — the (total) k Y residual is the difference between and its expectation, the partial residual is Y and what we expect it to be ignoring the contribution the difference between ( k ) from . Let’s introduce a symbol for this, say Y . X k ] [ ( k ) | Y x E X = x β (8.8) = k k k k In words, if the over-all model is linear, then the partial residuals are linear. And notice that X is the only input feature appearing here — if we could somehow k get hold of the partial residuals, then we can find β by doing a simple regression, k rather than a multiple regression. Of course to get the partial residual we need β s. . . to know all the other j This suggests the following estimation scheme for linear models, known as the Gauss-Seidel algorithm , or more commonly and transparently as back- fitting ; the pseudo-code is in Example 17. This is an iterative approximation algorithm. Initially, we look at how far each “You say ’vicious 2 X | ] X,Z | Y Y As you learned in baby prob., this is the fact that E [ ] — that we can always | X ] = E [ E [ I circle’, condition more variables, provided we then average over those extra variables when we’re done. ’it- say erative improve- ment’.”

170 170 Additive Models th n p + 1) inputs x (0 × column all 1s) Given: ( 1 responses n y × small tolerance 0 δ > center y x and each column of ̂ ← 0 for j ∈ β p 1 : j ̂ (all β until − γ | |≤ δ ) { j j k for 1 : p { ∈ ∑ k ) ( ̂ y − x y = β j i ij i k = 6 j ( k ) γ y ← regression coefficient of on x · k k ̂ β γ ← k k } } ∑ ∑ ∑ n n p − 1 1 − ̂ ̂ ← ) − β x n y ( β n j i ij 0 i =1 j i =1 =1 ̂ ̂ ̂ , β β ) ,... Return: ( β p 0 1 Code Example 17: Pseudocode for back-fitting linear models. Assume we make at least one until loop. Recall from Chapter 1 that centering the data does not change the pass through the β s; this way the intercept only has to be calculated once, at the end. [[ATTN: Fix horizontal j lines]] point is from the global mean, and do a simple regression of those deviations on the first input variable. This then gives us a better idea of what the regression that surface in a simple regression surface really is, and we use the deviations from Y X on the next variable; this should catch relations between and that weren’t 2 already caught by regressing on . We then go on to the next variable in turn. X 1 At each step, each coefficient is adjusted to fit in with what we have already guessed about the other coefficients — that’s why it’s called “back-fitting”. It is 3 that this will ever converge, but it (generally) does, and the fixed not obvious point on which it converges is the usual least-squares estimate of β . Back-fitting is rarely used to fit linear models these days, because with modern T − 1 T x x y . x ) computers and numerical linear algebra it’s faster to just calculate ( . But the cute thing about back-fitting is that it doesn’t actually rely on linearity 8.2.2 Backfitting Additive Models Defining the partial residuals by analogy with the linear case, as ) ( ∑ ( k ) (8.9) ( f ) x + α − Y = Y j j j 6 = k a little algebra along the lines of 8.2.1 shows that § [ ] ( k ) Y | X (8.10) = E ) x = f x ( k k k k 3 Unless, I suppose, you’re Gauss.

171 8.2 Partial Residuals and Back-fitting 171 n Given: inputs x × p 1 responses × y n small tolerance 0 δ > one-dimensional smoother S ∑ n − 1 ̂ n α ← y i i =1 ̂ f ← 0 for j ∈ 1 : p j ̂ | f until (all − g { |≤ δ ) j j k ∈ 1 : p { for ∑ k ( ) ̂ y − ( = y x f ) ij i j i = 6 j k ( k ) ( y ∼ g ) ←S x · k k ∑ n − 1 ) − g n ← g g ( x k k k ik =1 i ̂ f ← g k k } } ̂ ̂ ̂ α, ,... Return: ( f ) f 1 p Code Example 18: Pseudo-code for back-fitting additive models. Notice the extra step, as com- pared to back-fitting linear models, which keeps each partial response function centered. If we knew how to estimate arbitrary one-dimensional regressions, we could now use back-fitting to estimate additive models. But we have spent a lot of time learning how to use smoothers to fit one-dimensional regressions! We could use nearest neighbors, or splines, or kernels, or local-linear regression, or anything else we feel like substituting here. Our new, improved back-fitting algorithm in Example 18. Once again, while it’s not obvious that this converges, it does. Also, the back-fitting procedure works well with some complications or refinements of the additive model. If we know the function form of one or another of the f , we can fit those parametrically (rather j than with the smoother) at the appropriate points in the loop. (This would be a semiparametric model.) If we think that there is an interaction between x and j x , rather than their making separate additive contributions for each variable, k we can smooth them together; etc. two packages standard packages for fitting additive models There are actually in R: gam and mgcv . Both have commands called gam , which fit generalized additive models — the generalization is to use the additive model for things like the probabilities of categorical responses, rather than the response variable itself. If that sounds obscure right now, don’t worry — we’ll come back to this in Chapters 11–12 after we’ve looked at generalized linear models. § 8.4 below illustrates using one of these packages to fit an additive model.

172 172 Additive Models 8.3 The Curse of Dimensionality Before illustrating how additive models work in practice, let’s talk about why we’d want to use them. So far, we have looked at two extremes for regression models; additive models are somewhere in between. On the one hand, we had linear regression, which is a parametric method (with +1 parameters). Its weakness is that the true regression function is hardly ever μ p linear, so even with infinite data linear regression will always make systematic mistakes in its predictions — there’s always some approximation bias, bigger or μ smaller depending on how non-linear is. The strength of linear regression is that it converges very quickly as we get more data. Generally speaking, 2 − 1 + a (8.11) + O ( n = σ ) MSE linear linear where the first term is the intrinsic noise around the true regression function, the second term is the (squared) approximation bias, and the last term is the estimation variance. Notice that the rate at which the estimation variance shrinks 4 . doesn’t depend on Other p — factors like that are all absorbed into the big O parametric models generally converge at the same rate. At the other extreme, we’ve seen a number of completely nonparametric regres- k -nearest neighbors, sion methods, such as kernel regression, local polynomials, , at least for any rea- etc. Here the limiting approximation bias is actually zero μ . The problem is that they converge more slowly, sonable regression function because we need to use the data not just to figure out the coefficients of a para- metric model, but the sheer shape of the regression function. We saw in Chapter 4 2 − 4 / 5 n σ ( + O ). that the mean-squared error of kernel regression in one dimension is k Splines, ), etc., all attain the same rate. But -nearest-neighbors (with growing k dimensions, this becomes (Wasserman, 2006, in p § 5.12) 2 +4) 4 / ( p − σ MSE = O ( − ) (8.12) n nonpara There’s no ultimate approximation bias term here. Why does the rate depend on ̂ ? Well, to hand-wave a bit, think of kernel smoothing, where μ ( ~x ) is an average p p ), for ~x near over . In a p dimensional space, the volume within  of ~x is O (  y ~x i i so the probability that a training point ~x falls in the averaging region around ~x i gets exponentially smaller as grows. Turned around, to get the same number of p training points per ~x , we need exponentially larger sample sizes. The appearance of the 4s is a little more mysterious, but can be resolved from an error analysis 5 . This slow rate isn’t just of the kind we did for kernel regression in Chapter 4 4 O ” notation. See Appendix C you are not familiar with “big 5 2 is Remember that in one dimension, the bias of a kernel smoother with bandwidth ( h h ), and the O variance is (1 /nh ), because only samples falling in an interval about h O across contribute to the prediction at any one point, and when h is small, the number of such samples is proportional to nh . 1 − 4 h O (( nh ) O Adding bias squared to variance gives an error of ), solving for the best ) + ( 1 / 5 5 − 4 / − bandwidth gives h ). Suppose for the moment ), and the total error is then = ( n O ( n O opt p dimensions we use the same bandwidth along each dimension. (We get the same end result that in 2 O h with more work if we let each dimension have its own bandwidth.) The bias is still ), because ( p O ( h ) the Taylor expansion still goes through. But now only samples falling into a region of volume

173 8.3 The Curse of Dimensionality 173 − 1 n 1.0 5 4 − n 26 1 − n 0.8 0.6 Excess MSE 0.4 0.2 0.0 10000 1 10 1000 100 n curve(x^(-1),from=1,to=1e4,log="x",xlab="n",ylab="Excess MSE") curve(x^(-4/5),add=TRUE,lty="dashed") curve(x^(-1/26),add=TRUE,lty="dotted") legend("topright",legend=c(expression(n^{-1}), expression(n^{-4/5}),expression(n^{-1/26})), lty=c("solid","dashed","dotted")) Schematic of rates of convergence of MSEs for parametric Figure 8.1 − 1 n )), one-dimensional nonparametric regressions or additive models ( O ( − 4 / 5 O models ( n )), and a 100-dimensional nonparametric regression ( 26 − 1 / O n )). Note that the horizontal but not the vertical axis is on a ( ( logarithmic scale. a weakness of kernel smoothers, but turns out to be the best any nonparametric estimator can do. 5 − 4 / n p For ( = 1, the nonparametric rate is ), which is of course slower than O − 1 O ( n ), but not all that much, and the improved bias usually more than makes up for it. But as p grows, the nonparametric rate gets slower and slower, and the fully nonparametric estimate more and more imprecise, yielding the infamous − 1 / 26 ( n curse of dimensionality p = 100, say, we get a rate of O ), which . For is not very good at all. (See Figure 8.1.) Said another way, to get the same 5 (4+ p ) / inputs that n precision with n p data points gives us with one input takes 20 . 8 data points. For p = 100, this is n , which tells us that matching the error of 41 = 100 one-dimensional observations requires O (4 × 10 ) hundred-dimensional n observations. So completely unstructured nonparametric regressions won’t work very well in high dimensions, at least not with plausible amounts of data. The trouble is that p − 1 , so the variance is O (( nh ). The best bandwidth is ) around x contribute to the prediction at x +4) +4) 1 / ( p − − 4 / ( p n ), yielding an error of O ( n h ) as promised. = now ( O opt

174 Additive Models 174 too many there are just possible high-dimensional functions, and seeing only a trillion points from the function doesn’t pin down its shape very well at all. [[ATTN: More This is where additive models come in. Not every regression function is additive, mathe- so they have, even asymptotically, some approximation bias. But we can estimate 5 − 4 / matical by a simple one-dimensional smoothing, which converges at O ( n each f ), j expla- almost as good as the parametric rate. So overall nation 5 2 / − 4 σ ) = a MSE + O ( n − (8.13) additive additive ap- in pendix?]] Since linear models are a sub-class of additive models, a ≤ a . From a additive lm purely predictive point of view, the only time to prefer linear models to additive − 4 / 5 − 1 models is when n is so small that O ) − O ( n ( n ) exceeds this difference in 6 approximation biases; eventually the additive model will be more accurate. 8.4 Example: California House Prices Revisited As an example, we’ll look at data on median house prices across Census tracts § A.13. This has both California and Penn- from the data-analysis assignment in sylvania, but it’s hard to visually see patterns with both states; I’ll do California, and let you replicate this all on Pennsylvania, and even on the combined data. Start with getting the data: housing <- read.csv("http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/data/calif_penn_2011.csv") housing <- na.omit(housing) calif <- housing[housing$STATEFP == 6, ] STATEFP code of 6 corresponds to California?) (How do I know that the We’ll fit a linear model for the log price, on the thought that it makes some sense for the factors which raise or lower house values to multiply together, rather than just adding. calif.lm <- lm(log(Median_house_value) ~ Median_household_income + Mean_household_income + POPULATION + Total_units + Vacant_units + Owners + Median_rooms + Mean_household_size_owners + Mean_household_size_renters + LATITUDE + LONGITUDE, data = calif) This is very fast — about a fifth of a second on my laptop. 7 Here are the summary statistics : print(summary(calif.lm), signif.stars = FALSE, digits = 3) ## ## Call: ## lm(formula = log(Median_house_value) ~ Median_household_income + ## Mean_household_income + POPULATION + Total_units + Vacant_units + ## Owners + Median_rooms + Mean_household_size_owners + Mean_household_size_renters + ## LATITUDE + LONGITUDE, data = calif) ## 6 Unless the best additive approximation to μ is linear; then the linear model has no more bias and less variance. 7 I have suppressed the usual stars on “significant” regression coefficients, because, as discussed in Chapter ?? , those aren’t really the most important variables, and I have reined in R’s tendency to use far too many decimal places.

175 8.4 Example: California House Prices Revisited 175 predlims <- function(preds, sigma) { prediction.sd <- sqrt(preds$se.fit^2 + sigma^2) upper <- preds$fit + 2 * prediction.sd lower <- preds$fit - 2 * prediction.sd lims <- cbind(lower = lower, upper = upper) return(lims) } Code Example 19: Calculating quick-and-dirty prediction limits from a prediction object ) containing fitted values and their standard errors, plus an estimate of the noise level. ( preds Because those are two (presumably uncorrelated) sources of noise, we combine the standard deviations by “adding in quadrature”. ## Residuals: ## Min 1Q Median 3Q Max ## -3.855 -0.153 0.034 0.189 1.214 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -5.74e+00 5.28e-01 -10.86 < 2e-16 ## Median_household_income 1.34e-06 4.63e-07 2.90 0.0038 ## Mean_household_income 1.07e-05 3.88e-07 27.71 < 2e-16 ## POPULATION -4.15e-05 5.03e-06 -8.27 < 2e-16 ## Total_units 8.37e-05 1.55e-05 5.41 6.4e-08 ## Vacant_units 8.37e-07 2.37e-05 0.04 0.9719 ## Owners -3.98e-03 3.21e-04 -12.41 < 2e-16 ## Median_rooms -1.62e-02 8.37e-03 -1.94 0.0525 ## Mean_household_size_owners 5.60e-02 7.16e-03 7.83 5.8e-15 ## Mean_household_size_renters -7.47e-02 6.38e-03 -11.71 < 2e-16 ## LATITUDE -2.14e-01 5.66e-03 -37.76 < 2e-16 ## LONGITUDE -2.15e-01 5.94e-03 -36.15 < 2e-16 ## ## Residual standard error: 0.317 on 7469 degrees of freedom ## Multiple R-squared: 0.639,Adjusted R-squared: 0.638 ## F-statistic: 1.2e+03 on 11 and 7469 DF, p-value: <2e-16 Figure 8.2 plots the predicted prices, ± 2 standard errors, against the actual prices. The predictions are not all that accurate . 317 on — the RMS residual is 0 the log scale (i.e., 37% on the original scale), but they do have pretty reasonable 8 coverage; about 96% of actual prices fall within the prediction limits . On the are quite precise, with the median of the calculated other hand, the predictions 8 Remember from your linear regression class that there are two kinds of confidence intervals we might want to use for prediction. One is a confidence interval for the conditional mean at a given value of x ; the other is a confidence interval for the realized values of Y at a given x . Earlier examples have emphasized the former, but since we don’t know the true conditional means here, we need to use the latter sort of intervals, prediction intervals proper, to evaluate coverage. The predlims function in Code Example 19 calculates a rough prediction interval by taking the standard error of the conditional mean, combining it with the estimated standard deviation, and multiplying by 2. Strictly speaking, we ought to worry about using a t -distribution rather than a Gaussian here, but with 7469 residual degrees of freedom, this isn’t going to matter much. (Assuming Gaussian noise is likely to be more of a concern, but this is only meant to be a rough cut anyway.)

176 176 Additive Models standard errors being 0.011 on the log scale (i.e., 1.1% in dollars). This linear model thinks it knows what’s going on. function from the package; gam mgcv Next, we’ll fit an additive model, using the this automatically sets the bandwidths using a fast approximation to leave-one- , or GCV ( out CV called 3.4.3). generalized cross-validation § system.time(calif.gam <- gam(log(Median_house_value) ~ s(Median_household_income) + s(Mean_household_income) + s(POPULATION) + s(Total_units) + s(Vacant_units) + s(Owners) + s(Median_rooms) + s(Mean_household_size_owners) + s(Mean_household_size_renters) + s(LATITUDE) + s(LONGITUDE), data = calif)) ## user system elapsed ## 3.452 0.144 3.614 s() (That is, it took about five seconds total to run this.) The terms in the formula indicate which terms are to be smoothed — if we wanted particular gam parametric forms for some variables, we could do that as well. (Unfortunately we can’t just write MedianHouseValue ∼ s(.) , we have to list all the variables on 9 the right-hand side. s() ), and ) The smoothing here is done by splines (hence there are lots of options for controlling the splines, or replacing them by other smoothers, if you know what you’re doing. Figure 8.3 compares the predicted to the actual responses. The RMS error has improved (0 . 27 on the log scale, or 130%, with 96% of observations falling with ± 2 standard errors of their fitted values), at only a fairly modest cost in the claimed precision (the median standard error of prediction is 0 . . 1%). 02, or 2 Figure 8.4 shows the partial response functions. It makes little sense to have latitude and longitude make separate additive con- 10 : tributions here; presumably they interact. We can just smooth them together calif.gam2 <- gam(log(Median_house_value) ~ s(Median_household_income) + s(Mean_household_income) + s(POPULATION) + s(Total_units) + s(Vacant_units) + s(Owners) + s(Median_rooms) + s(Mean_household_size_owners) + s(Mean_household_size_renters) + s(LONGITUDE, LATITUDE), data = calif) This gives an RMS error of 0 . 25 (log-scale) and 96% coverage, with a median ± standard error of 0 . 021, so accuracy is improving (at least in sample), with little loss of precision. Figures 8.6 and 8.7 show two different views of the joint smoothing of longitude and latitude. In the perspective plot, it’s quite clear that price increases specif- ically towards the coast, and even more specifically towards the great coastal cities. In the contour plot, one sees more clearly an inward bulge of a negative, but not too very negative, contour line (between -122 and -120 longitude) which embraces Napa, Sacramento, and some related areas, which are comparatively more developed and more expensive than the rest of central California, and so 9 Alternately, we could use Kevin Gilbert’s formulaTools functions — see https://gist.github.com/kgilbert-cmu . 10 If the two variables which interact have very different magnitudes, it’s better to smooth them with a te() term than an s() term, but here they are comparable. See § 8.5 for more, and help(gam.models) .

177 8.4 Example: California House Prices Revisited 177 graymapper <- function(z, x = calif$LONGITUDE, y = calif$LATITUDE, n.levels = 10, breaks = NULL, break.by = "length", legend.loc = "topright", digits = 3, ...) { my.greys = grey(((n.levels - 1):0)/n.levels) if (!is.null(breaks)) { stopifnot(length(breaks) == (n.levels + 1)) } else { if (identical(break.by, "length")) { breaks = seq(from = min(z), to = max(z), length.out = n.levels + 1) } else { breaks = quantile(z, probs = seq(0, 1, length.out = n.levels + 1)) } } z = cut(z, breaks, include.lowest = TRUE) colors = my.greys[z] plot(x, y, col = colors, bg = colors, ...) if (!is.null(legend.loc)) { breaks.printable <- signif(breaks[1:n.levels], digits) legend(legend.loc, legend = breaks.printable, fill = my.greys) } invisible(breaks) } Code Example 20: Map-making code. In its basic use, this takes vectors for x and y coordinates, and draws gray points whose color depends on a third vector for z , with darker points indicating z higher values of . Options allow for the control of the number of gray levels, setting the breaks between levels automatically, and using a legend. Returning the break-points makes it easier to use the same scale in multiple maps. See online for commented code. more expensive than one would expect based on their distance from the coast and San Francisco. If you worked through problem set A.13, you will recall that one of the big things wrong with the linear model is that its errors (the residuals) are highly structured and very far from random. In essence, it totally missed the existence of cities, and the fact that houses cost more in cities (because land costs more there). It’s a good idea, therefore, to make some maps, showing the actual values, and then, by way of contrast, the residuals of the models. Rather than do the plotting by hand over and over, let’s write a function (Code Example 20). Figures 8.8 and 8.9 show that allowing for the interaction of latitude and longi- tude (the smoothing term plotted in Figures 8.6–8.7) leads to a much more ran- dom and less systematic clumping of residuals. This is desirable in itself, even if it does little to improve the mean prediction error. Essentially, what that smooth- ing term is doing is picking out the existence of California’s urban regions, and their distinction from the rural background. Examining the plots of the inter- action term should suggest to you how inadequate it would be to just put in a LONGITUDE × LATITUDE term in a linear model. Including an interaction between latitude and longitude in a spatial problem is

178 178 Additive Models pretty obvious. There are other potential interactions which might be important here — for instance, between the two measures of income, or between the total number of housing units available and the number of vacant units. We could, of course, just use a completely unrestricted nonparametric regression — going to the opposite extreme from the linear model. In addition to the possible curse- npreg of-dimensionality issues, however, getting something like to run with 7000 data points and 11 predictor variables requires a lot of patience. Other techniques, like nearest neighbor regression ( § 1.5.1) or regression trees (Ch. 13), may run faster, though cross-validation can be demanding even there. 8.5 Interaction Terms and Expansions One way to think about additive models, and about (possibly) including interac- tion terms, is to imagine doing a sort of Taylor series or power series expansion of the true regression function. The zero-th order expansion would be a constant: x ) ≈ α (8.14) μ ( [ Y ]. (“Best” here is in the mean- E The best constant to use here would just be square sense, as usual.) A purely additive model would correspond to a first-order expansion: p ∑ ( (8.15) f ) x + μ α ≈ ) x ( j j =1 j Two-way interactions come in when we go to a second-order expansion: p p p ∑ ∑ ∑ ≈ ( + μ f α x x ) ) + ( f (8.16) ) ,x x ( j j j jk k =1 j j =1 k = j +1 (Why do I limit k to run from j + 1 to p ?, rather than from 1 to p ?) We will, . If we want to estimate these of course, insist that E ( X f ,X j,k )] = 0 for all [ k jk j , we use the syntax mgcv or te(xj, xk) . The former terms in R, using s(xj, xk) x ) plane, and is appropriate when those ,x fits a thin-plate spline over the ( k j variables are measured on similar scales, so that curvatures along each direction are comparable. The latter uses a tensor product of smoothing splines along each coordinate, and is more appropriate when the measurement scales are very 11 different . j , with additive partial-response There is an important ambiguity here: for any ′ ,x f function ) + ( x ,x f ) = f ( x , I could take any of its interactions, set jk j j j k k jk ′ ( x f ) and f x ) = 0, and get exactly the same predictions under all circum- ( j j j j stances. This is the parallel to being able to add and subtract constants from the first-order functions, provided we made corresponding changes to the intercept term. We therefore need to similarly fix the two-way interaction functions. A natural way to do this is to insist that the second-order f function should jk 11 § 7.4. If we want to interact a For the distinction between thin-plate and tensor-product splines, see continuous variable x te(xj, by=xk) with a categorical . or , mgcv ’s syntax is s(xj, by=xk) x j k

179 8.5 Interaction Terms and Expansions 179 and f be uncorrelated with (“orthogonal to”) the first-order functions ; this f k j is the analog to insisting that the first-order functions all have expectation zero. The s then represent purely interactive contributions to the response, which f jk could not be captured by additive terms. If this is what we want to do, the best syntax to use in is ti , which specifically separates the first- and higher- mgcv will estimate three functions, order terms, e.g., ti(xj) + ti(xk) + ti(xj, xk) for the additive contributions and their interaction. into it. The model f pick , and absorb f An alternative is to just a particular j jk then looks like p p ∑ ∑ α + μ ( x ) ≈ (8.17) ) f ( x ,x k jk j =1 j +1 k j = We can also mix these two approaches, if we specifically do not want additive or interactive terms for certain predictor variables. This is what I did above, where I estimated a single second-order smoothing term for both latitude and longitude, with no additive components for either. Of course, there is nothing special about two-way interactions. If you’re curious about what a three-way term would be like, and you’re lucky enough to have data which amenable to fitting it, you could certainly try p p p ∑ ∑ ∑ ∑ + μ ≈ α ) + x ( f ) ( (8.18) ,x ,x f ( x ,x ) + x f j j k jk k l j jkl j j =1 j =1 = k j +1 j,k,l (How should the indices for the last term go?) More ambitious combinations are certainly possible, though they tend to become a confused mass of algebra and indices. Geometric interpretation It’s often convenient to think of the regression function as living in a big (infinite- dimensional) vector space of functions. Within this space, the constant functions 12 form a linear sub-space , and we can ask for the projection of the true regression 13 μ as function on to that sub-space; this would be the best approximation to a constant. This is, of course, the expectation value. The additive functions of 14 variables also form a linear sub-space all , so the right-hand side of Eq. 8.15 p is just the projection of μ on to that space, and so forth and so on. When we insist on having the higher-order interaction functions be uncorrelated with the additive functions, we’re taking the projection of μ on to the space of all functions orthogonal to the additive functions. 12 Because if f and g are two constant functions, af + bg is also a constant, for any real numbers a and b . 13 Remember that projecting a vector on to a linear sub-space finds the point in the sub-space closest to the original vector. This is equivalent to minimizing the (squared) bias. 14 By parallel reasoning to the previous footnote.

180 180 Additive Models Selecting interactions There are two issues with interaction terms. First, the curse of dimensionality ) q (4+ / − 4 n returns: an order- O q ), so interaction term will converge at the rate ( they can dominate the over-all uncertainty. Second, there are lots of possible ) ( p interactions ( , in fact), which can make it very demanding in time and data q to fit them all, and hard to interpret. Just as with linear models, therefore, it can make a lot of sense to selective examine interactions based on subject-matter knowledge, or residuals of additive models. Varying-coefficient models In some contexts, people like to use models of the form p ∑ x (8.19) ) f x ( α ( + μ x ) = j − j j j =1 where f predictor variables, or some subset of them. is a function of the non- j j These functions are obviously a subset of the usual class of varying-coefficient additive models, but there are occasions where they have some scientific justifi- 15 cation . These are conveniently estimated in mgcv through the by option, e.g., 16 x s(xk, by=xj) will estimate a term of the form ( x ). f k j 8.6 Closing Modeling Advice With modern computing power, there are very few situations in which it is ac- tually better to do linear regression than to fit an additive model. In fact, there seem to be only two good reasons to prefer linear models. 1. Our data analysis is guided by a credible scientific theory which asserts lin- ear relationships among the variables we measure (not others, for which our observables serve as imperfect proxies). 2. Our data set is so massive that either the extra processing time, or the extra computer memory, needed to fit and store an additive rather than a linear model is prohibitive. Even when the first reason applies, and we have good reasons to believe a linear theory, the truly scientific thing to do would be to check linearity, by fitting a flexible non-linear model and seeing if it looks close to linear. (We will see formal tests based on this idea in Chapter 9.) Even when the second reason applies, we would like to know how much bias we’re introducing by using linear predictors, which we could do by randomly selecting a subset of the data which is small enough for us to manage, and fitting an additive model. In the vast majority of cases when users of statistical software fit linear models, neither of these justifications applies: theory doesn’t tell us to expect linearity, 15 They can also serve as a “transitional object” when giving up the use of purely linear models. 16 As we saw above, by does something slightly different when given a categorical variable. How are these two uses related?

181 8.7 Further Reading 181 and our machines don’t compel us to use it. Linear regression is then employed for no better reason than that users know how to type gam . You now but not lm know better, and can spread the word. 8.7 Further Reading Simon Wood, who wrote the mgcv package, has a nice book about additive models and their generalizations, Wood (2006); at this level it’s your best source for et al. further information. Buja (1989) is a thorough theoretical treatment. § 8.5 are sometimes called “functional analysis of variance” The expansions of or “functional ANOVA”. Making those ideas precise requires exploring some of the geometry of infinite-dimensional spaces of functions (“Hilbert space”). See Wahba (1990) for a treatment of the statistical topic, and Halmos (1957) for a classic introduction to Hilbert spaces. Historical notes Ezekiel (1924) seems to be the first publication advocating the use of additive models as a general method, which he called “curvilinear multiple correlation”. His paper was complete with worked examples on simulated data (with known 17 answers) and real data (from economics) . He was explicit that any reasonable smoothing or regression technique could be used to find what we’d call the partial response functions. He also gave a successive-approximation algorithm for esti- mate the over-all model: start with an initial guess about all the partial responses; plot all the partial residuals; refine the partial responses simultaneously; repeat. This differs from back-fitting in that the partial response functions are updating in parallel within each cycle, not one after the other. This is a subtle difference, and Ezekiel’s method will often work, but can run into trouble with correlated predictor variables, when back-fitting will not. The Gauss-Seidel or backfitting algorithm was invented by Gauss in the early 1800s during his work on least squares estimation in linear models; he mentioned it in letters to students, described it as something one could do “while half asleep”, but never published it. Seidel gave the first published version in 1874. (For all this history, see Benzi 2009.) I am not sure when the connection was made between additive statistical models and back-fitting. Exercises 8.1 Repeat the analyses of California housing prices with Pennsylvania housing prices. Which partial response functions might one reasonably hope would stay the same? Do they? (How can you tell?) 17 “Each of these curves illustrates and substantiates conclusions reached by theoretical economic analysis. Equally important, they provide definite quantitative statements of the relationships. The method of . . . curvilinear multiple correlation enable[s] us to use the favorite tool of the economist, caeteris paribus , in the analysis of actual happenings equally as well as in the intricacies of theoretical reasoning” (p. 453). (See also Exercise 8.4.)

182 182 Additive Models Additive? 8.2 , let ‖ ~x ‖ be the (ordinary, Euclidean) length of the vector ~x . Is For general p 2 ~x an additive this an additive function of the (ordinary, Cartesian) coordinates? Is ‖ ‖ 2 − ~x ‖ ‖ for a fixed ~x ? ? function? ~x − ~x ‖ ~x ‖ 0 0 0 8.3 Additivity vs. parallelism p arguments x 1. Take any additive function ,x of ,...x and . Fix a coordinate index i f p 1 2 c . Prove that f ( x ) depends ,x + ,...x a real number ,...x ,...x ) − f ( x ,x c,...x p p 2 1 2 1 i i only on and c , and not on the other coordinates. x i p f is additive. Consider the curve formed by 2. Suppose = 2, and continue to assume ( x plotting ,x f ) against x , and the curved formed by plotting for a fixed value of x 2 1 1 2 ′ x . Prove that the curves ,x x ) against f x with ( fixed at a different value, say x 2 2 1 1 2 are parallel, i.e., that the vertical distance between them is constant. and additive f , consider the surfaces formed by the f p 3. For general by varying all but one of the coordinates. Prove that these surfaces are always parallel to each other. 4. Is the converse true? That is, do parallel regression surfaces imply an additive model? 8.4 Additivity vs. partial derivatives μ is additive, with partial response functions 1. Suppose that the true regression function ∂μ ), so that each partial derivative is a function of that = f ( x f . Show that j j j ∂x j coordinate alone. 2. (Much harder) Suppose that, for each coordinate x x , there is some function f of j j j ∂μ necessarily additive? = f alone such that ( x μ ). Is j j ∂x j ∑ p Suppose that an additive model holds, so that Y = + 8.5 α Y f [ ( X E ) +  , with ], = α j j j =1 [ ] . ( X x ) j = 0 for each f , and E [  | X = x ] = 0 for all E j j ] [ , let μ . Show that ( x x ) = E 1. For each Y | X j = j j j j ∑ ] [ + f ( ) = ) + α μ ( x x x X = E f | ( X ) j j j j j j k k 6 = k j 2. Show that if X = is statistically independent of X α , for all k 6 = j , then μ − ( x ) j j j k x ( f ). j j 3. Does the conclusion of Exercise 22 still hold if one or more of the X s is statistically k dependent on X ? Explain why this should be the case, or give a counter-example to j show that it’s not true. Hint: All linear models are additive models, so if it is true for all additive models, it’s true for all linear models. Is it true for all linear models?

183 Exercises 183 Linear model 4e+06 3e+06 l l l l l Predicted ($) 2e+06 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 1e+06 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0e+00 6e+05 0e+00 4e+05 8e+05 2e+05 1e+06 Actual price ($) plot(calif$Median_house_value, exp(preds.lm$fit), type = "n", xlab = "Actual price ($)", ylab = "Predicted ($)", main = "Linear model", ylim = c(0, exp(max(predlims.lm)))) segments(calif$Median_house_value, exp(predlims.lm[, "lower"]), calif$Median_house_value, exp(predlims.lm[, "upper"]), col = "grey") abline(a = 0, b = 1, lty = "dashed") points(calif$Median_house_value, exp(preds.lm$fit), pch = 16, cex = 0.1) Actual median house values (horizontal axis) versus those Figure 8.2 predicted by the linear model (black dots), plus or minus two predictive standard errors (grey bars). The dashed line shows where actual and predicted prices are equal. Here predict gives both a fitted value for each point, and a standard error for that prediction. (Without a newdata argument, predict defaults to the data used to estimate calif.lm , which here is what we want.) Predictions are exponentiated so they’re comparable to the original values (and because it’s easier to grasp dollars than log-dollars).

184 184 Additive Models First additive model 2000000 1500000 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Predicted ($) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 1000000 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 500000 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0 0e+00 6e+05 8e+05 1e+06 2e+05 4e+05 Actual price ($) plot(calif$Median_house_value, exp(preds.gam$fit), type = "n", xlab = "Actual price ($)", ylab = "Predicted ($)", main = "First additive model", ylim = c(0, exp(max(predlims.gam)))) segments(calif$Median_house_value, exp(predlims.gam[, "lower"]), calif$Median_house_value, exp(predlims.gam[, "upper"]), col = "grey") abline(a = 0, b = 1, lty = "dashed") points(calif$Median_house_value, exp(preds.gam$fit), pch = 16, cex = 0.1) Actual versus predicted prices for the additive model, as in Figure 8.3 Figure 8.2. Note that the sig2 attribute of a model returned by gam() is the 2 estimate of the noise variance around the regression surface ( σ ).

185 Exercises 185 1.0 0.2 0.3 0.0 0.0 0.1 −0.2 −0.4 s(Total_units,5.37) −1.0 s(POPULATION,2.27) −0.1 −0.6 −0.8 50000 50000 200000 8000 0 0 4000 150000 20000 s(Mean_household_income,6.52) s(Median_household_income,5.03) POPULATION Total_units Mean_household_income Median_household_income 2.0 0.4 0.5 0.2 0.2 1.0 0.0 0.0 s(Owners,3.88) s(Vacant_units,5.85) −0.2 s(Median_rooms,7.39) 0.0 −0.5 −0.2 6 8 40 80 2000 5000 0 2 4 6 8 10 0 2 4 s(Mean_household_size_owners,7.82) Median_rooms Owners Mean_household_size_owners Vacant_units 0.2 0.5 0.5 0.0 −0.2 −0.5 s(LATITUDE,8.81) s(LONGITUDE,8.85) −1.5 −1.0 −0.6 38 42 6 8 10 2 34 −124 −120 −116 4 s(Mean_household_size_renters,3.11) LATITUDE LONGITUDE Mean_household_size_renters Figure 8.4 The estimated partial response functions for the additive model, with a shaded region showing ± 2 standard errors. The tick marks along the horizontal axis show the observed values of the input variables (a rug plot ); note that the error bars are wider where there are fewer observations. Setting pages=0 (the default) would produce eight separate plots, with the user prompted to cycle through them. Setting scale=0 gives each plot its own vertical scale; the default is to force them to share the same one. Finally, note that here the vertical scales are logarithmic.

186 186 Additive Models 1 1 1 1 0 0 0 0 −1 −1 −1 −2 −2 −3 −3 s(Total_units,2.88) s(POPULATION,1) −3 −4 8000 50000 0 20000 0 4000 150000 200000 50000 s(Mean_household_income,6.08) s(Median_household_income,6.74) POPULATION Mean_household_income Median_household_income Total_units 1 1 1 1 0 0 0 0 −1 −1 −1 −2 s(Owners,6.12) −3 −3 s(Vacant_units,4.62) −3 s(Median_rooms,7.89) −4 6 8 80 2000 5000 6 0 0 2 4 40 8 10 2 4 s(Mean_household_size_owners,7.95) Median_rooms Vacant_units Owners Mean_household_size_owners −1se s(LONGITUDE,LATITUDE,28.47) +1se 42 −0.6 −0.4 −0.4 1 0 0.2 40 0 0.2 0 0 38 −0.2 0 −0.2 −1 −0.2 0.8 LATITUDE −0.4 0.8 0.6 −0.8 36 0.6 −0.8 0.6 −0.8 −0.6 −0.2 −0.6 −0.2 −0.2 −0.6 −0.4 −0.4 34 0.6 0.6 −0.4 −3 −0.6 0.6 −0.8 0.4 0.2 0.2 0.4 0.4 0 −0.8 0 8 10 2 4 6 −124 −122 −120 −118 −116 −114 s(Mean_household_size_renters,3.05) LONGITUDE Mean_household_size_renters plot(calif.gam2, scale = 0, se = 2, shade = TRUE, resid = TRUE, pages = 1) Figure 8.5 Partial response functions and partial residuals for addfit2 , as in Figure 8.4. See subsequent figures for the joint smoothing of longitude and latitude, which here is an illegible mess. See help(plot.gam) for the plotting options used here.

187 Exercises 187 s(LONGITUDE,LATITUDE,28.47) 0.5 0.0 40 −0.5 38 −124 −122 36 LONGITUDE −120 LATITUDE −118 34 −116 plot(calif.gam2, select = 10, phi = 60, pers = TRUE, ticktype = "detailed", cex.axis = 0.5) Figure 8.6 The result of the joint smoothing of longitude and latitude.

188 188 Additive Models s(LONGITUDE,LATITUDE,28.47) 42 −0.4 0 40 0.6 0 38 −0.4 −0.2 0.8 LATITUDE 36 0.6 −0.8 −0.6 −0.2 0.2 −0.4 0.6 0.4 34 −0.8 −116 −114 −122 −124 −118 −120 LONGITUDE plot(calif.gam2, select = 10, se = FALSE) Figure 8.7 The result of the joint smoothing of longitude and latitude. Setting se=TRUE , the default, adds standard errors for the contour lines in multiple colors. Again, note that these are log units.

189 Exercises 189 Linear model Data l l l l l l l l l l l l l l l l l l 42 42 l l l l l l l l l l l l l l l l l l l l l l l l l l 16200 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 181000 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 40 l l 40 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 243000 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 296000 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 342000 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 38 38 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 382000 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 431000 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Latitude Latitude l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 36 36 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 493000 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 591000 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 34 34 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 705000 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −114 −116 −114 −120 −122 −118 −122 −116 −124 −120 −124 −118 Longitude Longitude Second additive model First additive model l l l l l l l l l l l l l l l l l l 42 42 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 40 l l 40 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 38 38 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Latitude Latitude l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 36 36 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 34 l l l l 34 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −124 −122 −118 −116 −114 −114 −120 −116 −118 −120 −124 −122 Longitude Longitude par(mfrow = c(2, 2)) calif.breaks <- graymapper(calif$Median_house_value, pch = 16, xlab = "Longitude", ylab = "Latitude", main = "Data", break.by = "quantiles") graymapper(exp(preds.lm$fit), breaks = calif.breaks, pch = 16, xlab = "Longitude", ylab = "Latitude", legend.loc = NULL, main = "Linear model") graymapper(exp(preds.gam$fit), breaks = calif.breaks, legend.loc = NULL, pch = 16, xlab = "Longitude", ylab = "Latitude", main = "First additive model") graymapper(exp(preds.gam2$fit), breaks = calif.breaks, legend.loc = NULL, pch = 16, xlab = "Longitude", ylab = "Latitude", main = "Second additive model") par(mfrow = c(1, 1)) Figure 8.8 Maps of real prices (top left), and those predicted by the linear model (top right), the purely additive model (bottom left), and the additive model with interaction between latitude and longitude (bottom right). Categories are deciles of the actual prices.

190 190 Additive Models Data Residuals of linear model l l l l l l l l l l l l l l l l l l 42 42 l l l l l l l l l l l l l l l l l l l l l l l l l l 16200 −3.85 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 181000 −0.352 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 40 l 40 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 243000 −0.205 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −0.11 296000 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −0.0337 342000 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 38 38 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0337 382000 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0952 431000 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Latitude Latitude l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 36 36 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 493000 0.156 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.226 l 591000 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 34 34 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 705000 0.329 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −114 −124 −122 −120 −120 −118 −114 −118 −116 −116 −122 −124 Longitude Longitude Residuals of second additive model Residuals errors of first additive model l l l l l l l l l l l l l l l l l l 42 42 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 40 40 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 38 38 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Latitude Latitude l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 36 36 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 34 l l l l 34 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −124 −122 −120 −118 −116 −114 −120 −118 −116 −122 −124 −114 Longitude Longitude Figure 8.9 Actual housing values (top left), and the residuals of the three models. (The residuals are all plotted with the same color codes.) Notice that both the linear model and the additive model without spatial interaction systematically mis-price urban areas. The model with spatial interaction does much better at having randomly-scattered errors, though hardly perfect. — How would you make a map of the magnitude of regression errors?

191 9 Testing Parametric Regression Specifications with Nonparametric Regression 9.1 Testing Functional Forms One important, but under-appreciated, use of nonparametric regression is in test- ing whether parametric regressions are well-specified. The typical parametric re- gression model is something like ) + ( X ; Y =  (9.1) f θ f is some function which is completely specified except for the adjustable where θ parameters  , as usual, is uncorrelated noise. Usually, but not necessarily, , and people use a function that is linear in the variables in X , or perhaps includes f some interactions between them. How can we tell if the specification is right? If, for example, it’s a linear model, how can we check whether there might not be some nonlinearity? One common specific departures from the approach is to modify the specification by adding in modeling assumptions — say, adding a quadratic term — and seeing whether the coefficients that go with those terms are significantly non-zero, or whether the 1 improvement in fit is significant. For example, one might compare the model Y = θ (9.2) x  + θ + x 2 2 1 1 to the model 2 = + θ Y x θ + θ x x (9.3) +  1 1 3 2 2 1 by checking whether the estimated θ is significantly different from 0, or whether 3 the residuals from the second model are significantly smaller than the residuals from the first. This can work, you have chosen the right nonlinearity to test. It has the if power to detect certain mis-specifications, if they exist, but not others. (What if the departure from linearity is not quadratic but cubic?) If you have good reasons to think that when the model is wrong, it can only be wrong in certain ways, fine; if not, though, why only check for those errors? Nonparametric regression effectively lets you check for all kinds of systematic errors, rather than singling out a particular one. There are three basic approaches, which I give in order of increasing sophistication. 1 In my experience, this is second in popularity only to ignoring the issue. 191 st 11:17 Monday 1 April, 2019 c Copyright © Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/ cshalizi/ADAfaEPoV/ ~

192 192 Testing Regression Specifications If the parametric model is right, it should predict as well as, or even better than, • ̂ ̂ the non-parametric one, and we can check whether θ ) − MSE MSE ( ( μ ) is np p sufficiently small. If the parametric model is right, the non-parametric estimated regression curve • ̂ x ; should be very close to the parametric one. So we can check whether θ ) − f ( ̂ ( x ) is approximately zero everywhere. μ • If the parametric model is right, then its residuals should be patternless and independent of input features, because [ Y − f ( x ; θ ) | X ] = E [ f ( x ; θ ) +  − f E x ; θ ) | X ] = E [  | X ] = 0 (9.4) ( So we can apply non-parametric smoothing to the parametric residuals, − y ̂ ( x θ ), and see if their expectation is approximately zero everywhere. f ; We’ll stick with the first procedure, because it’s simpler for us to implement computationally. However, it turns out to be easier to develop theory for the other two, and especially for the third — see Li and Racine (2007, ch. 12), or Hart (1997). Here is the basic procedure. x ). ,y ,y ) , ( x x 1. Get data ( ( ) ,... ,y 2 n 1 n 1 2 ̂ , and in-sample mean-squared 2. Fit the parametric model, getting an estimate θ ̂ MSE error ( θ ). p 3. Fit your favorite nonparametric regression (using cross-validation to pick con- ̂ trol settings as necessary), getting curve and in-sample mean-squared error μ ̂ ( μ ). MSE np ̂ ̂ ̂ d ( ) θ MSE − MSE = ( 4. Calculate μ ). np p ∗ ∗ ∗ ∗ ̂ ). ,y ) ,... ( x ,y x θ 5. Simulate from the parametric model to get faked data ( n n 1 1 ∗ ̂ and θ 1. Fit the parametric model to the simulated data, getting estimate ∗ ̂ MSE ( θ ). p ∗ ̂ μ 2. Fit the nonparametric model to the simulated data, getting estimate ∗ ̂ MSE μ ). ( and np ∗ ∗ ∗ ̂ ̂ D ( θ = ) − MSE ). ( MSE μ 3. Calculate np p 6. Repeat step 5 b times to get an estimate of the distribution of D under the null hypothesis. ∗ ̂ 1+# D > d { } -value is p 7. The approximate . b 1+ Let’s step through the logic. In general, the error of the non-parametric model will be converging to the smallest level compatible with the intrinsic noise of the process. What about the parametric model? Suppose on the one hand that the parametric model is correctly specified. Then its error will also be converging to the minimum — by assumption, it’s got the ̂ functional form right so bias will go to zero, and as θ → θ , the variance will also 0

193 9.1 Testing Functional Forms 193 go to zero. In fact, with enough data the correctly-specified parametric model 2 will actually . generalize better than the non-parametric model Suppose on the other hand that the parametric model is mis-specified. Then its predictions are systematically wrong, even with unlimited amounts of data — there’s some bias which never goes away, no matter how big the sample. Since the non-parametric smoother does eventually come arbitrarily close to the true regression function, the smoother will end up predicting better than the parametric model. Smaller errors for the smoother, then, suggest that the parametric model is wrong. But since the smoother has higher capacity, it could easily get smaller er- rors on a particular sample by chance and/or over-fitting, so only big differences in error count as evidence. Simulating from the parametric model gives us surro- if gate data which looks just like reality ought to, the model is true. We then see under the how much better we could expect the non-parametric smoother to fit . If the non-parametric smoother fits the actual data much bet- parametric model ter than this, we can reject the parametric model with high confidence: it’s really unlikely that we’d see that big an improvement from using the nonparametric 3 model just by luck. As usual, we simulate from the parametric model simply because we have no hope of working out the distribution of the differences in MSEs from first principles. This is an example of our general strategy of bootstrapping. 9.1.1 Examples of Testing a Parametric Model Let’s see this in action. First, let’s detect a reasonably subtle nonlinearity. Take the non-linear function g ( x ) = log (1 + x ), and say that Y = g ( x )+  , with  being IID Gaussian noise with mean 0 and standard deviation 0 . 15. (This is one of the § examples from 4.2.) Figure 9.1 shows the regression function and the data. The nonlinearity is clear with the curve to “guide the eye”, but fairly subtle. A simple linear regression looks pretty good: glinfit = lm(y ~ x, data = gframe) print(summary(glinfit), signif.stars = FALSE, digits = 2) ## ## Call: ## lm(formula = y ~ x, data = gframe) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.499 -0.091 0.002 0.106 0.425 ## ## Coefficients: 2 Remember that the smoother must, so to speak, use up some of the information in the data to figure out the shape of the regression function. The parametric model, on the other hand, takes that basic shape as given, and uses all the data’s information to tune its parameters. 3 As usual with p -values, this is not symmetric. A high p -value might mean that the true regression function is very close to μ ( x ; θ ), or it might mean that we don’t have enough data to draw conclusions (or that we were unlucky).

194 194 Testing Regression Specifications l l l l l l l l l 1.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 1.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l y l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 l l l l l l l l 2.5 3.0 0.0 0.5 1.0 1.5 2.0 x x <- runif(300, 0, 3) yg <- log(x + 1) + rnorm(length(x), 0, 0.15) gframe <- data.frame(x = x, y = yg) plot(x, yg, xlab = "x", ylab = "y", pch = 16, cex = 0.5) curve(log(1 + x), col = "grey", add = TRUE, lwd = 4) Figure 9.1 True regression curve (grey) and data points (circles). The curve g ( x ) = log (1 + x ). ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.182 0.017 10 <2e-16 ## x 0.434 0.010 43 <2e-16 ## ## Residual standard error: 0.15 on 298 degrees of freedom ## Multiple R-squared: 0.86,Adjusted R-squared: 0.86

195 9.1 Testing Functional Forms 195 l l l l l l l l l 1.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 1.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l y l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 l l l l l l l l 2.0 1.5 1.0 0.5 0.0 3.0 2.5 x Figure 9.2 As in previous figure, but adding the least-squares regression line (black). Line widths exaggerated for clarity. ## F-statistic: 1.8e+03 on 1 and 298 DF, p-value: <2e-16 2 R is ridiculously high — the regression line preserves 86 percent of the variance in the data. The p -value reported by R is also very, very low, but remember all this really means is “you’d have to be crazy to think a flat line fit better than straight line with a slope” (Figure 9.2). 4 The in-sample MSE of the linear fit is 4 If we ask R for the MSE, by squaring summary(glinfit)$sigma , we get 0.0234815. This differs from the mean of the squared residuals by a factor of factor of n/ ( n − 2) = 300 / 298 = 1 . 0067, because R is trying to estimate the out-of-sample error by scaling up the in-sample error, the same way the estimated population variance scales up the sample variance. We want to compare in-sample fits.

196 196 Testing Regression Specifications sim.lm <- function(linfit, test.x) { n <- length(test.x) sim.frame <- data.frame(x = test.x) sigma <- summary(linfit)$sigma * (n - 2)/n y.sim <- predict(linfit, newdata = sim.frame) y.sim <- y.sim + rnorm(n, 0, sigma) sim.frame <- data.frame(sim.frame, y = y.sim) return(sim.frame) } Code Example 21: Simulate a new data set from a linear model, assuming homoskedastic x , and that the response variable Gaussian noise. It also assumes that there is one input variable, is called . Could you modify it to work with multiple regression? y calc.D <- function(data) { MSE.p <- mean((lm(y ~ x, data = data)$residuals)^2) MSE.np.bw <- npregbw(y ~ x, data = data) MSE.np <- npreg(MSE.np.bw)$MSE return(MSE.p - MSE.np) } Code Example 22: Calculate the difference-in-MSEs test statistic. signif(mean(residuals(glinfit)^2), 3) ## [1] 0.0233 5 The nonparametric regression has a somewhat smaller MSE library(np) gnpr <- npreg(y ~ x, data = gframe) signif(gnpr$MSE, 3) ## [1] 0.0204 ̂ So d is signif((d.hat = mean(glinfit$residual^2) - gnpr$MSE), 3) ## [1] 0.00294 Now we need to simulate from the fitted parametric model, using its estimated coefficients and noise level. We have seen several times now how to do this. The function sim.lm in Example 21 does this, along the same lines as the examples in Chapter 6; it assumes homoskedastic Gaussian noise. Again, as before, we need a function which will calculate the difference in MSEs between a linear model and a kernel smoother fit to the same data set — which will do automatically calc.D in Example 22. Note that the kernel what we did by hand above. This is bandwidth has to be re-tuned to each new data set. If we call calc.D on the output of sim.lm , we get one value of the test statistic under the null distribution: 5 npreg does not apply the kind of correction mentioned in the previous footnote.

197 9.1 Testing Functional Forms 197 calc.D(sim.lm(glinfit, x)) ## [1] 0.0005368707 Now we just repeat this a lot to get a good approximation to the sampling under the null hypothesis: distribution of D null.samples.D <- replicate(200, calc.D(sim.lm(glinfit, x))) This takes some time, because each replication involves not just generating a new simulation sample, but also cross-validation to pick a bandwidth. This adds up to about a second per replicate on my laptop, and so a couple of minutes for 200 replicates. (While the computer is thinking, look at the command a little more closely. It leaves the y values. x values alone, and only uses simulation to generate new say where the x values This is appropriate here because our model doesn’t really Y given X . If the model came from; it’s just about the conditional distribution of x we were testing specified a distribution for x each time we , we should generate calc.D x is IID” but with no particular invoke . If the specification is vague, like “ distribution, then resample X .) When it’s done, we can plot the distribution and see that the observed value ̂ d ?? ). This tells us that it’s very is pretty far out along the right tail (Figure unlikely that would improve so much on the linear model if the latter were npreg true. In fact, exactly 0 of the simulated values of the test statistic were that big: sum(null.samples.D > d.hat) ## [1] 0 Thus our estimated p -value is ≤ 0.00498. We can reject the linear model pretty 6 confidently. is As a second example, let’s suppose that the linear model right — then the test should give us a high p -value. So let us stipulate that in reality = 0 . Y . 5 x + η (9.5) 2 + 0 2 η ∼N (0 , 0 . 15 ). Figure 9.4 shows data from this, of the same size as before. with − 4 ˆ . 7 d 10 Repeating the same exercise as before, we get that = 7 , together × with a slightly different null distribution (Figure 9.5). Now the p -value is 0 . 3, which it would be quite rash to reject. 9.1.2 Remarks Other Nonparametric Regressions There is nothing especially magical about using kernel regression here. Any con- sistent nonparametric estimator (say, your favorite spline) would work. They may differ somewhat in their answers on particular cases. 6 If we wanted a more precise estimate of the p -value, we’d need to use more bootstrap samples.

198 198 Testing Regression Specifications Histogram of null.samples.D 1500 1000 Density 500 0 0.0025 0.0000 0.0030 0.0005 0.0010 0.0015 0.0020 null.samples.D hist(null.samples.D, n = 31, xlim = c(min(null.samples.D), 1.1 * d.hat), probability = TRUE) abline(v = d.hat) Figure 9.3 D = MSE − MSE Histogram of the distribution of for data np p simulated from the parametric model. The vertical line marks the observed value. Notice that the mode is positive and the distribution is right-skewed; this is typical. Curse of Dimensionality For multivariate regressions, testing against a fully nonparametric alternative can be very time-consuming, as well as running up against curse-of-dimensionality

199 9.1 Testing Functional Forms 199 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 1.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 1.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l y l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 l 3.0 1.5 1.0 0.5 2.5 0.0 2.0 x y2 <- 0.2 + 0.5 * x + rnorm(length(x), 0, 0.15) y2.frame <- data.frame(x = x, y = y2) plot(x, y2, xlab = "x", ylab = "y") abline(0.2, 0.5, col = "grey", lwd = 2) Data from the linear model (true regression line in grey). Figure 9.4 7 issues . A compromise is to test the parametric regression against an additive model. Essentially nothing has to change. 7 This curse manifests itself here as a loss of power in the test. Said another way, because unconstrained non-parametric regression must use a lot of data points just to determine the general shape of the regression function, even more data is needed to tell whether a particular parametric guess is wrong.

200 200 Testing Regression Specifications Histogram of null.samples.D.y2 1000 800 600 Density 400 200 0 0.003 0.004 0.005 0.000 0.001 0.002 null.samples.D.y2 Figure 9.5 As in Figure 9.3, but using the data and fits from Figure 9.4. E [ ̂  Testing X ] = 0 | I mentioned at the beginning of the chapter that one way to test whether the parametric model is correctly specified is to test whether the residuals have expec- tation zero everywhere. Setting r ( x ; m ) ≡ E [ Y − m ( X ) | X = x ], we know from μ Chapter r ( x ; that ) = 0 everywhere, and that, for any other function m , ?? ̂ ( x ; m ) 6 = 0 for at least some values of x . Thus, if we take the residuals from our  r parametric model and we smooth them, we get an estimated function ˆ r x ) that ( should be converging to 0 everywhere if the parametric model is well-specified. 8 r , such as A natural test statistic is therefore some measure of the “size” of ˆ 8 If you’ve taken functional analysis or measure theory, you may recognize these as the (squared) L 2 and L ( f ) norms of the function ˆ r . 2

201 9.2 Why Use Parametric Models At All? 201 ∫ ∫ 2 2 ) is the pdf of ) dx , or ˆ ˆ r ( ( x ) f ( x ) dx (where f ( x r X ). (The latter, in particu- x ∑ n − 1 2 n lar, can be approximated by ˆ ( r ).) Our testing procedure would then x i i =1 amount to (i) finding the residuals by fitting the parametric model, (ii) smooth- ing the residuals to get ˆ r , (iii) calculating the size of ˆ r , and (iv) simulating to get a distribution for how big ˆ r should be, under the null hypothesis that the parametric model is right. An alternative to measuring the size of the expected-residuals function would be to try to predict the residuals. We would compare the MSEs of the “model” that the residuals have conditional expectation 0 everywhere, to the MSE of the , and proceed much as model that predicts the residuals by smoothing against X 9 before . Stabilizing the Sampling Distribution of the Test Statistic I have just looked at the difference in MSEs. The bootstrap principle being in- voked is that the sampling distribution of the test statistic, under the estimated parametric model, should be close to the distribution under the true parameter value. As discussed in Chapter 6, sometimes some massaging of the test statistic helps bring these distributions closer. Some modifications to consider: • Divide the MSE difference by an estimate of the noise σ . • Divide by an estimate of the noise σ times the difference in degrees of freedom, using the effective degrees of freedom ( § 1.5.3.2) of the nonparametric regression. • Use the log of the ratio in MSEs instead of the MSE difference. Doing a double bootstrap can help you assess whether these are necessary. 9.2 Why Use Parametric Models At All? It might seem by this point that there is little point to using parametric models at all. Either our favorite parametric model is right, or it isn’t. If it is right, then a consistent nonparametric estimate will eventually approximate it arbitrarily closely. If the parametric model is wrong, it will not self-correct, but the non- parametric estimate will eventually show us that the parametric model doesn’t work. Either way, the parametric model seems superfluous. There are two things wrong with this line of reasoning — two good reasons to use parametric models. 1. One use of statistical models, like regression models, is to connect scientific theories to data. The theories are ideas about the mechanisms generating the data. Sometimes these ideas are precise enough to tell us what the functional form of the regression should be, or even what the distribution of noise terms should be, but still contain unknown parameters. In this case, the parameters 9 Can you write the difference in MSEs for the residuals in terms of either of the measures of the size of ˆ r ?

202 202 Testing Regression Specifications don’t just care themselves are substantively meaningful and interesting — we 10 about prediction. 2. Even if all we care about prediction accuracy, there is still the bias-variance is trade-off to consider. Non-parametric smoothers will have larger variance in their predictions, at the same sample size, than correctly-specified parametric models, simply because the former are more flexible. Both models are converg- ing on the true regression function, but the parametric model converges faster, because it searches over a more confined space. In terms of total prediction error, the parametric model’s low variance plus vanishing bias beats the non- parametric smoother’s larger variance plus vanishing bias. (Remember that this is part of the logic of testing parametric models in the previous section.) In the next section, we will see that this argument can actually be pushed further, to work with not-quite-correctly specified models. Of course, both of these advantages of parametric models only obtain they if are well-specified. If we want to claim those advantages, we need to check the specification. 9.2.1 Why We Sometimes Want Mis-Specified Parametric Models Low-dimensional parametric models have potentially high bias (if the real re- gression curve is very different from what the model posits), but low variance (because there isn’t that much to estimate). Non-parametric regression models have low bias (they’re flexible) but high variance (they’re flexible). If the para- faster metric model is true, it can converge than the non-parametric one. Even if the parametric model isn’t quite true, a small bias plus low variance can some- times still beat a non-parametric smoother’s smaller bias and substantial vari- ance. With enough data the non-parametric smoother will eventually over-take the mis-specified parametric model, but with small samples we might be better off embracing bias. To illustrate, suppose that the true regression function is ( ) sin x 1 x (9.6) 1 + . 2 + E Y | X = x [ ] = 0 10 2 This is very nearly linear over small ranges — say x ∈ [0 , 3] (Figure 9.6). I will use the fact that I know the true model here to calculate the actual expected generalization error, by averaging over many samples (Example 23). Figure 9.7 shows that, out to a fairly substantial sample size ( ≈ 500), the lower bias of the non-parametric regression is systematically beaten by the lower variance of the linear model — though admittedly not by much. 10 On the other hand, it is not uncommon for scientists to write down theories positing linear relationships between variables, not because they actually believe that, but because that’s the only thing they know how to estimate statistically.

203 9.2 Why Use Parametric Models At All? 203 1.5 1.0 h(x) 0.5 1.5 2.5 3.0 0.0 1.0 2.0 0.5 x h <- function(x) { 0.2 + 0.5*(1+sin(x)/10)*x } curve(h(x),from=0,to=3) ) ( sin x 1 3]. 1 + x over [0 , x h Graph of . Figure 9.6 2 + ( ) = 0 2 10 nearly.linear.out.of.sample = function(n) { x <- seq(from = 0, to = 3, length.out = n) y <- h(x) + rnorm(n, 0, 0.15) data <- data.frame(x = x, y = y) y.new <- h(x) + rnorm(n, 0, 0.15) sim.lm <- lm(y ~ x, data = data) lm.mse <- mean((fitted(sim.lm) - y.new)^2) sim.np.bw <- npregbw(y ~ x, data = data) sim.np <- npreg(sim.np.bw) np.mse <- mean((fitted(sim.np) - y.new)^2) mses <- c(lm.mse, np.mse) return(mses) } nearly.linear.generalization <- function(n, m = 100) { raw <- replicate(m, nearly.linear.out.of.sample(n)) reduced <- rowMeans(raw) return(reduced) } Code Example 23: Evaluating the out-of-sample error for the nearly-linear problem as a func- tion of n , and evaluting the generalization error by averaging over many samples.

204 204 Testing Regression Specifications 0.22 0.20 0.18 RMS generalization error 0.16 500 5 20 10 50 100 200 1000 n sizes <- c(5, 10, 15, 20, 25, 30, 50, 100, 200, 500, 1000) generalizations <- sapply(sizes, nearly.linear.generalization) plot(sizes, sqrt(generalizations[1, ]), type = "l", xlab = "n", ylab = "RMS generalization error", log = "xy", ylim = range(sqrt(generalizations))) lines(sizes, sqrt(generalizations[2, ]), lty = "dashed") abline(h = 0.15, col = "grey") Figure 9.7 Root-mean-square generalization error for linear model (solid line) and kernel smoother (dashed line), fit to the same sample of the indicated size. The true regression curve is as in 9.6, and observations are corrupted by IID Gaussian noise with σ = 0 . 15 (grey horizontal line). The cross-over after which the nonparametric regressor has better generalization performance happens shortly before n = 500.

205 9.3 Further Reading 205 9.3 Further Reading This chapter has been on specification testing for regression models, focusing on whether they are correctly specified for the conditional expectation function. I am not aware of any other treatment of this topic at this level, other than the et al. (2012). If you have somewhat more statistical not-wholly-independent Spain theory than this book demands, there are very good treatments of related tests in Li and Racine (2007), and of tests based on smoothing residuals in Hart (1997). Econometrics seems to have more of a tradition of formal specification testing than many other branches of statistics. Godfrey (1988) reviews tests based on looking for parametric extensions of the model, i.e., refinements of the idea of testing whether θ = 0 in Eq. 9.3. White (1994) combines a detailed theory 3 not presuming any of specification testing within parametric stochastic models, particular parametric model is correct, with an analysis of when we can and cannot still draw useful inferences from estimates within a mis-specified model. Because of its generality, it, too, is at a higher theoretical level than this book, but is strongly recommend. White was also the co-author of a paper (Hong and White, 1995) presenting a theoretical analysis of the difference-in-MSEs test used in this chapter, albeit for a particular sort of nonparametric regression we’ve not really touched on. We will return to specification testing in Appendix E and Chapter 15, but for models of distributions, rather than regressions.

206 10 Moving Beyond Conditional Expectations: Weighted Least Squares, Heteroskedasticity, Local Polynomial Regression So far, all our estimates have been based on the mean squared error, giving equal importance to all observations, as is generally appropriate when looking at con- ditional expectations. In this chapter, we’ll start to work with giving more or weighted least squares. The oldest less weight to different observations, through reason to want to use weighted least squares is to deal with non-constant vari- ance, or heteroskedasticity, by giving more weight to lower-variance observations. This leads us naturally to estimating the conditional variance function, just as we’ve been estimating conditional expectations. On the other hand, weighted least squares lets us general kernel regression to locally polynomial regression. 10.1 Weighted Least Squares When we use ordinary least squares to estimate linear regression, we (naturally) minimize the mean squared error: n ∑ 1 2 · − ) ~x β ( y (10.1) ) = β ( MSE i i n =1 i The solution is of course T T 1 − ̂ (10.2) x x = ( β ) y x OLS We could instead minimize the weighted mean squared error, n ∑ 1 2 ) = ( WMSE β, ~w (10.3) w ) ( y β − ~x · i i i n =1 i This includes ordinary least squares as the special case where all the weights w = 1. We can solve it by the same kind of linear algebra we used to solve the i w for the matrix with the w ordinary linear least squares problem. If we write i on the diagonal and zeroes everywhere else, the solution is T − 1 T ̂ β wx ) (10.4) = ( x x wy WLS But why would we want to minimize Eq. 10.3? 1. Focusing accuracy. We may care very strongly about predicting the response for certain values of the input — ones we expect to see often again, ones where mistakes are especially costly or embarrassing or painful, etc. — than others. 206 st 11:17 Monday 1 April, 2019 c Copyright © Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/ cshalizi/ADAfaEPoV/ ~

207 10.1 Weighted Least Squares 207 ~x w , and points elsewhere If we give the points near that region big weights i i smaller weights, the regression will be pulled towards matching the data in that region. 2. Ordinary least squares is the maximum likelihood Discounting imprecision. ~ Y = estimate when the X  β +  is IID Gaussian white noise. This means in ·  has to be constant, and we measure the regression curve that the variance of with the same precision elsewhere. This situation, of constant noise variance, is called . Often however the magnitude of the noise is not homoskedasticity constant, and the data are . heteroskedastic When we have heteroskedasticity, even if each noise term is still Gaussian, ordinary least squares is no longer the maximum likelihood estimate, and so no 2 σ at each measurement longer efficient. If however we know the noise variance i 2 w /σ = 1 i , and set , we get the heteroskedastic MLE, and recover efficiency. i i (See below.) To say the same thing slightly differently, there’s just no way that we can estimate the regression function as accurately where the noise is large as we can where the noise is small. Trying to give equal attention to all parts of the input space is a waste of time; we should be more concerned about fitting well where the noise is small, and expect to fit poorly where the noise is big. 3. Sampling bias. In many situations, our data comes from a survey, and some members of the population may be more likely to be included in the sample than others. When this happens, the sample is a biased representation of the population. If we want to draw inferences about the population, it can help to give more weight to the kinds of data points which we’ve under-sampled, and less to those which were over-sampled. In fact, typically the weight put on data point i would be inversely proportional to the probability of i being included in the sample (exercise 10.1). Strictly speaking, if we are willing to believe that linear model is exactly correct, that there are no omitted variables, p do not vary with and that the inclusion probabilities y , then this sort of i i survey weighting is redundant (DuMouchel and Duncan, 1983). When those assumptions are not met — when there’re non-linearities, omitted variables, or “selection on the dependent variable” — survey weighting is advisable, if we know the inclusion probabilities fairly well. The same trick works under the same conditions when we deal with “co- variate shift”, a change in the distribution of X . If the old probability density function was p ( x ) and the new one is q ( x ), the weight we’d want to use is ) (Qui ̃nonero-Candela w = q ( x , 2009). This can involve estimat- ) /p ( x et al. i i i ing both densities, or their ratio (chapter 14). 4. Doing something else. There are a number of other optimization problems which can be transformed into, or approximated by, weighted least squares. The most important of these arises from generalized linear models, where the mean response is some nonlinear function of a linear predictor; we will look at them in Chapters 11 and 12. In the first case, we decide on the weights to reflect our priorities. In the

208 208 Weighting and Variance 5 0 0 −5 −10 −15 4 −2 2 −4 0 Index Figure 10.1 Black line: Linear response function ( y = 3 − 2 x ). Grey curve: 2 x ( σ ( x ) = 1 + x 2). (Code deliberately / standard deviation as a function of omitted; can you reproduce this figure?) third case, the weights come from the optimization problem we’d really rather be solving. What about the second case, of heteroskedasticity? 10.2 Heteroskedasticity Suppose the noise variance is itself variable. For example, the figure shows a simple linear relationship between the input X and the response Y , but also a nonlinear relationship between X and V [ Y ]. In this particular case, the ordinary least squares estimate of the regression line 71 . 69 −− 1 . is 2 x , with R reporting standard errors in the coefficients of ± 0 . 36

209 10.2 Heteroskedasticity 209 30 l l l l l 20 l l l l l l l 10 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l y l l l l l l l l l l l l l l l 0 l l l l l l l l l l l l l l l l l l l l −10 l l l −20 l l 5 0 −5 x plot(x, y) abline(a = 3, b = -2, col = "grey") fit.ols = lm(y ~ x) abline(fit.ols, lty = "dashed") Figure 10.2 Scatter-plot of n = 100 data points from the above model. (Here X is Gaussian with mean 0 and variance 9.) Grey: True regression line. Dashed: ordinary least squares regression line. and 0 . 24, respectively. Those are however calculated under the assumption that the noise is homoskedastic, which it isn’t. And in fact we can see, pretty much, that there is heteroskedasticity — if looking at the scatter-plot didn’t convince us, we could always plot the residuals against x , which we should do anyway.

210 210 Weighting and Variance 20 l l 1500 l l l l l 10 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0 l l l l l l l l l l l l l l l l l l l 1000 l l l l l l l l l l l l l l l l l l l l l l l −10 residuals(fit.ols) l (residuals(fit.ols))^2 500 −20 l l l l l −30 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0 −40 5 5 0 −5 −5 0 x x par(mfrow = c(1, 2)) plot(x, residuals(fit.ols)) plot(x, (residuals(fit.ols))^2) par(mfrow = c(1, 1)) Figure 10.3 Residuals (left) and squared residuals (right) of the ordinary least squares regression as a function of . Note the much greater range of x the residuals at large absolute values of x than towards the center; this changing dispersion is a sign of heteroskedasticity. To see whether that makes a difference, let’s re-do this many times with dif- ferent draws from the same model (Example 24). 4 Running ols.heterosked.error.stats(1e4) produces 10 random simulated data sets, which all have the same x values as the first one, but different values

211 10.2 Heteroskedasticity 211 ols.heterosked.example = function(n) { y = 3 - 2 * x + rnorm(n, 0, sapply(x, function(x) { 1 + 0.5 * x^2 })) fit.ols = lm(y ~ x) return(fit.ols$coefficients - c(3, -2)) } ols.heterosked.error.stats = function(n, m = 10000) { ols.errors.raw = t(replicate(m, ols.heterosked.example(n))) intercept.se = sd(ols.errors.raw[, "(Intercept)"]) slope.se = sd(ols.errors.raw[, "x"]) return(c(intercept.se = intercept.se, slope.se = slope.se)) } Code Example 24: Functions to generate heteroskedastic data and fit OLS regression to it, and to collect error statistics on the results. of y , generated however from the same model. It then uses those samples to get the standard error of the ordinary least squares estimates. (Bias remains a non-issue.) What we find is the standard error of the intercept is only a little . inflated (simulation value of 0 . 71), but the standard 81 versus official value of 0 error of the slope is much larger than what R reports, 0 55 versus 0 . 24. Since the . intercept is fixed by the need to make the regression line go through the center of the data (Chapter 2), the real issue here is that our estimate of the slope is much less precise than ordinary least squares makes it out to be. Our estimate is still consistent, but not as good as it was when things were homoskedastic. Can we get back some of that efficiency? 10.2.1 Weighted Least Squares as a Solution to Heteroskedasticity Suppose we visit the Oracle of Regression (Figure 10.4), who tells us that the 2 x / 2. We can then use this to noise has a standard deviation that goes as 1 + improve our regression, by solving the weighted least squares problem rather than ordinary least squares (Figure 10.5). This not only looks better, it is better: the estimated line is now 2 . − 1 . 84 x , 98 . . with reported standard errors of 0 18. This checks check out with sim- 3 and 0 . 23 for the ulation (Example 25): the standard errors from the simulation are 0 . 26 for the slope, so R’s internal calculations are working very intercept and 0 well. Why does putting these weights into WLS improve things? 10.2.2 Some Explanations for Weighted Least Squares Qualitatively, the reason WLS with inverse variance weights works is the follow- 1 ing. OLS tries equally hard to match observations at each data point. Weighted 1 Less anthropomorphically, the objective function in Eq. 10.1 has the same derivative with respect to 1 ∂MSE = . the squared error at each point, 2 n β y · ( ) ~x ∂ − i i

212 212 Weighting and Variance Statistician (right) consulting the Oracle of Regression (left) Figure 10.4 about the proper weights to use to overcome heteroskedasticity. (Image from http://en.wikipedia.org/wiki/Image:Pythia1.jpg .) wls.heterosked.example = function(n) { y = 3 - 2 * x + rnorm(n, 0, sapply(x, function(x) { 1 + 0.5 * x^2 })) fit.wls = lm(y ~ x, weights = 1/(1 + 0.5 * x^2)) return(fit.wls$coefficients - c(3, -2)) } wls.heterosked.error.stats = function(n, m = 10000) { wls.errors.raw = t(replicate(m, wls.heterosked.example(n))) intercept.se = sd(wls.errors.raw[, "(Intercept)"]) slope.se = sd(wls.errors.raw[, "x"]) return(c(intercept.se = intercept.se, slope.se = slope.se)) } Code Example 25: Linear regression of heteroskedastic data, using weighted least-squared re- gression.

213 213 10.2 Heteroskedasticity 30 l l l l l 20 l l l l l l l 10 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l y l l l l l l l l l l l l l l l 0 l l l l l l l l l l l l l l l l l l l l −10 l l l −20 l l 0 5 −5 x plot(x, y) abline(a = 3, b = -2, col = "grey") fit.ols = lm(y ~ x) abline(fit.ols, lty = "dashed") fit.wls = lm(y ~ x, weights = 1/(1 + 0.5 * x^2)) abline(fit.wls, lty = "dotted") Figure 10.5 Figure 10.2, plus the weighted least squares regression line (dotted). least squares, naturally enough, tries harder to match observations where the weights are big, and less hard to match them where the weights are small. But contains not only the true regression function each y ( x ) but also some noise μ i i  . The noise terms have large magnitudes where the variance is large. So we i should want to have small weights where the noise variance is large, because there the data tends to be far from the true regression. Conversely, we should put big weights where the noise variance is small, and the data points are close to the true regression. The qualitative reasoning in the last paragraph doesn’t explain why the weights 2 ∝ ∝ 1 /σ — why not w w should be inversely proportional to the variances, i i x i 1 /σ , for instance? Seeing why those are the right weights requires investigating x i how well different, indeed arbitrary , choices of weights would work. Look at the equation for the WLS estimates again: T − 1 T ̂ wx ) = ( x x (10.5) wy β WLS (10.6) ( w ) y h = T − 1 T ) = ( x defining the matrix wx ) h ( x w w for brevity. (The notation reminds us that everything depends on the weights in w .) Imagine holding x constant, but repeating the experiment multiple times, so that we get noisy values of y . In each

214 214 Weighting and Variance 2 = ~x Y · β +  . So , where E [  experiment, ] = 0 and V [  σ ] = i i i i i x i ̂ β h ( w ) x β + h ( w =  (10.7) ) WLS + h ( w )  β = (10.8) E [  ] = 0, the WLS estimator is unbiased: Since [ ] ̂ β = β (10.9) E WLS th In fact, for the j coefficient, ̂ = β (10.10) + [ β ( w )  ] h j j j n ∑ = + (10.11) h ( w )  β ji j i =1 i Since the WLS estimate is unbiased, it’s natural to want it to also have a small variance, and n [ ] ∑ 2 ̂ w (10.12) ) σ h ( = β V ji j x i =1 i — that It can be shown — the result is called the Gauss-Markov theorem picking weights to minimize the variance in the WLS estimate has the unique 2 2 /σ solution w , but . It does not require us to assume the noise is Gaussian = 1 i x i the proof is a bit tricky, so I will confine it to § 10.2.2.1 below. A less general but easier-to-grasp result comes from adding the assumption that the noise around the regression line is Gaussian — that 2 ~x · β + ,  ∼N (0 ,σ Y = ) (10.13) x The log-likelihood is then (Exercise 10.2) n n 2 ∑ ∑ ) β − y ( · ~x 1 1 n i i 2 − σ log − (10.14) ln 2 π − x i 2 2 σ 2 2 x i =1 =1 i i If we maximize this with respect to β , everything except the final sum is irrelevant, and so we minimize n 2 ∑ ) ( y β − ~x · i i (10.15) 2 σ x i =1 i 2 which is just weighted least squares with = 1 /σ . So, if the probabilistic w i x i assumption holds, WLS is the efficient maximum likelihood estimator. 2 Despite the first part of the name! Gauss himself was much less committed to assuming Gaussian distributions than many later statisticians.

215 10.2 Heteroskedasticity 215 3 10.2.2.1 Proof of the Gauss-Markov Theorem We want to prove that, when we are doing weighted least squares for linear 2 . We saw that that WLS is w = 1 regression, the best choice of weights /σ i x i unbiased (Eq. 10.9), so “best” here means minimizing the variance. We have also already seen (Eq. 10.6) that ̂ = h ( w ) y (10.16) β WLS h ( where the matrix ) is w T − 1 T h wx ) w ) = ( x x ( (10.17) w It would be natural to try to write out the variance as a function of the weights , set the derivative equal to zero, and solve. This is tricky, partly because we w need to make sure that all the weights are positive and add up to one, but mostly because of the matrix inversion in the definition of ( w ). A slightly less direct h approach is actually much cleaner. Write w for the hat matrix we for the inverse-variance weight matrix, and h 0 0 ( get with those weights. Then for any other choice of weights, we have w h ) = c + c is implicitly a function of the weights, but let’s suppress that in the h . ( 0 notation for brevity.) Since we know all WLS estimates are unbiased, we must have h + c ) x β = β (10.18) ( 0 but using the inverse-variance weights is a particular WLS estimate so h x β = β (10.19) 0 and so we can deduce that = 0 (10.20) cx from unbiasedness. [ ] ̃ Now consider the covariance matrix of the estimates, β V . This will be V [( h ], + c ) Y 0 3 You can skip this section, without loss of continuity.

216 216 Weighting and Variance which we can expand: ] [ ̃ V [( h β V c ) Y ] (10.21) = + 0 T + c ) V [ Y ] ( h = ( + c ) h (10.22) 0 0 T − 1 ) w = ( + c ( h (10.23) + c ) h 0 0 0 T T T − 1 T − 1 1 − 1 − cw + c (10.24) c + h w h cw + h = h w 0 0 0 0 0 0 0 0 − 1 T T 1 − 1 − T w w = ( w x ) x w (10.25) x ( x x w ) x 0 0 0 0 0 T 1 − 1 − x cw x ( w w x ) + 0 0 0 T T 1 − T 1 − c ) x w +( w x x w 0 0 0 − 1 T + cw c 0 T − 1 T T 1 − = ( x x w (10.26) x ( x ) w w x ) x 0 0 0 T 1 − T T 1 − T ) x + ( x w w + x ) x ( x cx c 0 0 T 1 − cw + c 0 1 T − 1 − T w ) + cw (10.27) = ( x c x 0 0 T T T x = 0 (and so cx = 0 where in the last step we use the fact that = 0). Since c 1 T − 0, because c ≥ cw w is a positive-definite matrix, we see that the variance 0 0 c = 0, and using the inverse variance weights. is minimized by setting Notes: 1. If all the variances are equal, then we’ve proved the optimality of OLS. 2. The proof actually works when comparing the inverse-variance weights to any other linear, unbiased estimator; WLS with different weights is just a special case. T y − x β ) β w ( y − x 3. We can write the WLS problem as that of minimizing ( ). If we allow to be a non-diagonal, but still positive-definite, matrix, then we w have the generalized least squares problem. This is appropriate when there are correlations between the noise terms at different observations, i.e., when Cov [  . In this case, the proof is easily adapted to , j ] 6 = 0 even though i 6 = j i show that the optimal weight matrix w is the inverse of the noise covariance matrix. (This is why I wrote everything as a function of w .) 10.2.3 Finding the Variance and Weights All of this was possible because the Oracle told us what the variance function was. What do we do when the Oracle is not available (Figure 10.6)? Sometimes we can work things out for ourselves, without needing an oracle. • We know, empirically, the precision of our measurement of the response variable — we know how precise our instruments are, or the response is really an average of several measurements with known standard deviations, etc. • We know how the noise in the response must depend on the input variables. For example, when taking polls or surveys, the variance of the proportions we

217 10.3 Estimating Conditional Variance Functions 217 The Oracle may be out (left), or too creepy to go visit (right). Figure 10.6 What then? (Left, the sacred oak of the Oracle of Dodona, copyright 2006 by Flickr user “essayen”, http://flickr.com/photos/essayen/245236125/ ; right, the entrace to the cave of the Sibyl of Cumæ, copyright 2005 by Flickr user “pverdicchio”, http://flickr.com/photos/occhio/17923096/ . Both used under Creative Commons license.) [[ATTN: Both are only licensed for non-commercial use, so find substitutes OR obtain rights for the for-money version of the book]] find should be inversely proportional to the sample size. So we can make the weights proportional to the sample size. Both of these outs rely on kinds of background knowledge which are easier to get in the natural or even the social sciences than in many industrial applications. However, there are approaches for other situations which try to use the observed residuals to get estimates of the heteroskedasticity; this is the topic of the next section. 10.3 Estimating Conditional Variance Functions Remember that there are two equivalent ways of defining the variance: ] ] [ [ 2 2 2 V [ − ( E [ X ]) X = E ] = ( X − E E X ]) [ X (10.28) The latter is more useful for us when it comes to estimating variance functions. We have already figured out how to estimate means — that’s what all this previous work on smoothing and regression is for — and the deviation of a random variable from its mean shows up as a residual. There are two generic ways to estimate conditional variances, which differ slightly in how they use non-parametric smoothing. We can call these the squared residuals method log squared residuals method . Here is how the and the first one goes. ̂ μ ( x ) with your favorite regression method, getting ( μ 1. Estimate x ). 2 ̂ − u 2. Construct the = ( y )) squared residuals , μ ( x . i i i

218 218 Weighting and Variance method to estimate the conditional mean of 3. Use your favorite non-parametric ̂ ( q u x ). the , call it i 2 ̂ ̂ = q ( x ). 4. Predict the variance using σ x The log-squared residuals method goes very similarly. ̂ ( x ) with your favorite regression method, getting x μ ( 1. Estimate ). μ 2 ̂ . 2. Construct the = log ( y , − log squared residuals μ ( x z )) i i i 3. Use your favorite non-parametric method to estimate the conditional mean of z , call it ˆ the s ( x ). i 2 ̂ ̂ 4. Predict the variance using ). = exp σ s ( x x th ̂ ̂ The quantity μ ( x , then the residuals should ) is the i y residual. If − μ ≈ μ i i have mean zero. Consequently the variance of the residuals (which is what we want) should equal the expected squared residual. So squaring the residuals makes sense, and the first method just smoothes these values to get at their expectations. What about the second method — why the log? Basically, this is a conve- nience — squares are necessarily non-negative numbers, but lots of regression methods don’t easily include constraints like that, and we really don’t want to 4 predict negative variances. Taking the log gives us an unbounded range for the regression. Strictly speaking, we don’t need to use non-parametric smoothing for either 2 , we could just fit the parametric method. If we had a parametric model for σ x model to the squared residuals (or their logs). But even if you think you know what the variance function should look like it, why not check it? We came to estimating the variance function because of wanting to do weighted least squares, but these methods can be used more generally. It’s often important to understand variance in its own right, and this is a general method for esti- mating it. Our estimate of the variance function depends on first having a good estimate of the regression function 10.3.1 Iterative Refinement of Mean and Variance: An Example 2 ̂ ̂ depends on the initial estimate of the regression function μ . But, σ The estimate x as we saw when we looked at weighted least squares, taking heteroskedasticity into account can change our estimates of the regression function. This suggests an iterative approach, where we alternate between estimating the regression function and the variance function, using each to improve the other. That is, we take either 2 ̂ σ method above, and then, once we have estimated the variance function , we x ̂ μ using weighted least squares, with weights inversely proportional to re-estimate our estimated variance. Since this will generally change our estimated regression, it will change the residuals as well. Once the residuals have changed, we should re-estimate the variance function. We keep going around this cycle until the 4 Occasionally people do things like claiming that gene differences explains more than 100% of the variance in some psychological trait, and so environment and up-bringing contribute negative variance. Some of them — like Alford et al. (2005) — say this with a straight face.

219 10.3 Estimating Conditional Variance Functions 219 change in the regression function becomes so small that we don’t care about further modifications. It’s hard to give a strict guarantee, but usually this sort of iterative improvement will converge. Let’s apply this idea to our example. Figure 10.3b already plotted the residuals from OLS. Figure 10.7 shows those squared residuals again, along with the true variance function and the estimated variance function. ̂ β = 2 . 69 The OLS estimate of the regression line is not especially good ( 0 ̂ . β versus = − 1 β 36 versus β = 3, = − 2), so the residuals are systematically 0 1 1 off, but it’s clear from the figure that kernel smoothing of the squared residuals is picking up on the heteroskedasticity, and getting a pretty reasonable picture of the variance function. Now we use the estimated variance function to re-estimate the regression line, with weighted least squares. fit.wls1 <- lm(y ~ x, weights = 1/fitted(var1)) coefficients(fit.wls1) ## (Intercept) x ## 2.978753 -1.905204 var2 <- npreg(residuals(fit.wls1)^2 ~ x) The slope has changed substantially, and in the right direction (Figure 10.8a). The residuals have also changed (Figure 10.8b), and the new variance function is closer to the truth than the old one. Since we have a new variance function, we can re-weight the data points and re-estimate the regression: fit.wls2 <- lm(y ~ x, weights = 1/fitted(var2)) coefficients(fit.wls2) ## (Intercept) x ## 2.990366 -1.928978 var3 <- npreg(residuals(fit.wls2)^2 ~ x) Since we know that the true coefficients are 3 and − 2, we know that this is moving in the right direction. If I hadn’t told you what they were, you could still observe that the difference in coefficients between fit.wls1 and fit.wls2 is smaller than that between and fit.wls1 , which is a sign that this is fit.ols converging. I will spare you the plot of the new regression and of the new residuals. Let’s iterate a few more times: fit.wls3 <- lm(y ~ x, weights = 1/fitted(var3)) coefficients(fit.wls3) ## (Intercept) x ## 2.990687 -1.929818 var4 <- npreg(residuals(fit.wls3)^2 ~ x) fit.wls4 <- lm(y ~ x, weights = 1/fitted(var4)) coefficients(fit.wls4) ## (Intercept) x ## 2.990697 -1.929848 By now, the coefficients of the regression are changing in the fourth significant

220 220 Weighting and Variance l 1500 l 1000 squared residuals 500 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0 5 −5 0 x plot(x, residuals(fit.ols)^2, ylab = "squared residuals") curve((1 + x^2/2)^2, col = "grey", add = TRUE) require(np) var1 <- npreg(residuals(fit.ols)^2 ~ x) grid.x <- seq(from = min(x), to = max(x), length.out = 300) lines(grid.x, predict(var1, exdat = grid.x)) Points: actual squared residuals from the OLS line. Grey Figure 10.7 2 2 2 . Black curve: kernel curve: true variance function, x σ / 2) = (1 + x smoothing of the squared residuals, using npreg . digit, and we only have 100 data points, so the imprecision from a limited sample surely swamps the changes we’re making, and we might as well stop. Manually going back and forth between estimating the regression function and

221 10.3 Estimating Conditional Variance Functions 221 30 l l 1500 l l l l 20 l l l l l l 1000 l l 10 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l y l l l l l l l l l l l l l l l 0 l l l l l l l l l l l l l l l l l l squared residuals l 500 l −10 l l l l l l l l l l −20 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0 5 0 5 −5 −5 0 x x Figure 10.8 Left: As in Figure 10.2, but with the addition of the weighted least squares regression line (dotted), using the estimated variance from Figure 10.7 for weights. Right: As in Figure 10.7, but with the addition of the residuals from the WLS regression (black squares), and the new estimated variance function (dotted curve). estimating the variance function is tedious. We could automate it with a function, which would look something like this: iterative.wls <- function(x, y, tol = 0.01, max.iter = 100) { iteration <- 1 old.coefs <- NA regression <- lm(y ~ x) coefs <- coefficients(regression) while (is.na(old.coefs) || ((max(coefs - old.coefs) > tol) && (iteration < max.iter))) {

222 222 Weighting and Variance variance <- npreg(residuals(regression)^2 ~ x) old.coefs <- coefs iteration <- iteration + 1 regression <- lm(y ~ x, weights = 1/fitted(variance)) coefs <- coefficients(regression) } return(list(regression = regression, variance = variance, iterations = iteration)) } This starts by doing an unweighted linear regression, and then alternates be- tween WLS for the getting the regression and kernel smoothing for getting the tol , variance. It stops when no parameter of the regression changes by more than 5 max.iter times. This code is a bit too inflex- or when it’s gone around the cycle ible to be really “industrial strength” (what if we wanted to use a data frame, or a more complex regression formula?), but shows the core idea. 10.3.2 Real Data Example: Old Heteroskedastic 5.4.2 introduced the geyser data set, which is about predicting the waiting § time between consecutive eruptions of the “Old Faithful” geyser at Yellowstone National Park from the duration of the latest eruption. Our exploration there showed that a simple linear model (of the kind often fit to this data in textbooks and elementary classes) is not very good, and raised the suspicion that one im- portant problem was heteroskedasticity. Let’s follow up on that, building on the computational work done in that section. The estimated variance function geyser.var does not look particularly flat, but it comes from applying a fairly complicated procedure (kernel smoothing with data-driven bandwidth selection) to a fairly limited amount of data (299 expect to see due observations). Maybe that’s the amount of wiggliness we should to finite-sample fluctuations? To rule this out, we can make surrogate data from the homoskedastic model, treat it the same way as the real data, and plot the resulting variance functions (Figure 10.10). The conditional variance functions estimated from the homoskedastic model are flat or gently varying, with much less range than what’s seen in the data. While that sort of qualitative comparison is genuinely informative, one can also be more quantitative. One might measure heteroskedasticity by, say, evaluating the conditional variance at all the data points, and looking at the ratio of the in- terquartile range to the median. This would be zero for perfect homoskedasticity, and grow as the dispersion of actual variances around the “typical” variance in- creased. For the data, this is IQR(fitted(geyser.var))/median(fitted(geyser.var)) − 15 86. Simulations from the OLS model give values around 10 = 0 . . There is nothing particularly special about this measure of heteroskedasticity — after all, I just made it up. The broad point it illustrates is the one made in § 5.4.2.1: whenever we have some sort of quantitative summary statistic we can 5 The condition in the while loop is a bit complicated, to ensure that the loop is executed at least once. Some languages have an until control structure which would simplify this.

223 10.3 Estimating Conditional Variance Functions 223 l l data kernel variance 800 homoskedastic (OLS) ) 2 s e t l u n i l m l ( 600 l l l l l l l l l l l l l l l 400 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 200 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Squared residuals of linear model l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0 1 2 3 4 5 Duration (minutes) library(MASS) data(geyser) geyser.ols <- lm(waiting ~ duration, data = geyser) plot(geyser$duration, residuals(geyser.ols)^2, cex = 0.5, pch = 16, xlab = "Duration (minutes)", ylab = expression(`Squared residuals of linear model `(minutes^2))) geyser.var <- npreg(residuals(geyser.ols)^2 ~ geyser$duration) duration.order <- order(geyser$duration) lines(geyser$duration[duration.order], fitted(geyser.var)[duration.order]) abline(h = summary(geyser.ols)$sigma^2, lty = "dashed") legend("topleft", legend = c("data", "kernel variance", "homoskedastic (OLS)"), lty = c("blank", "solid", "dashed"), pch = c(16, NA, NA)) Figure 10.9 Squared residuals from the linear model of Figure 5.1, plotted against duration, along with the unconditional, homoskedastic variance implicit in OLS (dashed), and a kernel-regression estimate of the conditional variance (solid).

224 224 Weighting and Variance 300 ) 2 250 s e t u n i m ( 200 150 100 Squared residuals of linear model 50 0 4 3 1 2 5 Duration (minutes) duration.grid <- seq(from = min(geyser$duration), to = max(geyser$duration), length.out = 300) plot(duration.grid, predict(geyser.var, exdat = duration.grid), ylim = c(0, 300), type = "l", xlab = "Duration (minutes)", ylab = expression(`Squared residuals of linear model `(minutes^2))) abline(h = summary(geyser.ols)$sigma^2, lty = "dashed") one.var.func <- function() { fit <- lm(waiting ~ duration, data = rgeyser()) var.func <- npreg(residuals(fit)^2 ~ geyser$duration) lines(duration.grid, predict(var.func, exdat = duration.grid), col = "grey") } invisible(replicate(30, one.var.func())) Figure 10.10 The actual conditional variance function estimated from the Old Faithful data (and the linear regression), in black, plus the results of applying the same procedure to simulations from the homoskedastic linear regression model (grey lines; see § 5.4.2 for the rgeyser function). The fact that the estimates from the simulations are mostly flat or gently sloped suggests that the changes in variance found in the data are likely too large to just be sampling noise.

225 10.4 Re-sampling Residuals with Heteroskedasticity 225 calculate on our real data, we can also calculate the same statistic on realizations of the model, and the difference will then tell us something about how close the simulations, and so the model, come to the data. In this case, we learn that the linear, homoskedastic model seriously understates the variability of this data. That leaves open the question of whether the problem is the linearity or the homoskedasticity; I will leave that question to Exercise 10.6. 10.4 Re-sampling Residuals with Heteroskedasticity 6.4, assumes that the § Re-sampling the residuals of a regression, as described in distribution of fluctuations around the regression curve is the same for all values of the input . Under heteroskedasticity, this is of course not the case. Nonetheless, x we can still re-sample residuals to get bootstrap confidence intervals, standard errors, and so forth, provided we define and scale them properly. If we have a 2 x ), as well as the estimated regression function conditional variance function ˆ ( σ ̂ x ), we can combine them to re-sample heteroskedastic residuals. μ ( 1. Construct the standardized residuals, by dividing the actual residuals by the conditional standard deviation: ( = (10.29) / ˆ η  x ) σ i i i η The should now be all the same magnitude (in distribution!), no matter i where x is in the space of predictors. i 2. Re-sample the with replacement, to get ̃ η η ,... ̃ η . i 1 n 3. Set ̃ x = x . i i ̂ σ = y μ ( ̃ x . ) + ˆ η ( ̃ x ) ̃ 4. Set ̃ i i i i 5. Analyze the surrogate data ( ̃ x ) like it was real data. , ̃ y y ) ,... ( ̃ x ̃ , n n 1 1 difference in distribution for the noise only Of course, this still assumes that the x is the scale. at different values of 10.5 Local Linear Regression Switching gears, recall from Chapter 2 that one reason it can be sensible to use is that we can typically a linear approximation to the true regression function μ Taylor-expand (App. refapp:taylor) the latter around any point x , 0 ∣ ∞ k k ∑ ∣ d μ x ( ) − x 0 ∣ (10.30) ( x μ x ) = ) + ( μ 0 ∣ k k dx ! x = x 0 k =1 and similarly with all the partial derivatives in higher dimensions. Truncating ′ μ ( x ) ≈ μ ( x ), we see the first derivative ) + ( x − x x ) μ the series at first order, ( 0 0 0 ′ ( x μ x close enough to x . The ) is the best linear prediction coefficient, at least if 0 0 ′ snag in this line of argument is that if μ ( x ) is nonlinear, then μ isn’t a constant, and the optimal linear predictor changes depending on where we want to make predictions.

226 226 Weighting and Variance However, statisticians are thrifty people, and having assembled all the ma- chinery for linear regression, they are loathe to throw it away just because the fundamental model is wrong. If we can’t fit one line, why not fit many? If each point has a different best linear regression, why not estimate them all? Thus linear regression: fit a different linear regression everywhere, local the idea of 6 . weighting the data points by how close they are to the point of interest x The simplest approach we could take would be to divide up the range of into so many bins, and fit a separate linear regression for each bin. This has at least three drawbacks. First, we get weird discontinuities at the boundaries between bins. Second, we pick up an odd sort of bias, where our predictions near the boundaries of a bin depend strongly on data from one side of the bin, and not at all on nearby data points just across the border, which is weird. Third, we need to pick the bins. The next simplest approach would be to first figure out where we want to make a prediction (say x ), and do a linear regression with all the data points which were sufficiently close, | − x | ≤ h for some h . Now we are basically using a x i uniform-density kernel to weight the data points. This eliminates two problems from the binning idea — the examples we include are always centered on the x we’re trying to get a prediction for, and we just need to pick one bandwidth h rather than placing all the bin boundaries. But still, each example point always has either weight 0 or weight 1, so our predictions change jerkily as training points fall into or out of the window. It generally works nicer to have the weights change more smoothly with the distance, starting off large and then gradually trailing to zero. By now bells may be going off, as this sounds very similar to the kernel regres- sion. In fact, kernel regression is what happens when we truncate Eq. 10.30 at order, getting locally constant regression. We set up the problem zeroth n ∑ 1 2 ̂ ) ( (10.31) m − w y )( x x ) = argmin ( μ i i n m =1 i and get the solution n ∑ ) w ( x i ∑ ̂ y μ ( x ) = (10.32) i n ) w ( x j j =1 =1 i which just is our kernel regression, when the weights are proportional to the kernels, w ( x ) ∝ K ( x ). (Without loss of generality, we can take the constant ,x i i of proportionality to be 1.) locally linear What about regression? The optimization problem is n ( ) ∑ 1 2 ̂ ̂ ) x − x y ( − ) m β · − (10.33) w )( ( x x μ ( x = argmin ) ( β , ) i i i n m,β i =1 where again we can make w ∝ ( x ) proportional to some kernel function, w ) ( x i i K ( x ), i.e., the ,x ). To solve this, abuse notation slightly to define z x = (1 ,x − i i i 6 Some people say “local linear” and some “local ly linear”.

227 10.5 Local Linear Regression 227 1.0 0.8 0.6 tricubic function 0.4 0.2 0.0 −0.5 0.0 −1.0 1.0 0.5 x curve((1 - abs(x)^3)^3, from = -1, to = 1, ylab = "tricubic function") The tricubic kernel, with broad plateau where | x |≈ 0, and Figure 10.11 x | = 1. the smooth fall-off to zero at | , with a 1 stuck at the beginning to (as usual) handle the x displacement from intercept. Now, by the machinery above, ( ) 1 − T T ̂ ̂ ) ( ) β = ( z , w ( x ) z x ) x z ( w ( x ) y (10.34) μ ̂ and the prediction is just the intercept, μ ). If you need an estimate of the first ( x ̂ β ( ). Eq. 10.34 guarantees that the weights given to derivatives, those are the x each training point change smoothly with x , so the predictions will also change 7 smoothly. Using a smooth kernel whose density is positive everywhere, like the Gaussian, ensures that the weights will change smoothly. But we could also use a kernel which goes to zero outside some finite range, so long as the kernel rises gradually from zero inside the range. For locally linear regression, a common choice of kernel tri-cubic , is therefore the ) ( 3 ) ( 3 | x − x | 0 i (10.35) 1 ) = K x − ( ,x i h if | x − x , and = 0 otherwise (Figure 10.11). | < h i 7 Notice that local linear predictors are still linear smoothers as defined in Chapter 1, (i.e., the predictions are linear in the y ), but they are not, strictly speaking, kernel smoothers, since you i can’t re-write the last equation in the form of a kernel average.

228 228 Weighting and Variance 10.5.1 For and Against Locally Linear Regression Why would we use locally linear regression, if we already have kernel regression? 1. You may recall that when we worked out the bias of kernel smoothers (Eq. ′ μ ). If ( 4.10 in Chapter 4), we got a contribution that was proportional to x we do an analogous analysis for locally linear regression, the bias is the same, that this derivative term goes away. except 2. Relatedly, that analysis we did of kernel regression tacitly assumed the point we were looking at was in the middle of the training data (or at least rather more than h from the border). The bias gets worse near the edges of the x ) is decreasing in the vicinity of the training data. Suppose that the true μ ( . (See the grey curve in Figure 10.12.) When we make our predictions x largest i there, in kernel regression we can only average values of y which tend to be i systematically larger than the value we want to predict. This means that our kernel predictions are systematically biased upwards, and the size of the bias ′ grows with μ ). (See the black line in Figure 10.12 at the lower right.) If we ( x use a locally linear model, however, it can pick up that there is a trend, and reduce the edge bias by extrapolating it (dashed line in the figure). 3. The predictions of locally linear regression tend to be smoother than those of kernel regression, simply because we are locally fitting a smooth line rather ̂ μ d tend to than a flat constant. As a consequence, estimates of the derivative dx ̂ μ comes from a locally linear model than a kernel regression. be less noisy when Of course, total prediction error depends not only on the bias but also on the variance. Remarkably enough, the variance for kernel regression and locally linear regression is the same, at least asymptotically. Since locally linear regression has smaller bias, local-linear fits are often better predictors. Despite all these advantages, local linear models have a real drawback. To make a prediction with a kernel smoother, we have to calculate a weighted average. To make a prediction with a local linear model, we have to solve a (weighted) linear least squares problem for each point, or each prediction. This takes much more 8 computing time . There are several packages which implement locally linear regression. Since np regtype="ll" in we are already using , one of the simplest is to set the 8 μ ( Let’s think this through. To find ) with a kernel smoother, we need to calculate K ( x ̂ ,x ) for each x i ) computational . If we’ve got p predictor variables and use a product kernel, that takes x ( pn O i steps. We then need to add up the kernels to get the denominator, which we could certainly do in O n ) more steps. (Could you do it faster?) Multiplying each weight by its y ( is a further O ( n ), and i the final adding up is at most ( n ); total, O ( pn ). To make a prediction with a local linear model, O T z x w ( we need to calculate the right-hand side of Eq. 10.34. Finding ( ) z ) means multiplying 2 2 × n ][ n × n ][ n × ( p + 1)] matrices, which will take O (( p + 1) [( n ) = O ( p + 1) n ) steps. Inverting a p 3 3 3 O ( q q ) steps, so our inversion takes O × p + 1) q ) = O ( p matrix takes ) steps. Just getting (( T 2 T 3 1 − p ) x ( thus requires O ( p w + p z n ). Finding the ( ) + 1) × 1 matrix z ( w ( x ) y similarly takes z 2 (( p + 1) n ) = O ( pn ) steps, and the final matrix multiplication is O (( p + 1)( p + 1)) = O ( p O ). Total, 2 3 ( O n ) + O ( p p ). The speed advantage of kernel smoothing thus gets increasingly extreme as the number of predictor variables p grows.

229 10.5 Local Linear Regression 229 l l l l l l l l l l l 8 l l l l l l 6 l l l y l l l l 4 l 2 l l l l l 0.0 0.5 2.0 1.5 2.5 1.0 x x <- runif(30, max = 3) y <- 9 - x^2 + rnorm(30, sd = 0.1) plot(x, y) rug(x, side = 1, col = "grey") rug(y, side = 2, col = "grey") curve(9 - x^2, col = "grey", add = TRUE, lwd = 3) grid.x <- seq(from = 0, to = 3, length.out = 300) np0 <- npreg(y ~ x) lines(grid.x, predict(np0, exdat = grid.x)) np1 <- npreg(y ~ x, regtype = "ll") lines(grid.x, predict(np1, exdat = grid.x), lty = "dashed") Figure 10.12 Points are samples from the true, nonlinear regression function shown in grey. The solid black line is a kernel regression, and the dashed line is a locally linear regression. Note that the locally linear model is smoother than the kernel regression, and less biased when the true curve has a non-zero bias at a boundary of the data (far right).

230 230 Weighting and Variance 9 npreg There are several other packages which support it, notably . KernSmooth . locpoly and As the name of the latter suggests, there is no reason we have to stop at locally linear models, and we could use local polynomials of any order. The main reason to use a higher-order local polynomial, rather than a locally-linear or locally-constant model, is to estimate higher derivatives. Since this is a somewhat specialized topic, I will not say more about it. 10.5.2 Lowess There is however one additional topic in locally linear models which is worth 10 loess . lowess The basic idea is to fit or mentioning. This is the variant called a locally linear model, with a kernel which goes to zero outside a finite window and rises gradually inside it, typically the tri-cubic I plotted earlier. The wrin- kle, however, is that rather than solving a least squares problem, it minimizes a different and more “robust” loss function, n ∑ 1 ` x ( (10.36) w β ( x ) ( y − ~x · )) argmin i i n ( x ) β =1 i 2 ` ( a ) doesn’t grow as rapidly for large a as a where . The idea is to make the fitting less vulnerable to occasional large outliers, which would have very large squared errors, unless the regression curve went far out of its way to accommodate them. 2 11 ( ) = a ` if | a | < 1, and For instance, we might have ( a ) = 2 | a |− 1 otherwise a . ` There is a large theory of robust estimation, largely parallel to the more familiar least-squares theory. In the interest of space, we won’t pursue it further, but lowess is worth mentioning because it’s such a common smoothing technique, especially for sheer visualization. Lowess smoothing is implemented in base R through the function lowess (rather basic), and through the function loess (more sophisticated), as well as locfit in the CRAN package (more sophisticated still). The lowess idea can be loess and locfit combined with local fitting of higher-order polynomials; the commands both support this. 10.6 Further Reading Weighted least squares goes back to the 19th century, almost as far back as ordinary least squares; see the references in chapter 1 and 2. I am not sure who invented the use of smoothing to estimate variance functions comes from; I learned it from I learned it from Wasserman (2006, pp. 87–88). I’ve 9 stands for “locally linear”, of course; the default is regtype="lc" , for “locally constant”. "ll" 10 I have heard this name explained as an acronym for both “locally weighted scatterplot smoothing” and “locally weight sum of squares”. 11 This is called the Huber loss ; it continuously interpolates between looking like squared error and looking like absolute error. This means that when errors are small, it gives results very like least-squares, but it is resistant to outliers. See also App. L.6.1.

231 Exercises 231 occasionally seen it done with a linear model for the conditional variance; I don’t recommend that. Simonoff (1996) is a good reference on local linear and local polynomial models, including actually doing the bias-variance analyses where I’ve just made empty “it can be shown” promises. Fan and Gijbels (1996) is more comprehensive, but also a much harder read. Lowess was introduced by Cleveland (1979), but the name evidently came later (since it doesn’t appear in that paper). Exercises from a large population of size n 10.1 Imagine we are trying to estimate the mean value of , Y 0 ∑ n − 1  i y = members of the population, with individual n y n . We observe n so 0 j 0 j =1 being included in our sample with a probability proportional to π . i ) ( ∑ ∑ n n ′ is a consistent estimator of /π 1 / y /π 1. Show that y , by showing that ′ i i i i =1 =1 i n towards 0. that it is unbiased and it has a variance that shrinks with ∑ n − 1 π when the y n y 2. Is the unweighted sample mean a consistent estimator of i i =1 i are not all equal? 10.2 Show that the model of Eq. 10.13 has the log-likelihood given by Eq. 10.14 10.3 Do the calculus to verify Eq. 10.4. w = 1 a necessary as well as a sufficient condition for Eq. 10.3 and Eq. 10.1 to have 10.4 Is i the same minimum? 10.5 § 10.2.2 showed that WLS gives better parameter estimates than OLS when there is het- eroskedasticity, and we know and use the variance. Modify the code for to see which one has better generalization error. 10.6 § 10.3.2 looked at the residuals of the linear regression model for the Old Faithful geyser data, and showed that they would imply lots of heteroskedasticity. This might, however, be an artifact of inappropriately using a linear model. Use either kernel regression (cf. § 6.4.2) or local linear regression to estimate the conditional mean of waiting given duration, and see whether the apparent heteroskedasticity goes away. 10.7 Should local linear regression do better or worse than ordinary least squares under het- eroskedasticity? What exactly would this mean, and how might you test your ideas?

232 11 Logistic Regression 11.1 Modeling Conditional Probabilities So far, we either looked at estimating the conditional expectations of continu- ous variables (as in regression), or at estimating distributions. There are many situations where however we are interested in input-output relationships, as in regression, but the output variable is discrete rather than continuous. In par- ticular there are many situations where we have binary outcomes (it snows in Pittsburgh on a given day, or it doesn’t; this squirrel carries plague, or it doesn’t; this loan will be paid back, or it won’t; this person will get heart disease in the next five years, or they won’t). In addition to the binary outcome, we have some input variables, which may or may not be continuous. How could we model and analyze such data? We could try to come up with a rule which guesses the binary output from , and is an important topic in the input variables. This is called classification statistics and machine learning. However, guessing “yes” or “no” is pretty crude — especially if there is no perfect rule. (Why should there be a perfect rule?) Something which takes noise into account, and doesn’t just give a binary answer, will often be useful. In short, we want probabilities — which means we need to fit a stochastic model. What would be nice, in fact, would be to have conditional distribution of the Y response Y | X ). This would tell us about how , given the input variables, Pr ( precise our predictions should be. If our model says that there’s a 51% chance of snow and it doesn’t snow, that’s better than if it had said there was a 99% chance of snow (though even a 99% chance is not a sure thing). We will see, in Chapter 14, general approaches to estimating conditional probabilities non- parametrically, which can use the kernels for discrete variables from Chapter 4. While there are a lot of merits to this approach, it does involve coming up with a model for the joint distribution of outputs and inputs X , which can be quite Y time-consuming. Let’s pick one of the classes and call it “1” and the other “0”. (It doesn’t matter which is which.) Then becomes an indicator variable , and you can convince Y yourself that Pr ( Y = 1) = E [ Y ]. Similarly, Pr ( Y = 1 | X = x ) = E [ Y | X = x ]. (In a phrase, “conditional probability is the conditional expectation of the indica- tor”.) This helps us because by this point we know all about estimating condi- tional expectations. The most straightforward thing for us to do at this point 232 st 11:17 Monday 1 April, 2019 c Copyright © Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/ cshalizi/ADAfaEPoV/ ~

233 11.2 Logistic Regression 233 would be to pick out our favorite smoother and estimate the regression function for the indicator variable; this will be an estimate of the conditional probability function. There are two reasons not to just plunge ahead with that idea. One is that probabilities must be between 0 and 1, but our smoothers will not necessarily y respect that, even if all the observed they get are either 0 or 1. The other is i that we might be better off making more use of the fact that we are trying to estimate probabilities, by more explicitly modeling the probability. = 1 | X = x ) = p ( x ; θ ), for some function p parameterized Assume that Pr ( Y . parameterized function θ , and further assume that observations are inde- θ by pendent of each other. The the (conditional) likelihood function is n n ∏ ∏ y − 1 y i i )) x Y ) = Pr ( y (11.1) | p ( x = ; θ ) X = (1 − p ( x θ ; i i i i =1 i =1 i y ,...y , where there is a constant Recall that in a sequence of Bernoulli trials n 1 probability of success p , the likelihood is n ∏ y − 1 y i i (11.2) ) − (1 p p =1 i As you learned in basic statistics, this likelihood is maximized when p = ˆ p = ∑ n − 1 n , this likelihood be- y p . If each trial had its own success probability i i =1 i comes n ∏ y − 1 y i i (11.3) − p ) (1 p i i i =1 Without some constraints, estimating the “inhomogeneous Bernoulli” model by p y = 1 when maximum likelihood doesn’t work; we’d get ˆ p = 1, ˆ = 0 when i i i = 0, and learn nothing. If on the other hand we assume that the y aren’t just p i i arbitrary numbers but are linked together, if we model the probabilities, those constraints give non-trivial parameter estimates, and let us generalize. In the ), tells us that kind of model we are talking about, the constraint, = p ( x θ ; p i i must be the same whenever p is the same, and if p is a continuous function, x i i then similar values of x is known must lead to similar values of p p . Assuming i i θ , and we can estimate θ by (up to parameters), the likelihood is a function of maximizing the likelihood. This chapter will be about this approach. 11.2 Logistic Regression To sum up: we have a binary output variable Y , and we want to model the condi- tional probability Pr ( Y = 1 | X = x ) as a function of x ; any unknown parameters in the function are to be estimated by maximum likelihood. By now, it will not surprise you to learn that statisticians have approached this problem by asking themselves “how can we use linear regression to solve this?”

234 234 Logistic Regression p 1. The most obvious idea is to let x . Every incre- ( x ) be a linear function of would add or subtract so much to the probability. x ment of a component of The conceptual problem here is that must be between 0 and 1, and lin- p ear functions are unbounded. Moreover, in many situations we empirically see “diminishing returns” — changing p by the same amount requires a bigger when p is already large (or small) than when p is close to 1 / 2. change in x Linear models can’t do this. p x ) be a linear function of x , so ( 2. The next most obvious idea is to let log the probability by a fixed amount. that changing an input variable multiplies The problem is that logarithms of probabilities are unbounded in only one direction, and linear functions are not. 3. Finally, the easiest modification of log p which has an unbounded range is p (or the ) , log logistic transformation a linear this . We can make logit p 1 − x function of without fear of nonsensical results. (Of course the results could , but they’re not guaranteed to be wrong.) still happen to be wrong logistic regression . This last alternative is Formally, the logistic regression model is that p ( x ) + = β x · β (11.4) log 0 x ) 1 p ( − Solving for p , this gives β · x + β 0 1 e (11.5) = ( p ) = β ; x + ) · β x β β · − ( β x + 0 0 1 + e 1 + e Notice that the overall specification is a lot easier to grasp in terms of the trans- 1 formed probability that in terms of the untransformed probability. Y = 1 when p ≥ 0 . To minimize the mis-classification rate, we should predict 5 and = 0 when p < 0 . 5 (Exercise 11.1). This means guessing 1 whenever Y β + x · β 0 is non-negative, and 0 otherwise. So logistic regression gives us a linear classifier . The decision boundary separating the two predicted classes is the solution of + x β β = 0, which is a point if x is one dimensional, a line if it is two dimensional, · 0 etc. One can show (exercise!) that the distance from the decision boundary is β . Logistic regression not only says where the boundary between / ‖ β ‖ + x · β/ ‖ β ‖ 0 the classes is, but also says (via Eq. 11.5) that the class probabilities depend on distance from the boundary, in a particular way, and that they go towards the extremes (0 and 1) more rapidly when ‖ β is larger. It’s these statements about ‖ probabilities which make logistic regression more than just a classifier. It makes stronger, more detailed predictions, and can be fit in a different way; but those strong predictions could be wrong. Using logistic regression to predict class probabilities is a modeling choice , just like it’s a modeling choice to predict quantitative variables with linear regression. In neither case is the appropriateness of the model guaranteed by the gods, nature, 1 Unless you’ve taken thermodynamics or physical chemistry, in which case you recognize that this is the Boltzmann distribution for a system with two states, which differ in energy by β . + x · β 0

235 11.2 Logistic Regression 235 − + 1.0 1.0 + − 0.8 − − − − 0.5 − − 0.56 − + + + + + + + + + 0.46 0.48 0.52 − − 0.7 0.5 0.5 + − − + 0.54 − − + − − + + − + + − + + + + + − + − − − − − + + + − + − + − + + + − + 0.44 + + 0.6 2 2 + + − − 0.4 x x 0.0 0.0 + − − − + + − − − − − − + − − − 0.5 + − − − 0.42 − − −0.5 −0.5 + + − + 0.2 0.3 − − − + 0.1 + − 0.4 −1.0 −1.0 −1.0 −0.5 0.5 0.0 1.0 −1.0 1.0 0.5 0.0 −0.5 x x 1 1 + + 1 1.0 1.0 − − 0.3 0.5 − − − − 0.4 + + + + − − 0.4 0.8 0.8 + + − + − − 0.7 − − 0.5 0.5 − − − − 0.2 + + + + − − − − + + + + + − − − − − − − − − − − 0.9 0.6 − − − − − − − − + + + + − − 2 2 0.5 + + − − x x 0.0 0.0 − − − − 0.3 − − 0.7 − − − − 0.9 − − − − − − + + − 0.6 − 0.1 − − 0.2 −0.5 −0.5 − − − − − − − − 0.1 − − −1.0 −1.0 −1.0 1.0 1.0 0.5 0.0 −0.5 −0.5 0.0 −1.0 0.5 x x 1 1 x <- matrix(runif(n = 50 * 2, min = -1, max = 1), ncol = 2) par(mfrow = c(2, 2)) plot.logistic.sim(x, beta.0 = -0.1, beta = c(-0.2, 0.2)) y.1 <- plot.logistic.sim(x, beta.0 = -0.5, beta = c(-1, 1)) plot.logistic.sim(x, beta.0 = -2.5, beta = c(-5, 5)) plot.logistic.sim(x, beta.0 = -250, beta = c(-500, 500)) Figure 11.1 x Effects of scaling logistic regression parameters. Values of 1 1) for both coordinates), but x are the same in all plots ( ∼ Unif( − and , 1 2 labels were generated randomly from logistic regressions with β = − 0 . 1, 0 β = ( − 0 . 2 , 0 . 2) (top left); from β 1) (top right); from = − 0 . 5, β = ( − 1 , 0 2 β = − 2 . 5, β = ( − 5 , 5) (bottom left); and from β , = 2 . 5 × 10 0 0 2 2 − 5 × 10 × , 5 β 10 = ( ). Notice how as the parameters get increased in constant ratio to each other, we approach a deterministic relation between Y and x , with a linear boundary between the classes. (We save one set of the random binary responses for use later, as the imaginatively-named y.1 .)

236 236 Logistic Regression sim.logistic <- function(x, beta.0, beta, bind = FALSE) { require(faraway) linear.parts <- beta.0 + (x %*% beta) y <- rbinom(nrow(x), size = 1, prob = ilogit(linear.parts)) if (bind) { return(cbind(x, y)) } else { return(y) } } plot.logistic.sim <- function(x, beta.0, beta, n.grid = 50, labcex = 0.3, col = "grey", ...) { grid.seq <- seq(from = -1, to = 1, length.out = n.grid) plot.grid <- as.matrix(expand.grid(grid.seq, grid.seq)) require(faraway) p <- matrix(ilogit(beta.0 + (plot.grid %*% beta)), nrow = n.grid) contour(x = grid.seq, y = grid.seq, z = p, xlab = expression(x[1]), ylab = expression(x[2]), main = "", labcex = labcex, col = col) y <- sim.logistic(x, beta.0, beta, bind = FALSE) points(x[, 1], x[, 2], pch = ifelse(y == 1, "+", "-"), col = ifelse(y == 1, "blue", "red")) invisible(y) } Code Example 26: Code to simulate binary responses from a logistic regression model, and to plot a 2D logistic regression’s probability contours and simulated binary values. (How would you modify this to take the responses from a data frame? mathematical necessity, etc. We begin by positing the model, to get something to work with, and we end (if we know what we’re doing) by checking whether it really does match the data, or whether it has systematic flaws. Logistic regression is one of the most commonly used tools for applied statistics and discrete data analysis. There are basically four reasons for this. 1. Tradition. 2. In addition to the heuristic approach above, the quantity log p/ (1 − p ) plays an important role in the analysis of contingency tables (the “log odds”). Clas- sification is a bit like having a contingency table with two columns (classes) ). With a finite contingency table, we can x and infinitely many rows (values of estimate the log-odds for each row empirically, by just taking counts in the table. With infinitely many rows, we need some sort of interpolation scheme; logistic regression is linear interpolation for the log-odds. 3. It’s closely related to “exponential family” distributions, where the probability { } ∑ m of some vector + v is proportional to exp . If one of the f β ( v ) β j j 0 =1 j components of v is binary, and the functions f are all the identity function, j then we get a logistic regression. Exponential families arise in many contexts in statistical theory (and in physics!), so there are lots of problems which can be turned into logistic regression. 4. It often works surprisingly well as a classifier. But, many simple techniques

237 11.3 Numerical Optimization of the Likelihood 237 often work surprisingly well as classifiers, and this doesn’t really testify to logistic regression getting the probabilities right. 11.2.1 Likelihood Function for Logistic Regression Because logistic regression predicts probabilities, rather than just classes, we can fit it using likelihood. For each training data-point, we have a vector of features, y , if , and an observed class, y = 1, . The probability of that class was either x p i i i , if y = 0. The likelihood is then p − or 1 i n ∏ y 1 − i y i (11.6) ( x L ) β ( (1 − p ) = x ,β )) p ( i i 0 =1 i , but things will be clearer in a (I could substitute in the actual equation for p moment if I don’t.) The log-likelihood turns products into sums: n ∑ ,β ` ) = ( β (11.7) )) y x log p ( x ( ) + (1 − y p ) log (1 − i i i i 0 =1 i n n ∑ ∑ ) x p ( i y log ( x p )) + − log (1 = (11.8) i i 1 − p ( x ) i =1 i =1 i n n ∑ ∑ ) log (1 − p ( x β )) + = (11.9) · x y + ( β 0 i i i =1 =1 i i n n ∑ ∑ ) ( β · β x + 0 i (11.10) ) β · x + β + y ( − = e 1 + log i 0 i =1 i =1 i where in the next-to-last step we finally use equation 11.4. Typically, to find the maximum likelihood estimates we’d differentiate the log likelihood with respect to the parameters, set the derivatives equal to zero, and solve. To start that, take the derivative with respect to one component of β , say β . j n n ∑ ∑ 1 ∂` β · x β + i 0 (11.11) = − x e + x y ij i ij β x + β · 0 i ∂β 1 + e j i =1 =1 i n ∑ = ( y − x ( x (11.12) ; β ,β )) p 0 ij i i =1 i We are not going to be able to set this to zero and solve exactly. (That’s a transcendental equation, and there is no closed-form solution.) We can however approximately solve it numerically. 11.3 Numerical Optimization of the Likelihood While our likelihood isn’t nice enough that we have an explicit expression for the maximum (the way we do in OLS or WLS), it is a pretty well-behaved func-

238 238 Logistic Regression tion, and one which is amenable to lots of the usual numerical methods for op- timization. In particular, like most log-likelihood functions, it’s suitable for an application of Newton’s method. Briefly (see Appendix H.2 for details), New- ton’s method starts with an initial guess about the optimal parameters, and then calculates the gradient of the log-likelihood with respect to those parameters. It then adds an amount proportional to the gradient to the parameters, moving up the surface of the log-likelihood function. The size of the step in the gradient direction is dictated by the second derivatives — it takes bigger steps when the second derivatives are small (so the gradient is a good guide to what the function looks like), and small steps when the curvature is large. 11.3.1 Iteratively Re-Weighted Least Squares This discussion of Newton’s method is quite general, and therefore abstract. In the particular case of logistic regression, we can make everything look much more like a good, old-fashioned linear regression problem. Logistic regression, after all, is a linear model for a transformation of the prob- : g ability. Let’s call this transformation p (11.13) ≡ log g ( p ) p 1 − So the model is p ) = β + g x · β (11.14) ( 0 − 1 X = x ∼ Binom(1 and Y | ( β )). It seems that what we should want to + x · β ,g 0 Y g ) and regress it linearly on x . Of course, the variance of y , according do is take ( 1 − g + to the model, is going to change depending on ( β − x — it will be ( · β ))(1 x 0 − 1 ( β )) — so we really ought to do a weighted linear regression, with g + · β x 0 − 1 x weights inversely proportional to that variance. Since writing ( β ) is + g · β 0 getting annoying, let’s abbreviate it by ( x ) or just p , and let’s abbreviate that p V ). p ( variance as is either 0 or 1, so g ( y ) is either −∞ or + ∞ . We will The problem is that y evade this by using Taylor expansion. ′ y ) ≈ g ( p ) + ( y − p ) g ( ( p ) ≡ z (11.15) g The right hand side, z will be our effective response variable, which we will regress on x . To see why this should give us the right coefficients, substitute for g ( p ) in the definition of z , ′ − (11.16) + x · β + ( z β p ) g = ( p ) y 0 and notice that, if we’ve got the coefficients right, E [ Y | X = x ] = p , so ( y − p ) should be mean-zero noise. In other words, when we have the right coefficients, z is a linear function of x plus mean-zero noise. (This is our excuse for throwing away the rest of the Taylor expansion, even though we know the discarded terms

239 11.4 Generalized Linear and Additive Models 239 are infinitely large!) That noise doesn’t have constant variance, but we can work it out, ′ ′ 2 = x ] = V [( Y − p ) g V ( p ) | X = [ ] = ( g Z ( p )) | V ( p ) , (11.17) X x β . and so use that variance in weighted least squares to recover Notice that z and the weights both involve the parameters of our logistic re- p ( gression, through ). So having done this once, we should really use the new x parameters to update and the weights, and do it again. Eventually, we come z to a fixed point, where the parameter estimates no longer change. This loop — start with a guess about the parameters, use it to calculate the and their z i x to get new parameters, and repeat — is known as iter- weights, regress on the i iterative weighted least ative reweighted least squares (IRLS or IRWLS), (IWLS), etc. squares 2 The treatment above is rather heuristic , but it turns out to be equivalent to using Newton’s method, only with the expected second derivative of the log likelihood, instead of its actual value. This takes a reasonable amount of algebra 3 to show, so we’ll skip it (but see Exercise 11.3) . Since, with a large number of observations, the observed second derivative should be close to the expected second derivative, this is only a small approximation. 11.4 Generalized Linear and Additive Models Logistic regression is part of a broader family of generalized linear models (GLMs), where the conditional distribution of the response falls in some para- metric family, and the parameters are set by the linear predictor. Ordinary, least- squares regression is the case where response is Gaussian, with mean equal to the linear predictor, and constant variance. Logistic regression is the case where the n equal to the number of data-points with the given response is binomial, with x (usually but not always 1), and p is given by Equation 11.5. Changing the relationship between the parameters and the linear predictor is called changing the link function . For computational reasons, the link function is actually the function you apply to the mean response to get back the linear predictor, rather than the other way around — (11.4) rather than (11.5). There are thus other 4 There is also Poisson re- forms of binomial regression besides logistic regression. 2 That is, mathematically incorrect. 3 The two key points are as follows. First, the gradient of the log-likelihood turns out to be the sum of the x z . (Cf. Eq. 11.12.) Second, take a single Bernoulli observation with success probability p . The i i Y log p + (1 − Y ) log 1 − p . The first derivative with respect to p is log-likelihood is 2 2 (1 − Y ) / (1 − p ), and the second derivative is − Y/p − − (1 Y/p Y ) / (1 − p ) − . Taking expectations ′′ V 1 /p − 1 / (1 − p ) = − 1 /p (1 − p ). In other words, − ( p ) = − 1 / E [ ` of the second derivative gives ]. Using weights inversely proportional to the variance thus turns out to be equivalent to dividing by the expected second derivative. But gradient divided by second derivative is the increment we use in Newton’s method, QED. 4 My experience is that these tend to give similar error rates as classifiers, but have rather different guesses about the underlying probabilities.

240 240 Logistic Regression gression (appropriate when the data are counts without any upper limit), gamma regression, etc.; we will say more about these in Chapter 12. function, whose syntax In R, any standard GLM can be fit using the (base) glm . The major wrinkle is that, of course, you need to lm is very similar to that of family option — specify the family of probability distributions to use, by the help(glm) family=binomial defaults to logistic regression. (See for the gory details on how to do, say, probit regression.) All of these are fit by the same sort of numerical likelihood maximization. Perfect Classification One caution about using maximum likelihood to fit logistic regression is that it can be linearly separated. The can seem to work badly when the training data p reason is that, to make the likelihood large, x ( ) should be large when y = 1, i i p should be small when y is a set of parameters which perfectly and β ,β = 0. If 0 0 i classifies the training data, then cβ 1, but in a logistic ,cβ is too, for any c > 0 regression the second set of parameters will have more extreme probabilities, and so a higher likelihood. For linearly separable data, then, there is no parameter vector which ` can always be increased by making maximizes the likelihood, since the vector larger but keeping it pointed in the same direction. You should, of course, be so lucky as to have this problem. 11.4.1 Generalized Additive Models A natural step beyond generalized linear models is generalized additive mod- els linear (GAMs), where instead of making the transformed mean response a function of the inputs, we make it an additive function of the inputs. This means combining a function for fitting additive models with likelihood maximization. This is actually done in R with the same gam function we used for additive mod- els (hence the name). We will look at how this works in some detail in Chapter 12. For now, the basic idea is that the iteratively re-weighted least squares procedure of § 11.3.1 doesn’t really require the model for the log odds to be linear. We get z a GAM when we fit an additive model to the ; we could even fit an arbitrary i non-parametric model, like a kernel regression, though that’s not often done. GAMs can be used to check GLMs in much the same way that smoothers can be used to check parametric regressions: fit a GAM and a GLM to the same data, then simulate from the GLM, and re-fit both models to the simulated data. Repeated many times, this gives a distribution for how much better the GAM will seem to fit than the GLM does, . You can then even when the GLM is true read a p -value off of this distribution. This is illustrated in § 11.6 below. 11.5 Model Checking The validity of the logistic regression model is no more a fact of mathematics or nature than is the validity of the linear regression model. Both are sometimes

241 11.5 Model Checking 241 convenient assumptions, but neither is guaranteed to be correct, nor even some sort of generally-correct default. In either case, if we want to use the model, the the validity of the modeling proper scientific (and statistical) procedure is to check assumptions. 11.5.1 Residuals In your linear models course, you learned a lot of checks based on the residuals of the model (see Chapter 2). Many of these ideas translates to logistic regression, but we need to re-define residuals. Sometimes people work with the “response” residuals, − p y x ( ) (11.18) i i which should have mean zero (why?), but are heteroskedastic even when the model is true (why?). Others work with standardized or residuals, Pearson y − p ( x ) i i √ (11.19) p ( V ( )) x i and there are yet other notions of residuals for logistic models. Still, both the response and the Pearson residuals should be unpredictable from the covariates, and the latter should have constant variance. 11.5.2 Non-parametric Alternatives Chapter 9 discussed how non-parametric regression models can be used to check whether parametric regressions are well-specified. The same ideas apply to logistic regressions, with the minor modification that in place of the difference in MSEs, one should use the difference in log-likelihoods, or (what comes to the same thing, up to a factor of 2) the difference in deviances. The use of generalized additive models ( § 11.4.1) as the alternative model class is illustrated in § 11.6 below. 11.5.3 Calibration probabilities , we can check its predic- Because logistic regression predicts actual tions in a more stringent way than an ordinary regression, which just tells us the mean value of Y , but is otherwise silent about its distribution. If we’ve got a model which tells us that the probability of rain on a certain class of days is 50%, it had better rain on half of those days, or there model is just wrong about the probability of rain. More generally, we’ll say that the model is calibrated (or well-calibrated ) when Pr ( Y = 1 | ˆ p ( X ) = p ) = p (11.20) That is, the actual probabilities should match the predicted probabilities. If we have a large sample, by the law of large numbers, observed relative frequencies will converge on true probabilities. Thus, the observed relative frequencies should

242 242 Logistic Regression be close to the predicted probabilities, or else the model is making systematic mistakes. , so In practice, each case often has its own unique predicted probability p and check the relative we can’t really accumulate many cases with the same p frequency among those cases. When that happens, one option is to look at all  ); the the cases where the predicted probability is in some small range [ + p,p § 11.7 below observed relative frequency had then better be in that range too. illustrates some of the relevant calculations. proper scoring rule , which is a A second option is to use what is called a function of the outcome variables and the predicted probabilities that attains its minimum when, and only when, the predicted are calibrated. For binary out- comes, one proper scoring rule (historically the oldest) is the Brier score , n ∑ 1 2 − − ( y n (11.21) p ) i i i =1 Another however is simply the (normalized) negative log-likelihood, n ∑ − 1 + (1 p − y ) log (1 log p (11.22) ) − y − n i i i i i =1 Of course, proper scoring rules are better evaluated out-of-sample, or, failing that, through cross-validation, than in-sample. Even an in-sample evaluation is better than nothing, however, which is too often what happens. 11.6 A Toy Example Here’s a worked R example, using the data from the upper right panel of Fig- ure 11.1. The 50 × 2 matrix x holds the input variables (the coordinates are independently and uniformly distributed on [ − , 1]), and y.1 the correspond- 1 β = − 0 . 5, ing class labels, themselves generated from a logistic regression with 0 = ( − 1 , 1). β df <- data.frame(y = y.1, x1 = x[, 1], x2 = x[, 2]) logr <- glm(y ~ x1 + x2, data = df, family = "binomial") deviance of a model fitted by maximum likelihood is twice the difference The between its log likelihood and the maximum log likelihood for a saturated model, i.e., a model with one parameter per observation. Hopefully, the saturated model 5 can give a perfect fit. Here the saturated model would assign probability 1 to 6 ̂ ̂ the observed outcomes ` ( , and the logarithm of 1 is zero, so β D , = 2 β ). The 0 5 2 χ distribution. Specifically, if the model with The factor of two is so that the deviance will have a p 2 degrees of freedom. See parameters is right, the deviance will have a distribution with n − p χ 2 Appendix I for the connection between log likelihood ratios and χ distributions. 6 This is not possible when there are multiple observations with the same input features, but different classes.

243 11.6 A Toy Example 243 β null deviance is what’s achievable by using just a constant bias and setting 0 7 to 0. The fitted model definitely improves on that. β the rest of If we’re interested in inferential statistics on the estimated model, we can see , as with lm those with summary : summary(logr, digits = 2, signif.stars = FALSE) ## ## Call: ## glm(formula = y ~ x1 + x2, family = "binomial", data = df) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.7313 -1.0306 -0.6665 1.0914 2.1593 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.3226 0.3342 -0.965 0.3345 ## x1 -1.0528 0.5356 -1.966 0.0493 * ## x2 1.3493 0.7052 1.913 0.0557 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 68.593 on 49 degrees of freedom ## Residual deviance: 60.758 on 47 degrees of freedom ## AIC: 66.758 ## ## Number of Fisher Scoring iterations: 4 The fitted values of the logistic regression are the class probabilities; this next line gives us the (in-sample) mis-classification rate. mean(ifelse(fitted(logr) < 0.5, 0, 1) != df$y) ## [1] 0.26 An error rate of 26% may sound bad, but notice from the contour lines in . 5, meaning that the classes Figure 11.1 that lots of the probabilities are near 0 are just genuinely hard to predict. To see how well the logistic regression assumption holds up, let’s compare this mgcv to a GAM. We’ll use the same package for estimating the GAM, , that we used to fit the additive models in Chapter 8. library(mgcv) (gam.1 <- gam(y ~ s(x1) + s(x2), data = df, family = "binomial")) ## ## Family: binomial ## Link function: logit 7 AIC is of course the Akaike information criterion, − 2 ` + 2 p , with p being the number of parameters (here, = 3). (Some people divide this through by n .) AIC has some truly devoted adherents, p especially among non-statisticians, but I have been deliberately ignoring it and will continue to do so. Basically, to the extent AIC succeeds, it works as fast, large-sample approximation to doing leave-one-out cross-validation. Claeskens and Hjort (2008) is a thorough, modern treatment of AIC and related model-selection criteria from a statistical viewpoint; see especially § 2.9 of that book for the connection between AIC and leave-one-out.

244 244 Logistic Regression simulate.from.logr <- function(df, mdl) { probs <- predict(mdl, newdata = df, type = "response") df$y <- rbinom(n = nrow(df), size = 1, prob = probs) return(df) } Code Example 27: estimated logistic regression model. By default Code for simulating from an ), predict for logistic regressions returns predictions for the log odds; changing ( type="link" the to "response" returns a probability. type ## ## Formula: ## y ~ s(x1) + s(x2) ## ## Estimated degrees of freedom: ## 1.22 8.70 total = 10.92 ## ## UBRE score: 0.1544972 This fits a GAM to the same data, using spline smoothing of both input vari- ables. (Figure 11.2 shows the partial response functions.) The (in-sample) de- viance is signif(gam.1$deviance, 3) ## [1] 35.9 which is lower than the logistic regression, so the GAM gives the data higher likelihood. We expect this; the question is whether the difference is significant, or within the range of what we should expect when logistic regression is valid. To test this, we need to simulate from the logistic regression model. Now we simulate from our fitted model, and re-fit both the logistic regression and the GAM. delta.deviance.sim <- function(df, mdl) { sim.df <- simulate.from.logr(df, mdl) GLM.dev <- glm(y ~ x1 + x2, data = sim.df, family = "binomial")$deviance GAM.dev <- gam(y ~ s(x1) + s(x2), data = sim.df, family = "binomial")$deviance return(GLM.dev - GAM.dev) } ~ Notice that in this simulation we are not generating new values. The logistic X regression and the GAM are both models for the response conditional on the inputs, and are agnostic about how the inputs are distributed, or even whether it’s meaningful to talk about their distribution. Finally, we repeat the simulation a bunch of times, and see where the observed difference in deviances falls in the sampling distribution. (delta.dev.observed <- logr$deviance - gam.1$deviance) ## [1] 24.86973 delta.dev <- replicate(100, delta.deviance.sim(df, logr)) mean(delta.dev.observed <= delta.dev) ## [1] 0.11

245 11.7 Weather Forecasting in Snoqualmie Falls 245 20 0 s(x1,1.22) −20 −40 0.5 −1.0 0.0 1.0 −0.5 x1 20 0 s(x2,8.7) −20 −40 1.0 0.0 −0.5 −1.0 0.5 x2 plot(gam.1, residuals = TRUE, pages = 0) Figure 11.2 Partial response functions estimated when we fit a GAM to the data simulated from a logistic regression. Notice that the vertical axes are on the logit scale. In other words, the amount by which a GAM fits the data better than logistic regression is pretty near the middle of the null distribution. Since the example data really did come from a logistic regression, this is a relief. 11.7 Weather Forecasting in Snoqualmie Falls For our worked data example, we are going to build a simple weather forecaster. Our data consist of daily records, from the start of 1948 to the end of 1983, of

246 246 Logistic Regression 70 60 50 40 Frequency 30 20 10 0 40 10 60 50 70 30 20 0 Amount by which GAM fits better than logistic regression hist(delta.dev, main = "", xlab = "Amount by which GAM fits better than logistic regression") abline(v = delta.dev.observed, col = "grey", lwd = 4) Figure 11.3 Sampling distribution for the difference in deviance between a GAM and a logistic regression, on data generated from a logistic regression. The observed difference in deviances is shown by the grey vertical line. 8 precipitation at Snoqualmie Falls, Washington (Figure 11.4) . Each row of the data file is a different year; each column records, for that day of the year, the 1 day’s precipitation (rain or snow), in units of inch. Because of leap-days, there 100 8 I learned of this data set from Guttorp (1995); the data file is available from http://www.stat.washington.edu/peter/stoch.mod.data.html . See Code Example 28 for the commands used to read it in, and to reshape it into a form more convenient for R.

247 11.7 Weather Forecasting in Snoqualmie Falls 247 snoqualmie <- scan("http://www.stat.washington.edu/peter/book.data/set1", skip = 1) snoq <- data.frame(tomorrow = c(tail(snoqualmie, -1), NA), today = snoqualmie) years <- 1948:1983 days.per.year <- rep(c(366, 365, 365, 365), length.out = length(years)) snoq$year <- rep(years, times = days.per.year) snoq$day <- rep(c(1:366, 1:365, 1:365, 1:365), times = length(years)/4) snoq <- snoq[-nrow(snoq), ] Code Example 28: Read in and re-shape the Snoqualmie data set. Prof. Guttorp, who has kindly provided the data, formatted it so that each year was a different row, which is rather inconvenient for R. are 366 columns, with the last column having an NA value for three out of four years. What we want to do is predict tomorrow’s weather from today’s. This would be of interest if we lived in Snoqualmie Falls, or if we operated one of the local hydroelectric power plants, or the tourist attraction of the Falls themselves. Ex- amining the distribution of the data (Figures 11.5 and 11.6) shows that there is a big spike in the distribution at zero precipitation, and that days of no precipita- tion can follow days of any amount of precipitation but seem to be less common after heavy precipitation. These facts suggest that “no precipitation” is a special sort of event which would be worth predicting in its own right (as opposed to just being when the precipitation happens to be zero), so we will attempt to do so with logistic re- X gression. Specifically, the input variable will be the amount of precipitation on i th i the Y will be the indicator variable for whether there day, and the response i i + 1 — that is, Y = 0 if = 1 if X > 0, an Y was any precipitation on day i +1 i i X = 0. We expect from Figure 11.6, as well as common experience, that the i +1 9 X coefficient on should be positive. The estimation is straightforward: snoq.logistic <- glm((tomorrow > 0) ~ today, data = snoq, family = binomial) summary : To see what came from the fitting, run print(summary(snoq.logistic), digits = 3, signif.stars = FALSE) ## ## Call: ## glm(formula = (tomorrow > 0) ~ today, family = binomial, data = snoq) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -4.525 -0.999 0.167 1.170 1.367 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.43520 0.02163 -20.1 <2e-16 ## today 0.04523 0.00131 34.6 <2e-16 9 This does not attempt to model how much precipitation there will be tomorrow, if there is any. We could make that a separate model, if we can get this part right.

248 248 Logistic Regression Figure 11.4 Snoqualmie Falls, Washington, on a low-precipitation day. Photo by Jeannine Hall Gailey, from http://myblog.webbish6.com/2011/ 07/17-years-and-hoping-for-another-17.html . [[TODO: Get permission for photo use!]] ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 18191 on 13147 degrees of freedom ## Residual deviance: 15896 on 13146 degrees of freedom ## AIC: 15900 ## ## Number of Fisher Scoring iterations: 5 The coefficient on the amount of precipitation today is indeed positive, and (if we can trust R’s assumptions) highly significant. There is also an intercept term,

249 11.7 Weather Forecasting in Snoqualmie Falls 249 Histogram of snoqualmie 0.06 0.05 0.04 Density 0.03 0.02 0.01 0.00 200 400 0 300 100 Precipitation (1/100 inch) hist(snoqualmie, n = 50, probability = TRUE, xlab = "Precipitation (1/100 inch)") rug(snoqualmie, col = "grey") Figure 11.5 Histogram of the amount of daily precipitation at Snoqualmie Falls which is slight positive, but not very significant. We can see what the intercept term means by considering what happens when on days of no precipitation. The linear predictor is then just the intercept, -0.435, and the predicted probability of precipitation is 0.393. That is, even when there is no precipitation today, it’s 10 almost as likely as not that there will be some precipitation tomorrow. We can get a more global view of what the model is doing by plotting the data 10 For western Washington State, this is plausible — but see below.

250 250 Logistic Regression l l 400 l l l l l l l l 300 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 200 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Precipitation tomorrow (1/100 inch) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 100 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0 300 200 100 0 400 Precipitation today (1/100 inch) plot(tomorrow ~ today, data = snoq, xlab = "Precipitation today (1/100 inch)", ylab = "Precipitation tomorrow (1/100 inch)", cex = 0.1) rug(snoq$today, side = 1, col = "grey") rug(snoq$tomorrow, side = 2, col = "grey") Figure 11.6 Scatterplot showing relationship between amount of precipitation on successive days. Notice that days of no precipitation can follow days of any amount of precipitation, but seem to be more common when there is little or no precipitation to start with. and the predictions (Figure 11.7). This shows a steady increase in the probability of precipitation tomorrow as the precipitation today increases, though with the leveling off characteristic of logistic regression. The (approximate) 95% confidence limits for the predicted probability are (on close inspection) asymmetric.

251 11.7 Weather Forecasting in Snoqualmie Falls 251 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 1.0 0.8 0.6 0.4 Positive precipitation tomorrow? 0.2 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 400 200 100 300 0 Precipitation today (1/100 inch) plot((tomorrow > 0) ~ today, data = snoq, xlab = "Precipitation today (1/100 inch)", ylab = "Positive precipitation tomorrow?") rug(snoq$today, side = 1, col = "grey") data.plot <- data.frame(today = (0:500)) pred.bands <- function(mdl, data, col = "black", mult = 1.96) { preds <- predict(mdl, newdata = data, se.fit = TRUE) lines(data[, 1], ilogit(preds$fit), col = col) lines(data[, 1], ilogit(preds$fit + mult * preds$se.fit), col = col, lty = "dashed") lines(data[, 1], ilogit(preds$fit - mult * preds$se.fit), col = col, lty = "dashed") } pred.bands(snoq.logistic, data.plot) Figure 11.7 Data (dots), plus predicted probabilities (solid line) and approximate 95% confidence intervals from the logistic regression model (dashed lines). Note that calculating standard errors for predictions on the logit scale, and then transforming, is better practice than getting standard errors directly on the probability scale.

252 252 Logistic Regression How well does this work? We can get a first sense of this by comparing it to a simple nonparametric smoothing of the data. Remembering that when Y Y = 1 X = x ) = is binary, Pr ( [ | | X = x ], we can use a smoothing spline to Y E E [ estimate | X = x ] (Figure 11.8). This would not be so great as a model — it Y ignores the fact that the response is a binary event and we’re trying to estimate Y a probability, the fact that the variance of therefore depends on its mean, etc. — but it’s at least suggestive. The result starts out notably above the logistic regression, then levels out and climbs much more slowly. It also has a bunch of dubious-looking wiggles, despite the cross-validation. We can try to do better by fitting a generalized additive model. In this case, with only one predictor variable, this means using non-parametric smoothing to estimate the log odds — we’re still using the logistic transformation, but only requiring that the log odds change smoothly with X , not that they be linear in X . The result (Figure 11.9) is initially similar to the spline, but has some more exaggerated undulations, and has confidence intervals. At the largest values of X , the latter span nearly the whole range from 0 to 1, which is not unreasonable considering the sheer lack of data there. Visually, the logistic regression curve is hardly ever within the confidence limits of the non-parametric predictor. What can we say about the difference between the two models more quantiatively? 4 5895596 × 10 Numerically, the deviance is 1 . for the logistic regression, and 4 1 . 5121622 × 10 for the GAM. We can go through the testing procedure outlined in § 11.6. We need a simulator (which presumes that the logistic regression model is true), and we need to calculate the difference in deviance on simulated data many times. snoq.sim <- function(model = snoq.logistic) { fitted.probs = fitted(model) return(rbinom(n = length(fitted.probs), size = 1, prob = fitted.probs)) } A quick check of the simulator against the observed values: summary(ifelse(snoq[, 1] > 0, 1, 0)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.0000 1.0000 0.5262 1.0000 1.0000 summary(snoq.sim()) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.0000 1.0000 0.5176 1.0000 1.0000 This suggests that the simulator is not acting crazily. Now for the difference in deviances: diff.dev <- function(model = snoq.logistic, x = snoq[, "today"]) { y.new <- snoq.sim(model) GLM.dev <- glm(y.new ~ x, family = binomial)$deviance GAM.dev <- gam(y.new ~ s(x), family = binomial)$deviance return(GLM.dev - GAM.dev) }

253 11.7 Weather Forecasting in Snoqualmie Falls 253 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 1.0 0.8 0.6 0.4 Positive precipitation tomorrow? 0.2 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 400 200 100 300 0 Precipitation today (1/100 inch) plot((tomorrow > 0) ~ today, data = snoq, xlab = "Precipitation today (1/100 inch)", ylab = "Positive precipitation tomorrow?") rug(snoq$today, side = 1, col = "grey") data.plot <- data.frame(today = (0:500)) pred.bands(snoq.logistic, data.plot) snoq.spline <- smooth.spline(x = snoq$today, y = (snoq$tomorrow > 0)) lines(snoq.spline, col = "red") Figure 11.8 As Figure 11.7, plus a smoothing spline (red). A single run of this takes about 0 . 6 seconds on my computer. Finally, we calculate the distribution of difference in deviances under the null (that the logistic regression is properly specified), and the corresponding p -value:

254 254 Logistic Regression l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 1.0 0.8 0.6 0.4 Positive precipitation tomorrow? 0.2 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 300 200 100 0 400 Precipitation today (1/100 inch) library(mgcv) plot((tomorrow > 0) ~ today, data = snoq, xlab = "Precipitation today (1/100 inch)", ylab = "Positive precipitation tomorrow?") rug(snoq$today, side = 1, col = "grey") pred.bands(snoq.logistic, data.plot) lines(snoq.spline, col = "red") snoq.gam <- gam((tomorrow > 0) ~ s(today), data = snoq, family = binomial) pred.bands(snoq.gam, data.plot, "blue") Figure 11.9 As Figure 11.8, but with the addition of a generalized additive model (blue line) and its confidence limits (dashed blue lines). diff.dev.obs <- snoq.logistic$deviance - snoq.gam$deviance null.dist.of.diff.dev <- replicate(100, diff.dev()) p.value <- (1 + sum(null.dist.of.diff.dev > diff.dev.obs))/(1 + length(null.dist.of.diff.dev))

255 11.7 Weather Forecasting in Snoqualmie Falls 255 Using a thousand replicates takes about 67 seconds, or a bit over a minute; it gives a < 1 / 101. (A longer run of 1000 replicates, not shown, gives a p -value of − 3 -values of < .) p 10 Having detected that there is a problem with the logistic model, we can ask could just use the GAM, but it’s more interesting to try to where it lies. We diagnose what’s going on. In this respect Figure 11.9 is actually a little misleading, because it leads the X eye to emphasize the disagreement between the models at large , when actually there are very few data points there, and so even large differences in predicted probabilities there contribute little to the over-all likelihood difference. What is X = 0, which contains a very large actually more important is what happens at number of observations (about 47% of all observations), and which we have reason to think is a special value anyway. X Let’s try introducing a dummy variable for = 0 into the logistic regression, and see what happens. It will be convenient to augment the data frame with an extra column, recording 1 whenever X = 0 and 0 otherwise. snoq2 <- data.frame(snoq, dry = ifelse(snoq$today == 0, 1, 0)) snoq2.logistic <- glm((tomorrow > 0) ~ today + dry, data = snoq2, family = binomial) snoq2.gam <- gam((tomorrow > 0) ~ s(today) + dry, data = snoq2, family = binomial) Notice that I allow the GAM to treat zero as a special value as well, by giving it access to that dummy variable. In principle, with enough data it can decide whether or not that is useful on its own, but since we have guessed that it is, we 4 might as well include it. The new GLM has a deviance of 1 4954796 × 10 , lower . 4 than even the GAM before, and the new GAM has a deviance of 1 × 10 . . 4841671 I will leave repeating the specification test as an exercise. Figure 11.10 shows the data and the two new models. These are extremely close to each other at low percipitation, and diverge thereafter. The new GAM is the smoothest model we’ve seen yet, which suggests that before the it was being under-smoothed to help capture the special value at zero. Let’s turn now to looking at calibration. The actual fraction of no-precipitation days which are followed by precipitation is signif(mean(snoq$tomorrow[snoq$today == 0] > 0), 3) ## [1] 0.287 What does the new logistic model predict? signif(predict(snoq2.logistic, newdata = data.frame(today = 0, dry = 1), type = "response"), 3) ## 1 ## 0.287 This should not be surprising — we’ve given the model a special parameter dedicated to getting this one probability exactly right! The hope however is that this will change the predictions made on days with precipitation so that they are better. Looking at a histogram of fitted values ( hist(fitted(snoq2.logistic)) )

256 256 Logistic Regression l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 1.0 0.8 0.6 0.4 Positive precipitation tomorrow? 0.2 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 400 200 100 300 0 Precipitation today (1/100 inch) plot((tomorrow > 0) ~ today, data = snoq, xlab = "Precipitation today (1/100 inch)", ylab = "Positive precipitation tomorrow?") rug(snoq$today, side = 1, col = "grey") data.plot = data.frame(data.plot, dry = ifelse(data.plot$today == 0, 1, 0)) lines(snoq.spline, col = "red") pred.bands(snoq2.logistic, data.plot) pred.bands(snoq2.gam, data.plot, "blue") Figure 11.10 As Figure 11.9, but allowing the two models to use a dummy variable indicating when today is completely dry ( X = 0). shows a gap in the distribution of predicted probabilities below 0 . 63, so we’ll look first at days where the predicted probability is between 0 . 63 and 0 . 64.

257 11.8 Logistic Regression with More Than Two Classes 257 signif(mean(snoq$tomorrow[(fitted(snoq2.logistic) >= 0.63) & (fitted(snoq2.logistic) < 0.64)] > 0), 3) ## [1] 0.526 Not bad — but a bit painful to write out. Let’s write a function: frequency.vs.probability <- function(p.lower, p.upper = p.lower + 0.01, model = snoq2.logistic, events = (snoq$tomorrow > 0)) { fitted.probs <- fitted(model) indices <- (fitted.probs >= p.lower) & (fitted.probs < p.upper) ave.prob <- mean(fitted.probs[indices]) frequency <- mean(events[indices]) se <- sqrt(ave.prob * (1 - ave.prob)/sum(indices)) return(c(frequency = frequency, ave.prob = ave.prob, se = se)) } I have added a calculation of the average predicted probability, and a crude estimate of the standard error we should expect if the observations really are 11 . Try the function out before doing binomial with the predicted probabilities anything rash: frequency.vs.probability(0.63) ## frequency ave.prob se ## 0.52603037 0.63414568 0.01586292 This agrees with our previous calculation. Now we can do this for a lot of probability brackets: f.vs.p <- sapply(c(0.28, (63:100)/100), frequency.vs.probability) This comes with some unfortunate R cruft, removable thus f.vs.p <- data.frame(frequency = f.vs.p["frequency", ], ave.prob = f.vs.p["ave.prob", ], se = f.vs.p["se", ]) and we’re ready to plot (Figure 11.11). The observed frequencies are generally reasonably near the predicted probabilities. While I wouldn’t want to say this 12 , it’s surprisingly good for such a simple was the last word in weather forecasting model. I will leave calibration checking for the GAM as another exercise. 11.8 Logistic Regression with More Than Two Classes can take on more than two values, say k of them, we can still use logistic Y If c ,β , each class regression. Instead of having one set of parameters in 0 : β 0 ( c ) ( c ) − 1) will have its own offset and vector β β ( , and the predicted conditional k 0 probabilities will be ) c ( ) c ( ( ) β · β x + 0 e ~ (11.23) = = x Y = Pr | X c ∑ ( c ) ) c ( x · β + β 0 e c 11 This could be improved by averaging predicted variances for each point, but using probability ranges of 0 . 01 makes it hardly worth the effort. 12 There is an extensive discussion of this data in Guttorp (1995, ch. 2), including many significant refinements, such as dependence across multiple days.

258 258 Logistic Regression You can check that when there are only two classes (say, 0 and 1), equation (1) (0) (1) (0) = − β β . In fact, and β = β 11.23 reduces to equation 11.5, with β β − 0 0 0 no matter how many classes there are, we can always pick one of them, say = 0, c and fix its parameters at exactly zero, without any loss of generality (Exercise 13 11.2) . Calculation of the likelihood now proceeds as before (only with more book- keeping), and so does maximum likelihood estimation. As for R implementations, for am long time the easiest way to do this was nnet package for neural networks (Venables and Ripley, 2002). actually to use the multiclass More recently, the package does the same function from the mgcv glm sort of job, with an interface closer to what you will be familiar with from gam . and Exercises “We minimize the mis-classification rate by predicting the most likely class” : Let ̂ μ ( x ) 11.1 be our predicted class, either 0 or 1. Our error rate is then Pr ( Y = ̂ μ ). Show that 6 ] ] [ [ 2 2 = 1 Y − = Pr ( ) (1 x = X | x = . Further show that E X ( Y − ̂ μ ) | ) μ − Y ( E ) = μ ̂ = 6 Y Pr ( ̂ 2 0 ( x )) + ̂ μ ̂ ( x ). Conclude by showing that if Pr ( Y 2 | X = x ) > μ . 5, the risk of mis- = 1 classification is minimized by taking μ = 1, that if Pr ( Y = 1 | X ̂ x ) < 0 . 5 the risk is = minimized by taking ̂ μ = 0, and that when Pr ( Y = 1 | X = x ) = 0 . 5 both predictions are equally risky. ( c ) ( ) c β β and for each class 11.2 A multiclass logistic regression, as in Eq. 11.23, has parameters 0 ( c ) ) c ( = 0, β . Show that we can always get the same predicted probabilities by setting c = β 0 c , and adjusting the parameters for the other classes appropriately. 0 for any one class 11.3 Find the first and second derivatives of the log-likelihood for logistic regression with one predictor variable. Explicitly write out the formula for doing one step of Newton’s method. Explain how this relates to re-weighted least squares. 13 Since we can arbitrarily chose which class’s parameters to “zero out” without affecting the predicted probabilities, strictly speaking the model in Eq. 11.23 is unidentified . That is, different parameter settings lead to exactly the same outcome, so we can’t use the data to tell which one is right. The usual response here is to deal with this by a convention: we decide to zero out the parameters of the first class, and then estimate the contrasting parameters for the others.

259 Exercises 259 1.0 l l l l l l l l l l l l l l l l l l l l l l l l l 0.8 l l l l l l l l l l l 0.6 l 0.4 Observed frequencies l 0.2 0.0 0.2 0.8 0.0 1.0 0.4 0.6 Predicted probabilities plot(frequency ~ ave.prob, data = f.vs.p, xlim = c(0, 1), ylim = c(0, 1), xlab = "Predicted probabilities", ylab = "Observed frequencies") rug(fitted(snoq2.logistic), col = "grey") abline(0, 1, col = "grey") segments(x0 = f.vs.p$ave.prob, y0 = f.vs.p$ave.prob - 1.96 * f.vs.p$se, y1 = f.vs.p$ave.prob + 1.96 * f.vs.p$se) Figure 11.11 Calibration plot for the modified logistic regression model snoq2.logistic . Points show the actual frequency of precipitation for each level of predicted probability. Vertical lines are (approximate) 95% sampling intervals for the frequency, given the predicted probability and the number of observations.

260 12 Generalized Linear Models and Generalized Additive Models 12.1 Generalized Linear Models and Iterative Least Squares Logistic regression is a particular instance of a broader kind of model, called (GLM). You are familiar, of course, from your generalized linear model a regression class with the idea of transforming the response variable, what we’ve X . This was been calling , and then predicting the transformed variable from Y what we did in logistic regression. Rather, we transformed the conditional not that a linear function of X . This seems odd, because it expected value , and made odd, but it turns out to be useful. is Let’s be specific. Our usual focus in regression modeling has been the condi- | ( ) = E [ tional expectation function, x X = x ]. In plain linear regression, we μ Y μ ( x ) by β ] = + x · β . In logistic regression, μ ( x ) = E [ try to approximate | X = x Y 0 μ Y X = x ), and it is a transformation of | ( x ) which is linear. The usual = 1 Pr ( notation says ( x ) = β (12.1) + x · β η 0 μ ( x ) (12.2) ) = log η x ( μ x − ) 1 ( g = μ ( x )) (12.3) ( defining the logistic link function by g ( m ) = log m/ (1 − m ). The function η ( x ) is called the linear predictor . Now, the first impulse for estimating this model would be to apply the trans- ) = is always zero or one, so g ( Y Y ±∞ , and g formation to the response. But regression will not be helpful here. The standard strategy is instead to use (what x ( ) around μ ( Y ), and else?) Taylor expansion. Specifically, we try expanding g stop at first order: ′ Y ) ≈ g ( μ ( x )) + ( Y − μ ( x )) g g ( μ ( x )) (12.4) ( ′ η ( x ) + ( Y − μ ( x )) g = ( μ ( x )) ≡ z (12.5) We define this to be our effective response after transformation. Notice that if y μ ( x ), there were no noise, so that was always equal to its conditional mean z then regressing x would give us back exactly the coefficients β . What ,β on 0 this suggests is that we can estimate those parameters by regressing z on x . ) has expectation zero, so it acts like the noise, with the The term μ ( x − Y ′ factor of g telling us about how the noise is scaled by the transformation. This 260 st 11:17 Monday 1 April, 2019 c © Copyright Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/ cshalizi/ADAfaEPoV/ ~

261 12.1 Generalized Linear Models and Iterative Least Squares 261 z : lets us work out the variance of ′ ( X = x ] = V [ η ( x ) | X = x ] + V [( Y − μ ( x )) g Z V μ | x )) | X = x ] (12.6) [ ( 2 ′ μ ( x ))) g V ( Y | X = x ] (12.7) = 0 + ( [ Y binary, V [ Y | X = x ] = μ ( x )(1 − μ ( x )). On the other For logistic regression, with 1 ′ g μ ( x )) = hand, with the logistic link function, ( . Thus, for logistic x )(1 μ ( x )) ( μ − − 1 )(1 X = x V μ ( x Z − μ ( x ))] [ | . regression, ] = [ Z changes with X , this is a heteroskedastic regression Because the variance of problem. As we saw in chapter 10, the appropriate way of dealing with such a problem is to use weighted least squares, with weights inversely proportional to x should be the variances. This means that, in logistic regression, the weight at ( ( )(1 − proportional to x x )). Notice two things about this. First, the weights μ μ depend on the current guess about the parameters. Second, we give lots of weight μ ( to cases where ) ≈ 0 or where μ ( x ) ≈ 1, and little weight to those where x μ ( x ) = 0 . 5. This focuses our attention on places where we have a lot of potential information — the distinction between a probability of 0 . 499 and 0 . 501 is just a lot easier to discern than that between 0 003! 001 and 0 . . We can now put all this together into an estimation strategy for logistic regres- sion. ,y ,y 1. Get the data ( ) ,... ( x . x ,β ), and some initial guesses β 0 n n 1 1 2. until β ,β converge 0 ̂ ( x ) ) = β 1. Calculate + x x · β and the corresponding η μ ( i i i 0 − y ) x ( μ ̂ i i x ) + 2. Find the effective transformed responses z ( = η i i μ ( x ̂ )(1 − ̂ μ ( x )) i i ̂ ̂ = w μ ( 3. Calculate the weights )(1 − )) μ ( x x i i i 4. Do a weighted linear regression of z on x ,β with weights w β , and set i i i 0 to the intercept and slopes of this regression Our initial guess about the parameters tells us about the heteroskedastic- ity, which we use to improve our guess about the parameters, which we use to improve our guess about the variance, and so on, until the parameters stabi- lize. This is called (or “iterative weighted iterative reweighted least squares least squares”, “iteratively weighted least squares”, “iteratived reweighted least squares”, etc.), abbreviated IRLS, IRWLS, IWLS, etc. As mentioned in the last chapter, this turns out to be equivalent to Newton’s method, at least for almost this problem. 12.1.1 GLMs in General The set-up for an arbitrary GLM is a generalization of that for logistic regression. We need • A linear predictor , η ( x ) = β β + x · 0 μ A link function g , so that η ( x ) = g ( • ( x )). For logistic regression, we had g ( μ ) = log μ/ (1 − μ ).

262 262 GLMs and GAMs 2 dispersion scale function V [ Y | X = x ] = σ , so that V ( μ ( x )). For logis- A V • 2 μ ) = μ (1 − μ tic regression, we had σ V = 1. ( ), and With these, we know the conditional mean and conditional variance of the re- . sponse for each value of the input variables x As for estimation, basically everything in the IRWLS set up carries over un- changed. In fact, we can go through this algorithm: 1. Get the data ( ,y ) and dispersion scale ) ,... ( x x ,y μ ), fix link function g ( n 1 1 n V ( μ ), and make some initial guesses β function ,β . 0 2. Until converge: β ,β 0 ̂ ) = β + x η · x and the corresponding ( μ ( x ) 1. Calculate β i i 0 i ′ ̂ ̂ η ( x )) ) + ( y x − z μ ( x ( )) g = ( 2. Find the effective transformed responses μ i i i i i 1 − 2 ′ ̂ ̂ 3. Calculate the weights ( x μ )) = [( V ( ( μ ( x w ))] g i i i with weights x z w on , and set β ,β 4. Do a weighted linear regression of 0 i i i to the intercept and slopes of this regression 2 σ Notice that even if we don’t know the over-all variance scale , that’s OK, to the inverse variance. proportional because the weights just have to be 12.1.2 Examples of GLMs 12.1.2.1 Vanilla Linear Models To re-assure ourselves that we are not doing anything crazy, let’s see what 2 ( μ ) = μ (the “identity link”), and V [ Y | X = x ] = σ happens when , so that g ′ ( μ ) = 1. Then g V = 1, all weights w = 1, and the effective transformed re- i sponse z = y with no weighting at all . So we just end up regressing y x on i i i i — we do ordinary least squares. Since neither the weights nor the transformed response will change, IRWLS will converge exactly after one step. So if we get rid of all this nonlinearity and heteroskedasticity and go all the way back to our very first days of doing regression, we get the OLS answers we know and love. 12.1.2.2 Binomial Regression y In many situations, our response variable will be an integer count running i n . (Think: number of patients between 0 and some pre-determined upper limit i in a hospital ward with some condition, number of children in a classroom passing a test, number of widgets produced by a factory which are defective, number of people in a village with some genetic mutation.) One way to model this would be as a binomial random variable, with trials, and a success probability p which n i i is a logistic function of predictors x . The logistic regression we have done so far is the special case where n (12.1) for = 1 always. I will leave it as an Exercise i you to work out the link function and the weights for general binomial regression, where the n are treated as known. i One implication of this model is that each of the n “trials” aggregated together i in y is independent of all the others, at least once we condition on the predictors i

263 12.1 Generalized Linear Models and Iterative Least Squares 263 . (So, e.g., whether any student passes the test is independent of whether any x of their classmates pass, once we have conditioned on, say, teacher quality and average previous knowledge.) This may or may not be a reasonable assumption. When the successes or failures are dependent, even after conditioning on the predictors, the binomial model will be mis-specified. We can either try to get more information, and hope that conditioning on a richer set of predictors makes the dependence go away, or we can just try to account for the dependence by modifying the variance (“overdispersion” or “underdispersion”); we’ll return to 12.1.4. both topics in § 12.1.2.3 Poisson Regression Recall that the Poisson distribution has probability mass function − μ y μ e ( p ) = (12.8) y ! y E [ Y ] = V [ with ] = μ . As you remember from basic probability, a Poisson Y distribution is what we get from a binomial if the probability of success per trial shrinks towards zero but the number of trials grows to infinity, so that we keep the mean number of successes the same: ) Pois( μ n,μ/n (12.9) Binom( ) This makes the Poisson distribution suitable for modeling counts with no fixed upper limit, but where the probability that any one of the many individual trials μ is a success is fairly low. If is allowed to change with the predictor variables, we get Poisson regression. Since the variance is equal to the mean, Poisson regression is always going to be heteroskedastic. Since μ has to be non-negative, a natural link function is g μ ) = log μ . This ( ′ . When the expected count is large, g ) = 1 /μ , and so weights w = μ μ ( produces so is the variance, which normally would reduce the weight put on an observation in regression, but in this case large expected counts also provide more information about the coefficients, so they end up getting increasing weight. 12.1.3 Uncertainty Standard errors for coefficients can be worked out as in the case of weighted least squares for linear regression. Confidence intervals for the coefficients will be approximately Gaussian in large samples, for the usual likelihood-theory rea- sons, when the model is properly specified. One can, of course, also use either a parametric bootstrap, or resampling of cases/data-points to assess uncertainty. Resampling of residuals can be trickier, because it is not so clear what counts as a residual. When the response variable is continuous, we can get “standardized” ) μ ( y ̂ − x i i √ , and then add  , resample them to get ̃ = or “Pearson” residuals, ˆ  i i ̂ ( )) μ ( V x i √ ̂ )) to the fitted values. This does not really work when the response is V ( μ ( x  ̃ i i discrete-valued, however. [[ATTN: Look up if anyone has a good trick for this]]

264 264 GLMs and GAMs 12.1.4 Modeling Dispersion When we pick a family for the conditional distribution of Y , we get a pre- ( dicted conditional variance function, x )). The actual conditional variance μ V ( Y X = x ] may however not track this. When the variances are larger, the [ V | ; when they are smaller, under-dispersed process is over-dispersed . Over- dispersion is more common and more worrisome. In many cases, it arises from some un-modeled aspect of the process — some unobserved heterogeneity, or some missed dependence. For instance, if we observe count data with an upper limit and use a binomial model, we’re assuming that each “trial” within a data point is independent; positive correlation between the trials will give larger variance 1 − p around the mean that the mp . (1 ) we’d expect The most satisfying solution to over-dispersion is to actually figure out where it comes from, and model its origin. Failing that, however, we can fall back on more “phenomenological” modeling. One strategy is to say that [ V | X = x ] = φ ( x ) V ( μ ( x )) (12.10) Y and try to estimate the function φ — a modification of the variance-estimation x § V [ Y | idea we saw in = 10.3. In doing so, we need a separate estimate of ]. X i This can come from repeated measurements at the same value of x , or from the squared residuals at each data point. Once we have some noisy but independent estimate of V [ Y | X = x )) can be regressed on ], the ratio V [ Y | X = x x ] /V ( μ ( i i i to estimate . Some people recommend doing this step, itself, through a gen- x φ i eralized linear or generalized additive model, with a gamma distribution for the response, so that the response is guaranteed to be positive. 12.1.5 Likelihood and Deviance When dealing with GLMs, it is conventional to report not the log-likelihood, but . The deviance of a model with parameters ( β ,β ) is defined as the deviance 0 ( β − ,β ) = 2[ ` (saturated) D ` ( β (12.11) ,β )] 0 0 ` ( β (saturated) is the log- ,β ) is the log-likelihood of our model, and ` Here, 0 likelihood of a saturated model which has one parameter per data point. Thus, models with high likelihoods will have low deviances, and vice versa. If our model is correct and has p + 1 parameters in all (including the intercept), then the 2 χ deviance will generally approach a distribution asymptotically, with n − ( p + 1) degrees of freedom (Appendix I); the factor of 2 in the definition is to ensure this. For discrete response variables, the saturated model can usually ensure that x Y y (saturated) = 0, and deviance is just twice the | X Pr ( = ` ) = 1, so = i i negative log-likelihood. If there are multiple data points with the same value of x but different values of y , then ` (saturated) < 0. In any case, even for repeated values of or even continuous response variables, differences in deviance are x 1 If (for simplicity) all the trials have the same covariance ρ , then the variance of their sum is mp (1 (why?). p ) + m ( m − 1) ρ −

265 12.1 Generalized Linear Models and Iterative Least Squares 265 (model just twice differences in log-likelihood: ) − D (model D ) = 2[ ` (model − ) 2 2 1 (model )]. ` 1 12.1.5.1 Maximum Likelihood and the Choice of Link Function Having chosen a family of conditional distributions, it may happen that when we both write out the log-likelihood, the latter depends on the the response variables y y and the coefficients only through the product of with some transformation i i ̂ μ : of the conditional mean n ∑ ̂ = (12.12) f ( y ) ,x θ ) + y ( g ( h μ ` ) + i i i i i =1 11.2.1, p. 237) shows that In the case of logistic regression, examining Eq. 11.8 ( § ̂ ̂ ̂ g ( ) = log μ the log-likelihood can be put in this form with / (1 − μ μ ). In the i i i 2 , we would have f = − y case of a Gaussian conditional distribution for Y / 2, i 2 ̂ ̂ ̂ h μ μ g , and ( ( θ ) = − ) = μ . When the log-likelihood can be written in this form, i i i g ( ) is the “natural” transformation to apply to the conditional mean, i.e., the · natural link function, and assures us that the solution to iterative least squares 2 will converge on the maximum likelihood estimate. Of course we are free to nonetheless use other transformations of the conditional expectation. glm 12.1.6 R: As with logistic regression, the workhorse R function for all manner of GLMs is, . The syntax is strongly parallel to that of simply, , with the addition of a glm lm argument that specifies the intended distribution of the response variable family binomial , gaussian ( poisson , etc.), and, optionally, a link function appropriate , to the family. (See for the details.) With family="gaussian" and help(family) lm an identity link function, its intended behavior is the same as . 2 To be more technical, we say that a distribution with parameters θ is an exponential family if its probability density function at x is exp f ( x ) + T ( x ) · g ( θ ) /z ( θ ), for some vector of statistics T and some transformation of the parameters. (To ensure normalization, g ∫ ( θ ) = x exp ( f ( x ) + T ( x ) · g ( θ )) dx . Of course, if the sample space z is discrete, replace this integral , and with a sum.) We then say that · ) are the “natural” or “canonical” sufficient statistics ( g ( θ ) T are the “natural” parameters. Eq. 12.12 is picking out the natural parameters, presuming the response variable is itself the natural sufficient statistic. Many of the familiar families of distributions, like Gaussians, exponentials, gammas, Paretos, binomials and Poissons are exponential families. Exponential families are very important in classical statistical theory, and have deep connections to thermodynamics and statistical mechanics (where they’re called “canonical ensembles”, “Boltzmann distributions” or “Gibbs distributions” (Mandelbrot, 1962)), and to information theory (where they’re “maximum entropy distributions”, or “minimax codes” (Gr ̈unwald, 2007)). Despite their coolness, they are a rather peripheral topic for our sort of data analysis — though see Guttorp (1995) for examples of using them in modeling discrete processes. Any good book on statistical theory (e.g., Casella and Berger 2002) will have a fairly extensive discussion; Barndorff-Nielsen (1978) and Brown (1986) are comprehensive treatments.

266 266 GLMs and GAMs 12.2 Generalized Additive Models In the development of generalized linear models, we use the link function g to ̂ μ x ) to the linear predictor η ( x ). But really nothing ( relate the conditional mean to be linear in x . In particular, it all works η in what we were doing required η is an additive function of x . We form the effective responses perfectly well if z w , but now instead of doing a linear regression on as before, and the weights i i x we do an additive regression, using backfitting (or whatever). This gives us a i generalized additive model (GAM). Essentially everything we know about the relationship between linear models and additive models carries over. GAMs converge somewhat more slowly as n grows than do GLMs, but the former have less bias, and strictly include GLMs as special cases. The transformed (mean) response is related to the predictor vari- ables not just through coefficients, but through whole partial response functions. If we want to test whether a GLM is well-specified, we can do so by comparing it to a GAM, and so forth. In fact, one could even make η x ) an arbitrary smooth function of x , to be ( estimated through (say) kernel smoothing of z . This is rarely done, however, on x i i partly because of curse-of-dimensionality issues, but also because, if one is going to go that far, one might as well just use kernels to estimate conditional distributions, as we will see in Chapter 14. 12.3 Further Reading At our level of theory, good references on generalized linear and generalized ad- ditive models include Faraway (2006) and Wood (2006), both of which include extensive examples in R. Tutz (2012) offers an extensive treatment of GLMs with categorical response distributions, along with comparisons to other models for that task. Overdispersion is the subject of a large literature of its own. All of the refer- ences just named discuss methods for it. Lambert and Roeder (1995) is worth mentioning for introducing some simple-to-calculate ways of detecting and de- scribing over-dispersion which give some information about why the response is over-dispersed. One of these (the “relative variance curve”) is closely related to the idea sketched above about estimating the dispersion factor. Exercises 12.1 In binomial regression, we have Y | X = x ∼ Binom( n,p ( x )), where p ( x ) follows a logistic w model. Work out the link function ( μ ), the variance function V ( μ ), and the weights g , assuming that n is known and not random. 12.2 Problem set A.15, on predicting the death rate in Chicago, is a good candidate for using Poisson regression. Repeat the exercises in that problem set with Poisson-response GAMs. How do the estimated functions change? Why is this any different from just taking the log of the death counts, as suggested in that problem set?

267 13 Classification and Regression Trees [[TODO: Notes So far, the models we’ve worked with have been built on the principle of every point in the data set contributing (at least potentially) to every prediction. An taken from partition , the data set, so that each prediction another alternative is to divide up, or will only use points from one chunk of the space. If this partition is done in a course; , which comes in two integrate]] prediction tree recursive or hierarchical manner, we get a regression trees and classification trees . These may seem too crude varieties, to actually work, but they can, in fact, be both powerful and computationally efficient. 13.1 Prediction Trees Y from inputs The basic idea is simple. We want to predict a response or class . We do this by growing a binary tree. At each internal node in the ,X X ,...X 2 1 p tree, we apply a test to one of the inputs, say X . Depending on the outcome of i the test, we go to either the left or the right sub-branch of the tree. Eventually we come to a leaf node, where we make a prediction. This prediction aggregates or averages all the training data points which reach that leaf. Figure 13.1 should help clarify this. Why do this? Predictors like linear or polynomial regression are global mod- , where a single predictive formula is supposed to hold over the entire data els space. When the data has lots of features which interact in complicated, nonlin- ear ways, assembling a single global model can be very difficult, and hopelessly confusing when you do succeed. As we’ve seen, non-parametric smoothers try to fit models locally and then paste them together, but again they can be hard to interpret. (Additive models are at least pretty easy to grasp.) An alternative approach to nonlinear prediction is to sub-divide, or , partition the space into smaller regions, where the interactions are more manageable. We then partition the sub-divisions again — this is recursive partitioning (or hierarchical partitioning) — until finally we get to chunks of the space which are so tame that we can fit simple models to them. The global model thus has two parts: one is just the recursive partition, the other is a simple model for each cell of the partition. Now look back at Figure 13.1 and the description which came before it. Predic- tion trees use the tree to represent the recursive partition. Each of the terminal nodes , or leaves , of the tree represents a cell of the partition, and has attached 267 st 11:17 Monday 1 April, 2019 c Copyright © Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/ cshalizi/ADAfaEPoV/ ~

268 268 Trees Figure 13.1 Classification tree for county-level outcomes in the 2008 Democratic Party primary (as of April 16), by Amanada Cox for the New York Times . [[TODO: Get figure permission!]]

269 13.1 Prediction Trees 269 belongs to it a simple model which applies in that cell only. A point to a leaf x if x falls in the corresponding cell of the partition. To figure out which cell we are in, we start at the of the tree, and ask a sequence of questions root node about the features. The interior nodes are labeled with questions, and the edges or branches between them labeled by the answers. Which question we ask next depends on the answers to previous questions. In the classic version, each ques- HSGrad tion refers to only a single attribute, and has a yes or no answer, e.g., “Is . 78?” or “Is Region == Midwest ?” The variables can be of any combination > 0 of types (continuous, discrete but ordered, categorical, etc.). You could do more- than-binary questions, but that can always be accommodated as a larger binary tree. Asking questions about multiple variables at once is, again, equivalent to asking multiple questions about single variables. That’s the recursive partition part; what about the simple local models? For constant estimate of Y . classic regression trees, the model in each cell is just a ,... x ,y That is, suppose the points ( ) , ( x ) are all the samples belong- ,y ,y ) x ( i i 2 c 2 c ∑ c 1 , the sample y y = ing to the leaf-node l . Then our model for l is just ˆ i =1 i c 1 mean of the response variable in that cell. This is a piecewise-constant model. There are several advantages to this: • Making predictions is fast (no complicated calculations, just looking up con- stants in the tree) • It’s easy to understand what variables are important in making the prediction (look at the tree) • If some data is missing, we might not be able to go all the way down the tree to a leaf, but we can still make a prediction by averaging all the leaves in the sub-tree we do reach • The model gives a jagged response, so it can work when the true regression surface is not smooth. If it is smooth, though, the piecewise-constant surface can approximate it arbitrarily closely (with enough leaves) • There are fast, reliable algorithms to learn these trees A last analogy before we go into some of the mechanics. One of the most comprehensible non-parametric methods is -nearest-neighbors: find the points k which are most similar to you, and do what, on average, they do. There are two big drawbacks to it: first, you’re defining “similar” entirely in terms of the inputs, not the response; second, k is constant everywhere, when some points just might have more very-similar neighbors than others. Trees get around both problems: leaves correspond to regions of the input space (a neighborhood), but one where the responses are similar, as well as the inputs being nearby; and their size can vary arbitrarily. Prediction trees are, in a way, adaptive nearest-neighbor methods. 1 We could instead fit, say, a different linear regression for the response in each leaf node, using only the data points in that leaf (and using dummy variables for non-quantitative variables). This would give a piecewise-linear model, rather than a piecewise-constant one. If we’ve built the tree well, however, all the points in each leaf are pretty similar, so the regression surface would be nearly constant anyway.

270 270 Trees 13.2 Regression Trees Let’s start with an example. 13.2.1 Example: California Real Estate Again We’ll revisit the Califonia house-price data from Chapter 8, where we try to pre- dict the median house price in each census tract of California from the attributes of the houses and of the inhabitants of the tract. We’ll try growing a regression tree for this. There are several R packages for regression trees; the easiest one is called, tree (Ripley, 2015). simply, calif <- read.table("http://www.stat.cmu.edu/~cshalizi/350/hw/06/cadata.dat", header = TRUE) require(tree) treefit <- tree(log(MedianHouseValue) ~ Longitude + Latitude, data = calif) This does a tree regression of the log price on longitude and latitude. What does this look like? Figure 13.2 shows the tree itself; Figure 13.3 shows the partition, overlaid on the actual prices in the state. (The ability to show the partition is why I picked only two input variables.) Qualitatively, this looks like it does a fair job of capturing the interaction between longitude and latitude, and the way prices are higher around the coasts and the big cities. Quantitatively, the error isn’t bad: summary(treefit) ## ## Regression tree: ## tree(formula = log(MedianHouseValue) ~ Longitude + Latitude, ## data = calif) ## Number of terminal nodes: 12 ## Residual mean deviance: 0.1662 = 3429 / 20630 ## Distribution of residuals: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -2.75900 -0.26080 -0.01359 0.00000 0.26310 1.84100 Here “deviance” is just mean squared error; this gives us an RMS error of 0 41, . which is higher than the smooth non-linear models in Chapter 8, but not shocking since we’re using only two variables, and have only twelve leaves. The flexibility of a tree is basically controlled by how many leaves they have, since that’s how many cells they partition things into. The tree fitting function has a number of controls settings which limit how much it will grow — each node has to contain a certain number of points, and adding a node has to reduce the error by at least a certain amount. The default for the latter, mindev , is 0 . 01; let’s turn it down and see what happens. treefit2 <- tree(log(MedianHouseValue) ~ Longitude + Latitude, data = calif, mindev = 0.001) Figure 13.4 shows the tree itself; with 68 nodes, the plot is fairly hard to read,

271 13.2 Regression Trees 271 Latitude < 38.485 | Longitude < −121.655 Latitude < 39.355 11.73 11.32 Latitude < 34.675 Latitude < 37.925 12.10 12.48 Longitude < −118.315 Longitude < −120.275 11.75 11.28 Longitude < −117.545 12.53 Latitude < 33.59 Latitude < 33.725 Longitude < −116.33 12.54 12.14 11.63 11.16 12.09 plot(treefit) text(treefit, cex = 0.75) Figure 13.2 Regression tree for predicting California housing prices from geographic coordinates. At each internal node, we ask the associated question, and go to the left child if the answer is “yes”, to the right child if the answer is “no”. Note that leaves are labeled with log prices; the plotting function isn’t flexible enough, unfortunately, to apply transformations to the labels. but by zooming in on any part of it, you can check what it’s doing. Figure 13.5 shows the corresponding partition. It’s obviously much finer-grained than that in Figure 13.3, and does a better job of matching the actual prices (RMS error 0 . 32). More interestingly, it doesn’t just uniformly divide up the big cells from

272 272 Trees l l l l 42 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.3 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 40 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.7 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.1 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 38 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Latitude l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.3 11.8 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 36 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.1 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.6 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 34 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.1 11.2 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −116 −118 −120 −122 −124 −114 Longitude price.deciles <- quantile(calif$MedianHouseValue, 0:10/10) cut.prices <- cut(calif$MedianHouseValue, price.deciles, include.lowest = TRUE) plot(calif$Longitude, calif$Latitude, col = grey(10:2/11)[cut.prices], pch = 20, xlab = "Longitude", ylab = "Latitude") partition.tree(treefit, ordvars = c("Longitude", "Latitude"), add = TRUE) Figure 13.3 Map of actual median house prices (color-coded by decile, darker being more expensive), and the partition of the treefit tree. the first partition; some of the new cells are very small, others quite large. The metropolitan areas get a lot more detail than the Mojave. Of course there’s nothing magic about the geographic coordinates, except that they make for pretty plots. We can include all the input features in our model:

273 13.2 Regression Trees 273 Latitude < 38.485 | Longitude < −121.655 Latitude < 39.355 Longitude < −121.365 Latitude < 38.86 Longitude < −121.57 Longitude < −122.915 11.3 11.9 11.3 11.8 11.6 12.0 Latitude < 34.675 Latitude < 37.925 Longitude < −122.38 Longitude < −122.305 Longitude < −122.335 Latitude < 38.225 Latitude < 37.585 Latitude < 37.985 Longitude < −122.025 Latitude < 37.815 Longitude < −122.145 Longitude < −122.255 Longitude < −121.865 Latitude < 37.175 Latitude < 37.315 Latitude < 37.465 Longitude < −122.235 12.7 Latitude < 37.715 11.6 Latitude < 37.875 12.6 12.2 11.9 12.2 12.3 12.4 12.6 12.9 12.4 11.9 12.7 12.8 12.4 12.6 12.2 11.9 12.5 Longitude < −118.315 Longitude < −120.275 Latitude < 37.155 Latitude < 36.805 Longitude < −119.935 Latitude < 34.825 Longitude < −121.335 Longitude < −120.645 Latitude < 36.02 11.6 11.3 11.7 11.2 11.6 11.9 12.1 12.0 11.1 Longitude < −117.545 Latitude < 34.165 Longitude < −118.485 Longitude < −118.365 Longitude < −120.415 Longitude < −119.365 Latitude < 34.055 Latitude < 33.885 Latitude < 33.59 Latitude < 33.725 12.2 11.8 12.4 12.7 12.9 Latitude < 34.105 12.7 Longitude < −116.33 Latitude < 34.055 Longitude < −116.245 12.7 12.1 Longitude < −118.165 Latitude < 34.525 Longitude < −118.015 Longitude < −118.115 Longitude < −117.165 12.5 Longitude < −117.755 Latitude < 33.875 11.5 11.3 11.8 Longitude < −118.285 Latitude < 33.905 Latitude < 34.045 Longitude < −117.815 Latitude < 32.755 Latitude < 33.125 Longitude < −117.235 11.9 Longitude < −118.225 Longitude < −118.245 Longitude < −118.285 12.2 Latitude < 33.915 12.4 12.7 11.2 11.9 12.2 12.5 12.1 12.3 12.7 12.1 11.8 12.1 12.7 12.4 11.9 11.9 12.3 11.6 11.7 12.2 Figure 13.4 As Figure 13.2, but allowing splits for smaller reductions in error ( mindev=0.001 rather than the default mindev=0.01 ). Then fact that the plot is nearly illegible is deliberate. treefit3 <- tree(log(MedianHouseValue) ~ ., data = calif) with the result shown in Figure 13.6. This model has fifteen leaves, as opposed to sixty-eight for treefit2 , but the RMS error is almost as good (0 . 36). This is highly interactive: latitude and longitude are only used if the income level is sufficiently low. (Unfortunately, this does mean that we don’t have a spatial partition to compare to the previous ones, but we can map the predictions; Figure 13.7.) Many of the features, while they were available to the tree fit, aren’t used at all. Now let’s turn to how we actually grow these trees.

274 274 Trees l l l l 42 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.3 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 40 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.8 11.3 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.9 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.0 11.6 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.2 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.9 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.6 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.6 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.2 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.7 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 38 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.9 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.9 11.6 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.6 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.9 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.2 11.3 11.6 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.8 12.4 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.4 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.9 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Latitude l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.1 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.1 l l l l l l l l l l 36 l l 11.2 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.7 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.3 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.6 l l l l 11.7 l 12.4 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.9 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.8 l l l l l l l l l l l l l l l l l l l l l l l l l 12.7 12.4 12.2 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.4 12.7 12.2 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.7 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.9 12.3 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.2 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.7 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.1 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.6 11.9 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.9 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.2 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 34 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.8 11.3 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.3 12.7 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.5 12.1 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.1 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.9 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.1 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.7 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.2 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.7 12.4 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.8 −124 −114 −116 −118 −120 −122 Longitude plot(calif$Longitude, calif$Latitude, col = grey(10:2/11)[cut.prices], pch = 20, xlab = "Longitude", ylab = "Latitude") partition.tree(treefit2, ordvars = c("Longitude", "Latitude"), add = TRUE, cex = 0.3) Figure 13.5 Partition for treefit2 . Note the high level of detail around the cities, as compared to the much coarser cells covering rural areas where variations in prices are less extreme. 13.2.2 Regression Tree Fitting Once we fix the tree, the local models are completely determined, and easy to find (we just average), so all the effort should go into finding a good tree, which is to say into finding a good partitioning of the data. Ideally, we would like to maximize the information the partition gives us about

275 275 13.2 Regression Trees MedianIncome < 3.5471 | MedianIncome < 2.51025 MedianIncome < 5.5892 Latitude < 34.465 Latitude < 37.925 Longitude < −122.235 MedianHouseAge < 38.5 MedianIncome < 7.393 Latitude < 34.455 Longitude < −120.275 Longitude < −117.775 11.7 Longitude < −120.385 Longitude < −117.765 Latitude < 37.905 12.5 13.0 12.6 12.5 12.2 11.5 11.9 11.1 11.8 12.2 11.8 11.4 12.0 11.4 plot(treefit3) text(treefit3, cex = 0.5, digits = 3) Figure 13.6 Regression tree for log price when all other features are included as (potential) inputs. Note that many of the features are not used by the tree. the response variable. Since we are doing regression, what we would really like is for the conditional mean E [ Y | X = x ] to be nearly constant in x over each cell of the partition, and for adjoining cells to have distinct expected values. (It’s OK if two cells of the partition far apart have similar average values.) We can’t do this directly, so we do a greedy search. We start by finding the one binary question we can ask about the predictors which maximizes the information we get about

276 276 Trees l l l l 42 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 40 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 38 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Latitude l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 36 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 34 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −124 −120 −114 −116 −122 −118 Longitude cut.predictions <- cut(predict(treefit3), log(price.deciles), include.lowest = TRUE) plot(calif$Longitude, calif$Latitude, col = grey(10:2/11)[cut.predictions], pch = 20, xlab = "Longitude", ylab = "Latitude") Figure 13.7 Predicted prices for the treefit3 model. Same color scale as in previous plots (where dots indicated actual prices). 2 Y ; this gives us our root node and two daughter nodes. At the average value of each daughter node, we repeat our initial procedure, asking which question would give us the maximum information about the average value of Y , given where we already are in the tree. We repeat this recursively. Every recursive algorithm needs to know when it’s done, a stopping criterion . 2 Mixing botanical and genealogical metaphors for trees is ugly, but I can’t find a way around it.

277 13.2 Regression Trees 277 Here this means when to stop trying to split nodes. Obviously nodes which contain only one data point cannot be split, but giving each observations its own leaf is unlikely to generalize well. A more typical criterion is something like: halt when each child would contain less than five data points, or when splitting increases the information by less than some threshold. Picking the criterion is important to get a good tree, so we’ll come back to it presently. To really make this operational, we need to be more precise about what we Y ”. This can be measured mean by “information about the average value of T straightforwardly by the mean squared error. The MSE for a tree is ∑ ∑ 1 2 y ( MSE (13.1) ) = ) ( T m − i c n c ∈ i ) c ∈ leaves( T ∑ 1 , the prediction for leaf y . c = m where i c ∈ c i n c The basic regression-tree-growing algorithm then is as follows: 1. Start with a single node containing all points. Calculate and MSE . m c 2. If all the points in the node have the same value for all the input variables, stop. Otherwise, search over all binary splits of all variables for the one which will reduce as much as possible. If the largest decrease in MSE would MSE δ be less than some threshold , or one of the resulting nodes would contain less than q points, stop. Otherwise, take that split, creating two new nodes. 3. In each new node, go back to step 1. Trees use only one feature (input variable) at each step. If multiple features are equally good, which one is chosen is a matter of chance, or arbitrary programming decisions. One problem with the straight-forward algorithm I’ve just given is that it can stop too early, in the following sense. There can be variables which are not very informative themselves, but which lead to very informative subsequent splits. S becomes less than This suggests a problem with stopping when the decrease in δ some . Similar problems can arise from arbitrarily setting a minimum number of points q per node. A more successful approach to finding regression trees uses the idea of cross- validation (Chapter 3), especially k -fold cross-validation. We initially grow a large tree, looking only at the error on the training data. (We might even set = 1 q and δ = 0 to get the largest tree we can.) This tree is generally too large and will over-fit the data. The issue is basically about the number of leaves in the tree. For a given number of leaves, there is a unique best tree. As we add more leaves, we can only lower the bias, but we also increase the variance, since we have to estimate more. At any finite sample size, then, there is a tree with a certain number of leaves which will generalize better than any other. We would like to find this optimal number of leaves. The reason we start with a big (lush? exuberant? spreading?) tree is to make sure that we’ve got an upper bound on the optimal number of leaves. Thereafter,

278 278 Trees pruning the large tree. At each we consider simpler trees, which we obtain by pair of leaves with a common parent, we evaluate the error of the tree on the testing data, and also of the sub-tree, which removes those two leaves and puts a leaf at the common parent. We then prune that branch of the tree, and so forth until we come back to the root. Starting the pruning from different leaves may give multiple pruned trees with the same number of leaves; we’ll look at which sub- tree does best on the testing set. The reason this is superior to arbitrary stopping criteria, or to rewarding parsimony as such, is that it directly checks whether the extra capacity (nodes in the tree) pays for itself by improving generalization error. If it does, great; if not, get rid of the complexity. There are lots of other cross-validation tricks for trees. One cute one is to alternate growing and pruning. We divide the data into two parts, as before, and first grow and then prune the tree. We then exchange the role of the training and testing sets, and try to grow our pruned tree to fit the second half. We then prune again, on the first half. We keep alternating in this manner until the size of the tree doesn’t change. 13.2.2.1 Cross-Validation and Pruning in R The package contains functions prune.tree and cv.tree for pruning trees tree by cross-validation. prune.tree tree , and evaluates the error The function takes a tree you fit by of the tree and various prunings of the tree, all the way down to the stump. The evaluation can be done either on new data, if supplied, or on the training data (the default). If you ask it for a particular size of tree, it gives you the best 3 pruning of that size . If you don’t ask it for the best tree, it gives an object which shows the number of leaves in the pruned trees, and the error of each one. This object can be plotted. my.tree <- tree(y ~ x1 + x2, data = my.data) prune.tree(my.tree, best = 5) prune.tree(my.tree, best = 5, newdata = test.set) my.tree.seq <- prune.tree(my.tree) plot(my.tree.seq) my.tree.seq$dev opt.trees <- which(my.tree.seq$dev == min(my.tree.seq$dev)) min(my.tree.seq$size[opt.trees]) prune.tree Finally, method argument. The default is method="deviance" , has an optional which fits by minimizing the mean squared error (for continuous responses) or 4 the negative log likelihood (for discrete responses; see below). The function cv.tree does k -fold cross-validation (default is k = 10). It re- quires as an argument a fitted tree, and a function which will take that tree and prune.tree . new data. By default, this function is 3 Or, if there is no tree with that many leaves, the smallest number of leaves ≥ the requested size. 4 With discrete responses, you may get better results by saying method="misclass" , which looks at the misclassification rate.

279 13.3 Classification Trees 279 my.tree.cv <- cv.tree(my.tree) The type of output of is the same as the function it’s called on. If I cv.tree do cv.tree(my.tree, best = 19) I get the best tree (per cross-validation) of no more than 19 leaves. If I do cv.tree(my.tree) I get information about the cross-validated performance of the whole sequence plot(cv.tree(my.tree)) . Optional arguments to of pruned trees, e.g., cv.tree can include the number of folds, and any additional arguments for the function prune ). it applies (e.g., any arguments taken by To illustrate, think back to treefit2 , which predicted predicted California house prices based on geographic coordinates, but had a very large number of nodes because the tree-growing algorithm was told to split at the least provcation. Figure 13.8 shows the size/performance trade-off. Figures 13.9 and 13.10 show the result of pruning to the smallest size compatible with minimum cross-validated error. 13.2.3 Uncertainty in Regression Trees Even when we are making point predictions, we have some uncertainty, because entirely we’ve only seen a finite amount of data, and this is not an representative sample of the underlying probability distribution. With a regression tree, we can separate the uncertainty in our predictions into two parts. First, we have some uncertainty in what our predictions should be, the tree is correct. assuming Second, we may of course be wrong about the tree. The first source of uncertainty — imprecise estimates of the conditional means within a given partition — is fairly easily dealt with. We can consistently estimate the standard error of the mean for leaf c just like we would for any other mean of IID samples. The second source is more troublesome; as the response values shift, the tree itself changes, and discontinuously so, tree shape being a discrete variable. What we want is some estimate of how different the tree could have been, had we just drawn a different sample from the same source distribution. One way to estimate this, from the data at hand, is to use bootstrapping (ch. 6). It is important that we apply the bootstrap to the predicted values, which can change smoothly if we make a tiny perturbation to the distribution, and not to the shape of the tree itself (which can only change abruptly). 13.3 Classification Trees Classification trees work just like regression trees, only they try to predict a dis- crete category (the class), rather than a numerical value. The variables which go

280 280 Trees into the classification — the inputs — can be numerical or categorical themselves, the same way they can with a regression tree. They are useful for the same reasons regression trees are — they provide fairly comprehensible predictors in situations where there are many variables which interact in complicated, nonlinear ways. We find classification trees in almost the same way we found regression trees: we start with a single node, and then look for the binary distinction which gives us the most information about the class. We then take each of the resulting new nodes and repeat the process there, continuing the recursion until we reach some stopping criterion. The resulting tree will often be too large (i.e., over-fit), so we prune it back using (say) cross-validation. The differences from regression- tree growing have to do with (1) how we measure information, (2) what kind of predictions the tree makes, and (3) how we measure predictive error. 13.3.1 Measuring Information is categorical, so we can use information theory to mea- The response variable Y sure how much we learn about it from knowing the value of another discrete variable A : ∑ Y ; A ] ≡ a (13.2) Pr ( A = a ) I [ Y ; A = I [ ] a where = Y ; A = a ] ≡ H [ Y ] − H [ Y | I [ a ] (13.3) A and you remember the definitions of entropy [ Y ] and conditional entropy H [ H | A = Y a ], ∑ [ Y ] ≡ Y (13.4) − Pr ( Y = y ) log ) Pr ( H = y 2 y and ∑ H | A = a ] ≡ ) log (13.5) − Pr ( Y = y | A = a Y [ Pr ( Y = y | A = a ) 2 y [ Y ; A I a ] is how much our uncertainty about Y decreases from knowing that = A = a . (Less subjectively: how much less variable Y becomes when we go from the full population to the sub-population where A a .) I [ Y ; A ] is how much our = uncertainty about Y A . shrinks, on average, from knowing the value of For classification trees, isn’t (necessarily) one of the predictors, but rather A the answer to some question, generally binary, about one of the predictors X , i.e., A = 1 . This doesn’t change any of the math above, ( X ) for some set A A however. So we chose the question in the first, root node of the tree so as to maximize I [ Y ; A ], which we calculate from the formula above, using the relative frequencies in our data to get the probabilities. When we want to get good questions at subsequent nodes, we have to take into account what we know already at each stage. Computationally, we do this by computing the probabilities and informations using only the cases in that

281 13.3 Classification Trees 281 recursive node, rather than the complete data set. (Remember that we’re doing partitioning, so at each stage the sub-problem looks just like a smaller version of the original problem.) Mathematically, what this means is that if we reach , we look for the question and B = b a C which maximizes A = the node when Y ; C | A I a,B = b ], the information conditional on A = a , B = b . Algebraically, [ = [ Y ; I | A = a,B = b ] = H [ Y | A = a,B = b ] − H [ Y | A = a,B = b,C ] (13.6) C Computationally, rather than looking at all the cases in our data set, we just look A = a at the ones where B = b , and calculate as though that were all the and data. Also, notice that the first term on the right-hand side, H [ Y | A = a,B = b ], does not depend on the next question C I [ Y ; C | A = . So rather than maximizing = a,B ], we can just minimize H [ Y | A = a,B = b,C ]. b 13.3.2 Making Predictions There are two kinds of predictions which a classification tree can make. One is a prediction, a single guess as to the class or category: to say “this is a flower” point distributional prediction or “this is a tiger” and nothing more. The other, a , probability for each class. This is slightly more general, because if we need gives a to extract a point prediction from a probability forecast we can always do so, but we can’t go in the other direction. For probability forecasts, each terminal node in the tree gives us a distribution over the classes. If the terminal node corresponds to the sequence of answers A = a , B = b , . . . Q = q , then ideally this would give us Pr ( Y = y | A = a,B = b,...Q = q ) for each possible value y of the response. A simple way to get close to this is to use the empirical relative frequencies of the classes in that node. E.g., if there are 33 cases at a certain leaf, 22 of which are tigers and 11 of which are flowers, the leaf should predict “tiger with probability 2/3, flower with probability 1/3”. maximum likelihood estimate of the true probability distribution, This is the ̂ and we’ll write it Pr ( · ). Incidentally, while the empirical relative frequencies are consistent estimates of the true probabilities under many circumstances, nothing particularly compells us to use them. When the number of classes is large relative to the sample size, we may easily fail to see any samples at all of a particular class. The empirical relative frequency of that class is then zero. This is good if the actual probability is zero, not so good otherwise. (In fact, under the negative log-likelihood error discussed below, it’s infinitely bad, because we will eventually see that class, but our model will say it’s impossible.) The empirical relative frequency estimator is in a sense too reckless in following the data, without allowing for the possibility that it the data are wrong; it may under-smooth. Other probability estimators “shrink away” or “back off” from the empirical relative frequencies; Exercise 1 involves one such estimator. For point forecasts, the best strategy depends on the loss function. If it is just the mis-classification rate, then the best prediction at each leaf is the class with the highest conditional probability in that leaf. With other loss functions, we

282 282 Trees should make the guess which minimizes the expected loss. But this leads us to the topic of measuring error. 13.3.3 Measuring Error There are three common ways of measuring error for classification trees, or indeed other classification algorithms: misclassification rate, expected loss, and normal- cross-entropy . ized negative log-likelihood, a.k.a. 13.3.3.1 Misclassification Rate We’ve already seen this: it’s the fraction of cases assigned to the wrong class. 13.3.3.2 Average Loss The idea of the average loss is that some errors are more costly than others. For example, we might try classifying cells into “cancerous” or “not cancerous” based on their gene expression profiles. If we think a healthy cell from someone’s biopsy is cancerous, we refer them for further tests, which are frightening and unpleasant, but not, as the saying goes, the end of the world. If we think a cancer cell is healthy, th consequences are much more serious! There will be a different L for the cost for each combination of the real class and the guessed class; write ij cost (“loss”) we incur by saying that the class is when it’s really i . j = x Y = For an observation | X , the classifier gives class probabilities Pr ( x ). i Then the expected cost of predicting j is: ∑ Y = j | X = x ) = = (13.7) L ) Pr ( Y Loss( i | X = x ij i A cost matrix might look as follows prediction truth “cancer” “healthy” “cancer” 0 100 1 0 “healthy” We run an observation through the tree and wind up with class probabilities (0 . 4 , 0 . 6). The most likely class is “healthy”, but it is not the most cost-effective decision. The expected cost of predicting “cancer” is 0 . 4 ∗ 0 + 0 . 6 ∗ 1 = 0 . 6, while the expected cost of predicting “healthy” is 0 4 ∗ 100+0 . 6 ∗ 0 = 40. The probability . of Y = “healthy” must be 100 times higher than that of Y = “cancer” before “cancer” is a cost-effective prediction. Notice that if our estimate of the class probabilities is very bad, we can go through the math above correctly, but still come out with the wrong answer. If our estimates were exact, however, we’d always be doing as well as we could, given the data. You can show (Exercise 13.6) that if the costs are symmetric, we get the mis- classification rate back as our error function, and should always predict the most likely class.

283 13.3 Classification Trees 283 13.3.3.3 Likelihood and Cross-Entropy The normalized negative log-likelihood is a way of looking not just at whether the model made the wrong call, but whether it made the wrong call with confidence not a good way to go through or tentatively. (“Often wrong, never in doubt” is life.) More precisely, this loss function for a model is Q n ∑ 1 ,Q ) = L (data − ) log Q ( Y = y = | X (13.8) x i i n i =1 ( Y = y | X = x ) is the conditional probability the model predicts. If where Q were a function of perfect classification were possible, i.e., if , then the best Y X Y L = 0. If there is classifier would give the actual value of a probability of 1, and some irreducible uncertainty in the classification, then the best possible classifier L = H [ Y | X ], the conditional entropy of Y given the inputs X . Less- would give so we sum L > H | X ]. To see this, try re-write L Y than-ideal predictors have [ over values rather than data-points: ∑ 1 | L = ) N ( Y = y,X = x ) log Q ( Y = y − X = x n x,y ∑ ̂ − = Pr ( Y = y,X = x ) log Q ( Y = y | X = x ) x,y ∑ ̂ ̂ − = = x ) Pr ( Pr ( Y = y | X = X ) log Q ( Y = y | X = x ) x x,y ∑ ∑ ̂ ̂ − = = x ) | ) Pr ( Pr ( Y = y X X = x ) log Q ( Y = y | X = x x y If the quantity in the log was Pr ( Y = y | X = x ), this would be H [ Y | X ]. Since it’s the model’s estimated probability, rather than the real probability, it turns out that this is always than the conditional entropy. L is also called the larger for this reason. cross-entropy There is a slightly subtle issue here about the difference between the in-sample N ( Y = y,X = x ) /n = loss, and the expected generalization error or risk. ̂ Pr ( = y,X = x Y ), the empirical relative frequency or empirical probability. The law of large numbers says that this converges to the true probability, N ( Y = y,X = x ) /n → Pr ( Y = y,X = x ) as n → ∞ . Consequently, the model which minimizes the cross-entropy in sample may not be the one which minimizes it on future data, though the two ought to converge. Generally, the in-sample cross- entropy is lower than its expected value. Notice that to compare two models, or the same model on two different data sets, etc., we do not need to know the true conditional entropy H Y | X ]. All we [ need to know is that L is smaller the closer we get to the true class probabilities. If we could get L down to the cross-entropy, we would be exactly reproducing all the class probabilities, and then we could use our model to minimize any loss 5 function we liked (as we saw above). 5 Technically, if our model gets the class probabilities right, then the model’s predictions are just as

284 284 Trees 13.3.3.4 Neyman-Pearson Approach Using a loss function which assigns different weights to different error types has the weights, and this is two noticeable drawbacks. First of all, we have to pick often quite hard to do. Second, whether our classifier will do well in the future proportion depends on getting the same of cases in the future. Suppose that we’re developing a tree to classify cells as cancerous or not from their gene expression profiles. We will probably want to include lots of cancer cells in our training data, so that we can get a good idea of what cancers look like, biochemically. cancerous, so if doctors start applying our But, fortunately, most cells are not test to their patients, they’re going to find that it massively over-diagnoses cancer — it’s been calibrated to a sample where the proportion (cancer):(healthy) is, say, 6 1:1, rather than, say, 1:20. There is an alternative to weighting which deals with both of these issues, and deserves to be better known and more widely-used than it is. This was introduced by Scott and Nowak (2005), under the name of the “Neyman-Pearson approach” to statistical learning. The reasoning goes as follows. When we do a binary classification problem, we’re really doing a hypothesis test, and the central issue in hypothesis testing, as first recognized by Neyman and Pearson, is to distinguish between the rates of different of errors: false kinds positives and false negatives, false alarms and misses, type I and type II. The Neyman-Pearson approach to designing a hypothesis test is to first fix a limit on size of the test, canonically α the false positive probability, the . Then, among all tests of size α , we want to minimize the false negative rate, or equivalently maximize the power, β . In the traditional theory of testing, we know the distribution of the data under the null and alternative hypotheses, and so can (in principle) calculate α and β for any given test. This is not the case in many applied problems, but then we often do have large samples generated under both distributions (depending on the class of the data point). If we fix , we can ask, for any classifier — say, a tree α ≤ α . If so, we keep it for further consideration; — whether its false alarm rate is if not, we discard it. Among those with acceptable false alarm rates, then, we ask “which classifier has the lowest false negative rate, the highest β ?” This is the one we select. Notice that this solves both problems with weighting. We don’t have to pick a weight for the two errors; we just have to say what rate of false positives α we’re willing to accept. There are many situations where this will be easier to do than to α and β are properties of the conditional fix on a relative cost. Second, the rates informative as the original data. We then say that the predictions are a sufficient statistic for forecasting the class. In fact, if the model gets the exact probabilities wrong, but has the correct partition of the feature space, then its prediction is still a sufficient statistic. Under any loss function, the optimal strategy can be implemented using only a sufficient statistic, rather than needing the full, original data. This is an interesting but much more advanced topic; see, e.g., Blackwell and Girshick (1954) for details. 6 Cancer is rarer than that, but realistically doctors aren’t going to run a test like this unless they have some reason to suspect cancer might be present.

285 13.4 Further Reading 285 | distributions of the features, Pr ( Y X ). If those conditional distributions stay they same but the proportions of the classes change, then the error rates are unaffected. Thus, training the classifier with a different mix of cases than we’ll encounter in the future is not an issue. Unfortunately, I don’t know of any R implementation of Neyman-Pearson learning; it wouldn’t be hard, I think, but goes beyond one problem set at this level. 13.4 Further Reading The classic book on prediction trees, which basically introduced them into statis- (1984). Chapter three in Berk (2008) is tics and data mining, is Breiman et al. clear, easy to follow, and draws heavily on Breiman et al. Another very good chapter is the one on trees in Ripley (1996), which is especially useful for us be- package. (The whole book is strongly recommended.) cause Ripley wrote the tree There is another tradition of trying to learn tree-structured models which comes out of artificial intelligence and inductive logic; see Mitchell (1997). The clearest explanation of the Neyman-Pearson approach to hypothesis test- ing I have ever read is that in Reid (1982), which is one of the books which made me decide to learn statistics. Exercises 13.1 Repeat the analysis of the California house-price data with the Pennsylvania data from Problem Set A.13. fixed partition, a regression tree is a linear smoother. 13.2 Explain why, for a ∑ k k times, with 13.3 Suppose that we see each of classes n = n . The maximum likelihood n i i =1 i th estimate of the probability of the class is ̂ p i = n . Suppose that instead we use the /n i i estimates + 1 n i = ̃ (13.9) p ) ( ∑ i k n + 1 j =1 j This estimator goes back to Laplace, who called it the “rule of succession”. ∑ k 1. Show that ̃ p = 1, no matter what the sample is. i i 2. Show that if ̂ p → p as n →∞ , then ̃ p → p as well. 3. Using the result of the previous part, show that if we observe an IID sample, that p → p , i.e., that ̃ p is a consistent estimator of the true distribution. ̃ 4. Does ̃ p → p imply ̂ p → p ? 5. Which of these properties still hold if the +1s in the numerator and denominator are d for an arbitrary d > 0? replaced by + 13.4 Fun with Laplace’s rule of succession: will the Sun rise tomorrow? One illustration Laplace gave of the probability estimator in Eq. 13.9 was the following. Suppose we know, from 7 written records, that the Sun has risen in the east every day for the last 4000 years. 7 Laplace was thus ignoring people who live above the Artic circle, or below the Antarctic circle. The latter seems particularly unfair, because so many of them are scientists.

286 286 Trees 1. Calculate the probability of the event “the Sun will rise in the east tomorrow”, using Eq. 13.9. You may take the year as containing 365 256 days. . 2. Calculate the probability that the Sun will rise in the east every day for the next four thousand years , assuming this is an IID event each day. Is this a reasonable assumption? 3. Calculate the probability of the event “the Sun will rise in the east every day for four that as a single event. Why does your thousand years” directly from Eq. 13.9, treating answer here not agree with that of part (b)? (Laplace did not, of course, base his belief that the Sun will rise in the morning on such calculations; besides everything else, he was the world’s expert in celestial mechanics! But this shows a problem with the “rule of succession”.) 13.5 It’s reasonable to wonder why we should measure the complexity of a tree by just the number of leaves it has, rather than by the total number of nodes. Show that for a binary 1. (Thus, | T | leaves, the total number of nodes (including the leaves) is 2 | T |− tree, with controlling the number of leaves is equivalent to controlling the number of nodes.) 13.6 Show that, when all the off-diagonal elements of L (from § 13.3.3.2) are equal (and ij positive!), the best class to predict is always the most probable class .

287 Exercises 287 7.7 8.4 9.6 14.0 16.0 26.0 32.0 66.0 100.0 670.0 6500 6000 5500 5000 deviance 4500 4000 3500 40 30 50 60 68 1 10 20 size treefit2.cv <- cv.tree(treefit2) plot(treefit2.cv) Figure 13.8 Size (horizontal axis) versus cross-validated sum of squared errors (vertical axis) for successive prunings of the model. (The treefit2 upper scale on the horizontal axis refers to the “cost/complexity” penalty. The idea is that the pruning minimizes (total error) + λ (complexity) for a certain value of λ , which is what’s shown on that scale. Here complexity is taken to just be the number of leaves in the tree, i.e., its size (though sometimes other measures of complexity are used). λ then acts as a Lagrange multiplier ( § H.3.2) which enforces a constraint on the complexity of the tree. See Ripley (1996, § 7.2, pp. 221–226) for details.

288 288 Trees Latitude < 38.485 | Longitude < −121.655 Latitude < 39.355 11.73 11.32 Latitude < 37.925 Latitude < 34.675 12.10 12.48 Longitude < −118.315 Longitude < −120.275 11.75 11.28 Longitude < −117.545 Latitude < 34.165 Longitude < −118.365 12.37 Latitude < 33.59 Latitude < 33.725 12.38 12.86 Longitude < −116.33 Latitude < 34.105 Longitude < −118.165 Longitude < −117.165 12.54 11.63 12.40 11.16 11.92 12.20 11.95 12.38 opt.trees <- which(treefit2.cv$dev == min(treefit2.cv$dev)) best.leaves <- min(treefit2.cv$size[opt.trees]) treefit2.pruned <- prune.tree(treefit2, best = best.leaves) plot(treefit2.pruned) text(treefit2.pruned, cex = 0.75) Figure 13.9 treefit2 , after being pruned by ten-fold cross-validation.

289 Exercises 289 l l l l 42 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.3 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 40 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 11.7 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 12.1 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 38 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l