# Front Matter

## Transcript

1 The R Book

2 The R Book Second Edition Michael J. Crawley Imperial College London at Silwood Park, UK http://www.bio.ic.ac.uk/research/mjcraw/ therbook / index.htm A John Wiley & Sons, Ltd., Publicatio n

4 Chapters Preface xxiii Getting Started 1 1 12 2 Essentials of the R Language 3 Data Input 137 4 Dataframes 159 Graphics 189 5 6 Tables 244 Mathematics 7 258 Classical Tests 344 8 9 388 Statistical Modelling 10 Regression 449 11 Analysis of Variance 498 12 Analysis of Covariance 537 Generalized Linear Models 13 557 Count Data 579 14 15 599 Count Data in Tables 16 Proportion Data 628 17 Binary Response Variables 650 18 666 Generalized Additive Models 19 Mixed-Effects Models 681 20 Non-Linear Regression 715 21 Meta-Analysis 740 22 Bayesian Statistics 752

5 vi CHAPTERS Tree Models 768 23 24 785 Time Series Analysis 25 Multivariate Statistics 809 26 Spatial Statistics 825 27 869 Survival Analysis 28 Simulation Models 893 29 Changing the Look of Graphics 907 References and Further Reading 971 Index 977

6 Detailed Contents Preface xxiii Getting Started 1 1 1.1 How to use this book 1 1.1.1 Beginner in both computing and statistics 1 1.1.2 Student needing help with project work 2 1.1.3 Done some R and some statistics, but keen to learn more of both 2 1.1.4 Done regression and ANOVA, but want to learn more advanced statistical modelling 2 2 1.1.5 Experienced in statistics, but a beginner in R 1.1.6 Experienced in computing, but a beginner in R 2 3 1.1.7 Familiar with statistics and computing, but need a friendly reference manual 3 1.2 Installing R 1.3 Running R 3 4 1.4 The Comprehensive R Archive Network 5 1.4.1 Manuals 1.4.2 Frequently asked questions 5 1.4.3 Contributed documentation 5 1.5 Getting help in R 6 1.5.1 Worked examples of functions 6 7 1.5.2 Demonstrations of R functions 7 1.6 Packages in R 1.6.1 Contents of packages 8 8 1.6.2 Installing packages 9 1.7 Command line versus scripts 1.8 Data editor 9 1.9 Changing the look of the R screen 10 1.10 Good housekeeping 10 1.11 Linking to other computer languages 11 2 Essentials of the R Language 12 2.1 Calculations 13 2.1.1 Complex numbers in R 13 2.1.2 Rounding 14 2.1.3 Arithmetic 16 2.1.4 Modulo and integer quotients 17

7 viii DETAILED CONTENTS 18 2.1.5 Variable names and assignment 2.1.6 Operators 19 2.1.7 Integers 19 20 2.1.8 Factors 2.2 Logical operations 22 TRUE and T with and F 22 2.2.1 FALSE 2.2.2 Testing for equality with real numbers 23 23 2.2.3 Equality of floating point numbers using all.equal 24 2.2.4 Summarizing differences between objects using all.equal 2.2.5 Evaluation of combinations of and FALSE 25 TRUE 2.2.6 Logical arithmetic 25 2.3 Generating sequences 27 28 2.3.1 Generating repeats 2.3.2 Generating factor levels 29 30 2.4 Membership: Testing and coercing in R 2.5 Missing values, infinity and things that are not numbers 32 2.5.1 Missing values: NA 33 35 2.6 Vectors and subscripts 2.6.1 Extracting elements of a vector using subscripts 36 38 2.6.2 Classes of vector 38 2.6.3 Naming elements within vectors 39 2.6.4 Working with logical subscripts 41 2.7 Vector functions 2.7.1 Obtaining tables of means using tapply 42 2.7.2 The aggregate function for grouped summary statistics 44 2.7.3 Parallel minima and maxima: pmin pmax 45 and 46 2.7.4 Summary information from vectors by groups 46 2.7.5 Addresses within vectors 2.7.6 Finding closest values 47 2.7.7 Sorting, ranking and ordering 47 unique and duplicated 49 2.7.8 Understanding the difference between 50 2.7.9 Looking for runs of numbers within vectors union, intersect and setdiff 2.7.10 Sets: 52 2.8 Matrices and arrays 53 2.8.1 Matrices 54 2.8.2 Naming the rows and columns of matrices 55 2.8.3 Calculations on rows or columns of the matrix 56 58 2.8.4 Adding rows and columns to the matrix sweep function 59 2.8.5 The apply, sapply and lapply 61 2.8.6 Applying functions with 2.8.7 Using the max.col function 65 2.8.8 Restructuring a multi-dimensional array using 67 aperm 2.9 Random numbers, sampling and shuffling 69 2.9.1 The sample function 70 2.10 Loops and repeats 71 2.10.1 Creating the binary representation of a number 73 2.10.2 Loop avoidance 74

8 DETAILED CONTENTS ix 75 2.10.3 The slowness of loops 2.10.4 Do not ‘grow’ data sets by concatenation or recursive function calls 76 2.10.5 Loops for producing time series 77 78 2.11 Lists 2.11.1 Lists and 80 lapply 2.11.2 Manipulating and saving lists 82 2.12 Text, character strings and pattern matching 86 2.12.1 Pasting character strings together 87 88 2.12.2 Extracting parts of strings 2.12.3 Counting things within strings 89 2.12.4 Upper- and lower-case text 91 2.12.5 The match function and relational databases 91 2.12.6 Pattern matching 93 2.12.7 Dot . as the ‘anything’ character 95 96 2.12.8 Substituting text within character strings regexpr 97 2.12.9 Locations of a pattern within a vector using 2.12.10 Using %in% and which 98 98 2.12.11 More on pattern matching 2.12.12 Perl regular expressions 100 100 2.12.13 Stripping patterned text out of complex strings 2.13 Dates and times in R 101 2.13.1 Reading time data from files 102 2.13.2 The strptime function 103 2.13.3 The difftime function 104 2.13.4 Calculations with dates and times 105 difftime and functions 105 2.13.5 The as.difftime 107 2.13.6 Generating sequences of dates 109 2.13.7 Calculating time differences between the rows of a dataframe 2.13.8 Regression using dates and times 111 2.13.9 Summary of dates and times in R 113 2.14 Environments 113 with rather than attach 2.14.1 Using 113 2.14.2 Using attach in this book 114 2.15 Writing R functions 115 2.15.1 Arithmetic mean of a single sample 115 115 2.15.2 Median of a single sample 116 2.15.3 Geometric mean 2.15.4 Harmonic mean 118 2.15.5 Variance 119 2.15.6 Degrees of freedom 119 2.15.7 Variance ratio test 120 2.15.8 Using variance 121 2.15.9 Deparsing: A graphics function for error bars 123 2.15.10 The switch function 125 2.15.11 The evaluation environment of a function 126 2.15.12 Scope 126 2.15.13 Optional arguments 126

10 DETAILED CONTENTS xi 174 4.7 Complex ordering with mixed directions 4.8 A dataframe with row names instead of row numbers 176 177 4.9 Creating a dataframe from another kind of object 4.10 Eliminating duplicate rows from a dataframe 180 4.11 Dates in dataframes 180 4.12 Using the match function in dataframes 182 4.13 Merging two dataframes 183 185 4.14 Adding margins to a dataframe 4.15 Summarizing the contents of dataframes 187 5 Graphics 189 5.1 Plots with two variables 189 5.2 Plotting with two continuous explanatory variables: Scatterplots 190 5.2.1 Plotting symbols: 195 pch 196 5.2.2 Colour for symbols in plots 197 5.2.3 Adding text to scatterplots 5.2.4 Identifying individuals in scatterplots 198 5.2.5 Using a third variable to label a scatterplot 200 201 5.2.6 Joining the dots 5.2.7 Plotting stepped lines 202 203 5.3 Adding other shapes to a plot 5.3.1 Placing items on a plot with the cursor, using the locator 204 function 5.3.2 Drawing more complex shapes with polygon 205 5.4 Drawing mathematical functions 206 5.4.1 Adding smooth parametric curves to a scatterplot 207 209 5.4.2 Fitting non-parametric curves through a scatterplot 5.5 Shape and size of the graphics window 211 212 5.6 Plotting with a categorical explanatory variable 5.6.1 Boxplots with notches to indicate significant differences 213 214 5.6.2 Barplots with error bars 5.6.3 Plots for multiple comparisons 217 5.6.4 Using colour palettes with categorical explanatory variables 219 5.7 Plots for single samples 220 220 5.7.1 Histograms and bar charts 5.7.2 Histograms 221 5.7.3 Histograms of integers 224 5.7.4 Overlaying histograms with smooth density functions 225 5.7.5 Density estimation for continuous variables 226 5.7.6 Index plots 227 228 5.7.7 Time series plots 5.7.8 Pie charts 230 5.7.9 The function 231 stripchart 5.7.10 A plot to test for normality 232 5.8 Plots with multiple variables 234 5.8.1 The pairs function 234 5.8.2 The coplot function 236 5.8.3 Interaction plots 237

11 xii DETAILED CONTENTS 238 5.9 Special plots 5.9.1 Design plots 238 5.9.2 Bubble plots 239 240 5.9.3 Plots with many identical values 5.10 Saving graphics to file 242 242 5.11 Summary Tables 244 6 6.1 Tables of counts 244 245 6.2 Summary tables 6.3 Expanding a table into a dataframe 250 6.4 Converting from a dataframe to a table 252 6.5 Calculating tables of proportions with prop.table 253 6.6 The function 254 scale 6.7 The function 254 expand.grid model.matrix 255 6.8 The function table and tabulate 256 6.9 Comparing 7 258 Mathematics 7.1 Mathematical functions 258 7.1.1 Logarithmic functions 259 7.1.2 Trigonometric functions 260 7.1.3 Power laws 261 7.1.4 Polynomial functions 262 264 7.1.5 Gamma function 7.1.6 Asymptotic functions 265 266 7.1.7 Parameter estimation in asymptotic functions 7.1.8 Sigmoid (S-shaped) functions 267 269 7.1.9 Biexponential model 7.1.10 Transformations of the response and explanatory variables 270 7.2 Probability functions 271 7.3 Continuous probability distributions 272 274 7.3.1 Normal distribution 7.3.2 The central limit theorem 278 7.3.3 Maximum likelihood with the normal distribution 282 7.3.4 Generating random numbers with exact mean and standard deviation 284 7.3.5 Comparing data with a normal distribution 285 7.3.6 Other distributions used in hypothesis testing 286 287 7.3.7 The chi-squared distribution 7.3.8 Fisher’s F distribution 289 7.3.9 Student’s t distribution 291 7.3.10 The gamma distribution 293 7.3.11 The exponential distribution 296 7.3.12 The beta distribution 296 7.3.13 The Cauchy distribution 298 7.3.14 The lognormal distribution 299 7.3.15 The logistic distribution 300 7.3.16 The log-logistic distribution 301

12 DETAILED CONTENTS xiii 301 7.3.17 The Weibull distribution 7.3.18 Multivariate normal distribution 303 304 7.3.19 The uniform distribution 7.3.20 Plotting empirical cumulative distribution functions 306 7.4 Discrete probability distributions 307 7.4.1 The Bernoulli distribution 307 7.4.2 The binomial distribution 308 7.4.3 The geometric distribution 311 312 7.4.4 The hypergeometric distribution 7.4.5 The multinomial distribution 313 7.4.6 The Poisson distribution 314 7.4.7 The negative binomial distribution 315 7.4.8 The Wilcoxon rank-sum statistic 322 7.5 Matrix algebra 322 323 7.5.1 Matrix multiplication 324 7.5.2 Diagonals of matrices 7.5.3 Determinant 325 7.5.4 Inverse of a matrix 327 328 7.5.5 Eigenvalues and eigenvectors 7.5.6 Matrices in statistical models 331 7.5.7 Statistical models in matrix notation 334 7.6 Solving systems of linear equations using matrices 338 7.7 Calculus 339 7.7.1 Derivatives 339 7.7.2 Integrals 339 7.7.3 Differential equations 340 Classical Tests 8 344 8.1 Single samples 344 8.1.1 Data summary 345 8.1.2 Plots for testing normality 346 347 8.1.3 Testing for normality 348 8.1.4 An example of single-sample data 8.2 Bootstrap in hypothesis testing 349 8.3 Skew and kurtosis 350 8.3.1 Skew 350 8.3.2 Kurtosis 352 8.4 Two samples 353 8.4.1 Comparing two variances 354 358 8.4.2 Comparing two means 8.4.3 Student’s t test 358 8.4.4 Wilcoxon rank-sum test 361 8.5 Tests on paired samples 362 8.6 The sign test 364 8.7 Binomial test to compare two proportions 365 8.8 Chi-squared contingency tables 365 8.8.1 Pearson’s chi-squared 367 8.8.2 G test of contingency 369

13 xiv DETAILED CONTENTS 370 8.8.3 Unequal probabilities in the null hypothesis 8.8.4 Chi-squared tests on table objects 370 371 8.8.5 Contingency tables with small expected frequencies: Fisher’s exact test 8.9 Correlation and covariance 373 8.9.1 Data dredging 375 8.9.2 Partial correlation 375 8.9.3 Correlation and the variance of differences between variables 376 8.9.4 Scale-dependent correlations 377 8.10 Kolmogorov–Smirnov test 379 8.11 Power analysis 382 8.12 Bootstrap 385 Statistical Modelling 9 388 9.1 First things first 389 390 9.2 Maximum likelihood 390 9.3 The principle of parsimony (Occam’s razor) 9.4 Types of statistical model 391 9.5 Steps involved in model simplification 393 393 9.5.1 Caveats 9.5.2 Order of deletion 394 9.6 Model formulae in R 395 9.6.1 Interactions between explanatory variables 396 9.6.2 Creating formula objects 397 9.7 Multiple error terms 398 9.8 The intercept as parameter 1 398 9.9 The update 399 function in model simplification 399 9.10 Model formulae for regression 401 9.11 Box–Cox transformations 9.12 Model criticism 403 9.13 Model checking 404 404 9.13.1 Heteroscedasticity 9.13.2 Non-normality of errors 405 9.14 Influence 408 9.15 Summary of statistical models in R 411 9.16 Optional arguments in model-fitting functions 412 9.16.1 Subsets 413 9.16.2 Weights 413 9.16.3 Missing values 414 9.16.4 Offsets 415 9.16.5 Dataframes containing the same variable names 415 9.17 Akaike’s information criterion 415 9.17.1 AIC as a measure of the fit of a model 416 9.18 Leverage 417 9.19 Misspecified model 418 9.20 Model checking in R 418 9.21 Extracting information from model objects 420 9.21.1 Extracting information by name 421 9.21.2 Extracting information by list subscripts 421

14 DETAILED CONTENTS xv \$ 425 9.21.3 Extracting components of the model using 9.21.4 Using lists with models 425 9.22 The tables for continuous and categorical explanatory variables 426 summary 9.23 Contrasts 430 9.23.1 Contrast coefficients 431 9.23.2 An example of contrasts in R 432 contrasts 433 A priori 9.23.3 9.24 Model simplification by stepwise deletion 437 9.25 Comparison of the three kinds of contrasts 440 9.25.1 Treatment contrasts 440 9.25.2 Helmert contrasts 440 442 9.25.3 Sum contrasts 443 9.26 Aliasing 9.27 Orthogonal polynomial contrasts: contr.poly 443 9.28 Summary of statistical modelling 448 Regression 10 449 10.1 Linear regression 450 10.1.1 The famous five in R 453 10.1.2 Corrected sums of squares and sums of products 453 10.1.3 Degree of scatter 456 10.1.4 Analysis of variance in regression: SSY = SSR + SSE 458 10.1.5 Unreliability estimates for the parameters 460 462 10.1.6 Prediction using the fitted model 10.1.7 Model checking 463 465 10.2 Polynomial approximations to elementary functions 10.3 Polynomial regression 466 468 10.4 Fitting a mechanistic model to data 10.5 Linear regression after transformation 469 10.6 Prediction following regression 472 10.7 Testing for lack of fit in a regression 475 478 10.8 Bootstrap with regression 10.9 Jackknife with regression 481 10.10 Jackknife after bootstrap 483 10.11 Serial correlation in the residuals 484 10.12 Piecewise regression 485 10.13 Multiple regression 489 490 10.13.1 The multiple regression model 10.13.2 Common problems arising in multiple regression 497 Analysis of Variance 498 11 11.1 One-way ANOVA 498 11.1.1 Calculations in one-way ANOVA 502 11.1.2 Assumptions of ANOVA 503 11.1.3 A worked example of one-way ANOVA 503 11.1.4 Effect sizes 509 11.1.5 Plots for interpreting one-way ANOVA 511 11.2 Factorial experiments 516 11.3 Pseudoreplication: Nested designs and split plots 519

15 xvi DETAILED CONTENTS 519 11.3.1 Split-plot experiments 11.3.2 Mixed-effects models 522 523 11.3.3 Fixed effect or random effect? 11.3.4 Removing the pseudoreplication 523 11.3.5 Derived variable analysis 524 11.4 Variance components analysis 524 or 11.5 Effect sizes in ANOVA: ? 527 aov lm 531 11.6 Multiple comparisons 11.7 Multivariate analysis of variance 535 Analysis of Covariance 537 12 538 12.1 Analysis of covariance in R 548 12.2 ANCOVA and experimental design 548 12.3 ANCOVA with two factors and one continuous covariate 12.4 Contrasts and the parameters of ANCOVA models 551 12.5 Order matters in summary.aov 554 Generalized Linear Models 557 13 558 13.1 Error structure 13.2 Linear predictor 559 13.3 Link function 559 13.3.1 Canonical link functions 560 13.4 Proportion data and binomial errors 560 13.5 Count data and Poisson errors 561 13.6 Deviance: Measuring the goodness of fit of a GLM 562 562 13.7 Quasi-likelihood quasi family of models 563 13.8 The 13.9 Generalized additive models 565 13.10 Offsets 566 568 13.11 Residuals 569 13.11.1 Misspecified error structure 13.11.2 Misspecified link function 569 13.12 Overdispersion 570 13.13 Bootstrapping a GLM 570 13.14 Binomial GLM with ordered categorical variables 574 14 Count Data 579 14.1 A regression with Poisson errors 579 581 14.2 Analysis of deviance with count data 14.3 Analysis of covariance with count data 586 14.4 Frequency distributions 588 14.5 Overdispersion in log-linear models 592 14.6 Negative binomial errors 595 15 Count Data in Tables 599 15.1 A two-class table of counts 599 15.2 Sample size for count data 600 15.3 A four-class table of counts 600 15.4 Two-by-two contingency tables 601 15.5 Using log-linear models for simple contingency tables 602

16 DETAILED CONTENTS xvii 604 15.6 The danger of contingency tables 15.7 Quasi-Poisson and negative binomial models compared 606 608 15.8 A contingency table of intermediate complexity 15.9 Schoener’s lizards: A complex contingency table 610 15.10 Plot methods for contingency tables 616 15.11 Graphics for count data: Spine plots and spinograms 621 628 16 Proportion Data 16.1 Analyses of data on one and two proportions 629 16.2 Count data on proportions 629 16.3 Odds 630 16.4 Overdispersion and hypothesis testing 631 16.5 Applications 632 16.5.1 Logistic regression with binomial errors 633 16.5.2 Estimating LD50 and LD90 from bioassay data 635 636 16.5.3 Proportion data with categorical explanatory variables 639 16.6 Averaging proportions 16.7 Summary of modelling with proportion count data 640 16.8 Analysis of covariance with binomial data 640 16.9 Converting complex contingency tables to proportions 643 16.9.1 Analysing Schoener’s lizards as proportion data 645 17 Binary Response Variables 650 17.1 Incidence functions 652 17.2 Graphical tests of the fit of the logistic to data 653 17.3 ANCOVA with a binary response variable 655 17.4 Binary response with pseudoreplication 660 18 Generalized Additive Models 666 667 18.1 Non-parametric smoothers 669 18.2 Generalized additive models 18.2.1 Technical aspects 672 18.3 An example with strongly humped data 675 677 18.4 Generalized additive models with binary data gam 679 18.5 Three-dimensional graphic output from 19 Mixed-Effects Models 681 19.1 Replication and pseudoreplication 683 19.2 The lme and lmer functions 684 19.2.1 lme 684 lmer 685 19.2.2 19.3 Best linear unbiased predictors 685 19.4 Designed experiments with different spatial scales: Split plots 685 19.5 Hierarchical sampling and variance components analysis 691 19.6 Mixed-effects models with temporal pseudoreplication 695 19.7 Time series analysis in mixed-effects models 699 19.8 Random effects in designed experiments 703 19.9 Regression in mixed-effects models 704 19.10 Generalized linear mixed models 710 19.10.1 Hierarchically structured count data 710

17 xviii DETAILED CONTENTS Non-Linear Regression 715 20 20.1 Comparing Michaelis–Menten and asymptotic exponential 719 20.2 Generalized additive models 720 20.3 Grouped data for non-linear estimation 721 20.4 Non-linear time series models (temporal pseudoreplication) 726 20.5 Self-starting functions 728 20.5.1 Self-starting Michaelis–Menten model 729 730 20.5.2 Self-starting asymptotic exponential model 20.5.3 Self-starting logistic 730 20.5.4 Self-starting four-parameter logistic 731 20.5.5 Self-starting Weibull growth function 733 20.5.6 Self-starting first-order compartment function 734 20.6 Bootstrapping a family of non-linear regressions 735 Meta-Analysis 21 740 21.1 Effect size 741 21.2 Weights 741 21.3 Fixed versus random effects 741 742 21.3.1 Fixed-effect meta-analysis of scaled differences 746 21.3.2 Random effects with a scaled mean difference 21.4 Random-effects meta-analysis of binary data 748 22 Bayesian Statistics 752 22.1 Background 754 22.2 A continuous response variable 755 22.3 Normal prior and normal likelihood 755 756 22.4 Priors 757 22.4.1 Conjugate priors 22.5 Bayesian statistics for realistically complicated models 757 22.6 Practical considerations 758 22.7 Writing BUGS models 758 758 22.8 Packages in R for carrying out Bayesian analysis 759 22.9 Installing JAGS on your computer 22.10 Running JAGS in R 759 22.11 MCMC for a simple linear regression 760 22.12 MCMC for a model with temporal pseudoreplication 763 22.13 MCMC for a model with binomial errors 766 23 Tree Models 768 23.1 Background 769 23.2 Regression trees 771 23.3 Using rpart to fit tree models 772 23.4 Tree models as regressions 775 23.5 Model simplification 776 23.6 Classification trees with categorical explanatory variables 778 23.7 Classification trees for replicated data 780 23.8 Testing for the existence of humps 783 24 Time Series Analysis 785 24.1 Nicholson’s blowflies 785

18 DETAILED CONTENTS xix 792 24.2 Moving average 24.3 Seasonal data 793 796 24.3.1 Pattern in the monthly means 24.4 Built-in time series functions 797 24.5 Decompositions 797 24.6 Testing for a trend in the time series 798 24.7 Spectral analysis 800 24.8 Multiple time series 801 24.9 Simulated time series 803 24.10 Time series models 805 25 Multivariate Statistics 809 809 25.1 Principal components analysis 813 25.2 Factor analysis 816 25.3 Cluster analysis 25.3.1 Partitioning 816 817 kmeans 25.3.2 Taxonomic use of 819 25.4 Hierarchical cluster analysis 25.5 Discriminant analysis 821 25.6 Neural networks 824 26 Spatial Statistics 825 26.1 Point processes 825 26.1.1 Random points in a circle 826 26.2 Nearest neighbours 829 26.2.1 Tessellation 833 26.3 Tests for spatial randomness 834 K 834 26.3.1 Ripley’s 26.3.2 Quadrat-based methods 838 26.3.3 Aggregated pattern and quadrat count data 839 842 26.3.4 Counting things on maps 26.4 Packages for spatial statistics 844 26.4.1 The package 845 spatstat 26.4.2 The spdep package 849 26.4.3 Polygon lists 854 26.5 Geostatistical data 856 26.6 Regression models with spatially correlated errors: Generalized least squares 860 26.7 Creating a dot-distribution map from a relational database 867 Survival Analysis 869 27 27.1 A Monte Carlo experiment 869 27.2 Background 872 27.3 The survivor function 873 27.4 The density function 873 27.5 The hazard function 874 27.6 The exponential distribution 874 27.6.1 Density function 874 27.6.2 Survivor function 874 27.6.3 Hazard function 874

19 xx DETAILED CONTENTS 875 27.7 Kaplan–Meier survival distributions 27.8 Age-specific hazard models 876 878 27.9 Survival analysis in R 27.9.1 Parametric models 878 27.9.2 Cox proportional hazards model 878 27.9.3 Cox’s proportional hazard or a parametric model? 879 27.10 Parametric analysis 879 27.11 Cox’s proportional hazards 882 27.12 Models with censoring 883 27.12.1 Parametric models 884 27.12.2 Comparing coxph and survreg survival analysis 887 28 893 Simulation Models 28.1 Temporal dynamics: Chaotic dynamics in population size 893 895 28.1.1 Investigating the route to chaos 896 28.2 Temporal and spatial dynamics: A simulated random walk in two dimensions 28.3 Spatial simulation models 897 28.3.1 Metapopulation dynamics 898 28.3.2 Coexistence resulting from spatially explicit (local) density dependence 900 28.4 Pattern generation resulting from dynamic interactions 903 29 Changing the Look of Graphics 907 29.1 Graphs for publication 907 29.2 Colour 908 29.2.1 Palettes for groups of colours 910 29.2.2 The RColorBrewer package 913 29.2.3 Coloured plotting symbols with contrasting margins 914 915 29.2.4 Colour in legends 916 29.2.5 Background colours 29.2.6 Foreground colours 917 29.2.7 Different colours and font styles for different parts of the graph 917 29.2.8 Full control of colours in plots 918 920 29.3 Cross-hatching 921 29.4 Grey scale 29.5 Coloured convex hulls and other polygons 921 29.6 Logarithmic axes 922 29.7 Different font families for text 923 29.8 Mathematical and other symbols on plots 924 29.9 Phase planes 928 29.10 Fat arrows 929 930 29.11 Three-dimensional plots 29.12 Complex 3D plots with wireframe 933 935 29.13 An alphabetical tour of the graphics parameters 29.13.1 Text justification, adj 935 29.13.2 Annotation of graphs, ann 935 29.13.3 Delay moving on to the next in a series of plots, ask 935 29.13.4 Control over the axes, axis 938 29.13.5 Background colour for plots, bg 939

20 DETAILED CONTENTS xxi bty 939 29.13.6 Boxes around plots, 940 29.13.7 Size of plotting symbols using the character expansion function, cex plt 941 29.13.8 Changing the shape of the plotting region, 942 29.13.9 Locating multiple graphs in non-standard layouts using fig scale but different y scales using fig 942 29.13.10 Two graphs with a common x layout function 943 29.13.11 The 945 29.13.12 Creating and controlling multiple screens on a single device las 29.13.13 Orientation of numbers on the tick marks, 947 29.13.14 Shapes for the ends and joins of lines, lend and ljoin 947 lty 29.13.15 Line types, 948 29.13.16 Line widths, lwd 949 29.13.17 Several graphs on the same page, mfrow and mfcol 950 29.13.18 Margins around the plotting area, mar 950 29.13.19 Plotting more than one graph on the same axes, new 951 y 951 29.13.20 Two graphs on the same plot with different scales for their axes 29.13.21 Outer margins, oma 952 29.13.22 Packing graphs closer together 954 29.13.23 Square plotting region, 955 pty 29.13.24 Character rotation, srt 955 29.13.25 Rotating the axis labels 955 29.13.26 Tick marks on the axes 956 29.13.27 Axis styles 957 29.14 Trellis graphics 957 29.14.1 Panel box-and-whisker plots 959 29.14.2 Panel scatterplots 960 29.14.3 Panel barplots 965 29.14.4 Panels for conditioning plots 966 29.14.5 Panel histograms 967 29.14.6 Effect sizes 968 29.14.7 More panel functions 969 References and Further Reading 971 Index 977

28 6 THE R BOOK 1.5 Getting help in R The simplest way to get help in R is to click on the Help button on the toolbar of the RGui window (this stands for R’s Graphic User Interface). Alternatively, if you are connected to the internet, you can type CRAN into Google and search for the help you need at CRAN (see Section 1.4). However, if you know the name of the ? at the command line prompt followed by the function you want help with, you just type a question mark name of the function. So to get help on , just type read.table ?read.table Sometimes you cannot remember the precise name of the function, but you know the subject on which help.search function (without a question mark) with you want help (e.g. data input in this case). Use the your query in double quotes like this: help.search("data input") and (with any luck) you will see the names of the R functions associated with this query. Then you can use ?read.table to get detailed help. find and apropos Other useful functions are find function tells you what package something .The is in: find("lowess") [1] "package:stats" while apropos returns a character vector giving the names of all objects in the search list that match your (potentially partial) enquiry: apropos("lm") [1] ". __C__anova.glm" ".__C__anova.glm.null" ". __C__glm" [4] ". __C__glm.null" ". __C__lm" ". __C__mlm" [7] "anova.glm" "anova.glmlist" "anova.lm" [10] "anova.lmlist" "anova.mlm" "anovalist.lm" [13] "contr.helmert" "glm" "glm.control" [16] "glm.fit" "glm.fit.null" "hatvalues.lm" [19] "KalmanForecast" "KalmanLike" "KalmanRun" [22] "KalmanSmooth" "lm" "lm.fit" [25] "lm.fit.null" "lm.influence" "lm.wfit" [28] "lm.wfit.null" "model.frame.glm" "model.frame.lm" [31] "model.matrix.lm" "nlm" "nlminb" [34] "plot.lm" "plot.mlm" "predict.glm" [37] "predict.lm" "predict.mlm" "print.glm" [40] "print.lm" "residuals.glm" "residuals.lm" [43] "rstandard.glm" "rstandard.lm" "rstudent.glm" [46] "rstudent.lm" "summary.glm" "summary.lm" [49] "summary.mlm" "kappa.lm" 1.5.1 Worked examples of functions To see a worked example just type the function name (e.g. linear models, lm ) example(lm) and you will see the printed and graphical output produced by the lm function.

29 GETTING STARTED 7 1.5.2 Demonstrations of R functions These can be useful for seeing the range of things that R can do. Here are some for you to try: demo(persp) demo(graphics) demo(Hershey) demo(plotmath) 1.6 Packages in R Finding your way around the contributed packages can be tricky, simply because there are so many of them, and the name of the package is not always as indicative of its function as you might hope. There is no comprehensive cross-referenced index, but there is a very helpful feature called ‘Task Views’ on CRAN, which explains the packages available under a limited number of usefully descriptive headings. Click on Packages on the CRAN home page, then inside Contributed Packages, you can click on CRAN Task Views, which allows you to browse bundles of packages assembled by topic. Currently, there are 29 Task Views on CRAN as follows: Bayesian Bayesian Inference Chemometrics and Computational Physics ChemPhys Clinical Trial Design, Monitoring, and Analysis ClinicalTrials Cluster Analysis & Finite Mixture Models Cluster Differential Equations DifferentialEquations Distributions Probability Distributions Computational Econometrics Econometrics Analysis of Ecological and Environmental Data Environmetrics ExperimentalDesign Design of Experiments (DoE) & Analysis of Experimental Data Finance Empirical Finance Genetics Statistical Genetics Graphics Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization HighPerformanceComputing High-Performance and Parallel Computing with R MachineLearning Machine Learning & Statistical Learning MedicalImaging Medical Image Analysis Multivariate Multivariate Statistics NaturalLanguageProcessing Natural Language Processing Official Statistics & Survey Methodology OfficialStatistics Optimization Optimization and Mathematical Programming Pharmacokinetics Analysis of Pharmacokinetic Data Phylogenetics Phylogenetics, Especially Comparative Methods Psychometrics Psychometric Models and Methods ReproducibleResearch Reproducible Research Robust Robust Statistical Methods SocialSciences Statistics for the Social Sciences Spatial Analysis of Spatial Data Survival Survival Analysis TimeSeries Time Series Analysis gR gRaphical Models in R

31 GETTING STARTED 9 mirror nearest to you for fast downloading (e.g. London), then everything else is automatic. The packages used in this book are install.packages("akima") install.packages("boot") install.packages("car") install.packages("lme4") install.packages("meta") install.packages("mgcv") install.packages("nlme") install.packages("deSolve") install.packages("R2jags") install.packages("RColorBrewer") install.packages("RODBC") install.packages("rpart") install.packages("spatstat") install.packages("spdep") install.packages("tree") If you want other packages, then go to CRAN and browse the list called ‘Packages’ to select the ones you want to investigate. 1.7 Command line versus scripts When writing functions and other multi-line sections of input you will find it useful to use a text editor rather than execute everything directly at the command line. Some people prefer to use R’s own built-in editor. It is accessible from the RGui menu bar. Click on File then click on New script. At this point R will open a window entitled Untitled - R Editor. You can type and edit in this, then when you want to execute a line or group of lines, just highlight them and press Ctrl+R (the Control key and R together). The lines are automatically transferred to the command window and executed. By pressing Ctrl+S you can save the contents of the R Editor window in a file that you will have to name. It will be given a .R file extension automatically. In a subsequent session you can click on File/Open script . . . when you will see all your saved .R files and can select the one you want to open. Other people prefer to use an editor with more features. Tinn-R (“this is not notepad” for R) is very good, or you might like to try RStudio, which has the nice feature of allowing you to scroll back through all of the graphics produced in a session. These and others are free to download from the web. 1.8 Data editor There is a data editor within R that can be accessed from the menu bar by selecting Edit/Data editor . . . . You provide the name of the matrix or dataframe containing the material you want to edit (this has to be a dataframe that is active in the current R session, rather than one which is stored on file), and a Data Editor window appears. Alternatively, you can do this from the command line using the fix function (e.g. fix(data.frame.name) ). Suppose you want to edit the bacteria dataframe which is part of the MASS library: library(MASS) attach(bacteria) fix(bacteria)

32 10 THE R BOOK The window has the look of a spreadsheet, and you can change the contents of the cells, navigating with the cursor or with the arrow keys. My preference is to do all of my data preparation and data editing in a spreadsheet before even thinking about using R. Once checked and edited, I save the data from the spreadsheet to a tab-delimited text file (*.txt) that can be imported to R very simply using the function called read.table (p. 20). One of the most persistent frustrations for beginners is that they cannot get their data imported into R. Things that typically go wrong at the data input stage and the necessary remedial actions are described on p. 139. 1.9 Changing the look of the R screen The default settings of the command window are inoffensive to most people, but you can change them if you do not like them. The Rgui Configuration Editor under Edit/GUI preferences . . . is used to change the look of the screen. You can change the colour of the input line (default is red), the output line (default navy) or the background (default white). The default numbers of rows (25) and columns (80) can be changed, and you have control over the font (default Courier New) and font size (default 10). 1.10 Good housekeeping To see what variables you have created in the current session, type: objects() [1] "colour.factor" "colours" "dates" "index" [5] "last.warning" "nbnumbers" "nbtable" "nums" [9] "wanted" "x" "xmat" "xv" To see which packages and dataframes are currently attached: search() [1] ".GlobalEnv" "nums" "nums" [4] "package:methods" "package:stats" "package:graphics" [7] "package:grDevices" "package:utils" "package:data sets" [10] "Autoloads" "package:base" At the end of a session in R, it is good practice to remove ( rm ) any variables names you have created (using, say, x <- 5.6 ) and to detach any dataframes you have attached earlier in the session. That way, variables with the same names but different properties will not get in each other’s way in subsequent work: rm(x,y,z) detach(worms) The detach command does not make the dataframe called worms disappear; it just means that the variables within worms, such as Slope and Area, are no longer accessible directly by name. To get rid of everything, including all the dataframes, type rm(list=ls()) but be absolutely sure that you really want to be as draconian as this before you execute the command.

33 GETTING STARTED 11 1.11 Linking to other computer languages Advanced users can employ the functions .C and to provide a standard interface to compiled .Fortran code that has been linked into R, either at build time or via dyn.load . They are primarily intended for compiled C and Fortran code respectively, but the .C function can be used with other languages which can generate C interfaces, for example C++. The and .Primitive interfaces are used to call .Internal C code compiled into R at build time. Functions .Call and .External provide interfaces which allow compiled code (primarily compiled C code) to manipulate R objects.

34 2 Essentials of the R Language There is an enormous range of things that R can do, and one of the hardest parts of learning R is finding your way around. Likewise, there is no obvious order in which different people will want to learn the different components of the R language. I suggest that you quickly scan down the following bullet points, which represent the order in which I have chosen to present the introductory material, and if you are relatively experienced in statistical computing, you might want to skip directly to the relevant section. I strongly recommend that beginners work thorough the material in the order presented, because successive sections build upon knowledge gained from previous sections. This chapter is divided into the following sections:  2.1 Calculations  2.2 Logical operations  2.3 Sequences  2.4 Testing and coercion  2.5 Missing values and things that are not numbers  2.6 Vectors and subscripts  2.7 Vectorized functions  2.8 Matrices and arrays  2.9 Sampling  2.10 Loops and repeats  2.11 Lists  2.12 Text, character strings and pattern matching  2.13 Dates and times  2.14 Environments  2.15 Writing R functions  2.16 Writing to file from R The R Book , Second Edition. Michael J. Crawley. © 2013 John Wiley & Sons, Ltd. Published 2013 by John Wiley & Sons, Ltd.

35 ESSENTIALS OF THE R LANGUAGE 13 Other essential material is elsewhere: beginners will want to master data input (Chapter 3), dataframes (Chapter 4) and graphics (Chapter 5). 2.1 Calculations is an invitation to put R to work. The convention in this book is that material that you The screen prompt > need to type into the command line after the screen prompt is shown in red in Courier New font. Just press the Return key to see the answer. You can use the command line as a calculator, like this: > log(42/7.3) [1] 1.749795 Each line can have at most 8192 characters, but if you want to see a lengthy instruction or a complicated expression on the screen, you can continue it on one or more further lines simply by ending the line at a place where the line is obviously incomplete (e.g. with a trailing comma, operator, or with more left parentheses than right parentheses, implying that more right parentheses will follow). When continuation is expected, the prompt changes from to +, as follows: > > 5+6+3+6+4+2+4+8+ + 3+2+7 [1] 50 Note that the + continuation prompt does not carry out arithmetic plus. If you have made a mistake, and you want to get rid of the + prompt and return to the > prompt, then press the Esc key and use the Up arrow to edit the last (incomplete) line. From here onwards and throughout the book, the prompt character will be omitted. The output from R > is shown in blue in Courier New font, which uses absolute rather than proportional spacing, so that columns of numbers remain neatly aligned on the page or on the screen. Two or more expressions can be placed on a single line so long as they are separated by semi-colons: 2+3; 5*7; 3-7 [1] 5 [1] 35 [1] -4 For very big numbers or very small numbers R uses the following scheme (called exponents): 1.2e3 means 1200 because the e3 means ‘move the decimal point 3 places to the right’; 1.2e-2 means 0.012 because the e-2 means ‘move the decimal point 2 places to the left’; 3.9+4.5i is a complex number with real (3.9) and imaginary (4.5) parts, and i is the square root of –1. 2.1.1 Complex numbers in R Complex numbers consist of a real part and an imaginary part, which is identified by lower-case i like this: z <- 3.5-8i

36 14 THE R BOOK The elementary trigonometric, logarithmic, exponential, square root and hyperbolic functions are all implemented for complex values. The following are the special R functions that you can use with com- plex numbers. Determine the real part: Re(z) [1] 3.5 Determine the imaginary part: Im(z) [1] -8 z to 0 in the complex plane by Pythagoras; if x is the real part and Calculate the modulus (the distance from √ 2 2 is the imaginary part, then the modulus is y + x ): y Mod(z) [1] 8.732125 Calculate the argument ( Arg(x+ yi)= atan(y/x) ): Arg(z) [1] -1.158386 Work out the complex conjugate (change the sign of the imaginary part): Conj(z) [1] 3.5+8i Membership and coercion are dealt with in the usual way (p. 30): is.complex(z) [1] TRUE as.complex(3.8) [1] 3.8+0i 2.1.2 Rounding Various sorts of rounding (rounding up, rounding down, rounding to the nearest integer) can be done easily. Take the number 5.7 as an example. The ‘greatest integer less than’ function is floor: floor(5.7) [1] 5 The ‘next integer’ function is ceiling: ceiling(5.7) [1] 6 You can round to the nearest integer by adding 0.5 to the number, then using floor . There is a built-in function for this, but we can easily write one of our own to introduce the notion of function writing. Call it

37 ESSENTIALS OF THE R LANGUAGE 15 , then define it as a function like this: rounded rounded <- function(x) floor(x+0.5) Now we can use the new function: rounded(5.7) [1] 6 rounded(5.4) [1] 5 The hard part is deciding how you want to round negative numbers, because the concept of up and down is more subtle (remember that –5 is a bigger number than –6). You need to think, instead, of whether you want to round towards zero or away from zero. For negative numbers, rounding up means rounding towards zero so do not be surprised when the value of the positive part is different: ceiling(-5.7) [1] -5 With floor, negative values are rounded away from zero: floor(-5.7) [1] -6 You can simply strip off the decimal part of the number using the function trunc , which returns the integers formed by truncating the values in x towards zero: trunc(5.7) [1] 5 trunc(-5.7) [1] -5 There is an R function called round that you can use by specifying 0 decimal places in the second argument: round(5.7,0) [1] 6 round(5.5,0) [1] 6 round(5.4,0) [1] 5 round(-5.7,0) [1] -6 The number of decimal places is not the same as the number of significant digits. You can control the number of significant digits in a number using the function signif . Take a big number like 12 345 678 (roughly

38 16 THE R BOOK 12.35 million). Here is what happens when we ask for 4, 5 or 6 significant digits: signif(12345678,4) [1] 12350000 signif(12345678,5) [1] 12346000 signif(12345678,6) [1] 12345700 and so on. Why you would want to do this would need to be explained. 2.1.3 Arithmetic + and - The screen prompt in R is a fully functional calculator. You can add and subtract using the obvious / symbols, while division is achieved with a forward slash * and multiplication is done by using an asterisk like this: 7+3-5*2 [1] 0 × 2) is done before the additions and subtractions. Powers Notice from this example that multiplication (5 (like squared or cube root) use the caret symbol ˆ and are done before multiplication or division, as you can see from this example: 3ˆ2 / 2 [1] 4.5 All the mathematical functions you could ever want are here (see Table 2.1). The function gives logs log = to the base e (e exp : 2.718 282), for which the antilog function is log(10) [1] 2.302585 exp(1) [1] 2.718282 , log10 : If you are old fashioned, and want logs to the base 10, then there is a separate function log10(6) [1] 0.7781513 Logs to other bases are possible by providing the log function with a second argument which is the base of the logs you want to take. Suppose you want log to base 3 of 9: log(9,3) [1] 2

39 ESSENTIALS OF THE R LANGUAGE 17 Mathematical functions used in R. Table 2.1. Function Meaning x log to base e of log(x) x antilog of x (e ) exp(x) n log(x,n) x log to base of log to base 10 of x log10(x) sqrt(x) square root of x x ! = x × ( x − 1) × ( x − factorial(x) ×···× 3 × 2 2) choose(n,x) n !/( x !( n – x )!) binomial coefficients  ( x ), for real x gamma(x) x –1)!, for integer x ( natural log of  ( x ) lgamma(x) floor(x) x greatest integer less than ceiling(x) smallest integer greater than x trunc(x) closest integer to x between x and 0, e.g. trunc(1.5) = 1, trunc(–1.5) = –1; trunc is like floor for positive values and like ceiling for negative values round(x, digits=0) x to an integer round the value of give to 6 digits in scientific notation signif(x, digits=6) x runif(n) generates n random numbers between 0 and 1 from a uniform distribution cosine of x in radians cos(x) sine of x in radians sin(x) tangent of x in radians tan(x) inverse trigonometric transformations of real or complex numbers acos(x), asin(x), atan(x) acosh(x), asinh(x), inverse hyperbolic trigonometric transformations of real or complex numbers atanh(x) abs(x) the absolute value of x , ignoring the minus sign if there is one ◦ The trigonometric functions in R measure angles in radians. A circle is 2 π ,soa radians, and this is 360 ◦ /2 radians. R knows the value of π : π )is pi as right angle (90 pi [1] 3.141593 sin(pi/2) [1] 1 cos(pi/2) [1] 6.123032e-017 Notice that the cosine of a right angle does not come out as exactly zero, even though the sine came out as –17 exactly 1. The e-017 means ‘times 10 ’. While this is a very small number, it is clearly not exactly zero (so you need to be careful when testing for exact equality of real numbers; see p. 23). 2.1.4 Modulo and integer quotients Integer quotients and remainders are obtained using the notation %/% (percent, divide, percent) and %% (percent, percent) respectively. Suppose we want to know the integer part of a division: say, how many 13s

40 18 THE R BOOK are there in 119: 119 %/% 13 [1] 9 Now suppose we wanted to know the remainder (what is left over when 119 is divided by 13): in maths this modulo is known as : 119 %% 13 [1] 2 Modulo is very useful for testing whether numbers are odd or even: odd numbers have modulo 2 value 1 and even numbers have modulo 2 value 0: 9%%2 [1] 1 8%%2 [1] 0 Likewise, you use modulo to test if one number is an exact multiple of some other number. For instance, to find out whether 15 421 is a multiple of 7 (which it is), then ask: 15421 %% 7 == 0 [1] TRUE Note the use of ‘double equals’ to test for equality (this is explained in detail on p. 26). 2.1.5 Variable names and assignment There are three important things to remember when selecting names for your variables in R:  y is not the same as Y . Variable names in R are case sensitive, so  Variable names should not begin with numbers (e.g. 1x ) or symbols (e.g. %x ).  Variable names should not contain blank spaces (use back.pay not back pay ). In terms of your work–life balance, make your variable names as short as possible, so that you do not spend most of your time typing, and the rest of your time correcting spelling mistakes in your ridiculously long variable names. Objects obtain values in R by assignment (‘ a value’). This is achieved by the gets arrow <- which xgets is a composite symbol made up from ‘less than’ and ‘minus’ with no space between them. Thus, to create a scalar constant x with value 5 we type: x<-5 and not x=5 . Notice that there is a potential ambiguity if you get the spacing wrong. Compare our x<-5 x gets 5’, with x<-5 where there is a space between the ‘less than’ and ‘minus’ symbol. In R, ,‘ this is actually a question, asking ‘is x less than minus 5?’ and, depending on the current value of x , would evaluate to the answer either TRUE or FALSE .

41 ESSENTIALS OF THE R LANGUAGE 19 2.1.6 Operators R uses the following operator tokens: +-*/%/%%%ˆ arithmetic (plus, minus, times, divide, integer quotient, modulo, power) relational (greater than, greater than or equals, less than, less than or equals, >=<<===!= equals, not equals) !&| logical (not, and, or) ~ model formulae (‘is modelled as a function of’) <- -> assignment (gets) list indexing (the ‘element name’ operator) \$ create a sequence : * Several of these operators have different meaning inside model formulae. Thus indicates the main effects plus interaction (rather than multiplication), : indicates the interaction between two variables (rather than generate a sequence) and ˆ means all interactions up to the indicated power (rather than raise to the power). You will learn more about these ideas in Chapter 9. 2.1.7 Integers Integer vectors exist so that data can be passed to C or Fortran code which expects them, and so that small integer data can be represented exactly and compactly. The range of integers is from − 2 000 000 000 to + 2 000 000 000 ( -2*10ˆ9 to +2*10ˆ9, which R could portray as -2e+09 to 2e+09 ). Be careful. Do not try to change the class of a vector by using the function. Here is a numeric integer vector of whole numbers that you want to convert into a vector of integers: x <- c(5,3,7,8) is.integer(x) [1] FALSE is.numeric(x) [1] TRUE Applying the integer function to it replaces all your numbers with zeros; definitely not what you intended. x <- integer(x) x [1]00000 Make the numeric object first, then convert the object to integer using the as.integer function like this: x <- c(5,3,7,8) x <- as.integer(x) is.integer(x) [1] TRUE

42 20 THE R BOOK trunc when applied to real numbers, and removes the imaginary part when The integer function works as applied to complex numbers: as.integer(5.7) [1] 5 as.integer(-5.7) [1] -5 as.integer(5.7 -3i) [1] 5 Warning message: imaginary parts discarded in coercion 2.1.8 Factors Factors are categorical variables that have a fixed number of levels. A simple example of a factor might be a variable called gender with two levels: ‘female’ and ‘male’. If you had three females and two males, you could create the factor like this: gender <- factor(c("female", "male", "female", "male", "female")) class(gender) [1] "factor" mode(gender) [1] "numeric" More often, you will create a dataframe by reading your data from a file using read.table . When you do this, all variables containing one or more character strings will be converted automatically into factors. Here is an example: data <- read.table("c: \\ \\ daphnia.txt",header=T) temp attach(data) head(data) Growth.rate Water Detergent Daphnia 1 2.919086 Tyne BrandA Clone1 2 2.492904 Tyne BrandA Clone1 3 3.021804 Tyne BrandA Clone1 4 2.350874 Tyne BrandA Clone2 5 3.148174 Tyne BrandA Clone2 6 4.423853 Tyne BrandA Clone2 This dataframe contains a continuous response variable ( ) and three categorical explanatory Growth.rate variables ( Water , Detergent and Daphnia ), all of which are factors. In statistical modelling, factors are associated with analysis of variance (all the explanatory variables are categorical) and analysis of covariance (some of the explanatory variables are categorical and some are continuous).

43 ESSENTIALS OF THE R LANGUAGE 21 There are some important functions for dealing with factors. You will often want to check that a variable is a factor (especially if the factor levels are numbers rather than characters): is.factor(Water) [1] TRUE names levels function: of the factor levels, we use the To discover the levels(Detergent) [1] "BrandA" "BrandB" "BrandC" "BrandD" of levels of a factor, we use the nlevels function: number To discover the nlevels(Detergent) [1] 4 The same result is achieved by applying the function to the levels of a factor: length length(levels(Detergent)) [1] 4 By default, factor levels are treated in alphabetical order. If you want to change this (as you might, for instance, in ordering the bars of a bar chart) then this is straightforward: just type the factor levels in the order factor that you want them to be used, and provide this vector as the second argument to the function. Suppose we have an experiment with three factor levels in a variable called treatment , and we want them to appear in this order: ‘nothing’, ‘single’ dose and ‘double’ dose. We shall need to override R’s natural tendency to order them ‘double’, ‘nothing’, ‘single’: frame <- read.table("c: \\ \\ trial.txt",header=T) temp attach(frame) tapply(response,treatment,mean) double nothing single 25 60 34 factor function like this: This is achieved using the treatment <- factor(treatment,levels=c("nothing","single","double")) Now we get the order we want: tapply(response,treatment,mean) nothing single double 60 34 25 Only == and != can be used for factors. Note, also, that a factor can only be compared to another factor with an identical set of levels (not necessarily in the same ordering) or to a character vector. For example, you > or <= , even if these levels are numeric. cannot ask quantitative questions about factor levels, like To turn factor levels into numbers (integers) use the unclass function like this: as.vector(unclass(Daphnia)) [1]11122233311122233311122233311122233311 [39] 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3

44 22 THE R BOOK Logical and relational operations. Table 2.2. Symbol Meaning logical NOT ! & logical AND | logical OR < less than <= less than or equal to > greater than = greater than or equal to == = ) logical equals (double != not equal AND with IF && OR with IF || exclusive OR xor(x,y) isTRUE(x) identical(TRUE,x) an abbreviation of 2.2 Logical operations A crucial part of computing involves asking questions about things. Is one thing bigger than other? Are two things the same size? Questions can be joined together using words like ‘and’ ‘or’, ‘not’. Questions in R typically evaluate to TRUE or FALSE but there is the option of a ‘maybe’ (when the answer is not available, NA ). In R, means ‘less than’, > means ‘greater than’, and ! means ‘not’ (see Table 2.2). < 2.2.1 TRUE and T with FALSE and F T for TRUE and F for FALSE, but you should be aware that You can use and F might have been allocated T as variables. So this is obvious: TRUE == FALSE [1] FALSE T==F [1] FALSE This, however, is not so obvious: T<-0 T == FALSE [1] TRUE F<-1 TRUE == F [1] TRUE But now, of course, T is not equal to F : T!=F [1] TRUE To be sure, always write TRUE and FALSE in full, and never use T or F as variable names.

45 ESSENTIALS OF THE R LANGUAGE 23 2.2.2 Testing for equality with real numbers There are international standards for carrying out floating point arithmetic, but on your computer these 16 standards are beyond the control of R. Roughly speaking, integer arithmetic will be exact between –10 16 , but for fractions and other real numbers we lose accuracy because of round-off error. This is only and 10 subtract similarly sized but very large numbers. A likely to become a real problem in practice if you have to dramatic loss in accuracy under these circumstances is called ‘catastrophic cancellation error’. It occurs when relative error substantially more than it increases absolute error . an operation on two numbers increases You need to be careful in programming when you want to test whether or not two computed numbers are equal. R will assume that you mean ‘exactly equal’, and what means depends upon machine precision. that Most numbers are rounded to an accuracy of 53 binary digits. Typically therefore, two floating point numbers will not reliably be equal unless they were computed by the same algorithm, and not always even then. You can see this by squaring the square root of 2: surely these values are the same? x <- sqrt(2) x*x==2 [1] FALSE In fact, they are not the same. We can see by how much the two values differ by subtraction: x*x-2 [1] 4.440892e-16 This is not a big number, but it is not zero either. So how do we test for equality of real numbers? The best advice is not to do it. Try instead to use the alternatives ‘less than’ with ‘greater than or equal to’, or conversely ‘greater than’ with ‘less than or equal to’. Then you will not go wrong. Sometimes, however, you really do want to test for equality. In those circumstances, do not use double equals to test for equality, but employ the function instead. all.equal all.equal 2.2.3 Equality of floating point numbers using The nature of floating point numbers used in computing is the cause of some initially perplexing features. You would imagine that since 0.3 minus 0.2 is 0.1, and the logic presented below would evaluate to TRUE. Not so: x <- 0.3 - 0.2 y <- 0.1 x==y [1] FALSE The function called identical gives the same result. identical(x,y) [1] FALSE all.equal which allows for insignificant differences: The solution is to use the function called all.equal(x,y) [1] TRUE Do not use all.equal directly in if expressions. Either use isTRUE(all.equal(...)) or identical as appropriate.

46 24 THE R BOOK all.equal 2.2.4 Summarizing differences between objects using is very useful in programming for checking that objects are as you expect them to The function all.equal does a useful job in describing all the differences it finds. Here, all.equal be. Where differences occur, a which is a vector of characters and b which is a factor: for instance, it reports on the difference between a <- c("cat","dog","goldfish") b <- factor(a) In the function, the object on the left ( a ) is called the ‘target’ and the object on the right ( b all.equal )is ‘current’: all.equal(a,b) [1] "Modes: character, numeric" [2] "Attributes: < target is NULL, current is list >" [3] "target is character, current is factor" Recall that factors are stored internally as integers, so they have mode = numeric . class(b) [1] "factor" mode(b) [1] "numeric" The reason why ‘current is list’ in line [2] of the output is that factors have two attributes and these are stored as a list – namely, their levels and their class: attributes(b) \$levels [1] "cat" "dog" "goldfish" \$class [1] "factor" The all.equal function is also useful for obtaining feedback on differences in things like the lengths of vectors: n1 <- c(1,2,3) n2 <- c(1,2,3,4) all.equal(n1,n2) [1] "Numeric: lengths (3, 4) differ" It works well, too, for multiple differences: n2 <- as.character(n2) all.equal(n1,n2) [1] "Modes: numeric, character" [2] "Lengths: 3, 4" [3] "target is numeric, current is character"

47 ESSENTIALS OF THE R LANGUAGE 25 Note that ‘target’ is the first argument to the function and ‘current’ is the second. If you supply more than two objects to be compared, the third and subsequent objects are simply ignored. 2.2.5 Evaluation of combinations of TRUE and FALSE It is important to understand how combinations of logical variables evaluate, and to appreciate how logical operations (such as those in Table 2.2) work when there are missing values, NA. Here are all the possible outcomes expressed as a logical vector called : x x <- c(NA, FALSE, TRUE) names(x) <- as.character(x) To see the logical combinations of outer function with x to evaluate all & (logical AND) we can use the NA , FALSE and TRUE nine combinations of like this: outer(x, x, "&") FALSE TRUE NA FALSE NA FALSE FALSE FALSE FALSE TRUE NA FALSE TRUE TRUE & TRUE evaluates to TRUE . Note the behaviour of NA & NA and Only . Where one of NA & TRUE the two components is NA , the result will be NA if the outcome is ambiguous. Thus, NA & TRUE evaluates to NA ,but NA & FALSE evaluates to FALSE . To see the logical combinations of | (logical OR) write: outer(x, x, "|") FALSE TRUE NA NA TRUE FALSE NA FALSE TRUE TRUE TRUE TRUE TRUE FALSE | FALSE evaluates to FALSE . Note the behaviour of NA | NA Only NA | FALSE . and 2.2.6 Logical arithmetic Arithmetic involving logical expressions is very useful in programming and in selection of variables. If logical arithmetic is unfamiliar to you, then persevere with it, because it will become clear how useful it is, once the penny has dropped. The key thing to understand is that logical expressions evaluate to either true or false (represented in R by TRUE or FALSE ), and that R can coerce TRUE or FALSE into numerical values: 1 for TRUE and 0 for . Suppose that x is a sequence from 0 to 6 like this: FALSE x <- 0:6 x .Is x less than 4? Now we can ask questions about the contents of the vector called x<4 [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE The answer is yes for the first four values (0, 1, 2 and 3) and no for the last three (4, 5 and 6). Two important TRUE all and any . They check an entire vector but return a single logical value: logical functions are or values bigger than 0? FALSE .Areallthe x

48 26 THE R BOOK all(x>0) [1] FALSE value is a zero. Are any of the x values negative? No. The first x any(x<0) [1] FALSE No. The smallest value is a zero. x , using (x<4) We can use the answers of logical functions in arithmetic. We can count the true values of : sum sum(x<4) [1] 4 (x<4) We can multiply by other vectors: (x<4)*runif(7) [1] 0.9433433 0.9382651 0.6248691 0.9786844 0.0000000 0.0000000 [7] 0.0000000 Logical arithmetic is particularly useful in generating simplified factor levels during statistical modelling. Suppose we want to reduce a five-level factor (a, b, c, d, e) called treatment to a three-level factor called t2 by lumping together the levels a and e (new factor level 1) and c and d (new factor level 3) while leaving b distinct (with new factor level 2): (treatment <- letters[1:5]) [1] "a" "b" "c" "d" "e" (t2 <- factor(1+(treatment=="b")+2*(treatment=="c")+2*(treatment=="d"))) [1]12331 Levels: 1 2 3 The new factor t2 gets a value 1 as default for all the factors levels, and we want to leave this as it is for levels a and e. Thus, we do not add anything to the 1 if the old factor level is a or e. For old factor level b, however, we want the result that t2=2 so we add 1 ( treatment=="b" ) to the original 1 to get the answer we require. This works because the logical expression evaluates to 1 ( TRUE ) for every case in which the old FALSE factor level is b and to 0 ( ) in all other cases. For old factor levels c and d we want the result that so we add 2 to the baseline value of 1 if the original factor level is either c ( 2*(treatment=="c") ) t2=3 or d ( ). You may need to read this several times before the penny drops. Note 2*(treatment=="d") that ‘logical equals’ is a double equals sign without a space in between ( == ). You need to understand the distinction between: x<-y x is assigned the value of y ( x gets the values of y ); x=y x is set to y unless you specify otherwise; in a function or a list x==y produces otherwise. if x is exactly equal to y and FALSE TRUE

49 ESSENTIALS OF THE R LANGUAGE 27 2.3 Generating sequences An important way of creating vectors is to generate a sequence of numbers. The simplest sequences are in steps of 1, and the colon operator is the simplest way of generating such sequences. All you do is specify the first and last values separated by a colon. Here is a sequence from 0 up to 10: 0:10 [1]012345678910 Here is a sequence from 15 down to 5: 15:5 [1] 15 14 13 12 11 10 9 8 7 6 5 seq function. There are various forms of this, of To generate a sequence in steps other than 1, you use the from, to, by which the simplest has three arguments: (the initial value, the final value and the increment). If the initial value is smaller than the final value, the increment should be positive, like this: seq(0, 1.5, 0.1) [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 If the initial value is larger than the final value, the increment should be negative, like this: seq(6,4,-0.2) [1] 6.0 5.8 5.6 5.4 5.2 5.0 4.8 4.6 4.4 4.2 4.0 In many cases, you want to generate a sequence to match an existing vector in length. Rather than having to figure out the increment that will get from the initial to the final value and produce a vector of exactly the appropriate length, R provides the along and length options. Suppose you have a vector of population sizes: N <- c(55,76,92,103,84,88,121,91,65,77,99) You need to plot this against a sequence that starts at 0.04 in steps of 0.01: seq(from=0.04,by=0.01,length=11) [1] 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 N . A simpler method is to use the along But this requires you to figure out the length of argument and specify the vector, N , whose length has to be matched: seq(0.04,by=0.01,along=N) [1] 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 Alternatively, you can get R to work out the increment (0.01 in this example), by specifying the start and the ) whose length has to be matched: from and to ), and the name of the vector ( N end values ( seq(from=0.04,to=0.14,along=N) [1] 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 An important application of the last option is to get the x values for drawing smooth lines through a scatterplot of data using predicted values from a model (see p. 207).

50 28 THE R BOOK Notice that when the increment does not match the final value, then the generated sequence stops short of the last value (rather than overstepping it): seq(1.4,2.1,0.3) [1] 1.4 1.7 2.0 sequence function. Suppose If you want a vector made up of sequences of unequal lengths, then use the that most of the five sequences you want to string together are from 1 to 4, but the second one is 1 to 3 and the last one is 1 to 5, then: sequence(c(4,3,4,4,4,5)) [1]123412312341234123412345 2.3.1 Generating repeats You will often want to generate repeats of numbers or characters, for which the function is . The object rep that is named in the first argument is repeated a number of times as specified in the second argument. At its simplest, we would generate five 9s like this: rep(9,5) [1]99999 You can see the issues involved by a comparison of these three increasingly complicated uses of the rep function: rep(1:4, 2) [1]12341234 rep(1:4, each = 2) [1]11223344 rep(1:4, each = 2, times = 3) [1]112233441122 [13] 3 3 4 4 1 1 2 2 3 3 4 4 In the simplest case, the first argument is repeated (i.e. the sequence 1 to 4 appears twice). You often entire element want each each argument. Finally, of the sequence to be repeated, and this is accomplished with the you might want each number repeated and the whole series repeated a certain number of times (here three times). When each element of the series is to be repeated a different number of times, then the second argument must be a vector of the same length as the vector comprising the first argument (length 4 in this example). So if we want one 1, two 2s, three 3s and four 4s we would write: rep(1:4,1:4) [1]1223334444 In a more complicated case, there is a different but irregular repeat of each of the elements of the first argument. Suppose that we need four 1s, one 2, four 3s and two 4s. Then we use the concatenation function function: to create a vector of length 4 c(4,1,4,2) which will act as the second argument to the rep c

51 ESSENTIALS OF THE R LANGUAGE 29 rep(1:4,c(4,1,4,2)) [1]11112333344 Here is the most complex case with character data rather than numbers: each element of the series is repeated an irregular number of times: rep(c("cat","dog","gerbil","goldfish","rat"),c(2,3,2,1,3)) [1] "cat" "cat" "dog" "dog" "dog" "gerbil" [7] "gerbil" "goldfish" "rat" "rat" "rat" rep function. This is the most general, and also the most useful form of the 2.3.2 Generating factor levels The function gl (‘generate levels’) is useful when you want to encode long vectors of factor levels. The syntax for the three arguments is: ‘up to’, ‘with repeats of’, ‘to total length’. Here is the simplest case where we want factor levels up to 4 with repeats of 3 repeated only once (i.e. to total length 12): gl(4,3) [1]111222333444 Levels: 1 2 3 4 Here is the function when we want that whole pattern repeated twice: gl(4,3,24) [1] 111222333444 [13]111222333444 Levels: 1 2 3 4 If you want text for the factor levels, rather than numbers, use labels like this: Temp <- gl(2, 2, 24, labels = c("Low", "High")) Soft <- gl(3, 8, 24, labels = c("Hard","Medium","Soft")) M.user <- gl(2, 4, 24, labels = c("N", "Y")) Brand <- gl(2, 1, 24, labels = c("X", "M")) data.frame(Temp,Soft,M.user,Brand) Temp Soft M.user Brand 1 Low Hard N X 2 Low Hard N M 3 High Hard N X 4 High Hard N M 5 Low Hard Y X 6 Low Hard Y M 7 High Hard Y X 8 High Hard Y M 9 Low Medium N X

52 30 THE R BOOK 10 Low Medium N M 11 High Medium N X 12 High Medium N M 13 Low Medium Y X 14 Low Medium Y M 15 High Medium Y X 16 High Medium Y M 17 Low Soft N X 18 Low Soft N M 19 High Soft N X 20 High Soft N M 21 Low Soft Y X 22 Low Soft Y M 23 High Soft Y X 24 High Soft Y M 2.4 Membership: Testing and coercing in R The concepts of membership and coercion may be unfamiliar. Membership relates to the class of an object in R. Coercion changes the class of an object. For instance, a logical variable has class logical and mode logical . This is how we create the variable: lv <- c(T,F,T) We can assess its membership by asking if it is a logical variable using the is.logical function: is.logical(lv) [1] TRUE It is not a factor, and so it does not have levels: levels(lv) NULL But we can coerce it be a two-level factor like this: (fv <- as.factor(lv)) [1] TRUE FALSE TRUE Levels: FALSE TRUE is.factor(fv) [1] TRUE We can coerce a logical variable to be numeric: TRUE evaluates to 1 and FALSE evaluates to zero, like this: (nv <- as.numeric(lv)) [1] 1 0 1 This is particularly useful as a shortcut when creating new factors with reduced numbers of levels (as we do in model simplification).

53 ESSENTIALS OF THE R LANGUAGE 31 Functions for testing ( is Table 2.3. ) the attributes of different categories of object ) the attributes of an object into a specified form. as (arrays, lists, etc.) and for coercing ( Neither operation changes the attributes of the object unless you overwrite its name. Coercing Type Testing Array is.array as.array Character is.character as.character Complex is.complex as.complex is.data.frame as.data.frame Dataframe is.double as.double Double is.factor as.factor Factor is.list as.list List Logical is.logical as.logical Matrix is.matrix as.matrix Numeric is.numeric as.numeric Raw is.raw as.raw Time series (ts) is.ts as.ts Vector is.vector as.vector In general, the expression as(object, value) is the way to coerce an object to a particular class. Membership functions ask is.something as.something . and coercion functions say is.type Objects have a type, and you can test the type of an object using an function (Table 2.3). For instance, mathematical functions expect numeric input and text-processing functions expect character input. Some types of objects can be coerced into other types. A familiar type of coercion occurs when we interpret the TRUE and FALSE of logical variables as numeric 1 and 0, respectively. Factor levels can be coerced to numbers. Numbers can be coerced into characters, but non-numeric characters cannot be coerced into numbers. as.numeric(factor(c("a","b","c"))) [1]123 as.numeric(c("a","b","c")) [1] NA NA NA Warning message: NAs introduced by coercion as.numeric(c("a","4","c")) [1] NA 4 NA Warning message: NAs introduced by coercion If you try to coerce complex numbers to numeric the imaginary part will be discarded. Note that is.complex and is.numeric are never both TRUE . We often want to coerce tables into the form of vectors as a simple way of stripping off their dimnames (using as.vector ), and to turn matrices into dataframes ( as.data.frame ). A lot of testing involves the NOT operator ! in functions to return an error message if the wrong type is supplied. For instance, if

54 32 THE R BOOK you were writing a function to calculate geometric means you might want to test to ensure that the input was numeric using the !is.numeric function: { geometric <- function(x) if(!is.numeric(x)) stop ("Input must be numeric") exp(mean(log(x))) } Here is what happens when you try to work out the geometric mean of character data: geometric(c("a","b","c")) Error in geometric(c("a", "b", "c")) : Input must be numeric You might also want to check that there are no zeros or negative numbers in the input, because it would make no sense to try to calculate a geometric mean of such data: { geometric <- function(x) if(!is.numeric(x)) stop ("Input must be numeric") if(min(x)<=0) stop ("Input must be greater than zero") } exp(mean(log(x))) Testing this: geometric(c(2,3,0,4)) Error in geometric(c(2, 3, 0, 4)) : Input must be greater than zero But when the data are OK there will be no messages, just the numeric answer: geometric(c(10,1000,10,1,1)) [1] 10 When vectors are created by calculation from other vectors, the new vector will be as long as the longest vector used in the calculation and the shorter variable will be recycled as necessary: here A is of length 10 and B is of length 3: A <- 1:10 B <- c(2,4,8) A*B [1]28248204814327220 Warning message: longer object length is not a multiple of shorter object length in: A * B The vector B is recycled three times in full and a warning message in printed to indicate that the length of the longer vector ( A ) is not a multiple of the shorter vector ( B ). 2.5 Missing values, infinity and things that are not numbers Calculations can lead to answers that are plus infinity, represented in R by Inf , or minus infinity, which is represented as -Inf : 3/0 [1] Inf

55 ESSENTIALS OF THE R LANGUAGE 33 -12/0 [1] -Inf Calculations involving infinity can be evaluated: for instance, exp(-Inf) [1] 0 0/Inf [1] 0 (0:3)ˆInf [1] 0 1 Inf Inf Other calculations, however, lead to quantities that are not numbers. These are represented in R by NaN (‘not a number’). Here are some of the classic cases: 0/0 [1] NaN Inf-Inf [1] NaN Inf/Inf [1] NaN NaN and NA (this stands for ‘not available’ and You need to understand clearly the distinction between is the missing-value symbol in R; see below). The function is.nan is provided to check specifically for . Coercing NaN also returns TRUE for NaN is.na NaN to logical or integer type gives an NA of the , and appropriate type. There are built-in tests to check whether a number is finite or infinite: is.finite(10) [1] TRUE is.infinite(10) [1] FALSE is.infinite(Inf) [1] TRUE 2.5.1 Missing values: NA Missing values in dataframes are a real source of irritation, because they affect the way that model-fitting functions operate and they can greatly reduce the power of the modelling that we would like to do. You may want to discover which values in a vector are missing. Here is a simple case: y <- c(4,NA,7)

56 34 THE R BOOK FALSE TRUE FALSE . There are two ways of looking for The missing value question should evaluate to as if it was a piece of missing values that you might think should work, but do not. These involve treating NA text and using double equals ( ) to test for it. So this does not work: == y==NA [1] NA NA NA all the values into because it turns (definitively not what you intended). This does not work either: NA y == "NA" [1] FALSE NA FALSE It correctly reports that the numbers are not character strings, but it returns NA for the missing value itself, TRUE as required. This is how you do it properly: rather than is.na(y) [1] FALSE TRUE FALSE To produce a vector with the NA stripped out, use subscripts with the not ! operator like this: y[! is.na(y)] [1] 4 7 This syntax is useful in editing out rows containing missing values from large dataframes. Here is a very simple example of a dataframe with four rows and four columns: y1 <- c(1,2,3,NA) y2 <- c(5,6,NA,8) y3 <- c(9,NA,11,12) y4 <- c(NA,14,15,16) full.frame <- data.frame(y1,y2,y3,y4) reduced.frame <- full.frame[!is.na(full.frame\$y1),] so the new reduced.frame will have fewer rows than full.frame when the variable in full.frame called contains one or more missing values. full.frame\$y1 reduced.frame y1 y2 y3 y4 115 9NA 226NA14 3 3 NA 11 15 Some functions do not work with their default settings when there are missing values in the data, and mean is a classic example of this: x <- c(1:8,NA) mean(x) [1] NA

57 ESSENTIALS OF THE R LANGUAGE 35 NA are to be removed, In order to calculate the mean of the non-missing values, you need to specify that the argument: using the na.rm=TRUE mean(x,na.rm=T) [1] 4.5 Here is an example where we want to find the locations (7 and 8) of missing values within a vector called vmv : vmv <- c(1:6,NA,NA,9:12) vmv [1]123456NANA9101112 Making an index of the missing values in an array could use the function, like this: seq seq(along=vmv)[is.na(vmv)] [1] 7 8 However, the result is achieved more simply using the which function like this: which(is.na(vmv)) [1] 7 8 If the missing values are genuine counts of zero, you might want to edit the NA to 0. Use the is.na function to generate subscripts for this: vmv[is.na(vmv)] <- 0 vmv [1]123456009101112 Or use the ifelse function like this: vmv <- c(1:6,NA,NA,9:12) ifelse(is.na(vmv),0,vmv) [1]123456009101112 Be very careful when doing this, because most missing values are not genuine zeros. 2.6 Vectors and subscripts A vector is a variable with one or more values of the same type. For instance, the numbers of peas in six pods were 4, 7, 6, 5, 6 and 7. The vector called peas is one object of length = 6 . In this case, the class of the object is . The easiest way to create a vector in R is to concatenate (link together) the six values numeric using the concatenate function, c , like this: peas <- c(4, 7, 6, 5, 6, 7) We can ask all sorts of questions about the vector called peas . For instance, what type of vector is it? class(peas) [1] "numeric"

58 36 THE R BOOK How big is the vector? length(peas) [1] 6 The great advantage of a vector-based language is that it is very simple to ask quite involved questions that involve all of the values in the vector. These vector functions are often self-explanatory: mean(peas) [1] 5.833333 max(peas) [1] 7 min(peas) [1] 4 Others might be more opaque: quantile(peas) 0% 25% 50% 75% 100% 4.00 5.25 6.00 6.75 7.00 Another way to create a vector is to input data from the keyboard using the function called scan: peas <- scan() The prompt appears 1: which means type in the first number of peas (4) then press the return key, then the prompt 2: appears (you type in 7) and so on. When you have typed in all six values, and the prompt 7: has appeared, you just press the return key to tell R that the vector is now complete. R replies by telling you how many items it has read: 1: 4 2: 7 3: 6 4: 5 5: 6 6: 7 7: Read 6 items For more realistic applications, the usual way of creating vectors is to read the data from a pre-prepared computer file (as described in Chapter 3). 2.6.1 Extracting elements of a vector using subscripts You will often want to use some but not all of the contents of a vector. To do this, you need to master the use of subscripts (or indices as they are also known). In R, subscripts involve the use of square brackets []. Our vector called peas shows the numbers of peas in six pods: peas [1]476567

59 ESSENTIALS OF THE R LANGUAGE 37 peas is 4, the second 7, and so on. The elements are indexed left to right, 1 to 6. It could The first element of not be more straightforward. If we want to extract the fourth element of peas (which you can see is a 5) then this is what we do: peas[4] [1] 5 If we want to extract several values (say the 2nd, 3rd and 6th) we use a vector to specify the pods we want as subscripts, either in two stages like this: pods <- c(2,3,6) peas[pods] [1]767 or in a single step, like this: peas[c(2,3,6)] [1]767 peas : You can drop values from a vector by using negative subscripts. Here are all but the first values of peas[-1] [1]76567 length function to decide what is last): Here are all but the last (note the use of the peas[-length(peas)] [1]47656 We can use these ideas to write a function called trim to remove (say) the largest two and the smallest two values from a vector called x sort the vector, then remove the smallest two values (these . First we have to will have subscripts 1 and 2), then remove the largest two values (which will have subscripts length(x) and length(x)-1 ): trim <- function(x) sort(x)[-c(1,2,length(x)-1,length(x))] We can use trim on the vector called peas , expecting to get 6 and 6 as the result: trim(peas) [1] 6 6 Finally, we can use sequences of numbers to extract values from a vector. Here are the first three values of peas : peas[1:3] [1]476 Here are the even-numbered values of peas : peas[seq(2,length(peas),2)] [1]757

60 38 THE R BOOK or alternatively: peas[1:length(peas) %% 2 == 0] [1] 7 5 7 on the sequence 1 to 6 to extract the even numbers 2, 4 and 6. Note that vectors using the modulo function %% in R could have length 0, and this could be useful in writing functions: y <- 4.3 z <- y[-1] length(z) [1] 0 2.6.2 Classes of vector peas contained numbers: in the jargon, it is of class numeric . R allows vectors of six The vector called types, so long as all of the elements in one vector belong to the same class. The classes are logical, integer, real, complex, string (or character) or raw. You will use numeric, logical and character variables all the time. Engineers and mathematicians will use complex numbers. But you could go a whole career without ever needing to use integer or raw. 2.6.3 Naming elements within vectors It is often useful to have the values in a vector labelled in some way. For instance, if our data are counts of 0, 1, 2, . . . occurrences in a vector called counts , (counts <- c(25,12,7,4,6,2,1,0,2)) [1] 25 12 7 4 6 2 1 0 2 so that there were 25 zeros, 12 ones and so on, it would be useful to name each of the counts with the relevant number 0 to 8: names(counts) <- 0:8 Now when we inspect the vector called counts we see both the names and the frequencies: counts 0 1 2345678 25127462102 If you have computed a table of counts, and you want to the names, then use the as.vector function remove like this: (st <- table(rpois(2000,2.3))) 01 2 3 4 5 6789 205 455 510 431 233 102 43 13 7 1 as.vector(st) [1] 205 455 510 431 233 102 43 13 7 1

61 ESSENTIALS OF THE R LANGUAGE 39 2.6.4 Working with logical subscripts Take the example of a vector containing the 11 numbers 0 to 10: x <- 0:10 the There are two quite different kinds of things we might want to do with this. We might want to add up values of the elements: sum(x) [1] 55 Alternatively, we might want to count the elements that passed some logical criterion. Suppose we wanted to know how many of the values were less than 5: sum(x<5) [1] 5 sum in both cases. But sum(x) adds up the values of You see the distinction. We use the vector function x the sum(x<5) counts up the number of cases that pass the logical condition ‘ x is less than 5’. This s and has been coercion works because of has been coerced to numeric 1 and logical FALSE (p. 30). Logical TRUE coerced to numeric 0. That is all well and good, but how do you add up the values of just some of the elements of x ? We specify a logical condition, but we do not want to count the number of cases that pass the condition, we want to add up all the values of the cases that pass. This is the final piece of the jigsaw, and involves the use of logical . Note that when we counted the number of cases, the counting was applied to the entire vector, subscripts sum(x<5) . To find the sum of the values of x that are less than 5, we write: using sum(x[x<5]) [1] 10 Let us look at this in more detail. The logical condition is either true or false: x<5 x<5 [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE You can imagine false as being numeric 0 and true as being numeric 1. Then the vector of subscripts [x<5] is five 1s followed by six 0s: 1*(x<5) [1]11111000000 Now imagine multiplying the values of by the values of the logical vector x x*(x<5) [1]01234000000 When the function sum is applied, it gives us the answer we want: the sum of the values of the numbers 0 10. 1 + 2 + 3 + 4 = +

62 40 THE R BOOK sum(x*(x<5)) [1] 10 sum(x[x<5]) , but is rather less elegant. This produces the same answer as Suppose we want to work out the sum of the three largest values in a vector. There are two steps: first sort the vector into descending order; then add up the values of the first three elements of the reverse-sorted array. Let us do this in stages. First, the values of y: y <- c(8,3,5,7,6,6,8,9,2,3,9,4,10,4,11) sort Now if you apply to this, the numbers will be in ascending sequence, and this makes life slightly harder for the present problem: sort(y) [1] 23 3445667889 [13] 9 10 11 We can use the reverse function, like this (use the Up arrow key to save typing): rev rev(sort(y)) [1] 11109988766544 [13] 3 3 2 So the answer to our problem is 11 + 10 + 9 = 30. But how to compute this? A range of subscripts is simply a series generated using the colon operator. We want the subscripts 1 to 3, so this is: rev(sort(y))[1:3] [1] 11 10 9 So the answer to the exercise is just: sum(rev(sort(y))[1:3]) [1] 30 Note that we have not changed the vector y in any way, nor have we created any new space-consuming vectors during intermediate computational steps. You will often want to find out which value in a vector is the maximum or the minimum. This is a question about indices, and the answer you want is an integer indicating which element of the vector contains the maximum (or minimum) out of all the values in that vector. Here is the vector: x <- c(2,3,4,1,5,8,2,3,7,5,7) So the answers we want are 6 (the maximum) and 4 (the minimum). The slow way to do it is like this: which(x == max(x)) [1] 6 which(x == min(x)) [1] 4

63 ESSENTIALS OF THE R LANGUAGE 41 which.max or like this: Better, however, to use the much quicker built-in functions which.min which.max(x) [1] 6 which.min(x) [1] 4 2.7 Vector functions One of R’s great strengths is its ability to evaluate functions over entire vectors, thereby avoiding the need for loops and subscripts. The most important vector functions are listed in Table 2.4. Here is a numeric vector: y <- c(8,3,5,7,6,6,8,9,2,3,9,4,10,4,11) Some vector functions produce a single number: mean(y) [1] 6.333333 Table 2.4. Vector functions used in R. Meaning Operation x max(x) maximum value in minimum value in min(x) x sum(x) total of all the values in x mean(x) x arithmetic average of the values in median(x) x median value in vector of min( ) and max( x ) range(x) x var(x) sample variance of x correlation between vectors x and cor(x,y) y sort(x) x a sorted version of vector of the ranks of the values in rank(x) x order(x) an integer vector containing the permutation to sort x into ascending order quantile(x) vector containing the minimum, lower quartile, median, upper quartile, and maximum of x cumsum(x) vector containing the sum of all of the elements up to that point vector containing the product of all of the elements up to that point cumprod(x) x up to cummax(x) vector of non-decreasing numbers which are the cumulative maxima of the values in that point cummin(x) x up to vector of non-increasing numbers which are the cumulative minima of the values in that point vector, of length equal to the longest of pmax(x,y,z) , y or z , containing the maximum of x , y or z for the x i th position in each x pmin(x,y,z) , y or z , containing the minimum of x , y or z for the vector, of length equal to the longest of i th position in each colMeans(x) x column means of dataframe or matrix colSums(x) column totals of dataframe or matrix x rowMeans(x) row means of dataframe or matrix x rowSums(x) row totals of dataframe or matrix x

64 42 THE R BOOK Others produce two numbers: range(y) [1] 2 11 here showing that the minimum was 2 and the maximum was 11. Other functions produce several numbers: fivenum(y) [1] 2.0 4.0 6.0 8.5 11.0 This is Tukey’s famous five-number summary: the minimum, the lower hinge, the median, the upper hinge and the maximum (the hinges are explained on p. 346). Perhaps the single most useful vector function in R is table . You need to see it in action to appreciate counts containing 10 000 random integers from a negative just how good it is. Here is a huge vector called binomial distribution (counts of fungal lesions on 10 000 individual leaves, for instance): counts <- rnbinom(10000,mu=0.92,size=1.1) counts: Here is a look at the first 30 values in counts[1:30] [1]310010000110020131010110021401 The question is this: how many zeros are there in the whole vector of 10 000 numbers, how many 1s, and so on right up to the largest value within counts? A formidable task for you or me, but for R it is just: table(counts) counts 0 1 2 3 4 56789101113 5039 2574 1240 607 291 141 54 29 11 9 3 1 1 There were 5039 zeros, 2574 ones, and so on up the largest counts (there was one 11 and one 13 in this realization; you will have obtained different random numbers on your computer). 2.7.1 Obtaining tables of means using tapply tapply . It does not sound like much from the name, One of the most important functions in all of R is but you will use it time and again for calculating means, variances, sample sizes, minima and maxima. With weather data, for instance, we might want the 12 monthly mean temperatures rather than the whole-year temperature , and a categorical explanatory variable, month : average. We have a response variable, data<-read.table("c: \\ temp \\ temperatures.txt",header=T) attach(data) names(data) [1] "temperature" "lower" "rain" "month" "yr" The function that we want to apply is mean . All we do is invoke the tapply function with three arguments: the response variable, the categorical explanatory variable and the name of the function that we want to apply: tapply(temperature,month,mean) 123456

65 ESSENTIALS OF THE R LANGUAGE 43 7.930051 8.671136 11.200508 13.813708 17.880847 20.306151 7 8 9 10 11 12 22.673854 23.104924 19.344211 15.125976 10.720702 8.299830 It is easy to apply other functions in the same way: here are the monthly variances tapply(temperature,month,var) and the monthly minima tapply(temperature,month,min) If R does not have a built in function to do what you want (Table 2.4), then you can easily write your own. Here, for instance, is a function to calculate the standard error of each mean (these are called anonymous functions in R, because they are unnamed): tapply(temperature,month,function(x) sqrt(var(x)/length(x))) 123456 0.1401489 0.1414445 0.1358934 0.1476242 0.1673197 0.1596439 7 8 9 10 11 12 0.1539661 0.1516091 0.1309294 0.1155612 0.1291703 0.1398438 The tapply function is very flexible. It can produce multi-dimensional tables simply by replacing the month )bya list of categorical variables. Here are the monthly means calculated one categorical variable ( separately for each year, as specified by list(yr,month). The variable you name first in the list ( yr ) will appear as the row of the results table and the second will appear as the columns ( month ): tapply(temperature,list(yr,month),mean)[,1:6] 1 2 3456 1987 3.170968 6.871429 8.132258 14.92667 15.60645 17.73667 1988 8.048387 8.248276 9.959375 12.74483 17.31935 18.71667 1989 8.841935 9.482143 11.919355 11.09333 20.40323 21.23667 1990 9.445161 11.028571 12.487097 13.80000 20.16129 18.51667 1991 6.980645 4.817857 12.022581 13.14333 15.58065 16.88000 1992 6.964516 8.686207 11.477419 13.35000 20.45806 22.21667 1993 10.119355 6.985714 11.209677 14.17000 17.79355 21.10000 1994 8.825806 7.217857 11.806452 12.61667 16.23226 20.86000 1995 8.309677 10.439286 10.667742 14.79667 18.74063 19.94483 1996 7.019355 6.065517 8.487097 13.99667 14.38710 21.93667 1997 4.932258 10.178571 13.370968 15.00667 18.17419 19.93000 1998 8.759375 11.242857 11.719355 12.55333 19.43226 19.35000 1999 9.523333 8.485714 11.790323 14.65000 18.94839 20.00667 2000 8.229032 10.324138 11.900000 12.59000 18.22581 20.63333 2001 7.067742 9.121429 9.012903 12.65667 18.96452 20.52667 2002 9.067742 11.396429 12.319355 15.68667 16.81290 19.67667 2003 8.012903 8.171429 13.425806 15.69000 17.36452 22.80000 2004 8.261290 8.993103 10.354839 15.17000 17.98065 21.73667 2005 9.116129 7.032143 10.787097 13.78333 17.12258 22.00000

66 44 THE R BOOK [,1:6] simply restrict the output to the first six months. You can see at once that January The subscripts (month 1) 1993 was exceptionally warm and January 1987 exceptionally cold. tapply that might confuse you. If you try to apply a function that has There is just one thing about built-in protection against missing values, then tapply may not do what you want, producing NA instead of the numerical answer. This is most likely to happen with the mean function because its default is to produce tapply , when there are one or more missing values. The remedy is to provide an extra argument to NA specifying that you want to see the average of the non-missing values. Use na.rm=TRUE to remove the missing values like this: tapply(temperature,yr,mean,na.rm=TRUE) You might want to trim some of the extreme values before calculating the mean (the arithmetic mean is trim famously sensitive to outliers). The option allows you to specify the fraction of the data (between 0 and 0.5) that you want to be omitted from the left- and right-hand tails of the sorted vector of values before computing the mean of the central values: tapply(temperature,yr,mean,trim=0.2) 1987 1988 1989 1990 1991 1992 1993 13.46000 13.74500 14.99726 15.16301 13.92237 14.32091 14.28000 2.7.2 The aggregate function for grouped summary statistics Suppose that we have two response variables ( and z ) and two explanatory variables ( x and w ) that we y might want to use to summarize functions like mean or variance of y and/or z . The aggregate function has a formula method which allows elegant summaries of four kinds: one to one aggregate(y ~ x, mean) aggregate(y ~ x + w, mean) one to many aggregate(cbind(y,z) ~ x, mean) many to one many to many aggregate(cbind(y,z) ~ x + w, mean) This is very useful for removing pseudoreplication from dataframes. Here is an example using a dataframe Growth.rate with two continuous variables ( pH ) and three categorical explanatory variables ( Water, and Detergent and Daphnia ): data<-read.table("c: \\ temp \\ pHDaphnia.txt",header=T) names(data) [1] "Growth.rate" "Water" "Detergent" "Daphnia" "pH" Here is one-to-one use of aggregate to find mean growth rate in the two water samples: aggregate(Growth.rate~Water,data,mean) Water Growth.rate 1 Tyne 3.685862 2 Wear 4.017948 Here is a one-to-many use to look at the interaction between Water and Detergent : aggregate(Growth.rate~Water+Detergent,data,mean)

67 ESSENTIALS OF THE R LANGUAGE 45 Water Detergent Growth.rate 1 Tyne BrandA 3.661807 2 Wear BrandA 4.107857 3 Tyne BrandB 3.911116 4 Wear BrandB 4.108972 5 Tyne BrandC 3.814321 6 Wear BrandC 4.094704 7 Tyne BrandD 3.356203 8 Wear BrandD 3.760259 as well as mean Growth.rate for the interaction Finally, here is a many-to-many use to find mean pH Water and between : Detergent aggregate(cbind(pH,Growth.rate)~Water+Detergent,data,mean) Water Detergent pH Growth.rate 1 Tyne BrandA 4.883908 3.661807 2 Wear BrandA 5.054835 4.107857 3 Tyne BrandB 5.043797 3.911116 4 Wear BrandB 4.892346 4.108972 5 Tyne BrandC 4.847069 3.814321 6 Wear BrandC 4.912128 4.094704 7 Tyne BrandD 4.809144 3.356203 8 Wear BrandD 5.097039 3.760259 2.7.3 Parallel minima and maxima: pmin and pmax Here are three vectors of the same length, x , y and z . The parallel minimum function, pmin , finds the minimum from any one of the three variables for each subscript, and produces a vector as its result (of length x ): y ,or z equal to the longest of , x [1] 0.99822644 0.98204599 0.20206455 0.65995552 0.93456667 0.18836278 y [1] 0.51827913 0.30125005 0.41676059 0.53641449 0.07878714 0.49959328 z [1] 0.26591817 0.13271847 0.44062782 0.65120395 0.03183403 0.36938092 pmin(x,y,z) [1] 0.26591817 0.13271847 0.20206455 0.53641449 0.03183403 0.18836278 , and the Thus the first and second minima came from , the third from x , the fourth from y , the fifth from z z sixth from x. The functions min and max produce scalar results, not vectors.

68 46 THE R BOOK 2.7.4 Summary information from vectors by groups The vector function tapply is one of the most important and useful vector functions to master. The ‘t’ stands for ‘table’ and the idea is to apply a function to produce a table from the values in the vector, based on one or more grouping variables (often the grouping is by factor levels). This sounds much more complicated than it really is: data <- read.table("c: temp \\ daphnia.txt",header=T) \\ attach(data) names(data) [1] "Growth.rate" "Water" "Detergent" "Daphnia" The response variable is and the other three variables are factors (the analysis is on p. 528). Growth.rate Suppose we want the mean growth rate for each detergent: tapply(Growth.rate,Detergent,mean) BrandA BrandB BrandC BrandD 3.88 4.01 3.95 3.56 This produces a table with four entries, one for each level of the factor called Detergent . To produce a two-dimensional table we put the two grouping variables in a list. Here we calculate the median growth rate for water type and daphnia clone: tapply(Growth.rate,list(Water,Daphnia),median) Clone1 Clone2 Clone3 Tyne 2.87 3.91 4.62 Wear 2.59 5.53 4.30 The first variable in the list creates the rows of the table and the second the columns. More detail on the function is given in Chapter 6 (p. 245). tapply 2.7.5 Addresses within vectors which There is an important function called y looks like for finding addresses within vectors. The vector this: y <- c(8,3,5,7,6,6,8,9,2,3,9,4,10,4,11) Suppose we wanted to know which elements of y contained values bigger than 5. We type: which(y>5) [1] 1 4 5 6 7 8 11 13 15 Notice that the answer to this enquiry is a set of subscripts . We do not use subscripts inside the which function itself. The function is applied to the whole array. To see the values of y that are larger than 5, we just type: y[y>5] [1]87668991011

69 ESSENTIALS OF THE R LANGUAGE 47 y itself, because values of 5 or less have been left out: Note that this is a shorter vector than length(y) [1] 15 length(y[y>5]) [1] 9 2.7.6 Finding closest values Finding the value in a vector that is closest to a specified value is straightforward using which . The vector contains 1000 random numbers from a normal distribution with mean xv 100 and standard deviation = = 10: xv <- rnorm(1000,100,10) Here, we want to find the value of xv that is closest to 108.0. The logic is to work out the difference between 108 and each of the 1000 random numbers, then find which of these differences is the smallest. This is what the R code looks like: which(abs(xv-108)==min(abs(xv-108))) [1] 332 The closest value to 108.0 is in location 332 within xv . But just how close to 108.0 is this 332nd value? We use 332 as a subscript on xv to find this out: xv[332] [1] 108.0076 sv ) in any vector ( xv Now we can write a function to return the closest value to a specified value ( ): closest <- function(xv,sv) { xv[which(abs(xv-sv)==min(abs(xv-sv)))] } and run it like this: closest(xv,108) [1] 108.0076 2.7.7 Sorting, ranking and ordering These three related concepts are important, and one of them (order) is difficult to understand on first acquaintance. Let us take a simple example: houses <- read.table("c: \\ temp \\ houses.txt",header=T) attach(houses) names(houses) [1] "Location" "Price"

70 48 THE R BOOK Price : We apply the three different functions to the vector called ranks <- rank(Price) sorted <- sort(Price) ordered <- order(Price) Then we make a dataframe out of the four vectors like this: view <- data.frame(Price,ranks,sorted,ordered) view Price ranks sorted ordered 1 325 12.0 95 9 2 201 10.0 101 6 3 157 5.0 117 10 4 162 6.0 121 12 5 164 7.0 157 3 6 101 2.0 162 4 7 211 11.0 164 5 8 188 8.5 188 8 9 95 1.0 188 11 10 117 3.0 201 2 11 188 8.5 211 7 12 121 4.0 325 1 Rank ranks The prices themselves are in no particular sequence. The column contains the value that is the rank of the particular data point (value of Price ), where 1 is assigned to the lowest data point and length(Price) – here 12 – is assigned to the highest data point. So the first element, a price of 325, happens to be the highest value in Price . You should check that there are 11 values smaller than 325 in the vector called Price . Fractional ranks indicate ties. There are two 188s in Price and their ranks are 8 and 9. Because they are tied, each gets the average of their two ranks (8 9)/2 = 8.5. The lowest price is 95, indicated by a rank of 1. + Sort The sorted vector is very straightforward. It contains the values of sorted into ascending order. If Price you want to sort into descending order, use the reverse order function rev like this: y <- rev(sort(x)) Note that sort is potentially very dangerous , because it uncouples values that might need to be in the same row of the dataframe (e.g. because they are the explanatory variables associated with a particular value of the response variable). It is bad practice, therefore, to write x <- sort(x), not least because there is no ‘unsort’ function. Order This is the most important of the three functions, and much the hardest to understand on first acquaintance. The numbers in this column are subscripts between 1 and 12. The order function returns an integer vector containing the permutation that will sort the input into ascending order. You will need to think about this

71 ESSENTIALS OF THE R LANGUAGE 49 Price is 95. Look at the dataframe and ask yourself what is the subscript in the one. The lowest value of where 95 occurred. Scanning down the column, you find it in row number 9. original vector called Price ordered[1] . Where is the next smallest value (101) to be found within This is the first value in ordered, ordered[2]. Price Price (117) is in position ? It is in position 6, so this is The third smallest value of ordered[3]. And so on. 10, so this is order This function is particularly useful in sorting dataframes, as explained on p. 166. Using with subscripts is a much safer option than using because with sort the values of the response variable sort, and the explanatory variables could be uncoupled with potentially disastrous results if this is not realized at the time that modelling was carried out. The beauty of is that we can use order(Price) as a order subscript for to obtain the price-ranked list of locations: Location Location[order(Price)] [1] Reading Staines Winkfield Newbury [5] Bracknell Camberley Bagshot Maidenhead [9] Warfield Sunninghill Windsor Ascot When you see it used like this, you can see exactly why the function is called order . If you want to reverse rev the order, just use the function like this: Location[rev(order(Price))] [1] Ascot Windsor Sunninghill Warfield [5] Maidenhead Bagshot Camberley Bracknell [9] Newbury Winkfield Staines Reading Make sure you understand why some of the brackets are round and some are square. 2.7.8 Understanding the difference between and duplicated unique The difference is best seen with a simple example. Here is a vector of names: names <- c("Williams","Jones","Smith","Williams","Jones","Williams") We can see how many times each name appears using table : table(names) names Jones Smith Williams 21 3 It is clear that the vector contains just three different names. The function called unique extracts these three unique names, creating a vector of length 3, unsorted, in the order in which the names are encountered in the vector: unique(names) [1] "Williams" "Jones" "Smith" In contrast, the function called duplicated produces a vector, of the same length as the vector of names, containing the logical values either FALSE or TRUE , depending upon whether or not that name has appeared already (reading from the left). You need to see this in action to understand what is happening, and why it

72 50 THE R BOOK might be useful: duplicated(names) [1] FALSE FALSE FALSE TRUE TRUE TRUE ), but the last three are all duplicated ( TRUE ). We can mimic The first three names are not duplicated ( FALSE unique function by using this vector as subscripts like this: the names[!duplicated(names)] [1] "Williams" "Jones" "Smith" ! Note the use of the NOT operator ( duplicated function. There you have it: if you want ) in front of the a shortened vector, containing only the unique values in names, then use , but if you want a vector unique duplicated . You might use this to extract values from a different of the same length as names then use vector (salaries, for instance) if you wanted the mean salary, ignoring the repeats: salary <- c(42,42,48,42,42,42) mean(salary) [1] 43 salary[!duplicated(names)] [1] 42 42 48 mean(salary[!duplicated(names)]) [1] 44 Note that this is not the same answer as would be obtained by omitting the duplicate salaries, because two of the people (Jones and Williams) had the same salary (42). Here is the wrong answer: mean(salary[!duplicated(salary)]) [1] 45 2.7.9 Looking for runs of numbers within vectors rle The function called , which stands for ‘run length encoding’ is most easily understood with an example. Here is a vector of 150 random numbers from a Poisson distribution with mean 0.7: (poisson <- rpois(150,0.7)) [1] 11002101001110112110102112020100020 [36] 1 0400101010211101010000000200001000 [71] 2 1111010100110102112010100011012201 [106] 0 0000010021202022110201122211110001 [141] 0 214002101 We can do our own run length encoding on the vector by eye: there is a run of two 1s, then a run of two 0s, then a single 2, then a single 1, then a single 0, and so on. So the run lengths are 2, 2, 1, 1, 1, 1, ... The values associated with these runs were 1, 0, 2, 1, 0, 1, ... Here is the output from rle : rle(poisson) Run Length Encoding

73 ESSENTIALS OF THE R LANGUAGE 51 lengths: int [1:93] 2 2 1 2 1 1 2 3 1 2 1 ... values : num [1:93] 1 0 2 1 0 1 0 1 2 1 ... is a list of two vectors: the lengths of the runs and the values that did the running. The object produced by rle To find the longest run, and the value associated with that longest run, we use the indexed lists like this: max(rle(poisson)[[1]]) [1] 7 So the longest run in this vector of numbers was 7. But 7 of what? We use to find the location of the which 7 in lengths, then apply this index to values to find the answer: which(rle(poisson)[[1]]==7) [1] 55 rle(poisson)[[2]][55] [1] 0 So, not surprisingly given that the mean was just 0.7, the longest run was of zeros. Here is a function to return the length of the run and its value for any vector: run.and.value <- function (x) { a <- max(rle(poisson)[[1]]) b <- rle(poisson)[[2]][which(rle(poisson)[[1]] == a)] \ cat("length = ",a," value = ",b, " } n") Testing the function on the vector of 150 Poisson data gives: run.and.value(poisson) length = 7 value = 0 It is sometimes of interest to know the number of runs in a given vector (for instance, the lower the number of runs, the more aggregated the numbers; and the greater the number of runs, the more regularly spaced out). We use the function for this: length length(rle(poisson)[[2]]) [1] 93 indicating that the 150 values were arranged in 93 runs (this is an intermediate value, characteristic of a random pattern). The value 93 appears in square brackets [1:93] in the output of the run length encoding function. In a different example, suppose we had n values of 0 representing values of 1 representing ‘present’ and n 2 1 ‘absent’; then the minimum number of runs would be 2 (a solid block of 1s then a sold block of 0s). The maximum number of runs would be 2 n + 1 if they alternated (until the smaller number n = min(n1,n2) ran out). Here is a simple runs test based on 1000 randomizations of 25 ones and 30 zeros: n1 <- 25 n2 <- 30 y <- c(rep(1,n1),rep(0,n2)) len <- numeric(10000) for (i in 1:10000) len[i] <- length(rle(sample(y))[[2]]) quantile(len,c(0.025,0.975))

74 52 THE R BOOK 2.5% 97.5% 21 35 Thus, for these data ( n 25 and n = = 30) an aggregated pattern would score 21 or fewer runs, and a 1 2 regular pattern would score 35 or more runs. Any scores between 21 and 35 fall within the realm of random patterns. union, intersect and setdiff 2.7.10 Sets: There are three essential functions for manipulating sets. The principles are easy to see if we work with an example of two sets: setA <- c("a", "b", "c", "d", "e") setB <- c("d", "e", "f", "g") Make a mental note of what the two sets have in common, and what is unique to each. The union of two sets is everything in the two sets taken together, but counting elements only once that are common to both sets: union(setA,setB) [1] "a" "b" "c" "d" "e" "f" "g" The intersection of two sets is the material that they have in common: intersect(setA,setB) [1] "d" "e" Note, however, that the difference between two sets is order-dependent. It is the material that is in the first named set, that is not setdiff(A,B) gives a different answer than in the second named set. Thus setdiff(B,A) . For our example: setdiff(setA,setB) [1] "a" "b" "c" setdiff(setB,setA) [1] "f" "g" Thus, it should be the case that setdiff(setA,setB) plus intersect(setA,setB) plus setd- iff(setB,setA) union of the two sets. Let us check: is the same as the all(c(setdiff(setA,setB),intersect(setA,setB),setdiff(setB,setA))== union(setA,setB)) [1] TRUE There is also a built-in function setequal for testing if two sets are equal: setequal(c(setdiff(setA,setB),intersect(setA,setB),setdiff(setB,setA)), union(setA,setB)) [1] TRUE

75 ESSENTIALS OF THE R LANGUAGE 53 %in% You can use for comparing sets. The result is a logical vector whose length matches the vector on the left: setA %in% setB [1] FALSE FALSE FALSE TRUE TRUE setB %in% setA [1] TRUE TRUE FALSE FALSE Using these vectors of logical values as subscripts, we can demonstrate, for instance, that setA[setA is the same as %in% setB] intersect(setA,setB): setA[setA %in% setB] [1] "d" "e" intersect(setA,setB) [1] "d" "e" 2.8 Matrices and arrays An array is a multi-dimensional object. The dimensions of an array are specified by its dim attribute, which gives the maximal indices in each dimension. So for a three-dimensional array consisting of 24 numbers in a sequence 1:24, with dimensions 2 × 4 × 3, we write: y <- 1:24 dim(y) <- c(2,4,3) y ,,1 [,1] [,2] [,3] [,4] [1,] 1357 [2,] 2468 ,,2 [,1] [,2] [,3] [,4] [1,]9111315 [2,] 10 12 14 16 ,,3 [,1] [,2] [,3] [,4] [1,] 17 19 21 23 [2,] 18 20 22 24 This produces three two-dimensional tables, because the third dimension is 3. This is what happens when you change the dimensions: dim(y) <- c(3,2,4) y

76 54 THE R BOOK ,,1 [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 ,,2 [,1] [,2] [1,] 7 10 [2,] 8 11 [3,] 9 12 ,,3 [,1] [,2] [1,] 13 16 [2,] 14 17 [3,] 15 18 ,,4 [,1] [,2] [1,] 19 22 [2,] 20 23 [3,] 21 24 Now we have four two-dimensional tables, each of three rows and two columns. Keep looking at these two examples until you are sure that you understand exactly what has happened here. A matrix is a two-dimensional array containing numbers. A dataframe is a two-dimensional list containing (potentially a mix of) numbers, text or logical variables in different columns. When there are two subscripts [5,3] to an object like a matrix or a dataframe, the first subscript refers to the row number (5 in this example; the rows are defined as margin number 1) and the second subscript refers to the column number (3 in this example; the columns are margin number 2). There is an important and powerful convention in R, such that when a subscript appears as a blank it is understood to mean ‘all of ’. Thus:  [,4] means all rows in column 4 of an object;  [2,] means all columns in row 2 of an object. 2.8.1 Matrices There are several ways of making a matrix. You can create one directly like this: X <- matrix(c(1,0,0,0,1,0,0,0,1),nrow=3) X [,1] [,2] [,3] [1,] 1 0 0 [2,] 0 1 0 [3,] 0 0 1

77 ESSENTIALS OF THE R LANGUAGE 55 X where, by default, the numbers are entered column-wise. The class and attributes of indicate that it is a matrix of three rows and three columns (these are its dim attributes): class(X) [1] "matrix" attributes(X) \$dim [1] 3 3 byrow=T : In the next example, the data in the vector appear row-wise, so we indicate this with vector <- c(1,2,3,4,4,3,2,1) V <- matrix(vector,byrow=T,nrow=2) V [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 4 3 2 1 Another way to convert a vector into a matrix is by providing the vector object with two dimensions (rows and columns) using the dim function like this: dim(vector) <- c(4,2) We can check that vector has now become a matrix: is.matrix(vector) [1] TRUE We need to be careful, however, because we have made no allowance at this stage for the fact that the data were entered row-wise into vector: vector [,1] [,2] [1,] 1 4 [2,] 2 3 [3,] 3 2 [4,] 4 1 The matrix we want is the transpose, t , of this matrix: (vector <- t(vector)) [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 4 3 2 1 2.8.2 Naming the rows and columns of matrices At first, matrices have numbers naming their rows and columns (see above). Here is a 4 × 5 matrix of random integers from a Poisson distribution with mean 1.5:

78 56 THE R BOOK X <- matrix(rpois(20,1.5),nrow=4) X [,1] [,2] [,3] [,4] [,5] [1,] 1 0 2 5 3 [2,] 1 1 3 1 3 [3,] 3 1 0 2 2 [4,] 1 0 2 1 0 Suppose that the rows refer to four different trials and we want to label the rows ‘Trial.1’ etc. We employ the rownames to do this. We could use the paste function (see p. 87) but here we take advantage of function prefix the option: rownames(X) <- rownames(X,do.NULL=FALSE,prefix="Trial.") X [,1] [,2] [,3] [,4] [,5] Trial.1 1 0 2 5 3 Trial.2 1 1 3 1 3 Trial.3 3 1 0 2 2 Trial.4 1 0 2 1 0 For the columns we want to supply a vector of different names for the five drugs involved in the trial, and use this to specify the colnames(X) : drug.names <- c("aspirin", "paracetamol", "nurofen", "hedex", "placebo") colnames(X) <- drug.names X aspirin Paracetamol nurofen hedex placebo Trial.1 1 0 2 5 3 Trial.2 1 1 3 1 3 Trial.3 3 1 0 2 2 Trial.4 1 0 2 1 0 Alternatively, you can use the dimnames function to give names to the rows and/or columns of a matrix. In this example we want the rows to be unlabelled (NULL) and the column names to be of the form ‘drug.1’, ‘drug.2’, etc. The argument to dimnames has to be a list (rows first, columns second, as usual) with the elements of the list of exactly the correct lengths (4 and 5 in this particular case): dimnames(X) <- list(NULL,paste("drug.",1:5,sep="")) X drug.1 drug.2 drug.3 drug.4 drug.5 [1,]10253 [2,]11313 [3,]31022 [4,]10210 2.8.3 Calculations on rows or columns of the matrix We could use subscripts to select parts of the matrix, with a blank meaning ‘all of the rows’ or ‘all of the columns’. Here is the mean of the rightmost column (number 5), calculated over all the rows (blank

79 ESSENTIALS OF THE R LANGUAGE 57 then comma), mean(X[,5]) [1] 2 or the variance of the bottom row, calculated over all of the columns (a blank in the second position), var(X[4,]) [1] 0.7 There are some special functions for calculating summary statistics on matrices: rowSums(X) [1]11984 colSums(X) [1]62798 rowMeans(X) [1] 2.2 1.8 1.6 0.8 colMeans(X) [1] 1.50 0.50 1.75 2.25 2.00 These functions are built for speed, and blur some of the subtleties of dealing with or NaN . If such subtlety NA is an issue, then use instead (p. 61). Remember that columns are margin number 2 and rows are apply margin number 1: apply(X,2,mean) [1] 1.50 0.50 1.75 2.25 2.00 You might want to sum groups of rows within columns, and rowsum (singular and all lower case, in contrast to , above) is a very efficient function for this. In this example, we want to group together rowSums row 1 and row 4 (as group A) and row 2 and row 3 (group B). Note that the grouping vector has to have length equal to the number of rows: group=c("A","B","B","A") rowsum(X, group) [,1] [,2] [,3] [,4] [,5] A20463 B42335 You could achieve the same ends (but more slowly) with tapply or aggregate : tapply(X, list(group[row(X)], col(X)), sum) 12345 A20463 B42335 Note the use of row(X) . col(X) , with row(X) used as a subscript on group and

80 58 THE R BOOK aggregate(X,list(group),sum) Group.1 V1 V2 V3 V4 V5 1 A20463 2 B42335 Suppose that we want to shuffle the elements of each column of a matrix independently. We apply the function sample to each column (margin number 2) like this: apply(X,2,sample) [,1] [,2] [,3] [,4] [,5] [1,] 1 1 2 1 3 [2,] 3 1 0 1 3 [3,] 1 0 3 2 0 [4,] 1 0 2 5 2 apply(X,2,sample) [,1] [,2] [,3] [,4] [,5] [1,] 1 1 0 5 2 [2,] 1 1 2 1 3 [3,] 3 0 2 2 3 [4,] 1 0 3 1 0 and so on, for as many shuffled samples as you need. 2.8.4 Adding rows and columns to the matrix In this particular case we have been asked to add a row at the bottom showing the column means, and a column at the right showing the row variances: X <- rbind(X,apply(X,2,mean)) X <- cbind(X,apply(X,1,var)) X [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1.0 0.0 2.00 5.00 3 3.70000 [2,] 1.0 1.0 3.00 1.00 3 1.20000 [3,] 3.0 1.0 0.00 2.00 2 1.30000 [4,] 1.0 0.0 2.00 1.00 0 0.70000 [5,] 1.5 0.5 1.75 2.25 2 0.45625 Note that the number of decimal places varies across columns, with one in columns 1 and 2, two in columns 3 and 4, none in column 5 (integers) and five in column 6. The default in R is to print the minimum number of decimal places consistent with the contents of the column as a whole. Next, we need to label the sixth column as ‘variance’ and the fifth row as ‘mean’: colnames(X) <- c(1:5,"variance") rownames(X) <- c(1:4,"mean") X 1 2 3 4 5 variance 1 1.0 0.0 2.00 5.00 3 3.70000 2 1.0 1.0 3.00 1.00 3 1.20000

81 ESSENTIALS OF THE R LANGUAGE 59 3 3.0 1.0 0.00 2.00 2 1.30000 4 1.0 0.0 2.00 1.00 0 0.70000 mean 1.5 0.5 1.75 2.25 2 0.45625 When a matrix with a single row or column is created by a subscripting operation, for example row <- mat[2,] , it is by default turned into a vector. In a similar way, if an array with dimension, say, 2 × 3 × × 4 is created by subscripting it will be coerced into a 2 × 3 × 1 4 array, losing the unnecessary dimension. After much discussion this has been determined to be a of R. To prevent this happening, add the feature drop = FALSE option to the subscripting. For example: rowmatrix <- mat[2, , drop = FALSE] colmatrix <- mat[, 2, drop = FALSE] a <- b[1, 1, 1, drop = FALSE] drop = FALSE option should be used defensively when programming. For example, the The statement somerows <- mat[index,] will return a vector rather than a matrix if index happens to have length 1, and this might cause errors later in the code. It should be written as: somerows <- mat[index , , drop = FALSE] sweep 2.8.5 The function The sweep function is used to ‘sweep out’ array summaries from vectors, matrices, arrays or dataframes. In this example we want to express a matrix in terms of the departures of each value from its column mean. matdata <- read.table("c: temp \\ sweepdata.txt") \\ First, you need to create a vector containing the parameters that you intend to sweep out of the matrix. In this case we want to compute the four column means: (cols <- apply(matdata,2,mean)) V1 V2 V3 V4 4.60 13.30 0.44 151.60 matdata as departures from the relevant column Now it is straightforward to express all of the data in means: sweep(matdata,2,cols) V1 V2 V3 V4 1 -1.6 -1.3 -0.04 -26.6 2 0.4 -1.3 0.26 14.4 3 2.4 1.7 0.36 22.4 4 2.4 0.7 0.26 -23.6 5 0.4 4.7 -0.14 -15.6 6 4.4 -0.3 -0.24 3.4 7 2.4 1.7 0.06 -36.6 8 -2.6 -0.3 0.06 17.4 9 -3.6 -3.3 -0.34 30.4 10 -4.6 -2.3 -0.24 14.4

82 60 THE R BOOK margin = 2 as the second argument to indicate that we want the sweep to be carried out Note the use of , is used for centring and scaling data in on the columns (rather than on the rows). A related function, scale terms of standard deviations (p. 254). You can see what has done by doing the calculation long-hand. The operation of this particular sweep sweep is simply one of subtraction. The only issue is that the subtracted object has to have the same dimensions as the matrix to be swept (in this example, 10 rows of 4 columns). Thus, to sweep out the column means, the matdata object to be subtracted from must have the each column mean repeated in each of the 10 rows of 4 columns: (col.means <- matrix(rep(cols,rep(10,4)),nrow=10)) [,1] [,2] [,3] [,4] [1,] 4.6 13.3 0.44 151.6 [2,] 4.6 13.3 0.44 151.6 [3,] 4.6 13.3 0.44 151.6 [4,] 4.6 13.3 0.44 151.6 [5,] 4.6 13.3 0.44 151.6 [6,] 4.6 13.3 0.44 151.6 [7,] 4.6 13.3 0.44 151.6 [8,] 4.6 13.3 0.44 151.6 [9,] 4.6 13.3 0.44 151.6 [10,] 4.6 13.3 0.44 151.6 Then the same result as we got from is obtained simply by sweep matdata-col.means Suppose that you want to obtain the subscripts for a column-wise or a row-wise sweep of the data. Here are the row subscripts repeated in each column: apply(matdata,2,function (x) 1:10) V1 V2 V3 V4 [1,] 1 1 1 1 [2,] 2 2 2 2 [3,] 3 3 3 3 [4,] 4 4 4 4 [5,] 5 5 5 5 [6,] 6 6 6 6 [7,] 7 7 7 7 [8,] 8 8 8 8 [9,] 9 9 9 9 [10,] 10 10 10 10 Here are the column subscripts repeated in each row: t(apply(matdata,1,function (x) 1:4)) [,1] [,2] [,3] [,4] 1 123 4 2 123 4 3 123 4 4 123 4 5 123 4

83 ESSENTIALS OF THE R LANGUAGE 61 6 123 4 7 123 4 8 123 4 9 123 4 10 1 2 3 4 Here is the same procedure using sweep : sweep(matdata,1,1:10,function(a,b) b) [,1] [,2] [,3] [,4] [1,] 1 1 1 1 [2,] 2 2 2 2 [3,] 3 3 3 3 [4,] 4 4 4 4 [5,] 5 5 5 5 [6,] 6 6 6 6 [7,] 7 7 7 7 [8,] 8 8 8 8 [9,] 9 9 9 9 [10,] 10 10 10 10 sweep(matdata,2,1:4,function(a,b) b) [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 1 2 3 4 [3,] 1 2 3 4 [4,] 1 2 3 4 [5,] 1 2 3 4 [6,] 1 2 3 4 [7,] 1 2 3 4 [8,] 1 2 3 4 [9,] 1 2 3 4 [10,] 1 2 3 4 2.8.6 Applying functions with apply, sapply and lapply The apply function is used for applying functions to the rows or columns of matrices or dataframes. For example, here is a matrix with four rows and six columns: (X <- matrix(1:24,nrow=4)) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 5 9 13 17 21 [2,] 2 6 10 14 18 22 [3,] 3 7 11 15 19 23 [4,] 4 8 12 16 20 24 Note that placing the expression to be evaluated in parentheses (as above) causes the value of the result to be printed on the screen. Often you want to apply a function across one of the margins of a matrix. Margin 1

84 62 THE R BOOK refers to the rows and margin 2 to the columns. Here are the row totals (four of them): apply(X,1,sum) [1] 66 72 78 84 and here are the column totals (six of them): apply(X,2,sum) [1] 10 26 42 58 74 90 Note that in both cases, the answer produced by apply apply is a vector rather than a matrix. You can functions to the individual elements of the matrix rather than to the margins. The margin you specify influences only the shape of the resulting matrix. apply(X,1,sqrt) [,1] [,2] [,3] [,4] [1,] 1.000000 1.414214 1.732051 2.000000 [2,] 2.236068 2.449490 2.645751 2.828427 [3,] 3.000000 3.162278 3.316625 3.464102 [4,] 3.605551 3.741657 3.872983 4.000000 [5,] 4.123106 4.242641 4.358899 4.472136 [6,] 4.582576 4.690416 4.795832 4.898979 apply(X,2,sqrt) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1.000000 2.236068 3.000000 3.605551 4.123106 4.582576 [2,] 1.414214 2.449490 3.162278 3.741657 4.242641 4.690416 [3,] 1.732051 2.645751 3.316625 3.872983 4.358899 4.795832 [4,] 2.000000 2.828427 3.464102 4.000000 4.472136 4.898979 Here are the shuffled numbers from each of the rows, using without replacement: sample apply(X,1,sample) [,1] [,2] [,3] [,4] [1,] 5 14 19 8 [2,] 21 10 7 16 [3,] 17 18 15 24 [4,] 1 22 23 4 [5,] 9 2 3 12 [6,] 13 6 11 20 Note that the resulting matrix has six rows and four columns (i.e. it has been transposed). 2 x You can supply your own function definition (here + ) within apply like this: x apply(X,1,function(x) xˆ 2+x) [,1] [,2] [,3] [,4] [1,] 2 6 12 20 [2,] 30 42 56 72 [3,] 90 110 132 156 [4,] 182 210 240 272

85 ESSENTIALS OF THE R LANGUAGE 63 [5,] 306 342 380 420 [6,] 462 506 552 600 This is an anonymous function because the function is not named. a function to a vector (rather than to the margin of a matrix) then use . If you want to sapply apply Here is the code to generate a list of sequences from 1:3 up to 1:7 (see p. 30): sapply(3:7, seq) [[1]] [1] 1 2 3 [[2]] [1]1234 [[3]] [1]12345 [[4]] [1]123456 [[5]] [1]1234567 The function is most useful with complicated iterative calculations. The following data show decay sapply of radioactive emissions over a 50-day period, and we intend to use non-linear least squares (see p. 715) to a y = exp(– ax ): estimate the decay rate in sapdecay <- read.table("c: temp \\ sapdecay.txt",header=T) \\ attach(sapdecay) names(sapdecay) [1] "x" "y" We need to write a function to calculate the sum of the squares of the differences between the observed ( y ) and predicted ( yf y , when provided with a specific value of the parameter a : ) values of sumsq <- function(a,xv=x,yv=y) { yf <- exp(-a*xv) } sum((yv-yf)ˆ2) a We can get a rough idea of the decay constant, y ) against x , like , for these data by linear regression of log( this: lm(log(y)~x) Coefficients: (Intercept) x 0.04688 -0.05849 So our parameter a is somewhere close to 0.058. We generate a range of values for a spanning an interval on either side of 0.058: a <- seq(0.01,0.2,.005) Now we can use sapply to apply the sum of squares function for each of these values of a (without writing a loop), and plot the deviance against the parameter value for a : plot(a,sapply(a,sumsq),type="l")

86 64 THE R BOOK 65432 sapply (a, sumsq) 10 0.10 0.15 0.20 0.05 a This shows that the least-squares estimate of a is indeed close to 0.06 (this is the value of a associated with the minimum deviance). To extract the minimum value of a we use min with subscripts (square brackets): a[min(sapply(a,sumsq))==sapply(a,sumsq)] [1] 0.055 Finally, we could use this value of a to generate a smooth exponential function to fit through our scatter of data points: plot(x,y) xv <- seq(0,50,0.1) lines(xv,exp(-0.055*xv)) 1.0 0.8 0.6 y 0.4 0.2 01020304050 x

87 ESSENTIALS OF THE R LANGUAGE 65 optimize Here is the same procedure streamlined by using the function. Write a function showing how the sum of squares depends on the value of the parameter a : fa <- function(a) sum((y-exp(-a*x))ˆ2) optimize with a specified range of values for a , here c(0.01,0.1), to find the value of a that Now use minimizes the sum of squares: optimize(fa,c(0.01,0.1)) \$minimum [1] 0.05538411 \$objective [1] 0.01473559 a is that minimizes the sum of squares is 0.055 38 and the minimum value of the sum of squares The value of is 0.0147. What if we had chosen a different way of assessing the fit of the model to the data? Instead of minimizing the sum of the squares of the residuals, we might want to minimize the sum of the absolute values of the residuals. We need to write a new function to calculate this quantity: fb <- function(a) sum(abs(y-exp(-a*x))) Then we use optimize as before: optimize(fb,c(0.01,0.1)) \$minimum [1] 0.05596058 \$objective [1] 0.3939221 The results differ only in the fourth digit after the decimal point, and you could not choose between the two methods from a plot of the model. Sums of squares are not the only way of doing statistics, just the conventional way. 2.8.7 Using the function max.col The task is to work out the number of plots on which a species is dominant in the Park Grass dataframe. This involves scanning each row of a matrix and reporting on the column number that contains the maximum value. data <- read.table("c: \\ temp \\ pgfull.txt",header=T) attach(data) names(data) [1] "AC" "AE" "AM" "AO" "AP" "AR" "AS" [8] "AU" "BH" "BM" "CC" "CF" "CM" "CN" [15] "CX" "CY" "DC" "DG" "ER" "FM" "FP" [22] "FR" "GV" "HI" "HL" "HP" "HS" "HR" [29] "KA" "LA" "LC" "LH" "LM" "LO" "LP" [36] "OR" "PL" "PP" "PS" "PT" "QR" "RA" [43] "RB" "RC" "SG" "SM" "SO" "TF" "TG" [50] "TO" "TP" "TR" "VC" "VK" "plot" "lime" [57] "species" "hay" "pH"

88 66 THE R BOOK Agrostis capillaris ). We The species names are represented by two-letter codes (so, for example, ‘AC’ is define the dominant as the species that has the maximum biomass on a given plot. The first task is to create a dataframe that contains only the species abundances (we do not want the plot numbers, or the treatments, or the values of any covariates). For the Park Grass data, the first 54 columns contain species abundance values, so we select all of the rows in the first 54 columns like this: species <- data[,1:54] max.col to go through all of the 89 rows, and for each row return the column Now we use the function number that contains the maximum biomass: max.col(species) [1] 22 22 22 1 32 32 22 1 22 22 22 1 22 22 1 1 22 22 22 4 2 2 51 2 1 1125142214222222422252522525 [26] 1 22 22 2221151222722223551511221132 [51] 32 1 22 2 [76] 32 1 1111114121122 To get the identity of the dominant, we then extract the name of this column, using the index returned by as a subscript to the object called names(species) : max.col names(species)[max.col(species)] Finally, we use table to count up the total number of plots on which each species was dominant. The code looks like this: table(names(species)[max.col(species)]) AC AE AO AP CN FR HL HS LH LP TP 26234211931514 So AC was dominant on more plots than any other species, with AE in second place and FR in third. The total number of species that were dominant on one or more plots is given by determining the length of this table: length(table(names(species)[max.col(species)])) [1] 11 So the number of species that were present in the system, but never attained dominance was 54 − 11 = 43: length(names(species))-length(table(names(species)[max.col(species)])) [1] 43 There is no such function as ‘min.col’, but you can easily emulate it by using max.col with the negatives of your data. It makes no sense to do it with this example, because several species are absent from every plot, and the function would just pick one of the absent species at random. But, anyway, max.col(-species) picks out the identity (the column number) of one of the zeros from each row of the dataframe. In a case where there was a unique minimum in each row, then this would find it.

89 ESSENTIALS OF THE R LANGUAGE 67 aperm 2.8.8 Restructuring a multi-dimensional array using There are circumstances where you may want to reorder the dimensions of an array. Here is an example of an array with three dimensions: two sexes, three ages and four income groups. For simplicity and ease of × 3 × = 24): illustration the values in the array are just the numbers 1 to 24 in order (2 4 data <- array(1:24, 2:4) The second argument to the function specifies the number of levels in dimensions 1, 2, and 3 using array the sequence-generator 2:4 to produce the numbers 2, 3 and 4. This is what the array looks like: data ,,1 [,1] [,2] [,3] [1,]135 [2,]246 ,,2 [,1] [,2] [,3] [1,] 7 9 11 [2,] 8 10 12 ,,3 [,1] [,2] [,3] [1,] 13 15 17 [2,] 14 16 18 ,,4 [,1] [,2] [,3] [1,] 19 21 23 [2,] 20 22 24 There are four sub-tables, each with 2 rows and 3 columns. Now we give names to the factor levels in each of the three dimensions: these are called the dimnames attributes and are allocated as lists like this: dimnames(data)[[1]] <- list("male","female") dimnames(data)[[2]] <- list("young","mid","old") dimnames(data)[[3]] <- list("A","B","C","D") dimnames(data) [[1]] [1] "male" "female" [[2]] [1] "young" "mid" "old" [[3]] [1] "A" "B" "C" "D"

90 68 THE R BOOK You can see the advantage of naming the dimensions by comparing the output of the array with (below) and without names (above): data ,,A young mid old male 1 3 5 female 2 4 6 ,,B young mid old male 7 9 11 female 8 10 12 ,,C young mid old male 13 15 17 female 14 16 18 ,,D young mid old male 19 21 23 female 20 22 24 Suppose, however, that we want the four income groups (A–D) to be the columns in each of the sub-tables, and the separate sub-tables to represent the two genders. This is a job for . We need to specify the aperm order ‘age then income then gender’ in terms of the order of their dimensions (row, column, sub-table, namely 2 then 3 then 1) like this: new.data <- aperm(data,c(2,3,1)) new.data , , male ABCD young 1 7 13 19 mid 3 9 15 21 old 5111723 , , female ABCD young 2 8 14 20 mid 4101622 old 6121824 This will be tricky to see at first, but you should persevere, because aperm is a very useful function.

91 ESSENTIALS OF THE R LANGUAGE 69 2.9 Random numbers, sampling and shuffling When debugging a program it is often useful to be able to get the same string of random numbers as you had set.seed function to control this: last time. Use the set.seed(375) runif(3) [1] 0.9613669 0.6918535 0.7302684 runif(3) [1] 0.9228566 0.1603804 0.9642799 runif(3) [1] 0.52880907 0.08660864 0.29075809 If you reset the seed with the same value, you get the same random numbers as last time: set.seed(375) runif(3) [1] 0.9613669 0.6918535 0.7302684 You might want to obtain part of the same series of random numbers, and we use .Random.seed like this: current<-.Random.seed runif(3) [1] 0.9228566 0.1603804 0.9642799 runif(3) [1] 0.52880907 0.08660864 0.29075809 runif(3) [1] 0.02590182 0.85520652 0.31350305 Resetting .Random.seed recreates the same series of random numbers: .Random.seed<-current runif(3) [1] 0.9228566 0.1603804 0.9642799 Randomization is central to a great many scientific and statistical procedures. Generating random numbers from a variety of probability distributions is explained in Chapter 7 (p. 272). Here we are concerned with randomizing (shuffling or sampling from) the elements of a vector, as we might use when planning a designed experiment (e.g. allocating treatments to individuals). There are two ways of sampling:  sampling without replacement (where all of the values in the vector appear in the output, but in a randomized sequence; i.e. the values have been shuffled);  sampling with replacement (where some values are omitted, and other values appear more than once in the output).

92 70 THE R BOOK 2.9.1 The sample function The default sample function shuffles the contents of a vector into a random sequence while maintaining all the numerical values intact. It is extremely useful for randomization in experimental design, in simulation y and in computationally intensive hypothesis testing. The vector looks like this: y <- c(8,3,5,7,6,6,8,9,2,3,9,4,10,4,11) Here are two different shufflings of y: sample(y) [1] 88992106731154 [13]634 sample(y) [1] 9398865114647 [13]3210 The order of the values is different each time is invoked, but the same numbers are shuffled in sample every case, and all the numbers in the original vector appear once in the output (so if there are two 9s in the original data, there will be two 9s in the shuffled vector). This is called sampling without replacement .You can specify the size of the sample you want as an optional second argument. Suppose we want five random elements from y , in any one sample: sample(y,5) [1]9410811 sample(y,5) [1]934 28 The option allows for sampling with replacement , which is the basis of bootstrapping (see replace=T sample function with p. 570). The vector produced by the is the same length as the vector replace=T sampled, but some values are left out at random and other values, again at random, appear two or more times. In this sample, 10 has been left out, and there are now three 9s: sample(y,replace=T) [1] 9611294688444 [13]393 In this next case, there are two 10s and only one 9: sample(y,replace=T) [1] 3 7106825114639 [13] 10 7 4 More advanced options in sample include specifying different probabilities with which each element is to be sampled ( prob= ). For example, if we want to take four numbers at random from the sequence 1:10 without replacement where the probability of selection ( p ) is 5 times greater for the middle numbers (5 and 6) than for the first or last numbers, and we want to do this five times, we could write: p <- c(1, 2, 3, 4, 5, 5, 4, 3, 2, 1) x <- 1:10

93 ESSENTIALS OF THE R LANGUAGE 71 sapply(1:5,function(i) sample(x,4,prob=p)) [,1] [,2] [,3] [,4] [,5] [1,] 8 7 4 10 8 [2,] 75787 [3,] 44345 [4,] 9 10 8 7 6 Thus, the four random numbers in the first trial were 8, 7, 4 and 9 (i.e. column 1). To learn more about sapply , see p. 63. 2.10 Loops and repeats The classic, Fortran-like loop is available in R. The syntax is a little different, but the idea is identical; you , takes on a sequence of values, and that one or more lines of commands are executed request that an index, i as many times as there are different values of i . Here is a loop executed five times with the values of i from 1 to 5; we print the square of each value: for (i in 1:5) print(iˆ2) [1] 1 [1] 4 [1] 9 [1] 16 [1] 25 For multiple lines of code, you use curly brackets {} to enclose material over which the loop is to work. Note that the ‘hard return’ (the Enter key) at the end of each command line is an essential part of the structure (you can replace the hard returns by semicolons if you like, but clarity is improved if you put each command on a separate line): j<-k<-0 for (i in 1:5) { j <- j+1 k <- k+i*j print(i+j+k) } [1] 3 [1] 9 [1] 20 [1] 38 [1] 65 Here we use a for loop to write a function to calculate factorial x (written x !) which is x ! = x × ( x − 1) × ( x − 2) × ( x − 3) ... × 2 × 1 2 So 4! 4 × 3 × = = 24. Here is the function: fac1 <- function(x) { f<-1

94 72 THE R BOOK if (x<2) return (1) for (i in 2:x) { f <- f*i } f } That seems rather complicated for such a simple task, but we can try it out for the numbers 0 to 5: sapply(0:5,fac1) [1] 1 1 2 6 24 120 and while . We demonstrate their use for the purpose There are two other looping functions in R: repeat of illustration, but we can do much better in terms of writing a compact function for finding factorials (see while below). First, the function: { fac2 <- function(x) f<-1 t<-x { while(t>1) f <- f*t t <- t-1 } return(f) } The key point is that if you want to use while , you need to set up an indicator variable ( t in this case) and change its value within t <- t-1 ). We test the function on the numbers 0 to 5: each iteration ( sapply(0:5,fac2) [1] 1 1 2 6 24 120 repeat function: Finally, we demonstrate the use of the { fac3 <- function(x) f<-1 t<-x repeat { if (t<2) break f <- f*t t <- t-1 } return(f) } Because the repeat function contains no explicit limit, you need to be careful not to program an infinite loop. You must include a logical escape clause that leads to a break command: sapply(0:5,fac3) [1] 1 1 2 6 24 120 It is almost always better to use a built-in function that operates on the entire vector and hence removes the need for loops or repeats of any sort. In this case, we can make use of the cumulative product function, cumprod . Here it is in action: cumprod(1:5) [1] 1 2 6 24 120

95 ESSENTIALS OF THE R LANGUAGE 73 This is already pretty close to what we need for our factorial function. It does not work for 0! of course, because the whole vector would end up full of zeros if the first element in the vector was zero (try 0:5 and > 0 is the maximum value from the vector produced by cumprod : x see). The factorial of fac4 <- function(x) max(cumprod(1:x)) This definition has the desirable side effect that it also gets 0! correct, because when x is 0 the function finds the maximum of 1 and 0 which is 1. max(cumprod(1:0)) [1] 1 sapply(0:5,fac4) [1]112624120 x ! is the same as  Alternatively, you could adapt an existing built-in function to do the job. x + 1), so ( fac5 <- function(x) gamma(x+1) sapply(0:5,fac5) [1]112624120 Until quite recently there was no built-in factorial function in R, but now there is: sapply(0:5,factorial) [1]112624120 2.10.1 Creating the binary representation of a number while function in converting a specified number to its binary representation. Here is a function that uses the The trick is that the smallest digit (0 for even or 1 for odd numbers) is always at the right-hand side of the answer (in location 32 in this case): binary <- function(x) { i<-0 string <- numeric(32) while(x>0) { string[32-i]< -x %% 2 x<-x%%2 i<-i+1 } first <- match(1,string) string[first:32] } The leading zeros (1 to first – 1) within the string are not printed. We run the function to find the binary representation of the numbers 15 to 17: sapply(15:17,binary) [[1]] [1]1111 [[2]] [1]10000

96 74 THE R BOOK [[3]] [1]10001 to generate the Fibonacci series 1, 1, 2, 3, 5, 8, . . . in which each term The next function uses while loops is that the logical variable controlling while is the sum of its two predecessors. The key point about n , the number whose Fibonacci number their operation is altered inside the loop. In this example, we alter , reducing the value of n we want, starting at n gets down n by 1 each time around the loop, and ending when to 0. Here is the code: fibonacci <- function(n) { a<-1 b<-0 while(n>0) { swap <- a a <- a+b b <- swap } n <- n-1 b } swap variable above. When we An important general point about computing involves the use of the replace by a+b on line 6 we lose the original value of a . If we had not stored this value in swap ,we a could not set the new value of b to the old value of a on line 7. Now test the function by generating the Fibonacci numbers 1 to 10: sapply(1:10,fibonacci) [1] 1 1 2 3 5 8 13 21 34 55 2.10.2 Loop avoidance It is good R programming practice to avoid using loops wherever possible. The use of vector functions (p. 41) makes this particularly straightforward in many cases. Suppose that you wanted to replace all of the negative values in an array by zeros. In the old days, you might have written something like this: for (i in 1:length(y)) { if(y[i] < 0) y[i] <- 0 } Now, however, you would use logical subscripts (p. 39) like this: y[y<0] <- 0 The ifelse function Sometimes you want to do one thing if a condition is true and a different thing if the condition is false (rather than do nothing, as in the last example). The ifelse function allows you to do this for entire vectors without using for loops. We might want to replace any negative values of y by –1 and any positive values and zero by + 1: z <- ifelse (y < 0, -1, 1) Next we use ifelse to convert the continuous variable called Area into a new, two-level factor with values big and small defined by the median Area of the fields: data <- read.table("c: \\ temp \\ worms.txt",header=T) attach(data) ifelse(Area>median(Area),"big","small")

97 ESSENTIALS OF THE R LANGUAGE 75 [1] "big" "big" "small" "small" "big" "big" "big" "small" [9] "small" "small" "small" "big" "big" "small" "big" "big" [17]"small" "big" "small" "small" You should use the much more powerful function called cut when you want to convert a continuous variable like Area into many levels (p. 838). ifelse is to override R’s natural inclinations. The log of zero in R is -Inf , as you see Another use of in these 20 random numbers from a Poisson process with a mean count of 1.5: y <- log(rpois(20,1.5)) y [1] -Inf 0.6931472 -Inf 0.0000000 -Inf 0.0000000 [7] 0.0000000 -Inf 0.6931472 1.6094379 1.3862944 0.0000000 [13] 1.3862944 -Inf 0.0000000 0.0000000 0.6931472 0.6931472 [19] 0.0000000 -Inf NA in our particular application we can write: However, if we want the log of zero to be represented by ifelse(y<0,NA,y) [1] NA 0.6931472 NA 0.0000000 NA 0.0000000 [7] 0.0000000 NA 0.6931472 1.6094379 1.3862944 0.0000000 [13] 1.3862944 NA 0.0000000 0.0000000 0.6931472 0.6931472 [19] 0.0000000 NA 2.10.3 The slowness of loops To see how slow loops can be, we compare two ways of finding the maximum number in a vector of 10 million random numbers from a uniform distribution: x <- runif(10000000) First, using the vector function : max system.time(max(x)) user system elapsed 0.03 0.00 0.03 As you see, this operation took just 0.03 seconds to solve using the vector function max to look at the 10 million numbers in . Using a loop, however, took more than 9 seconds: x pc <- proc.time() cmax <- x[1] for (i in 2:10000000) { if(x[i]>cmax) cmax <- x[i] } proc.time()-pc user system elapsed 9.39 0.13 9.51

98 76 THE R BOOK system.time and produce a vector of three numbers, showing the user, system The functions proc.time and total elapsed times for the currently running R process. It is the third number (elapsed time in seconds, 9.51 in this case) that is typically the most useful. 2.10.4 Do not ‘grow’ data sets by concatenation or recursive function calls Here is an extreme example of what to do. We want to create a vector containing 100 000 numbers in not sequence from 1 to 100 000. First, the quickest way using the built-in sequence generator which is invoked by the colon symbol ( : ) { test1 <- function() y <- 1:100000 } Now we obtain the same result using a loop, where we tell R in advance how long the final vector is going to be, using the function. This is called preallocation. numeric test2 <- function() { y <- numeric(100000) for (i in 1:100000) y[i] <- i } Finally, the most inefficient way. Each time we go round the loop we concatenate the new value onto the right-hand end of the vector that has been created up to this point. We start with a NULL vector, then build it up, one step at a time, which looks like a neat idea, but is extremely inefficient, because changing the size of a vector takes roughly the same size as setting a vector up from scratch, and we change the length of our vector 100 000 times in this example. This ill-advised procedure is called re-dimensioning. { test3 <- function() y <- NULL for (i in 1:100000) y <- c(y,i) } To compare the efficiency of the three methods, we shall work out how long each takes to complete the proc.time task. The function called determines how much real time and computer processing unit (CPU) time (in seconds) the currently running R process has already taken: proc.time() user system elapsed 53.15 5.14 2483.00 The user time is the CPU time charged for the execution of user instructions of the calling process, the system time is the CPU time charged for execution by the system on behalf of the calling process, and the elapsed time includes other stuff that the computer is doing, unrelated to your R session. The function calls the function proc.time , then evaluates your expression, and then system.time calls proc.time once more, returning the difference between the two proc.time calls as its output. We can compare the efficiencies of our three different functions using system.time like this: system.time(test1()) user system elapsed 00 0

99 ESSENTIALS OF THE R LANGUAGE 77 system.time(test2()) user system elapsed 0.16 0.02 0.17 system.time(test3()) user system elapsed 8.95 0.02 8.97 The first method is so lightening fast that it does not even register on the clock. The loop using a pre-determined vector length is also very fast (0.16 seconds). In contrast, the last method, where we grew the vector at each iteration, is staggeringly slow (8.95 seconds). The moral: do not grow vectors by repeated concatenation. 2.10.5 Loops for producing time series Wherever we can, we use vectorized functions in R because this leads to compact, efficient and easily readable code. Sometimes, however, we need to resort to using loops. Suppose we are interested in the dynamics of ) and the maximum a population which is governed by two parameters: the per capita reproductive rate ( λ 1) ), which for convenience we shall set to 1.0). Next year’s population N ( t + supportable population ( N max that is given by this year’s population, N ( t ), multiplied by lambda, multiplied again by the fraction of N max ) in the current case). Thus, we have a difference t ( − N ( t )) / N N = 1 − is currently unrealized (i.e. ( N max max equation − t + 1) = λ N ( N )[1 ( N ( t )] t To simulate the dynamics of this population in R, we start by writing the difference equation as a function (call it next.year for instance): next.year <- function(x) lambda * x * (1 - x) So if we begin with a population of N = 0.6 and set λ = 3.7 we can predict next year’s population like this: lambda <- 3.7 next.year(0.6) [1] 0.888 The population has increased by 48% (0.888 / 0.6 = 1.48). What happens in the second year? next.year(0.888) [1] 0.3679872 The population crashes to less than half its previous value (0.367 987 2/0.888 = 0.4144). We could go on repeating these calculations, modelling year after year, but this is an obvious case where using a loop would be the best solution. Let us assume that we want to model the population over 20 years. It is good practice in cases like this to define a vector to contain the 20 population sizes at the outset ( preallocation ) using numeric like this: N <- numeric(20) We set the initial population size (0.6) like this: N[1] <- 0.6

100 78 THE R BOOK t Now if we run through a loop to simulate the years 2 through 20 using an index called (for time), we can invoke the function called next.year repeatedly, employing t as a subscript like this: for (t in 2:20) N[t] <- next.year(N[t-1]) Finally, we might want to plot a time series of the population dynamics over the course of 20 years. plot(N,type="l") 0.90.80.70.60.50.40.3 N 510 15 20 Index This famous difference equation is known as the quadratic map, and it played a central role in the development of chaos theory (May, 1974). For large values of (as we used in the example above), the λ function is capable of producing series of numbers that are, to all intents and purposes, random. This led to a definition of chaos as behaviour that exhibited extreme sensitivity to initial conditions : tiny differences in initial population size would lead to radically different time series in population dynamics. 2.11 Lists Lists are extremely important objects in R. You will have heard of the problems of ‘comparing apples and oranges’ or how two things are ‘as different as chalk and cheese’. You can think of lists as a way of getting around these problems. Here are four completely different objects: a numeric vector, a logical vector, a vector of character strings and a vector of complex numbers: apples <- c(4,4.5,4.2,5.1,3.9) oranges <- c(TRUE, TRUE, FALSE) chalk <- c("limestone", "marl","oolite", "CaC03") cheese <- c(3.2-4.5i,12.8+2.2i)

101 ESSENTIALS OF THE R LANGUAGE 79 We cannot bundle them together into a dataframe, because the vectors are of different lengths: data.frame(apples,oranges,chalk,cheese) Error in data.frame(apples, oranges, chalk, cheese) : arguments imply differing number of rows: 5, 3, 4, 2 Despite their differences, however, we can bundle them together in a single list called items: items <- list(apples,oranges,chalk,cheese) items [[1]] [1] 4.0 4.5 4.2 5.1 3.9 [[2]] [1] TRUE TRUE FALSE [[3]] [1] "limestone" "marl" "oolite" "CaC03" [[4]] [1] 3.2-4.5i 12.8+2.2i Subscripts on vectors, matrices, arrays and dataframes have one set of square brackets [6], [3,4] or [2,3,2,1], but subscripts on lists have double square brackets [[2]] or [[i,j]]. If we want to extract the chalk from the list, we use subscript [[3]]: items[[3]] [1] "limestone" "marl" "oolite" "CaC03" If we want to extract the third element within chalk (oolite) then we use single subscripts after the double subscripts like this: items[[3]][3] [1] "oolite" R is forgiving about failure to use double brackets on their own, but not when you try to access a component of an object within a list: items[3] [[1]] [1] "limestone" "marl" "oolite" "CaC03" items[3][3] [[1]] NULL There is another indexing convention in R which is used to extract named components from lists using the element names operator \$ . This is known as ‘indexing tagged lists’. For this to work, the elements of the list must have names. At the moment our list called items has no names: names(items) NULL

102 80 THE R BOOK You can give names to the elements of a list in the function that creates the list by using the equals sign like this: items <- list(first=apples,second=oranges,third=chalk,fourth=cheese) Now you can extract elements of the list by name items\$fourth [1] 3.2-4.5i 12.8+2.2i 2.11.1 Lists and lapply We can ask a variety of questions about our new list object: class(items) [1] "list" mode(items) [1] "list" is.numeric(items) [1] FALSE is.list(items) [1] TRUE length(items) [1] 4 Note that the length of a list is the number of items in the list, not the lengths of the individual vectors within the list. The function applies a specified function to each of the elements of a list in turn (without the lapply need for specifying a loop, and not requiring us to know how many elements there are in the list). A useful function to apply to lists is the length function; this asks how many elements comprise each component of the list. Technically we want to know the length of each of the vectors making up the list: lapply(items,length) \$first [1] 5 \$second [1] 3 \$third [1] 4 \$fourth [1] 2 This shows that items consists of four vectors, and shows that there were 5 elements in the first vector, 3 in the second 4 in the third and 2 in the fourth. But 5 of what, and 3 of what? To find out, we apply the function

103 ESSENTIALS OF THE R LANGUAGE 81 to the list: class lapply(items,class) \$first [1] "numeric" \$second [1] "logical" \$third [1] "character" \$fourth [1] "complex" So the answer is there were 5 numbers in the first vector, 3 logical variables in the second, 4 character strings in the third vector and 2 complex numbers in the fourth. Applying numeric functions to lists will only work for objects of class or complex , or objects numeric (like logical values) that can be coerced into numbers. Here is what happens when we use lapply to apply the function mean to items : lapply(items,mean) \$first [1] 4.34 \$second [1] 0.6666667 \$third [1] NA \$fourth [1] 8-1.15i Warning message: In mean.default(X[[3L]], ...) : argument is not numeric or logical: returning NA We get a warning message pointing out that the third vector cannot be coerced to a number (it is not numeric, complex or logical), so NA appears in the output. The second vector produces the answer 2/3 because logical false ( FALSE ) is coerced to numeric 0 and logical true ( TRUE ) is coerced to numeric 1. The summary function works for lists: summary(items) Length Class Mode first 5 -none- numeric second 3 -none- logical third 4 -none- character fourth 2 -none- complex

104 82 THE R BOOK str , the structure function: but the most useful overview of the contents of a list is obtained with str(items) List of 4 \$ first : num [1:5] 4 4.5 4.2 5.1 3.9 \$ second: logi [1:3] TRUE TRUE FALSE \$ third : chr [1:4] "limestone" "marl" "oolite" "CaC03" \$ fourth: cplx [1:2] 3.2-4.5i 12.8+2.2i 2.11.2 Manipulating and saving lists Saving lists to files is tricky, because lists typically have different numbers of items in each row so we cannot use . Here is a dataframe on species presence (1) or absence (0), with species’ Latin binomials write.table in the first column as the row names: data<-read.csv("c: \\ temp \\ pa.csv",row.names=1) data Carmel Derry Daneswall Erith Foggen Highbury Slatewell Uppington York Bartsia alpina 0 0 1 0 0 0 1 0 0 Cleome serrulata 1 1 0 0 0 1 0 0 0 Conopodium majus 0 0 0 0 0 0 0 1 1 Corydalis sempervirens 1 0 0 1 0 1 0 0 0 Nitella flexilis 1 0 0 0 0 0 0 0 1 Ranunculus baudotii 1 0 1 1 0 0 0 1 0 Rhododendron luteum 1 1 1 1 1 0 1 1 1 Rodgersia podophylla 0 1 0 0 0 1 0 0 0 Tiarella wherryi 0 0 1 1 1 0 0 0 0 Veronica opaca 1 0 0 0 0 1 1 1 0 There are two kinds of operations you might want to do with a dataframe like this:  produce lists of the sites at which each species is found;  produce lists of the species found in any given site. We shall do each of these tasks in turn. The issue is that the numbers of place names differ from species to species, and the numbers of species differ from place to place. However, it is easy to create a list showing the column numbers that contain locations for each species: sapply(1:10,function(i) which(data[i,]>0)) [[1]] [1] 3 7 [[2]] [1] 1 2 6 [[3]] [1] 8 9

105 ESSENTIALS OF THE R LANGUAGE 83 [[4]] [1]146 [[5]] [1] 1 9 [[6]] [1]1348 [[7]] [1]12345789 [[8]] [1] 2 6 [[9]] [1]345 [[10]] [1]1678 This indicates that Bartsia alpina (the first species) is found in locations 3 and 7 (Daneswall and Slatewell). If we save this list (calling it spp for instance), then we can extract the column names at which each species is present, using the elements of spp as subscripts on the column names of data, like this: spp<-sapply(1:10,function(i) which(data[i,]>0)) sapply(1:10, function(i)names(data)[spp[[i]]] ) [[1]] [1] "Daneswall" "Slatewell" [[2]] [1] "Carmel" "Derry" "Highbury" [[3]] [1] "Uppington" "York" [[4]] [1] "Carmel" "Erith" "Highbury" [[5]] [1] "Carmel" "York" [[6]] [1] "Carmel" "Daneswall" "Erith" "Uppington" [[7]] [1] "Carmel" "Derry" "Daneswall" "Erith" "Foggen" "Slatewell" "Uppington" "York" [[8]] [1] "Derry" "Highbury" [[9]] [1] "Daneswall" "Erith" "Foggen"

106 84 THE R BOOK [[10]] [1] "Carmel" "Highbury" "Slatewell" "Uppington" This completes the first task. The second task is to get species lists for each location. We apply a similar method to extract the appropriate rownames(data) ) on the basis that the presence score for this site is data[,j] species (this time from : >0 sapply(1:9, function (j) rownames(data)[data[,j]>0] ) [[1]] [1] "Cleome serrulata" "Corydalis sempervirens" "Nitella flexilis" "Ranunculus baudotii" [5] "Rhododendron luteum" "Veronica opaca" [[2]] [1] "Cleome serrulata" "Rhododendron luteum" "Rodgersia podophylla" [[3]] [1] "Bartsia alpina" "Ranunculus baudotii" "Rhododendron luteum" "Tiarella wherryi" [[4]] [1] "Corydalis sempervirens" "Ranunculus baudotii" "Rhododendron luteum" "Tiarella wherryi" [[5]] [1] "Rhododendron luteum" "Tiarella wherryi" [[6]] [1] "Cleome serrulata" "Corydalis sempervirens" "Rodgersia podophylla" "Veronica opaca" [[7]] [1] "Bartsia alpina" "Rhododendron luteum" "Veronica opaca" [[8]] [1] "Conopodium majus" "Ranunculus baudotii" "Rhododendron luteum" "Veronica opaca" [[9]] [1] "Conopodium majus" "Nitella flexilis" "Rhododendron luteum" Because the species lists for different sites are of different lengths, the simplest solution is to create a separate file for each species list. We need to create a set of nine file names incorporating the site name, then use write.table in a loop: spplists<-sapply(1:9, function (j) rownames(data)[data[,j]>0] ) for (i in 1:9) { slist<-data.frame(spplists[[i]]) names(slist)<-names(data)[i] fn<-paste("c: \\ temp \\ ",names(data)[i],".txt",sep="") write.table(slist,fn) } We have produced nine separate files. Here, for instance, are the contents of the file called c: \\ temp \\ Carmel.txt as viewed in a text editor like Notepad: "Carmel" "1" "Cleome serrulata" "2" "Corydalis sempervirens" "3" "Nitella flexilis"

107 ESSENTIALS OF THE R LANGUAGE 85 "4" "Ranunculus baudotii" "5" "Rhododendron luteum" "6" "Veronica opaca" Perhaps the simplest and best solution is to turn the whole presence/absence matrix into a dataframe. Then both tasks become very straightforward. Start by using stack to create a dataframe of place names and presence/absence information: newframe<-stack(data) head(newframe) values ind 1 0 Carmel 2 1 Carmel 3 0 Carmel 4 1 Carmel 5 1 Carmel 6 1 Carmel Now extract the species names from the row names, repeat the list of names nine times, and add the resulting vector species names to the dataframe: newframe<-data.frame(newframe, rep(rownames(data),9)) Finally, give the three columns of the new dataframe sensible names: names(newframe)<-c("present","location","species") head(newframe) present location species 1 0 Carmel Bartsia alpina 2 1 Carmel Cleome serrulata 3 0 Carmel Conopodium majus 4 1 Carmel Corydalis sempervirens 5 1 Carmel Nitella flexilis 6 1 Carmel Ranunculus baudotii Unlike the lists, you can easily save this object to file: write.table(newframe,"c: \\ temp \\ spplists.txt") Now it is simple to do both our tasks. Here a location list for species = Bartsia alpina : newframe[newframe\$species=="Bartsia alpina" & newframe\$present==1,2] [1] Daneswall Slatewell

108 86 THE R BOOK location = Carmel and here is a species list for : newframe[newframe\$location=="Carmel" & newframe\$present==1,3] [1] Cleome serrulata Corydalis sempervirens Nitella flexilis Ranunculus baudotii [5] Rhododendron luteum Veronica opaca Lists are great, but dataframes are better. The cost of the dataframe is the potentially substantial redundancy in storage requirement. In practice, with relatively small dataframes, this seldom matters. 2.12 Text, character strings and pattern matching In R, character strings are defined by double quotation marks: a <- "abc" b <- "123" Numbers can be coerced to characters (as in b above), but non-numeric characters cannot be coerced to numbers: as.numeric(a) [1] NA Warning message: NAs introduced by coercion as.numeric(b) [1] 123 length of a One of the initially confusing things about character strings is the distinction between the character object (a vector), and the numbers of characters ( nchar ) in the strings that comprise that object. An example should make the distinction clear: pets <- c("cat","dog","gerbil","terrapin") Here, pets is a vector comprising four character strings: length(pets) [1] 4 and the individual character strings have 3, 3, 6 and 7 characters, respectively: nchar(pets) [1]3367 When first defined, character strings are not factors: class(pets) [1] "character" is.factor(pets) [1] FALSE

109 ESSENTIALS OF THE R LANGUAGE 87 pets was part of a dataframe (e.g. if it was input using However, if the vector of characters called read.table ) then R would coerce all the character variables to act as factors: df <- data.frame(pets) is.factor(df\$pets) [1] TRUE There are built-in vectors in R that contain the 26 letters of the alphabet in lower case (letters) and in upper case (LETTERS): letters [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" [17] "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" LETTERS [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" [17] "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" n which function like this: To discover which number in the alphabet the letter is, you can use the which(letters=="n") [1] 14 For the purposes of printing you might want to suppress the quotes that appear around character strings by default. The function to do this is called noquote : noquote(letters) [1]abcdefghijklmnopqrstuvwxyz 2.12.1 Pasting character strings together You can amalgamate individual strings into vectors of character information: c(a,b) [1] "abc" "123" not convert two 3-character strings This shows that the concatenation produces a vector of two strings. It does paste : into one 6-character string. The R function to do that is paste(a,b,sep="") [1] "abc123" The third argument, sep="" , means that the two character strings are to be pasted together without any separator between them: the default for is to insert a single blank space, like this: paste paste(a,b) [1] "abc 123" Notice that you do not lose blanks that are within character strings when you use the sep="" option in paste .

110 88 THE R BOOK paste(a,b," a longer phrase containing blanks",sep="") [1] "abc123 a longer phrase containing blanks" is a vector, each of the elements of the vector is pasted to the specified If one of the arguments to paste character string to produce an object of the same length as the vector: d <- c(a,b,"new") e <- paste(d,"a longer phrase containing blanks") e [1] "abc a longer phrase containing blanks" [2] "123 a longer phrase containing blanks" [3] "new a longer phrase containing blanks" You may need to think about why there are three lines of output. In this next example, we have four fields of information and we want to paste them together to make a file path for reading data into R: drive <- "c:" folder <- "temp" file <- "file" extension <- ".txt" Now use the function paste to put them together: paste(drive, folder, file, extension) [1] "c: temp file .txt" This has the essence of what we want, but it is not quite there yet. We need to replace the blank spaces "" that are the default separator with "" no space, and to insert slashes " \\ " between the drive and the directory, and the directory and file names: paste(drive, " ",folder, " \\ ",file, extension,sep="") \\ [1] "c: temp \\ file.txt" \\ 2.12.2 Extracting parts of strings We being by defining a phrase: phrase <- "the quick brown fox jumps over the lazy dog" substr is used to extract substrings of a specified number of characters from within a The function called character string. Here is the code to extract the first, the first and second, the first, second and third, ...,the first 20 characters from our phrase: q <- character(20) for (i in 1:20) q[i] <- substr(phrase,1,i) q [1] "t" "th" "the" [4] "the " "the q" "the qu"

111 ESSENTIALS OF THE R LANGUAGE 89 [7] "the qui" "the quic" "the quick" [10] "the quick " "the quick b" "the quick br" [13] "the quick bro" "the quick brow" "the quick brown" [16] "the quick brown " "the quick brown f" "the quick brown fo" [19] "the quick brown fox" "the quick brown fox " The second argument in substr is the number of the character at which extraction is to begin (in this case always the first), and the third argument is the number of the character at which extraction is to end (in this case, the i th). 2.12.3 Counting things within strings Counting the total number of characters in a string could not be simpler; just use the function directly, nchar like this: nchar(phrase) [1] 43 So there are 43 characters including the blanks between the words. To count the numbers of separate individual characters (including blanks) you need to split up the character string into individual characters (43 of them), strsplit like this: using strsplit(phrase,split=character(0)) [[1]] [1] "t" "h" "e" " " "q" "u" "i" "c" "k" " " "b" "r" [13] "o" "w" "n" " " "f" "o" "x" " " "j" "u" "m" "p" [25] "s" "a " "o" "v" "e" "r" " " "t" "h" "e" " " "l" [37] "a" "z" "y" " " "d" "o" "g" table in place of split=character(0) You could use NULL function can then be (see below). The used for counting the number of occurrences of each of the characters: table(strsplit(phrase,split=character(0))) abcdefghijklmnopqrstuvwxyz 811113112111111411212211111 This demonstrates that all of the letters of the alphabet were used at least once within our phrase, and that phrase . This suggests a way of counting the number of there were eight blanks within the string called words in a phrase, given that this will always be one more than the number of blanks (so long as there are no leading or trailing blanks in the string): words <- 1+table(strsplit(phrase,split=character(0)))[1] words 9 What about the lengths of the words within phrase? Here are the separate words: strsplit(phrase, " ") [[1]] [1] "the" "quick" "brown" "fox" "jumps" "over" "the" "lazy" "dog"

112 90 THE R BOOK lapply to apply the function to each element of the list produced We work out their lengths using nchar . Then we use strsplit to count how many words of each length are present: by table table(lapply(strsplit(phrase, " "), nchar)) 345 423 showing there were 4 three-letter words, 2 four-letter words and 3 five-letter words. This is how you reverse a character string. The logic is that you need to break it up into individual characters, then reverse their order, then paste them all back together again. It seems long-winded until you think about what the alternative would be: strsplit(phrase,NULL) [[1]] [1] "t" "h" "e" " " "q" "u" "i" "c" "k" " " "b" "r" [13] "o" "w" "n" " " "f" "o" "x" " " "j" "u" "m" "p" [25] "s" "a " "o" "v" "e" "r" " " "t" "h" "e" " " "l" [37] "a" "z" "y" " " "d" "o" "g" lapply(strsplit(phrase,NULL),rev) [[1]] [1] "g" "o" "d" " " "y" "z" "a" "l" " " "e" "h" "t" [13] " " "r" "e" "v" "o" " " "s" "p" "m" "u" "j" " " [25] "x" "o" "f" " " "n" "w" "o" "r" "b" " " "k" "c" [37] "i" "u" "q" " " "e" "h" "t" sapply(lapply(strsplit(phrase, NULL), rev), paste, collapse="") [1] "god yzal eht revo spmuj xof nworb kciuq eht" collapse argument is necessary to reduce the answer back to a single character string. Note that the The word lengths are retained, so this would be a poor method of encryption. When we specify a particular string to form the basis of the split, we end up with a list made up from the components of the string that do not contain the specified string . This is hard to understand without an example. Suppose we split our phrase using ‘the’: strsplit(phrase,"the") [[1]] [1] "" " quick brown fox jumps over " " lazy dog" There are three elements in this list: the first one is the empty string "" because the first three characters within phrase were exactly ‘the’; the second element contains the part of the phrase between the two occurrences of the string ‘the’; and the third element is the end of the phrase, following the second ‘the’. Suppose that we want to extract the characters between the first and second occurrences of ‘the’. This is achieved very simply, using subscripts to extract the second element of the list: strsplit(phrase,"the")[[1]] [2] [1] " quick brown fox jumps over "

113 ESSENTIALS OF THE R LANGUAGE 91 is only one Note that the first subscript in double square brackets refers to the number within the list (there list in this case), and the second subscript refers to the second element within this list. So if we want to know how many characters there are between the first and second occurrences of the word ‘the’ within our phrase, we put: nchar(strsplit(phrase,"the")[[1]] [2]) [1] 28 2.12.4 Upper- and lower-case text and tolower functions: It is easy to switch between upper and lower cases using the toupper toupper(phrase) [1] THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG" tolower(toupper(phrase)) [1] "the quick brown fox jumps over the lazy dog" 2.12.5 The function and relational databases match match function answers the question: ‘Where (if at all) do the values in the second vector appear in the The first vector?’ It is a really important function, but it is impossible to understand without an example: first <- c(5,8,3,5,3,6,4,4,2,8,8,8,4,4,6) second <- c(8,6,4,2) match(first,second) [1]NA1NANANA2334111332 The first thing to note is that match produces a vector of subscripts (index values) and that these are subscripts within the second match is the length of the first vector (15 vector. The length of the vector produced by match in this example). If elements of the first vector do not occur anywhere in the second vector, then produces NA . It works like this. Where does 5 (from the first position in the first vector) appear in the second vector? Answer: it does not ( NA ). Then, where does 8 (the second element of the first vector) appear in the second vector? Answer: in position number 1. And so on. Why would you ever want to use this? The answer turns out to be very general and extremely useful in data management. Large and/or complicated databases are always best stored as relational databases (e.g. Oracle or Access). In these, data are stored in sets of two-dimensional spreadsheet-like objects called tables. Data are divided into small tables with strict rules as to what data they can contain. You then create relationships between the tables that allow the computer to look from one table to another in order to assemble the data you want for a particular application. The relationship between two tables is based on fields whose values (if not their variable names) are common to both tables. The rules for constructing effective relational databases were first proposed by Dr E.F. Codd of the IBM Research Laboratory at San Jose, California, in an extremely influential paper in 1970:  All data are in tables.  There is a separate table for each set of related variables.

114 92 THE R BOOK  The order of the records within tables is irrelevant (so you can add records without reordering the existing records).  The first column of each table is a unique ID number for every row in that table (a simple way to make sure that this works is to have the rows numbered sequentially from 1 at the top, so that when you add new rows you are sure that they get unique identifiers).  There is no unnecessary repetition of data so the storage requirement is minimized, and when we need to edit a record, we only need to edit it once (the last point is very important).  Each piece of data is ‘granular’ (meaning as small as possible); so you would split a customer’s name into title (Dr), first name (Charles), middle name (Urban), surname (Forrester), and preferred form of address (Chuck), so that if they were promoted, for instance, we would only need to convert Dr to Prof. in the title field. These are called the ‘normalization rules’ for creating bullet-proof databases. The use of Structured Query Language (SQL) in R to interrogate relational databases is discussed in Chapter 3 (p. 154). Here, the only point is to see how the function relates information in one vector (or table) to information in another. match Take a medical example. You have a vector containing the anonymous identifiers of nine patients (subjects): subjects <- c("A", "B", "G", "M", "N", "S", "T", "V", "Z") Suppose you wanted to give a new drug to all the patients identified in the second vector called , and the conventional drug to all the others. Here are the suitable patients: suitable.patients suitable.patients <- c("E", "G", "S", "U", "Z") Notice that there are several suitable patients who are not part of this trial (E and U). This is what the match function does: match(subjects, suitable.patients) [1] NA NA 2 NA NA 3 NA NA 5 For each of the individuals in the first vector (subjects) it finds the subscript in the second vector (suitable patients), returning if that patient does not appear in the second vector. The key point to understand is that NA match match the same length as the first vector supplied to the vector produced by , and that the numbers is subscripts within the second vector . The last bit is what people find hard to understand at in the result are first. Let us go through the output term by term and see what each means. Patient A is not in the suitable vector, so NA is returned. The same is true for patient B. Patient G is suitable, so we get a number in the third position. That number i suitable.patients. s a 2 because patient G is the second element of the vector called Neither patient M nor N is in the second vector, so they both appear as NA . Patient S is suitable and so produces a number. The number is 3 because that is the position of S with the second vector. To complete the job, we want to produce a vector of the drugs to be administered to each of the subjects. We create a vector containing the two treatment names: drug <- c("new", "conventional") Then we use the result of the match to give the right drug to the right patient: drug[ifelse(is.na(match(subjects, suitable.patients)),2,1)] [1] "conventional" "conventional" "new" "conventional"

115 ESSENTIALS OF THE R LANGUAGE 93 [5] "conventional" "new" "conventional" "conventional" [9] "new" with is.na to produce a subscript 2 (to use with drug) for the unsuitable patients, ifelse Note the use of (i.e. for the suitable patients). You may need to work through NA and a 1 when the result of the match is not this example several times (but it is well worth mastering it). 2.12.6 Pattern matching We need a dataframe with a serious amount of text in it to make these exercises relevant: wf <- read.table("c: temp \\ worldfloras.txt",header=T) \\ attach(wf) names(wf) [1] "Country" "Latitude" "Area" "Population" "Flora" [6] "Endemism" "Continent" Country As you can see, there are 161 countries in this dataframe (strictly, 161 places, since some of the entries, such as Sicily and Balearic Islands, are not countries). The idea is that we want to be able to select subsets of countries on the basis of specified patterns within the character strings that make up the country names (factor grep . This searches for matches to a pattern (specified in its first argument) levels). The function to do this is within the character vector which forms the second argument. It returns a vector of indices (subscripts) within the vector appearing as the second argument, where the pattern was found in whole or in part. The topic of pattern matching is very easy to master once the penny drops, but it hard to grasp without simple, concrete examples. Perhaps the simplest task is to select all the countries containing a particular letter – for instance, upper-case R: as.vector(Country[grep("R",as.character(Country))]) [1] "Central African Republic" "Costa Rica" [3] "Dominican Republic" "Puerto Rico" [5] "Reunion" "Romania" [7] "Rwanda" "USSR" To restrict the search to countries whose first name begins with R use the ˆ character like this: as.vector(Country[grep("ˆ R",as.character(Country))]) [1] "Reunion" "Romania" "Rwanda" To select those countries with multiple names with upper-case R as the first letter of their second or subsequent names, we specify the character string as “blank R” like this: as.vector(Country[grep(" R",as.character(Country)]) [1] "Central African Republic" "Costa Rica" [3] "Dominican Republic" "Puerto Rico"

116 94 THE R BOOK To find all the countries with two or more names, just search for a blank " ": as.vector(Country[grep(" ",as.character(Country))]) [1] "Balearic Islands" "Burkina Faso" [3] "Central African Republic" "Costa Rica" [5] "Dominican Republic" "El Salvador" [7] "French Guiana" "Germany East" [9] "Germany West" "Hong Kong" [11] "Ivory Coast" "New Caledonia" [13] "New Zealand" "Papua New Guinea" [15] "Puerto Rico" "Saudi Arabia" [17] "Sierra Leone" "Solomon Islands" [19] "South Africa" "Sri Lanka" [21] "Trinidad & Tobago" "Tristan da Cunha" [23] "United Kingdom" "Viet Nam" [25] "Yemen North" "Yemen South" To find countries with names ending in ‘y’ use the \$ symbol like this: as.vector(Country[grep("y\$",as.character(Country))]) [1] "Hungary" "Italy" "Norway" "Paraguay" "Sicily" "Turkey" [7] "Uruguay" To recap: the start of the character string is denoted by ˆ and the end of the character string is denoted by \$ . For conditions that can be expressed as groups (say, series of numbers or alphabetically grouped lists of letters), use square brackets inside the quotes to indicate the range of values that is to be selected. For instance, to select countries with names containing upper-case letters from C to E inclusive, write: as.vector(Country[grep("[C-E]",as.character(Country))]) [1] "Cameroon" "Canada" [3] "Central African Republic" "Chad" [5] "Chile" "China" [7] "Colombia" "Congo" [9] "Corsica" "Costa Rica" [11] "Crete" "Cuba" [13] "Cyprus" "Czechoslovakia" [15] "Denmark" "Dominican Republic" [17] "Ecuador" "Egypt" [19] "El Salvador" "Ethiopia" [21] "Germany East" "Ivory Coast" [23] "New Caledonia" "Tristan da Cunha" Notice that this formulation picks out countries like Ivory Coast and Tristan da Cunha that contain upper-case Cs in places other than as their first letters. To restrict the choice to first letters use the ˆ operator before the list of capital letters: as.vector(Country[grep("ˆ[C-E]",as.character(Country))]) [1] "Cameroon" "Canada" [3] "Central African Republic" "Chad"

117 ESSENTIALS OF THE R LANGUAGE 95 [5] "Chile" "China" [7] "Colombia" "Congo" [9] "Corsica" "Costa Rica" [11] "Crete" "Cuba" [13] "Cyprus" "Czechoslovakia" [15] "Denmark" "Dominican Republic" [17] "Ecuador" "Egypt" [19] "El Salvador" "Ethiopia" How about selecting the counties use negative not ending with a specified patterns? The answer is simply to to drop the selected items from the vector. Here are the countries that do not end with a letter subscripts between ‘a’ and ‘t’: as.vector(Country[-grep("[a-t]\$",as.character(Country))]) [1] "Hungary" "Italy" "Norway" "Paraguay" "Peru" "Sicily" [7] "Turkey" "Uruguay" "USA" "USSR" "Vanuatu" You see that USA and USSR are included in the list because we specified lower-case letters as the endings to omit. To omit these other countries, put ranges for both upper- and lower-case letters inside the square brackets, separated by a space: as.vector(Country[-grep("[A-T a-t]\$",as.character(Country))]) [1] "Hungary" "Italy" "Norway" "Paraguay" "Peru" "Sicily" [7] "Turkey" "Uruguay" "Vanuatu" 2.12.7 Dot . as the ‘anything’ character Countries with ‘y’ as their second letter are specified by ˆ.y . The ˆ shows ‘starting’, then a single dot means one character of any kind, so y is the specified second character: as.vector(Country[grep("ˆ.y",as.character(Country))]) [1] "Cyprus" "Syria" To search for countries with ‘y’ as third letter: as.vector(Country[grep("ˆ..y",as.character(Country))]) [1] "Egypt" "Guyana" "Seychelles" If we want countries with ‘y’ as their sixth letter: as.vector(Country[grep("ˆ. 5 } y",as.character(Country))]) { [1] "Norway" "Sicily" "Turkey" (Five ‘anythings’ is shown by ‘.’ then curly brackets { 5 } then y .) Which are the countries with four or fewer letters in their names? as.vector(Country[grep("ˆ. { ,4 } \$",as.character(Country))]) [1] "Chad" "Cuba" "Iran" "Iraq" "Laos" "Mali" "Oman" [8] "Peru" "Togo" "USA" "USSR"

118 96 THE R BOOK { ,4 means ‘repeat up to four’ anythings (dots) before \$ (the end of the The ‘.’ means ‘anything’ while the } string). So to find all the countries with 15 or more characters in their name: { as.vector(Country[grep("ˆ. \$",as.character(Country))]) } 15, [1] "Balearic Islands" "Central African Republic" [3] "Dominican Republic" "Papua New Guinea" [5] "Solomon Islands" "Trinidad & Tobago" [7] "Tristan da Cunha" 2.12.8 Substituting text within character strings sub and gsub Search-and-replace operations are carried out in R using the functions . The two substitution functions differ only in that replaces only the first occurrence of a pattern within a character string, sub gsub whereas replaces all occurrences. An example should make this clear. Here is a vector comprising seven character strings, called text : text <- c("arm", "leg", "head", "foot", "hand", "hindleg" "elbow") We want to replace all lower-case ‘h’ with upper-case ‘H’: gsub("h","H",text) [1] "arm" "leg" "Head" "foot" "Hand" "Hindleg" "elbow" Now suppose we want to convert the first occurrence of a lower-case ‘o’ into an upper-case ‘O’. We use sub for this (not gsub ): sub("o","O",text) [1] "arm" "leg" "head" "fOot" "hand" "hindleg" "elbOw" You can see the difference between sub gsub in the following, where both instances of ‘o’ in foot are and gsub sub : converted to upper case by but not by gsub("o","O",text) [1] "arm" "leg" "head" "fOOt" "hand" "hindleg" "elbOw" grep More general patterns can be specified in the same way as we learned for (above). For instance, to replace the first character of every string with upper-case ‘O’ we use the dot notation (. stands for ‘anything’) coupled with ˆ (the ‘start of string’ marker): gsub("ˆ.","O",text) [1] "Orm" "Oeg" "Oead" "Ooot" "Oand" "Oindleg" "Olbow" It is useful to be able to manipulate the cases of character strings. Here, we capitalize the first character in each string: gsub("( \\ w*)( \\ w*)", " \\ U \\ 1 \\ L \\ 2",text, perl=TRUE) [1] "Arm" "Leg" "Head" "Foot" "Hand" "Hindleg" "Elbow" Here we convert all the characters to upper case: gsub("( \\ w*)", " \\ U \\ 1",text, perl=TRUE) [1] "ARM" "LEG" "HEAD" "FOOT" "HAND" "HINDLEG" "ELBOW"

119 ESSENTIALS OF THE R LANGUAGE 97 regexpr 2.12.9 Locations of a pattern within a vector using it occurs in a string and, if so, where it occurs Instead of substituting the pattern, we might want to know if , therefore, is a numeric vector (as with grep , above), but now within each string. The result of regexpr whether indicating the position of the first instance of the pattern within the string (rather than just the pattern was there). If the pattern does not appear within the string, the default value returned by is –1. An regexpr example is essential to get the point of this: text [1] "arm" "leg" "head" "foot" "hand" "hindleg" "elbow" regexpr("o",text) [1]-1 -1 -1 2 -1 -1 4 attr(,"match.length") [1]-1 -1 -1 1 -1 -1 1 This indicates that there were lower-case ‘o’s in two of the elements of text, and that they occurred in positions 2 and 4, respectively. Remember that if we wanted just the subscripts showing which elements of text contained an ‘o’ we would use grep like this: grep("o",text) [1] 4 7 and we would extract the character strings like this: text[grep("o",text)] [1] "foot" "elbow" Counting how many ‘o’s there are in each string is a different problem again, and this involves the use of gregexpr : freq <- as.vector(unlist (lapply(gregexpr("o",text),length))) present <- ifelse(regexpr("o",text)<0,0,1) freq*present [1]0002001 indicating that there are no ‘o’s in the first three character strings, two in the fourth and one in the last string. You will need lots of practice with these functions to appreciate all of the issues involved. The function charmatch is for matching characters. If there are multiple matches (two or more) then the function returns the value 0 (e.g. when all the elements contain ‘m’): charmatch("m", c("mean", "median", "mode")) [1] 0 If there is a unique match the function returns the index of the match within the vector of character strings (here in location number 2): charmatch("med", c("mean", "median", "mode")) [1] 2

120 98 THE R BOOK %in% and 2.12.10 Using which You want to know all of the matches between one character vector and another: stock <- c("car","van") requests <- c("truck","suv","van","sports","car","waggon","car") which to find the locations in the first-named vector of any and all of the entries in the second-named Use vector: which(requests %in% stock) [1] 3 5 7 what the matches are as well as where they are: If you want to know requests [which(requests %in% stock)] [1] "van" "car" "car" match You could use the function to obtain the same result (p. 91): stock[match(requests,stock)][!is.na(match(requests,stock))] [1] "van" "car" "car" sapply : but this is more clumsy. A slightly more complicated way of doing it involves which(sapply(requests, "%in%", stock)) van car car 357 %in% function. Note also that the match must be perfect for this to work Note the use of quotes around the (‘car’ with ‘car’ is not the same as ‘car’ with ‘cars’). 2.12.11 More on pattern matching For the purposes of specifying these patterns, certain characters are called metacharacters , specifically \ |()[ ˆ\$ } *+? Any metacharacter with special meaning in your string may be quoted by { or \\ \{ preceding it with a backslash: \ \$ , \ * , for instance. You might be used to specifying one or more , ‘wildcards’ by * in DOS-like applications. In R, however, the regular expressions used are those specified by POSIX (Portable Operating System Interface) 1003.2, either extended or basic, depending on the value of the extended argument, unless perl = TRUE when they are those of PCRE (see ?grep for details). Note that the square brackets in these class names [ ] are part of the symbolic names, and must be included in addition to the brackets delimiting the bracket list. For example, [[:alnum:]] means [0-9A-Za-z], except the latter depends upon the locale and the character encoding, whereas the former is independent of locale and character set. The interpretation below is that of the POSIX locale: [:alnum:] Alphanumeric characters: [:alpha:] and [:digit:]. [:alpha:] Alphabetic characters: [:lower:] and [:upper:]. [:blank:] Blank characters: space and tab. [:cntrl:] Control characters in ASCII, octal codes 000 through 037, and 177 (DEL). [:digit:] Digits: 0 123456789.

121 ESSENTIALS OF THE R LANGUAGE 99 Graphical characters: [:alnum:] and [:punct:]. [:graph:] [:lower:] Lower-case letters in the current locale. } Printable characters: [:alnum:], [:punct:] and space. [:print:] [:punct:] Punctuation characters: \ ]ˆ_' { | } ~ !"#\$%&()*+\$,-./:; <=>? @ [ . Space characters: tab, newline, vertical tab, form feed, carriage return, space. [:space:] [:upper:] Upper-case letters in the current locale. [:xdigit:] Hexadecimal digits: 0 1 23456789ABCDEFabcdef. Most metacharacters lose their special meaning inside lists. Thus, to include a literal ], place it first in the list. Similarly, to include a literal ˆ, place it anywhere but first. Finally, to include a literal -, place it first or \ remain special inside character classes. To recap: last. Only these and  dot . matches any single character.  caret matches the empty string at the beginning of a line. ˆ  dollar sign \$ matches the empty string at the end of a line.  \ < and \ > symbols respectively match the empty string at the beginning and end of a word.  matches the empty string provided \ matches the empty string at the edge of a word, and \ the symbol b B it is not at the edge of a word. A regular expression may be followed by one of several repetition quantifiers: ? the preceding item is optional and will be matched at most once. * the preceding item will be matched zero or more times. the preceding item will be matched one or more times. + } the preceding item is matched exactly n times. { n } the preceding item is matched n or more times. { n, { } the preceding item is matched up to m times. ,m { n,m } the preceding item is matched at least n times, but not more than m times. You can use the OR operator | so that " abba|cde " matches either the string " abba " or the string " cde ". Here are some simple examples to illustrate the issues involved: text <- c("arm","leg","head", "foot","hand", "hindleg", "elbow") { n } in operation: The following lines demonstrate the ‘consecutive characters’ { 1 } ",text,value=T) grep("o [1] "foot" "elbow" grep("o 2 } ",text,value=T) { [1] "foot" grep("o { 3 } ",text,value=T) character(0)

122 100 THE R BOOK { n, ‘ n or more’ character counting in words: The following lines demonstrate the use of } 4, ",text,value=T) { grep("[[:alnum:]] } [1] "head" "foot" "hand" "hindleg" "elbow" { grep("[[:alnum:]] } ",text,value=T) 5, [1] "hindleg" "elbow" { 6, grep("[[:alnum:]] ",text,value=T) } [1] "hindleg" { } ",text,value=T) grep("[[:alnum:]] 7, [1] "hindleg" 2.12.12 Perl regular expressions The argument switches to the PCRE library that implements regular expression pattern perl = TRUE matching using the same syntax and semantics as Perl 5.6 or later (with just a few differences). For details (and there are many) see ?regexp . 2.12.13 Stripping patterned text out of complex strings Suppose that we want to tease apart the information in these complicated strings: (entries <- c ("Trial 1 58 cervicornis (52 match)", "Trial 2 60 terrestris (51 matched)", "Trial 8 109 flavicollis (101 matches)")) [1] "Trial 1 58 cervicornis (52 match)" [2] "Trial 2 60 terrestris (51 matched)" [3] "Trial 8 109 flavicollis (101 matches)" The first task is to remove the material on numbers of matches including the brackets: \\ (.* \\ )\$", "", entries)) gsub(" *\$", "", gsub(" [1] "Trial 1 58 cervicornis" "Trial 2 60 terrestris" [3] "Trial 8 109 flavicollis" The first argument " *\$", "", removes the ‘trailing blanks’, while the second deletes everything .* between the left \\ ( and right \\ ) hand brackets " \\ (.* \\ )\$" , substituting this with nothing "". The next job is to strip out the material in brackets and to extract that material, ignoring the brackets themselves: \\ (.* \\ )\$", entries) pos <- regexpr(" substring(entries, first=pos+1, last=pos+attr(pos,"match.length")-2) [1] "52 match" "51 matched" "101 matches" To see how this has worked it is useful to inspect the values of pos that have emerged from the regexpr function: pos [1] 25 23 25

123 ESSENTIALS OF THE R LANGUAGE 101 attr(,"match.length") [1] 10 12 13 The left-hand bracket appears in position 25 in the first and third elements (note that there are two blanks before ‘cervicornis’) but in position 23 in the second element. Now the lengths of the strings matching the \\ (.* \\ can be checked; it is the number of ‘anything’ characters between the two brackets, plus pattern )\$ one for each bracket: 10, 12 and 13. Thus, to extract the material in brackets, but to ignore the brackets themselves, we need pos+1 to locate the first character to be extracted ( ) and the last character to be extracted pos+attr(pos,"match.length")-2 substring function to do the extracting. Note , then use the = that first and last are vectors of length 3 ( ). length(entries) 2.13 Dates and times in R The measurement of time is highly idiosyncratic. Successive years start on different days of the week. There are months with different numbers of days. Leap years have an extra day in February. Americans and Britons put the day and the month in different places: 3/4/2006 is March 4 for the former and April 3 for the latter. Occasional years have an additional ‘leap second’ added to them because friction from the tides is slowing down the rotation of the earth from when the standard time was set on the basis of the tropical year in 1900. The cumulative effect of having set the atomic clock too slow accounts for the continual need to insert leap seconds (32 of them since 1958). There is currently a debate about abandoning leap seconds and introducing a ‘leap minute’ every century or so instead. Calculations involving times are complicated by the operation of time zones and daylight saving schemes in different countries. All these things mean that working with dates and times is excruciatingly complicated. Fortunately, R has a robust system for dealing with this complexity. To see how R handles dates and times, have a look at Sys.time(): Sys.time() [1] "2014-01-24 16:24:54 GMT" This description of date and time is strictly hierarchical from left to right: the longest time scale (years) comes first, then month, then day, separated by hyphens, then there is a blank space, followed by the time, with hours first (using the 24-hour clock), then minutes, then seconds, separated by colons. Finally, there is a character string explaining the time zone (GMT stands for Greenwich Mean Time). This representation of the date and time as a character string is user-friendly and familiar, but it is no good for calculations. For that, we need a single numeric representation of the combined date and time. The convention in R is to base this on seconds Sys.time ). You can always aggregate upwards to days or (the smallest time scale that is accommodated in year, but you cannot do the reverse. The baseline for expressing today’s date and time in seconds is 1 January 1970: as.numeric(Sys.time()) [1] 1390580694 This is fine for plotting time series graphs, but it is not much good for computing monthly means (e.g. is the mean for June significantly different from the July mean?) or daily means (e.g. is the Monday mean significantly different from the Friday mean?). To answer questions like these we have to be able to access a broad set of categorical variables associated with the date: the year, the month, the day of the week, and so forth. To accommodate this, R uses the POSIX system for representing times and dates:

124 102 THE R BOOK class(Sys.time()) [1] "POSIXct" "POSIXt" , with suffix ‘ct’, as continuous time (i.e. a number of seconds), and You can think of the class POSIXct list time , with suffix ‘lt’, as POSIXlt (i.e. a list of all the various categorical descriptions of the time, including day of the week and so forth). It is hard to remember these acronyms, but it is well worth making the effort. Naturally, you can easily convert to one representation to the other: time.list <- as.POSIXlt(Sys.time()) unlist(time.list) sec min hour mday mon year wday yday isdst 54 24 16 24 0 114 5 23 0 Here you see the nine components of the list. The time is represented by the number of seconds ( sec ), min minutes ( mday , starting from 1), ) and hours (on the 24-hour clock). Next comes the day of the month ( then the month of the year ( , starting at January = 0), then the year (starting at 0 = 1900). The day of the mon wday week ( = 0 to Saturday = 6. The day within the year ( yday ) is coded from 0 = ) is coded from Sunday January 1. Finally, there is a logical variable isdst which asks whether daylight saving time is in operation (0 = year (to get yearly mean values), FALSE in this case). The ones you are most likely to use include (to get monthly means) and (to get means for the different days of the week). mon wday 2.13.1 Reading time data from files It is most likely that your data files contain dates in Excel format, for example 03/09/2014 (a character string showing day/month/year separated by slashes). data <- read.table("c: temp \\ dates.txt",header=T) \\ attach(data) head(data) x date 1 3 15/06/2014 2 1 16/06/2014 3 6 17/06/2014 4 7 18/06/2014 5 8 19/06/2014 6 9 20/06/2014 When you read character data into R using read.table , the default option is to convert the character variables into factors. Factors are of mode numeric and class factor : mode(date) [1] "numeric" class(date) [1] "factor" For our present purposes, the point is that the data are not recognized by R as being dates. To convert a factor or a character string into a POSIXlt object, we employ an important function called ‘strip time’, written strptime.

125 ESSENTIALS OF THE R LANGUAGE 103 strptime function 2.13.2 The function, we provide a format To convert a factor or a character string into dates using the strptime statement enclosed in double quotes to tell R exactly what to expect, in what order, and separated by what kind of symbol. For our present example we have day (as two digits), then slash, then month (as two digits), then slash, then year (with the century, making four digits). Rdate <- strptime(as.character(date),"%d/%m/%Y") class(Rdate) [1] "POSIXlt" "POSIXt" It is always a good idea at this stage to add the R-formatted date to your dataframe: data <- data.frame(data,Rdate) head(data) x date Rdate 1 3 15/06/2014 2014-06-15 2 1 16/06/2014 2014-06-16 3 6 17/06/2014 2014-06-17 4 7 18/06/2014 2014-06-18 5 8 19/06/2014 2014-06-19 6 9 20/06/2014 2014-06-20 x for each day of Now, at last, we can do things with the date information. We might want the mean value of Rdate\$wday : the week. The name of this object is tapply(x,Rdate\$wday,mean) 0123456 5.660 2.892 5.092 7.692 8.692 9.692 8.892 The lowest mean is on Mondays ( wday = 1 ) and the highest on Fridays ( wday = 5 ). It is hard to remember all the format codes for strip time, but they are roughly mnemonic and they are always preceded by a percent symbol. Here is the full list of format components: Abbreviated weekday name %a Full weekday name %A %b Abbreviated month name %B Full month name %c Date and time, locale-specific %d Day of the month as decimal number (01–31) %H Hours as decimal number (00–23) on the 24-hour clock %I Hours as decimal number (01–12) on the 12-hour clock %j Day of year as decimal number (0–366) %m Month as decimal number (0–11) %M Minute as decimal number (00–59) %p AM/PM indicator in the locale %S Second as decimal number (00–61, allowing for two ‘leap seconds’) %U Week of the year (00–53) using the first Sunday as day 1 of week 1

126 104 THE R BOOK Weekday as decimal number (0–6, Sunday is 0) %w %W Week of the year (00–53) using the first Monday as day 1 of week 1 Date, locale-specific %x %X Time, locale-specific Year with century %Y %y Year without century Time zone as a character string (output only) %Z Note the difference between the upper case for year %Y (this is the unambiguous year including the century, 2014), and the potentially ambiguous lower case %y (it is not clear whether 14 means 1914 or 2014). weekdays (note the plural) for turning the day number into the There is a useful function called appropriate name: y <- strptime("01/02/2014",format="%d/%m/%Y") weekdays(y) [1] "Saturday" which is converted from y\$wday [1] 6 because the days of the week are numbered from Sunday = 0. Here is another kind of date, with years in two-digit form ( %y %b ) ), and the months as abbreviated names ( using no separators: other.dates <- c("1jan99", "2jan05", "31mar04", "30jul05") strptime(other.dates, "%d%b%y") [1] "1999-01-01" "2005-01-02" "2004-03-31" "2005-07-30" Here is yet another possibility with year, then month in full, then week of the year, then day of the week abbreviated, all separated by a single blank space: yet.another.date <- c("2016 January 2 Mon","2017 February 6 Fri","2018 March 10 Tue") strptime(yet.another.date,"%Y %B %W %a") [1] "2016-01-11" "2017-02-10" "2018-03-06" The system is clever in that it knows the date of the Monday in week number 2 of January in 2016, and of the Tuesday in week 10 of 2018 (the information on month is redundant in this case): yet.more.dates <- c("2016 2 Mon","2017 6 Fri","2018 10 Tue") strptime(yet.more.dates,"%Y %W %a") [1] "2016-01-11" "2017-02-10" "2018-03-06" 2.13.3 The difftime function The function difftime calculates a difference of two date-time objects and returns an object of class difftime with an attribute indicating the units. You can use various arithmetic operations on a difftime

127 ESSENTIALS OF THE R LANGUAGE 105 round, signif, floor, ceiling, trunc, abs, sign and certain logical object including object like this: operations. You can create a difftime as.difftime(yet.more.dates,"%Y %W %a") Time differences in days [1] 1434 1830 2219 attr(,"tzone") [1] "" or like this: difftime("2014-02-06","2014-07-06") Time difference of -149.9583 days round(difftime("2014-02-06","2014-07-06"),0) Time difference of -150 days 2.13.4 Calculations with dates and times You can do the following calculations with dates and times:  + number time  time – number  time1 – time2  time1 ‘logical operation’ time2 where the logical operations are one of == , != , < , <= , > or >= . You can add or subtract a number of seconds or a difftime object from a date-time object, but you cannot add two date-time objects. Subtraction of difftime POSIXlt two date-time objects is equivalent to using . Unless a time zone has been specified, objects are interpreted as being in the current time zone in calculations. before The thing you need to grasp is that you should convert your dates and times into POSIXlt objects starting to do any calculations. Once they are POSIXlt objects, it is straightforward to calculate means, differences and so on. Here we want to calculate the number of days between two dates, 22 October 2015 and 22 October 2018: y2 <- as.POSIXlt("2015-10-22") y1 <- as.POSIXlt("2018-10-22") Now you can do calculations with the two dates: y1-y2 Time difference of 1096 days 2.13.5 The difftime and as.difftime functions Working out the time difference between two dates and times involves the difftime function, which takes two date-time objects as its arguments. The function returns an object of class difftime with an attribute indicating the units. For instance, how many days elapsed between 15 August 2013 and 21 October 2015?

128 106 THE R BOOK difftime("2015-10-21","2013-8-15") Time difference of 797 days If you want only the number of days to use in calculation, then write as.numeric(difftime("2015-10-21","2013-8-15")) [1] 797 If you have times but no dates, then you can use as.difftime to create appropriate objects for calculations: t1 <- as.difftime("6:14:21") t2 <- as.difftime("5:12:32") t1-t2 Time difference of 1.030278 hours You will often want to create POSIXlt objects from components stored in different vectors within a dataframe. For instance, here is a dataframe with the hours, minutes and seconds from an experiment with two factor levels in four separate columns: times <- read.table("c: \\ temp \\ times.txt",header=T) attach(times) head(times) hrs min sec experiment 12236 A 2 3 16 17 A 33256 A 42450 A 53442 A 6 2 56 25 A Because the times are not in POSIXlt format, you need to paste together the hours, minutes and seconds into a character string, using colons as the separator: paste(hrs,min,sec,sep=":") [1] "2:23:6" "3:16:17" "3:2:56" "2:45:0" "3:4:42" "2:56:25" "3:12:28" [8] "1:57:12" "2:22:22" "1:42:7" "2:31:17" "3:15:16" "2:28:4" "1:55:34" [15] "2:17:7" "1:48:48" Now save this object as a difftime vector called : duration duration <- as.difftime (paste(hrs,min,sec,sep=":")) Then you can carry out calculations like mean and variance using the tapply function: tapply(duration,experiment,mean) AB 2.829375 2.292882 which gives the answer in decimal hours.

129 ESSENTIALS OF THE R LANGUAGE 107 2.13.6 Generating sequences of dates You may want to generate sequences of dates by years, months, weeks, days of the month or days of the week. Here are four sequences of dates, all starting on 4 November 2015, the first going in increments of one day: seq(as.POSIXlt("2015-11-04"), as.POSIXlt("2015-11-15"), "1 day") [1] "2015-11-04 GMT" "2015-11-05 GMT" "2015-11-06 GMT" "2015-11-07 GMT" [5] "2015-11-08 GMT" "2015-11-09 GMT" "2015-11-10 GMT" "2015-11-11 GMT" [9] "2015-11-12 GMT" "2015-11-13 GMT" "2015-11-14 GMT" "2015-11-15 GMT" the second with increments of 2 weeks: seq(as.POSIXlt("2015-11-04"), as.POSIXlt("2016-04-05"), "2 weeks") [1] "2015-11-04 GMT" "2015-11-18 GMT" "2015-12-02 GMT" "2015-12-16 GMT" [5] "2015-12-30 GMT" "2016-01-13 GMT" "2016-01-27 GMT" "2016-02-10 GMT" [9] "2016-02-24 GMT" "2016-03-09 GMT" "2016-03-23 GMT" the third with increments of 3 months: seq(as.POSIXlt("2015-11-04"), as.POSIXlt("2018-10-04"), "3 months") [1] "2015-11-04 GMT" "2016-02-04 GMT" "2016-05-04 BST" "2016-08-04 BST" [5] "2016-11-04 GMT" "2017-02-04 GMT" "2017-05-04 BST" "2017-08-04 BST" [9] "2017-11-04 GMT" "2018-02-04 GMT" "2018-05-04 BST" "2018-08-04 BST" the fourth with increments of years: seq(as.POSIXlt("2015-11-04"), as.POSIXlt("2026-02-04"), "year") [1] "2015-11-04 GMT" "2016-11-04 GMT" "2017-11-04 GMT" "2018-11-04 GMT" [5] "2019-11-04 GMT" "2020-11-04 GMT" "2021-11-04 GMT" "2022-11-04 GMT" [9] "2023-11-04 GMT" "2024-11-04 GMT" "2025-11-04 GMT" "2026-11-04 GMT" If you specify a number, rather than a recognized character string, in the by part of the sequence function, then the number is assumed to be a number of seconds, so this generates the time as well as the date: seq(as.POSIXlt("2015-11-04"), as.POSIXlt("2015-11-05"), 8955) [1] "2015-11-04 00:00:00 GMT" "2015-11-04 02:29:15 GMT" [3] "2015-11-04 04:58:30 GMT" "2015-11-04 07:27:45 GMT" [5] "2015-11-04 09:57:00 GMT" "2015-11-04 12:26:15 GMT" [7] "2015-11-04 14:55:30 GMT" "2015-11-04 17:24:45 GMT" [9] "2015-11-04 19:54:00 GMT" "2015-11-04 22:23:15 GMT" As with other forms of seq, you can specify the length of the vector to be generated, instead of specifying the final date: seq(as.POSIXlt("2015-11-04"), by="month", length=10) [1] "2015-11-04 GMT" "2015-12-04 GMT" "2016-01-04 GMT" "2016-02-04 GMT" [5] "2016-03-04 GMT" "2016-04-04 BST" "2016-05-04 BST" "2016-06-04 BST" [9] "2016-07-04 BST" "2016-08-04 BST"

130 108 THE R BOOK along= instead of or you can generate a vector of dates to match the length of an existing vector, using length= : results <- runif(16) seq(as.POSIXlt("2015-11-04"), by="month", along=results ) [1] "2015-11-04 GMT" "2015-12-04 GMT" "2016-01-04 GMT" "2016-02-04 GMT" [5] "2016-03-04 GMT" "2016-04-04 BST" "2016-05-04 BST" "2016-06-04 BST" [9] "2016-07-04 BST" "2016-08-04 BST" "2016-09-04 BST" "2016-10-04 BST" [13]"2016-11-04 GMT" "2016-12-04 GMT" "2017-01-04 GMT" "2017-02-04 GMT" weekdays function to extract the days of the week from a series of dates: You can use the weekdays(seq(as.POSIXlt("2015-11-04"), by="month", along=results )) [1] "Wednesday" "Friday" "Monday" "Thursday" "Friday" "Monday" [7] "Wednesday" "Saturday" "Monday" "Thursday" "Sunday" "Tuesday" [13] "Friday" "Sunday" "Wednesday" "Saturday" Suppose that you want to find the dates of all the Mondays in a sequence of dates. This involves the use of logical subscripts (see p. 39). The subscripts evaluating to TRUE will be selected, so the logical statement you wday == 1 . (because Sunday is need to make is ). You create an object called y containing wday == 0 the first 100 days in 2016 (note that the start date is 31 December 2015), then convert this vector of dates into a POSIXlt object, a list called x , like this: y <- as.Date(1:100,origin="2015-12-31") x <- as.POSIXlt(y) x \$ operator to access information on weekday, and you find, of Now, because is a list, you can use the course, that they are all 7 days apart, starting from the 4 January 2016: x[x\$wday==1] [1] "2016-01-04 UTC" "2016-01-11 UTC" "2016-01-18 UTC" "2016-01-25 UTC" [5] "2016-02-01 UTC" "2016-02-08 UTC" "2016-02-15 UTC" "2016-02-22 UTC" [9] "2016-02-29 UTC" "2016-03-07 UTC" "2016-03-14 UTC" "2016-03-21 UTC" [13]"2016-03-28 UTC" "2016-04-04 UTC" Suppose you want to list the dates of the first Monday in each month. This is the date with wday == 1 (as above) but only on its first occurrence in each month of the year. This is slightly more tricky, because several months will contain five Mondays, so you cannot use with by = "28 days" to solve the seq problem (this would generate 13 dates, not the 12 required). Here are the dates of all the Mondays in the year of 2016: y <- as.POSIXlt(as.Date(1:365,origin="2015-12-31")) Here is what we know so far: data.frame(monday=y[y\$wday==1],month=y\$mo[y\$wday==1])[1:12,] monday month 1 2016-01-04 0 2 2016-01-11 0 3 2016-01-18 0 4 2016-01-25 0

131 ESSENTIALS OF THE R LANGUAGE 109 5 2016-02-01 1 6 2016-02-08 1 7 2016-02-15 1 8 2016-02-22 1 9 2016-02-29 1 10 2016-03-07 2 11 2016-03-14 2 12 2016-03-21 2 You want a vector to mark the 12 Mondays you require: these are those where month is not duplicated (i.e. you want to take the first row from each month). For this example, the first Monday in January is in row 1 (obviously), the first in February in row 5, the first in March in row 10, and so on. You can use the not duplicated function !duplicated to tag these rows wanted <- !duplicated(y\$mo[y\$wday==1]) Finally, select the 12 dates of the first Mondays using wanted as a subscript like this: y[y\$wday==1][wanted] [1] "2016-01-04 UTC" "2016-02-01 UTC" "2016-03-07 UTC" "2016-04-04 UTC" [5] "2016-05-02 UTC" "2016-06-06 UTC" "2016-07-04 UTC" "2016-08-01 UTC" [9] "2016-09-05 UTC" "2016-10-03 UTC" "2016-11-07 UTC" "2016-12-05 UTC" Note that every month is represented, and none of the dates is later than the 7th of the month as required. 2.13.7 Calculating time differences between the rows of a dataframe A common action with time data is to compute the time difference between successive rows of a dataframe. The vector called duration created above is of class difftime and contains 16 times measured in decimal hours: class(duration) [1] "difftime" duration Time differences in hours [1] 2.385000 3.271389 3.048889 2.750000 3.078333 2.940278 3.207778 [8] 1.953333 2.372778 1.701944 2.521389 3.254444 2.467778 1.926111 [15] 2.285278 1.813333 attr(,"tzone") [1] "" You can compute the differences between successive rows using subscripts, like this: duration[1:15]-duration[2:16] Time differences in hours [1] -0.8863889 0.2225000 0.2988889 -0.3283333 0.1380556 [6] -0.2675000 1.2544444 -0.4194444 0.6708333 -0.8194444 [11] -0.7330556 0.7866667 0.5416667 -0.3591667 0.4719444

132 110 THE R BOOK You might want to make the differences between successive rows into part of the dataframe (for instance, to relate change in time to one of the explanatory variables in the dataframe). Before doing this, you need to decide on the row in which you want to put the first of the differences. You should be guided by whether the change in time between rows 1 and 2 is related to the explanatory variables in row 1 or row 2. Suppose it is row 1 that we want to contain the first time difference (–0.8864). Because we are working with differences (see p. 785) the vector of differences is shorter by one than the vector from which it was calculated: length(duration[1:15]-duration[2:16]) [1] 15 length(duration) [1] 16 NA so you need to add one to the bottom of the vector (in row 16): diffs <- c(duration[1:15]-duration[2:16],NA) diffs [1] -0.8863889 0.2225000 0.2988889 -0.3283333 0.1380556 -0.2675000 [7] 1.2544444 -0.4194444 0.6708333 -0.8194444 -0.7330556 0.7866667 [13] 0.5416667 -0.3591667 0.4719444 NA Now you can make this new vector part of the dataframe called times : times\$diffs <- diffs times hrs min sec experiment diffs 1 2 23 6 A -0.8863889 2 3 16 17 A 0.2225000 3 3 2 56 A 0.2988889 4 2 45 0 A -0.3283333 5 3 4 42 A 0.1380556 6 2 56 25 A -0.2675000 7 3 12 28 A 1.2544444 8 1 57 12 A -0.4194444 9 2 22 22 B 0.6708333 10 1 42 7 B -0.8194444 11 2 31 17 B -0.7330556 12 3 15 16 B 0.7866667 13 2 28 4 B 0.5416667 14 1 55 34 B -0.3591667 15 2 17 7 B 0.4719444 1614848 B NA You need to take care when doing things with differences. For instance, is it really appropriate that the difference in row 8 is between the last measurement on treatment A and the first measurement on treatment B? Perhaps what you really want are the time differences within the treatments, so you need to insert another NA in row number 8? If so, then: times\$diffs[8] <- NA

133 ESSENTIALS OF THE R LANGUAGE 111 2.13.8 Regression using dates and times Here is an example where the number of individual insects was monitored each month over the course of 13 months: temp timereg.txt",header=T) data <- read.table("c: \\ \\ attach(data) head(data) survivors date 1 100 01/01/2011 2 52 01/02/2011 3 28 01/03/2011 4 12 01/04/2011 5 6 01/05/2011 6 5 01/06/2011 The first job, as usual, is to use to convert the character string "01/01/2011" into a date-time strptime object: dl <- strptime(date,"%d/%m/%Y") : is of class POSIXlt and mode You can see that the object called dl list class(dl) [1] "POSIXlt" "POSIXt" mode(dl) [1] "list" We start by looking at the data using plot with the date dl on the x axis: windows(7,4) par(mfrow=c(1,2)) plot(dl,survivors,pch=16,xlab ="month") plot(dl,log(survivors),pch=16,xlab ="month") 4 80400 3 2 survivors 10 log (survivors) May Sep Jan Jan Jan Jan Sep May month month Inspection of the relationship suggests an exponential decay in numbers surviving, so we shall analyse a model in which log(survivors) is modelled as a function of time. There are lots of zeros at the end of the time series (once the last of the individuals was dead), so we shall use subset to leave out all of the zeros from the model. Let us try to do the regression analysis of log(survivors) against date:

134 112 THE R BOOK model <- lm(log(survivors)~dl,subset=(survivors>0)) Error in model.frame.default(formula = log(survivors) ~ dl, subset = (survivors > : invalid type (list) for variable 'dl' Oops. Why did that not work? The answer is that you cannot have a list as an explanatory variable in a linear model, and as we have just seen, dl is a list. We need to convert from a list ( )toa class = POSIXlt continuous numeric variable ( class = POSIXct ): dc <- as.POSIXct(dl) dc Now the regression works perfectly when we use the new continuous explanatory variable : model <- lm(log(survivors)~dc,subset=(survivors>0)) You would get the same effect by using as.numeric(dl) in the model formula. We can use the output log(survivors) against time using: from this model to add a regression line to the plot of abline(model) You need to take care in reporting the values of slopes in regressions involving date-time objects, because the slopes are rates of change of the response variable per second . Here is the summary: summary(model) Call: lm(formula = log(survivors) ~ dc, subset = (survivors > 0)) Residuals: Min 1Q Median 3Q Max -0.27606 -0.18306 0.04492 0.13760 0.39277 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.040e+02 1.531e+01 19.86 2.05e-07 *** dc -2.315e-07 1.174e-08 -19.72 2.15e-07 *** Residual standard error: 0.2383 on 7 degrees of freedom Multiple R-squared: 0.9823, Adjusted R-squared: 0.9798 F-statistic: 389 on 1 and 7 DF, p-value: 2.152e-07 –7 ; the change in log(survivors) per second . It might be more useful to 10 The slope is –2.315 × express this as the monthly rate. So, with 60 seconds per minute, 60 minutes per hour, 24 hours per day, and (say) 30 days per month, the appropriate rate is -2.315E-07 * 60 * 60 * 24 * 30 [1] -0.600048 We can check this out by calculating how many survivors we would expect from 100 starters after two months: 100*exp(-0.600048 * 2) [1] 30.11653 which compares well with our observed count of 28 (see above).

135 ESSENTIALS OF THE R LANGUAGE 113 2.13.9 Summary of dates and times in R The key thing to understand is the difference between the two representations of dates and times in R. They have unfortunately non-memorable names.  list containing separate vectors for the year, month, day of the week, day within the year, POSIXlt gives a and suchlike. It is very useful as a categorical explanatory variable (e.g. to get monthly means from data gathered over many years using date\$mon ).  containing the date and time expressed as a continuous variable that you can use POSIXct gives a vector in regression models (it is the number of seconds since the beginning of 1970). You can use other functions like date , but I do not recommend them. If you stick with POSIX you are less likely to get confused. 2.14 Environments R is built around a highly sophisticated system of naming and locating objects. When you start a session in R, the variables you create are in the global environment , which is known more familiarly as .GlobalEnv the user’s workspace. This is the first place in which R looks for things. Technically, it is the first item on the search path. It can also be accessed by globalenv(). frame , which is collection of named objects, and a pointer to an enclosing An environment consists of a environment. The most common example is the frame of variables that is local to a function call; its enclosure is the environment where the function was defined. The enclosing environment is distinguished from the parent frame, which is the environment of the caller of a function. There is a strict hierarchy in which R looks for things: it starts by looking in the frame, then in the enclosing frame, and so on. 2.14.1 Using with rather than attach When you attach a dataframe you can refer to the variables within that dataframe by name. attach Advanced R users do not routinely employ in their work, because it can lead to unexpected problems in resolving names (e.g. you can end up with multiple copies of the same variable name, each of a lm or glm different length and each meaning something completely different). Most modelling functions like data= argument so attach is unnecessary in those cases. Even when there is no data= argument have a it is preferable to wrap the call using with like this: with(dataframe, function(...)) with function evaluates an R expression in an environment constructed from data. You will often use The the with function with other functions like tapply or plot which have no built-in data argument. If your dataframe is part of the built-in package called datasets (like OrchardSprays ) you can refer to the dataframe directly by name: with(OrchardSprays,boxplot(decrease~treatment))

136 114 THE R BOOK bacteria dataframe which is part of the Here we calculate the number of ‘no’ (not infected) cases in the MASS library: library(MASS) with(bacteria,tapply((y=="n"),trt,sum)) placebo drug drug+ 12 18 13 Here we plot brain weight against body weight for mammals on log–log axes: with(mammals,plot(body,brain,log="xy")) : without attaching either dataframe. Here is an unattached dataframe called reg.data temp \\ regression.txt",header=T) \\ reg.data <- read.table("c: with which we carry out a linear regression and print a summary: with (reg.data, { model <- lm(growth~tannin) } summary(model) ) lm reg.data to find the variables called growth and The linear model fitting function knows to look in because the with tannin reg.data for constructing the environment from which lm function has used is called. Groups of statements (different lines of code) to which the with function applies are contained within curly brackets. An alternative is to define the data environment as an argument in the call to lm like this: summary(lm(growth~tannin,data=reg.data)) You should compare these outputs with the same example using attach on p. 450. Note that whatever form read.table you choose, you still need to get the dataframe into your current environment by using (if, as here, it is to be read from an external file), or from a library (like MASS to get bacteria and mammals , datasets , type: as above). To see the names of the dataframes in the built-in package called data() To see all available data sets (including those in the installed packages), type: data(package = .packages(all.available = TRUE)) 2.14.2 Using attach in this book Iuse attach throughout this book because experience has shown that it makes the code easier to understand for beginners. In particular, using attach provides simplicity and brevity, so that we can:  refer to variables by name, so rather than dataframe\$x x  lm(y~x) rather than lm(y~x,data=dataframe) write shorter models, so  go straight to the intended action, so plot(y~x) not with(dataframe,plot(y~x)) Nevertheless, readers are encouraged to use with or data= for their own work, and to avoid using attach wherever possible.

137 ESSENTIALS OF THE R LANGUAGE 115 2.15 Writing R functions You typically write functions in R to carry out operations that require two or more lines of code to execute, and that you do not want to type lots of times. We might want to write simple functions to calculate measures of central tendency (p. 116), work out factorials (p. 71) and such-like. Functions in R are objects that carry out operations on arguments that are supplied to them and return one or more values. The syntax for writing a function is function (argument list) } { body function , which indicates to R that you The first component of the function declaration is the keyword want to create a function. An argument list is a comma-separated list of formal arguments. A formal argument x or y ), a statement of the form symbol = expression can be a symbol (i.e. a variable name such as pch=16 ... (triple dot). The body can be any valid R expression or ) or the special formal argument (e.g. set of R expressions over one or more lines. Generally, the body is a group of expressions contained in curly {} , with each expression on a separate line (if the body fits on a single line, no curly brackets are brackets necessary). Functions are typically assigned to symbols, but they need not be. This will only begin to mean anything after you have seen several examples in operation. 2.15.1 Arithmetic mean of a single sample ∑ ∑ y divided by the number of numbers The mean is the sum of the numbers = n 1 (summing over the ∑ sum(y) n is length(y) and for number of numbers in the vector called y is ). The R function for ,so y a function to compute arithmetic means is arithmetic.mean <- function(x) sum(x)/length(x) We should test the function with some data where we know the right answer: y <- c(3,3,4,5,5) arithmetic.mean(y) [1] 4 Needless to say, there is a built-in function for arithmetic means called mean : mean(y) [1] 4 2.15.2 Median of a single sample The median (or 50th percentile) is the middle value of the sorted values of a vector of numbers: sort(y)[ceiling(length(y)/2)] There is slight hitch here, of course, because if the vector contains an even number of numbers, then there is no middle value. The logic here is that we need to work out the arithmetic average of the two values of y on either side of the middle. The question now arises as to how we know, in general, whether the vector y contains an odd or an even number of numbers, so that we can decide which of the two methods to use. The trick here is to use modulo 2 (p. 18). Now we have all the tools we need to write a general function to calculate medians. Let us call the function med and define it like this:

138 116 THE R BOOK { med <- function(x) odd.even <- length(x)%%2 if (odd.even == 0) (sort(x)[length(x)/2]+sort(x)[1+ length(x)/2])/2 else sort(x)[ceiling(length(x)/2)] } if statement is true (i.e. we have an even number of numbers) then the expression Notice that when the function is evaluated (this is the code for calculating the median with an immediately following the if statement is false (i.e. we have an odd number of numbers, and even number of numbers). When the if odd.even == 1 ) then the expression following the else function is evaluated (this is the code for calculating the median with an odd number of numbers). Let us try it out, first with the odd-numbered vector , then with the even-numbered vector y after the first element of y ( y[1] = 3 ) has been dropped y[-1], (using the negative subscript): med(y) [1] 4 med(y[-1]) [1] 4.5 You could write the same function in a single (long) line by using ifelse instead of if. You need to remember that the second argument in ifelse is the action to be performed when the condition is true, and the third argument is what to do when the condition is false: med <- function(x) ifelse(length(x)%%2==1, sort(x)[ceiling(length(x)/2)], (sort(x)[length(x)/2]+sort(x)[1+ length(x)/2])/2 ) Again, you will not be surprised that there is a built-in function for calculating medians, and helpfully it is median . called 2.15.3 Geometric mean For processes that change multiplicatively rather than additively, neither the arithmetic mean nor the median is an ideal measure of central tendency. Under these conditions, the appropriate measure is the geometric mean. The formal definition of this is somewhat abstract: the geometric mean is the n th root of the product ˆ  ) to represent multiplication, and y y (pronounced of the data. If we use capital Greek pi ( -hat) to represent the geometric mean, then √ n ˆ  y . y = Let us take a simple example we can work out by hand: the numbers of insects on 5 plants were as follows: 10, 1, 1000, 1, 10. Multiplying the numbers together gives 100 000. There are five numbers, so we want the fifth root of this. Roots are hard to do in your head, so we will use R as a calculator. Remember that roots are fractional powers, so the fifth root is a number raised to the power 1/5 = 0.2. In R, powers are denoted by the ˆ symbol: 100000ˆ0.2 [1] 10

139 ESSENTIALS OF THE R LANGUAGE 117 So the geometric mean of these insect numbers is 10 insects per stem. Note that two of the data were exactly like this, so it seems a reasonable estimate of central tendency. The arithmetic mean, on the other hand, is a hopeless measure of central tendency in this case, because the large value (1000) is so influential: it is given + 1 + 1000 + 1 by (10 10)/5 = 204.4, and none of the data is close to it. + insects <- c(1,10,1000,10,1) mean(insects) [1] 204.4 Another way to calculate geometric mean involves the use of logarithms. Recall that to multiply numbers together we add up their logarithms. And to take roots, we divide the logarithm by the root. So we should be able to calculate a geometric mean by finding the antilog ( exp ) of the average of the logarithms ( log )ofthe data: exp(mean(log(insects))) [1] 10 So here is a function to calculate geometric mean of a vector of numbers x : geometric <- function (x) exp(mean(log(x))) We can test it with the insect data: geometric(insects) [1] 10 The use of geometric means draws attention to a general scientific issue. Look at the figure below, which shows numbers varying through time in two populations. Now ask yourself which population is the more variable. Chances are, you will pick the upper line: 250 200 150 Number 100 50 0 5 101520 Index

140 118 THE R BOOK axis. The upper population is fluctuating 100, 200, 100, 200 and so on. y But now look at the scale on the In other words, it is doubling and halving, doubling and halving. The lower curve is fluctuating 10, 20, 10, 20, 10, 20 and so on. It, too, is doubling and halving, doubling and halving. So the answer to the question is that they are equally variable. It is just that one population has a higher mean value than the other (150 vs. 15 in this case). In order not to fall into the trap of saying that the upper curve is more variable than the lower curve, it is good practice to graph the logarithms rather than the raw values of things like population sizes that change multiplicatively, as below. 6 5 4 3 log numbers 21 5 10 15 20 Index Now it is clear that both populations are equally variable. Note the change of scale, as specified using the option within the plot function (p. 193). ylim=c(1,6) 2.15.4 Harmonic mean Consider the following problem. An elephant has a territory which is a square of side 2 km. Each morning, the elephant walks the boundary of this territory. He begins the day at a sedate pace, walking the first side of the territory at a speed of 1 km/hr. On the second side, he has sped up to 2 km/hr. By the third side he has accelerated to an impressive 4 km/hr, but this so wears him out, that he has to return on the final side at a sluggish 1 km/hr. So what is his average speed over the ground? You might say he travelled at 1, 2, 4 and 1 km/hr so the average speed is (1 + 2 + 4 + 1)/4 = 8/4 = 2 km/hr. But that is wrong. Can you see how to work out the right answer? Recall that velocity is defined as distance travelled divided by time taken. The × 2 = 8 km. The time taken is a bit harder. The first edge was 2 km distance travelled is easy: it is just 4 long, and travelling at 1 km/hr this must have taken 2 hr. The second edge was 2 km long, and travelling at 2 km/hr this must have taken 1 hr. The third edge was 2 km long and travelling at 4 km/hr this must have taken 0.5 hr. The final edge was 2 km long and travelling at 1 km/hr this must have taken 2 hr. So the total time taken was 2 + 1 + 0.5 + 2 = 5.5 hr. So the average speed is not 2 km/hr but 8/5.5 = 1.4545 km/hr. The way to solve this problem is to use the harmonic mean .

141 ESSENTIALS OF THE R LANGUAGE 119 The harmonic mean is the reciprocal of the average of the reciprocals. The average of our reciprocals is: 2 1 75 . 1 1 1 0 + = + + . 6875 = . 4 4 2 1 1 The reciprocal of this average is the harmonic mean 4 1 4545 = . . 1 = 6875 75 . 0 . 2 ̃ ( y -curl), is given by y In symbols, therefore, the harmonic mean, 1 n ( ) ∑ ̃ = y = . ∑ (1 y / ) n / y / (1 ) An R function for calculating harmonic means, therefore, could be harmonic <- function (x) 1/mean(1/x) and testing it on our elephant data gives harmonic(c(1,2,4,1)) [1] 1.454545 2.15.5 Variance A measure of variability is perhaps the most important quantity in statistical analysis. The greater the variability in the data, the greater will be our uncertainty in the values of parameters estimated from the data, and the less will be our ability to distinguish between competing hypotheses about the data. The variance of a sample is measured as a function of ‘the sum of the squares of the difference between the data and the arithmetic mean’. This important quantity is called the ‘sum of squares’: ∑ 2 ̄ ) ( − y y = SS . Naturally, this quantity gets bigger with every new data point you add to the sample. An obvious way to compensate for this is to measure variability as the average of the squared departures from the mean (the ‘mean square deviation’.). There is a slight problem, however. Look at the formula for the sum of squares, SS , above and ask yourself what you need to know before you can calculate it. You have the data, y , but the only ̄ ̄ , is to calculate it from the data (you will never know way you can know the sample mean, y in advance). y 2.15.6 Degrees of freedom To complete our calculation of the variance we need the degrees of freedom (d.f.). This important concept in statistics is defined as follows: d . f . = n − k , which is the sample size, , minus the number of parameters, k , estimated from the data. For the variance, n ̄ we have estimated one parameter from the data, y , and so there are n − 1 degrees of freedom. In a linear

142 120 THE R BOOK n − regression, we estimate two parameters from the data, the slope and the intercept, and so there are 2 degrees of freedom in a regression analysis. 2 s , is called . The square root of variance, s squared: s Variance is denoted by the lower-case Latin letter the standard deviation. We always calculate variance as sum of squares 2 . = = variance s degrees of freedom Consider the following data: y <- c(13,7,5,12,9,15,6,11,9,7,12) We need to write a function to calculate the sample variance: we call it variance and define it like this: variance <- function(x) sum((x - mean(x))ˆ2)/(length(x)-1) and use it like this: variance(y) [1] 10.25455 Our measure of variability in these data, the variance, is thus 10.254 55. It is said to be an unbiased estimator because we divide the sum of squares by the degrees of freedom ( n − 1) rather than by the sample size, n ,to compensate for the fact that we have estimated one parameter from the data. So the variance is close to the average squared difference between the data and the mean, especially for large samples, but it is not exactly equal to the mean squared deviation. Needless to say, R has a built-in function to calculate variance called : var var(y) [1] 10.25455 2.15.7 Variance ratio test How do we know if two variances are significantly different from one another? One of several sensible ways F test, which is simply the ratio of the two variances (see p. 287). Here is a to do this is to carry out Fisher’s function to print the p value (p. 347) associated with a comparison of the larger and smaller variances: variance.ratio <- function(x,y) { v1 <- var(x) v2 <- var(y) if (var(x) > var(y)) { vr <- var(x)/var(y) df1 <- length(x)-1 df2 <- length(y)-1 } else { vr <- var(y)/var(x) df1 <- length(y)-1 df2 <- length(x)-1 } 2*(1-pf(vr,df1,df2)) }

143 ESSENTIALS OF THE R LANGUAGE 121 F vr or bigger by chance The last line of our function works out the probability of getting an ratio as big as alone if the two variances were really the same, using the cumulative probability of the F distribution, which pf . We need to supply pf with three arguments : the size of the variance ratio ( vr ), is an R function called df1 the number of degrees of freedom in the numerator ( 9) and the number of degrees of freedom in the = df2 9). denominator ( = Here are some data to test our function. They are normally distributed random numbers but the first set has a variance of 4 and the second a variance of 16 (i.e. standard deviations of 2 and 4, respectively): a <- rnorm(10,15,2) b <- rnorm(10,15,4) Here is our function in action: variance.ratio(a,b) [1] 0.01593334 We can compare our p with the p value given by the built-in function called var.test : var.test(a,b) F test to compare two variances data: a and b F = 0.1748, num df = 9, denom df = 9, p-value = 0.01593 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.04340939 0.70360673 sample estimates: ratio of variances 0.1747660 2.15.8 Using variance Variance is used in two main ways: for establishing measures of unreliability (e.g. confidence intervals) and for testing hypotheses (e.g. Student’s t test). Here we will concentrate on the former; the latter is discussed in Chapter 8. Consider the properties that you would like a measure of unreliability to possess. As the variance of the data increases, what would happen to the unreliability of estimated parameters? Would it go up or down? Unreliability would go up as variance increased, so we would want to have the variance on the top (the numerator) of any divisions in our formula for unreliability: 2 . unreliability ∝ s What about sample size? Would you want your estimate of unreliability to go up or down as sample size, n , increased? You would want unreliability to go down as sample size went up, so you would put sample size on the bottom of the formula for unreliability (i.e. in the denominator): 2 s . unreliability ∝ n

144 122 THE R BOOK Finally, consider the units in which unreliability is measured. What are the units in which our current measure is expressed? Sample size is dimensionless, but variance is based on the sum of squared differences, 2 . so it has dimensions of mean squared. So if the mean was a length in cm, the variance would be an area in cm This is an unfortunate state of affairs. It would make good sense to have the dimensions of the unreliability measure and of the parameter whose unreliability it is measuring the same. That is why all unreliability measures are enclosed inside a big square root term. Unreliability measures are called standard errors . What we have just worked out is the standard error of the mean , √ 2 s , = se ̄ y n 2 s where is the variance and n is the sample size. There is no built-in R function to calculate the standard error of a mean, but it is easy to write one: se <- function(x) sqrt(var(x)/length(x)) t from You can refer to functions from within other functions. Recall that a confidence interval (CI) is ‘ tables times the standard error’: × se . CI = t d . f . α/ 2 , The R function qt t with 1 – α /2 = 0.975 and degrees of freedom d.f. = gives the value of Student’s length(x)-1 ci95 which uses our function se to compute 95% confidence . Here is a function called intervals for a mean: ci95 <- function(x) { t.value <- qt(0.975,length(x)-1) standard.error <- se(x) ci <- t.value*standard.error cat("95 Confidence Interval = ", mean(x) -ci, "to ", mean(x) +ci," \ n") } We can test the function with 150 normally distributed random numbers with mean 25 and standard deviation 3: x <- rnorm(150,25,3) ci95(x) 95% Confidence Interval = 24.76245 to 25.74469 If we were to repeat the experiment, we could be 95% certain that the mean of the new sample would lie between 24.76 and 25.74. We can use the se function to investigate how the standard error of the mean changes with the sample size. First we generate one set of data from which we shall take progressively larger samples: xv <- rnorm(30) Now in a loop take samples of size 2, 3, 4, ...,30: sem <- numeric(30) sem[1] <- NA for(i in 2:30) sem[i] <- se(xv[1:i])

145 ESSENTIALS OF THE R LANGUAGE 123 plot(1:30,sem,ylim=c(0,0.8), ylab="standard error of mean",xlab="sample size n",pch=16) = 15, so the standard error of the mean You can see clearly that as the sample size falls below about n increases rapidly. The blips in the line are caused by outlying values being included in the calculations of the standard error with increases in sample size. The smooth curve is easy to compute: since the values in xv came from a standard normal distribution with mean 0 and standard deviation 1, so the average curve would √ n , which we can add to our graph using : lines be 1 / lines(2:30,1/sqrt(2:30)) 0.8 0.6 0.6 standard error mean 0.2 0.0 30 15 10 0 20 5 25 sample size n You can see that our single simulation captured the essence of the shape but was wrong in detail, especially for the samples with the lowest replication. However, our single sample was reasonably good for n > 24. 2.15.9 Deparsing: A graphics function for error bars There is no function in the base package of R for drawing error bars on bar charts, although several contributed packages use the arrows function for this purpose (p. 204). Here is a simple, stripped-down function that is supplied with three arguments: the heights of the bars ( yv z ) ), the lengths (up and down) of the error bars ( x ). nn and the labels for the bars on the axis ( The process of deparsing turns an unevaluated expression into a character string. One of the important uses of deparsing is in functions that produce output that you want to label with the particular names of the variables that were passed to the function. For instance, if the function is written in terms of a continuous response variable y and a categorical explanatory variable x, you might want to label the axes of a plot produced by the function with, say, clipping and biomass in place of x and y. For instance, if the function is written in terms of a continuous response variable you might want to label the axes of a yv, plot produced by the function with, say, biomass in place of yv. Inside the error.bars function, the barplot function uses the deparse function to create the appropriate text for ylab.

146 124 THE R BOOK { error.bars <- function(yv,z,nn) xv <- barplot(yv,ylim=c(0,(max(yv)+max(z))),names=nn,ylab=deparse(substitute(yv) )) g=(max(xv)-min(xv))/50 for (i in 1:length(xv)) { lines(c(xv[i],xv[i]),c(yv[i]+z[i],yv[i]-z[i])) lines(c(xv[i]-g,xv[i]+g),c(yv[i]+z[i], yv[i]+z[i])) lines(c(xv[i]-g,xv[i]+g),c(yv[i]-z[i], yv[i]-z[i])) }} Here is the function in action with the plant competition data (p. 426): error.bars comp <- read.table("c: \\ temp \\ competition.txt",header=T) attach(comp) names(comp) [1] "biomass" "clipping" se <- rep(28.75,5) labels <- as.character(levels(clipping)) ybar <- as.vector(tapply(biomass,clipping,mean)) Now invoke the function with the means, standard errors and bar labels: error.bars(ybar,se,labels) 600 500 400 300 ybar 200 100 0 r5 n25 n50 r10 control Here is a function to plot error bars on a scatterplot in both the x and y directions: xy.error.bars <- function (x,y,xbar,ybar) { plot(x, y, pch=16, ylim=c(min(y-ybar),max(y+ybar)), xlim=c(min(x-xbar),max(x+xbar)))

147 ESSENTIALS OF THE R LANGUAGE 125 arrows(x, y-ybar, x, y+ybar, code=3, angle=90, length=0.1) arrows(x-xbar, y, x+xbar, y, code=3, angle=90, length=0.1) } We test it with these data: x <- rnorm(10,25,5) y <- rnorm(10,100,20) xb <- runif(10)*5 yb <- runif(10)*20 xy.error.bars(x,y,xb,yb) 120110 1009080 y 706050 15 20 25 30 35 x 2.15.10 The switch function When you want a function to do different things in different circumstances, then the function can switch be useful. Here we write a function that can calculate any one of four different measures of central tendency: arithmetic mean, geometric mean, harmonic mean or median (see pp. 115–119 for explanations of the separate functions). The character variable called measure should take one value of Mean, Geometric, Harmonic or Median; any other text will lead to the error message Measure not included . Alternatively, you can specify the number of the switch (e.g. 1 for Mean, 4 for Median). central <- function(y, measure) { switch(measure, Mean = mean(y), Geometric = exp(mean(log(y))), Harmonic = 1/mean(1/y), Median = median(y), stop("Measure not included")) }

148 126 THE R BOOK Note that you have to include the character strings in quotes as arguments to the function, but they must not be in quotes within the switch function itself. central(rnorm(100,10,2),"Harmonic") [1] 9.554712 central(rnorm(100,10,2),4) [1] 10.46240 2.15.11 The evaluation environment of a function is created. In this frame the formal arguments are When a function is called or invoked a new evaluation frame (below). The statements argument matching matched with the supplied arguments according to the rules of in the body of the function are evaluated sequentially in this environment frame. The first thing that occurs in a function evaluation is the matching of the formal to the actual or supplied arguments. This is done by a three-pass process:  . For each named supplied argument the list of formal arguments is searched for Exact matching on tags an item whose name matches exactly.  Partial matching on tags . Each named supplied argument is compared to the remaining formal arguments using partial matching. If the name of the supplied argument matches exactly with the first part of a formal argument then the two arguments are considered to be matched.  Positional matching . Any unmatched formal arguments are bound to unnamed supplied arguments, in order. If there is a . . . argument, it will take up the remaining arguments, tagged or not.  If any arguments remain unmatched an error is declared. Supplied arguments and default arguments are treated differently. The supplied arguments to a function are evaluated in the evaluation frame of the calling function. The default arguments to a function are evaluated in the evaluation frame of the function. In general, supplied arguments behave as if they are local variables initialized with the value supplied and the name of the corresponding formal argument. Changing the value of a supplied argument within a function will not affect the value of the variable in the calling frame. 2.15.12 Scope The scoping rules are the set of rules used by the evaluator to find a value for a symbol. A symbol can be bound either unbound . All of the formal arguments to a function provide bound symbols in the body of or the function. Any other symbols in the body of the function are either local variables or unbound variables. A local variable is one that is defined within the function, typically by having it on the left-hand side of an assignment. During the evaluation process if an unbound symbol is detected then R attempts to find a value for it: the environment of the function is searched first, then its enclosure and so on until the global environment is reached. The value of the first match is then used. 2.15.13 Optional arguments charplot that produces a scatterplot of x and y using solid red circles as the Here is a function called ) to control x and y ) and two optional ( pc and co plotting symbols: there are two essential arguments (

149 ESSENTIALS OF THE R LANGUAGE 127 selection of the plotting symbol and its colour: charplot <- function(x,y,pc=16,co="red") { plot(y~x,pch=pc,col=co) } The optional arguments are given their default values using = in the argument list. To execute the function x and y : you need only provide the vectors of charplot(1:10,1:10) to get solid red circles. You can get a different plotting symbol simply by adding a third argument charplot(1:10,1:10,17) which produces red solid triangles ( pch=17 ). If you want to change only the colour (the fourth argument) then you have to specify the variable name because the optional arguments would not then be presented in sequence. So, for navy blue solid circles, you put: charplot(1:10,1:10,co="navy") To change both the plotting symbol and the colour you do not need to specify the variable names, so long as the plotting symbol is the third argument and the colour is the fourth: charplot(1:10,1:10,15,"green") This produces solid green squares. Reversing the optional arguments does not work: charplot(1:10,1:10,"green",15) (this uses the letter g as the plotting symbol and colour no. 15). If you specify both variable names, then the order does not matter: charplot(1:10,1:10,co="green",pc=15) This produces solid green squares despite the arguments being out of sequence. 2.15.14 Variable numbers of arguments ( . . . ) Some applications are much more straightforward if the number of arguments does not need to be specified in advance. There is a special formal name . . . (triple dot) which is used in the argument list to specify that an arbitrary number of arguments are to be passed to the function. Here is a function that takes any number of vectors and calculates their means and variances: many.means <- function (...) { data <- list(...) n <- length(data) means <- numeric(n) vars <- numeric(n) for (i in 1:n) { means[i] <- mean(data[[i]]) vars[i] <- var(data[[i]]) } print(means) print(vars) invisible(NULL) }

150 128 THE R BOOK The main features to note are these. The function definition has . . . as its only argument. The ‘triple dot’ argument . . . allows the function to accept additional arguments of unspecified name and number, and this introduces tremendous flexibility into the structure and behaviour of functions. The first thing done inside the out of the list of vectors that are actually supplied in any particular function is to create an object called data case. The length of this list is the number of vectors, not the lengths of the vectors themselves (these could differ from one vector to another, as in the example below). Then the two output variables ( and vars ) means are defined to have as many elements as there are vectors in the parameter list. The loop goes from 1 to the mean and var number of vectors, and for each vector uses the built-in functions to compute the answers we require. It is important to note that because is a list, we use double [[ ]] subscripts in addressing its data elements. Now try it out. To make things difficult we shall give it three vectors of different lengths. All come from the standard normal distribution (with mean 0 and variance 1) but is 100 in length, y is 200 and z is x 300 numbers long: x <- rnorm(100) y <- rnorm(200) z <- rnorm(300) Now we invoke the function: many.means(x,y,z) [1] -0.039181830 0.003613744 0.050997841 [1] 1.146587 0.989700 0.999505 As expected, all three means (top row) are close to 0, and all three variances are close to 1 (bottom row). You can use . . . to absorb some arguments into an intermediate function which can then be extracted by functions called subsequently. R has a form of of function arguments in which arguments are lazy evaluation not evaluated until they are needed (in some cases the argument will never be evaluated). 2.15.15 Returning values from a function Often you want a function to return a single value (like a mean or a maximum), in which case you simply leave the last line of the function unassigned (i.e. there is no ‘gets arrow’ on the last line). Here is a function to return the median value of the parallel maxima (built-in function pmax ) of two vectors supplied as arguments: parmax <- function (a,b) { c <- pmax(a,b) median(c) } Here is the function in action: the unassigned last line median(c) returns the answer x <- c(1,9,2,8,3,7) y <- c(9,2,8,3,7,2) parmax(x,y) [1] 8 If you want to return two or more variables from a function you should use return with a list containing the variables to be returned. Suppose we wanted the median value of both the parallel maxima and the parallel

151 ESSENTIALS OF THE R LANGUAGE 129 minima to be returned: parboth <- function (a,b) { c <- pmax(a,b) d <- pmin(a,b) answer <- list(median(c),median(d)) names(answer)[[1]] <- "median of the parallel maxima" names(answer)[[2]] <- "median of the parallel minima" return(answer) } and y Here it is in action with the same x data as above: parboth(x,y) \$"median of the parallel maxima" [1] 8 \$"median of the parallel minima" [1] 2 The point is that you make the multiple returns into a list, then return the list. The provision of multi-argument returns (e.g. in the example above) has been deprecated in R and return(median(c),median(d)) a warning is given, as multi-argument returns were never documented in S, and whether or not the list was named differs from one version of S to another. 2.15.16 Anonymous functions Here is an example of an anonymous function. It generates a vector of values but the function is not allocated a name (although the answer could be). (function(x,y) { z <- 2* xˆ2 + yˆ2; x+y+z } )(0:7, 1) [1]251223385780107 x and y to calculate z , then returns the value of x+y+z The function first uses the supplied values of evaluated for eight values of x (from 0 to 7) and one value of y (1). Anonymous functions are used most frequently with apply, tapply, sapply lapply (p. 63). and 2.15.17 Flexible handling of arguments to functions Because of the practised by R, it is very simple to deal with missing arguments in function lazy evaluation calls, giving the user the opportunity to specify the absolute minimum number of arguments, but to override plotx2 the default arguments if they want to. As a simple example, take a function that we want to work when provided with either one or two arguments. In the one-argument case (only an integer x > 1 provided), 2 y z against z for is supplied, we want it = 1to x in steps of 1. In the second case, when z we want it to plot to plot y against z for z = 1to x : plotx2 <- function (x, y = zˆ2) { z<-1:x plot(z,y,type="l") }

152 130 THE R BOOK z is not defined at this point. But R does not In many other languages, the first line would fail because evaluate an expression until the body of the function actually calls for it to be evaluated (i.e. never, in the 2 against y z is supplied as a second argument). Thus for the one-argument case we get a graph of case where y against z (in this example, the straight line 1:12 vs. 1:12). z and in the two-argument case we get a graph of windows We rescale the (width then height in inches) so that the graphs come out looking roughly square rather than elongated: windows(7,4) par(mfrow=c(1,2)) plotx2(12) plotx2(12,1:12) 12 140 100 810 y y 6 60 42 200 81012 81012 4 26 26 4 z z It is possible to access the actual (not default) expressions used as arguments inside the function. The mechanism is implemented via promises. You can find an explanation of promises by typing at ?promise the command prompt. 2.15.18 Structure of an object: str Here is one of the simplest objects in R – a vector of length 7 containing real numbers: (y <- seq(0.9,0.3,-0.1)) [1] 0.9 0.8 0.7 0.6 0.5 0.4 0.3 We can ask R about the structure of the object called y using str : str(y) num [1:7] 0.9 0.8 0.7 0.6 0.5 0.4 0.3 We discover that it is numeric (in both class and mode), a vector of length 7 [1:7] , and (because the vector is short) we see all of the values listed. For longer vectors we would see the first few values, depending on what would fit on a single printed line (as affected by the number of decimal places displayed). What about a slightly more complicated object? Here is a dataframe with two columns: data <- read.table("c: \\ temp \\ spino.txt",header=T) str(data) 'data.frame': 109 obs. of 2 variables: \$ condition: Factor w/ 5 levels "better","much.better",..: 4114441541... \$ treatment: Factor w/ 3 levels "drug.A","drug.B",..: 1223221122...

153 ESSENTIALS OF THE R LANGUAGE 131 data is a dataframe with 109 rows and 2 columns, then we get detailed information on each We learn that which is a factor with five levels (the first of the columns in turn. The first is a variable called condition two levels of which (in alphabetical order) are better much.better ). The second variable is called and treatment and is a factor with three levels. The numbers are the integer representations of the factor levels in the first 10 rows of the dataframe. Because we can see only factor levels 1 and 2, we would need to do more work to discover what factor level 4 of condition, or level 3 of treatment, actually represented: levels(data\$condition);levels(data\$treatment) [1] "better" "much.better" "much.worse" "no.change" "worse" [1] "drug.A" "drug.B" "placebo" We often want to know about the structure of model objects. Here is the simplest case, with a linear regression model (see p. 450 for details): \\ reg <- read.table("c: \\ tannin.txt",header=T) temp reg.model <- lm(growth~tannin,data=reg) str(reg.model) List of 12 \$ coefficients : Named num [1:2] 11.76 -1.22 ..- attr(*, "names")= chr [1:2] "(Intercept)" "tannin" \$ residuals : Named num [1:9] 0.244 -0.539 -1.322 2.894 -0.889 ... ..- attr(*, "names")= chr [1:9] "1" "2" "3" "4" ... \$ effects : Named num [1:9] -20.67 -9.42 -1.32 2.83 -1.01 ... ..- attr(*, "names")= chr [1:9] "(Intercept)" "tannin" "" "" ... \$ rank : int 2 \$ fitted.values: Named num [1:9] 11.76 10.54 9.32 8.11 6.89 ... ..- attr(*, "names")= chr [1:9] "1" "2" "3" "4" ... \$ assign : int [1:2] 0 1 \$ qr :List of 5 ..\$ qr : num [1:9, 1:2] -3 0.333 0.333 0.333 0.333 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..\$ : chr [1:9] "1" "2" "3" "4" ... .. .. ..\$ : chr [1:2] "(Intercept)" "tannin" .. ..- attr(*, "assign")= int [1:2] 0 1 ..\$ qraux: num [1:2] 1.33 1.26 ..\$ pivot: int [1:2] 1 2 ..\$ tol : num 1e-07 ..\$ rank : int 2 ..- attr(*, "class")= chr "qr" \$ df.residual : int 7 \$ xlevels : Named list() \$ call : language lm(formula = growth ~ tannin, data = reg) \$ terms :Classes 'terms', 'formula' length 3 growth ~ tannin .. ..- attr(*, "variables")= language list(growth, tannin) .. ..- attr(*, "factors")= int [1:2, 1] 0 1 .. .. ..- attr(*, "dimnames")=List of 2 .. .. .. ..\$ : chr [1:2] "growth" "tannin"

154 132 THE R BOOK .. .. .. ..\$ : chr "tannin" .. ..- attr(*, "term.labels")= chr "tannin" .. ..- attr(*, "order")= int 1 .. ..- attr(*, "intercept")= int 1 .. ..- attr(*, "response")= int 1 .. ..- attr(*, ".Environment")= .. ..- attr(*, "predvars")= language list(growth, tannin) .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric" .. .. ..- attr(*, "names")= chr [1:2] "growth" "tannin" \$ model :'data.frame': 9 obs. of 2 variables: ..\$ growth: int [1:9] 12 10 8 11 6 7 2 3 3 ..\$ tannin: int [1:9] 0 1 2 3 4 5 6 7 8 ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 growth ~ tannin .. .. ..- attr(*, "variables")= language list(growth, tannin) .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1 .. .. .. ..- attr(*, "dimnames")=List of 2 .. .. .. .. ..\$ : chr [1:2] "growth" "tannin" .. .. .. .. ..\$ : chr "tannin" .. .. ..- attr(*, "term.labels")= chr "tannin" .. .. ..- attr(*, "order")= int 1 .. .. ..- attr(*, "intercept")= int 1 .. .. ..- attr(*, "response")= int 1 .. .. ..- attr(*, ".Environment")= .. .. ..- attr(*, "predvars")= language list(growth, tannin) .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric" .. .. .. ..- attr(*, "names")= chr [1:2] "growth" "tannin" - attr(*, "class")= chr "lm" There are 12 elements in the list representing the structure of this linear model object: coefficients, residuals, effects, rank, fitted values, assign, qr, residual degrees of freedom, xlevels, call, terms and model. Each of these, in turn, is broken down into components; for instance, the two coefficients are numbers (11.76 and Intercept ) and –1.22), and their names are ( . You should work down the list and see if you can tannin figure out why each row is an important part of the model. For more complicated models, the structure is even more involved. Here is the structure of a generalized linear model with a binary response and binomial errors: data <- read.table("c: \\ temp \\ spino.txt",header=T) attach(data) y <- factor(1+(condition=="better")+(condition=="much.better")) model <- glm(y~treatment,binomial) summary(model) Call: glm(formula = y ~ treatment, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -0.9741 -0.9741 -0.7747 1.3953 1.6431

155 ESSENTIALS OF THE R LANGUAGE 133 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.6131 0.3444 -1.780 0.075 . treatmentdrug.B 0.1141 0.4617 0.247 0.805 treatmentplacebo -0.4367 0.5581 -0.783 0.434 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 139.67 on 108 degrees of freedom Residual deviance: 138.54 on 106 degrees of freedom AIC: 144.54 We have carried out a one-way analysis of deviance with a two-level response (improved or not) and a three- treatment level factor as explanatory variable ( drug.B ). There was no significant difference between and the placebo, nor between either of these and (the intercept). Here is the structure of the object drug.A model : called str(model) As you will see, this is a very large object, comprising a list with 30 components covering all aspects of the model: the coefficients, fitted values, effects plus all the details of the family and the model formula. I recommend you work your way slowly down the whole list and try to understand why each of the rows represents an essential piece of information about the model. 2.16 Writing from R to file You often want to save an object that you have created in R. 2.16.1 Saving your work To save your current R session, so that you can load it again later and continue your work where you left off, use like this: save \\ temp save(list = ls(all=TRUE), file= "c: session") \\ Then, on another occasion, when you want to restore the data, use load like this: load(file= "c: \\ temp \\ session") 2.16.2 Saving history It is very useful to be able to see all of the lines of R code that one has typed during a particular session. You may want to copy the lines into a text editor to make minor alterations, or you may simply want to paste multiple lines back into R to repeat certain operations. To see all of your lines of input code just type: history(Inf) This opens a window called R History through which you can scroll, highlight and copy using Ctrl + C. You could then open a new Untitled R Editor window (File > New Script) and paste the selected lines of

156 134 THE R BOOK + V. Alternatively, you might want to save the entire history to file, for use on a subsequent code using Ctrl occasion: \\ \\ session18.txt") temp savehistory(file = "c: To retrieve the history for use on another occasion use: temp loadhistory(file = "c: session18.txt") \\ \\ history(Inf) in the new session. Then you can access it by 2.16.3 Saving graphics For speed and simplicity, you can click on a graph (the bar on top of the R Graphics Device goes darker + C (to copy the graph), then switch to a word processor and paste using Ctrl + V. blue) then press Ctrl For publication-quality graphics, however, you will want to save each figure in a separate file as a PDF or PostScript file. There are a great many options (see and ?postscript for details) but the basics are ?pdf very simple. Here we set the graphics device to produce a PDF: pdf("c: temp \\ fig1.pdf") \\ Now, any plot directives are sent to this file. To switch off writing graphics to file, type: dev.off() 2.16.4 Saving data produced within R to disc It is often convenient to generate numbers within R and then to use them somewhere else (in a spreadsheet, mu=1.2 and clumping say). Here are 1000 random integers from a negative binomial distribution with mean parameter or aggregation parameter ( k ) size = 1.0 , that I want to save as a single column of 1000 rows in a file called nbnumbers.txt in the temp directory on the c: drive: nbnumbers <- rnbinom(1000, size=1, mu=1.2) There is general point to note here about the number and order of arguments provided to built-in functions rnbinom size, mean ( mu ) and probability like . This function can have two of three optional arguments: prob ) (see ?rnbinom ). R knows that the unlabelled number 1000 refers to the number of numbers ( required because of its position, first in the list of arguments. If you are prepared to specify the names of the arguments, then the order in which they appear is irrelevant: rnbinom(1000, size=1, mu=1.2) and rnbinom(1000, mu=1.2, size=1) would give the same output. But if optional arguments are not labelled, then their order is crucial: so rnbinom(1000, 0.9, 0.6) is different from rnbinom(1000, 0.6, 0.9) because if there are no labels, then the second argument must be size and the third argument must prob . be To export the numbers I use write like this, specifying that the numbers are to be output in a single column (i.e. with third argument 1 because the default is 5 columns): write(nbnumbers,"c: \\ temp \\ nbnumbers.txt",1) Sometimes you will want to save a table or a matrix of numbers to file. There is an issue here, in that the write function transposes rows and columns. It is much simpler to use the write.table function which transpose the rows and columns. Here is a matrix of 1000 rows and 100 columns made up of random does not

158 136 THE R BOOK  Clever is good, but clear is better.  Test each line as you go along, to make sure it does what you want it to do.  Put plenty of comments in the code, using # for documentation.  Use variable names and function names that are self-explanatory.  Do not use attach in programs.  Use , or refer to variables within named dataframes. with  Try different ways of doing the same thing, and select the fastest method.  Use indents (tabs) to improve clarity of loops and if statements.  Build up the program from small, independently tested functions.  Stop tinkering once it works effectively.

159 3 Data Input You can get numbers into R through the keyboard, from the Clipboard or from an external file. For a single variable of up to 10 numbers or so, it is probably quickest to type the numbers at the command line, using concatenate function c like this: the y <- c (6,7,3,4,8,5,6,2) scan function. For intermediate sized variables, you might want to enter data from the keyboard using the For larger data sets, and certainly for sets with several variables, you should make a dataframe externally (e.g. in a spreadsheet) and read it into R using read.table (p. 139). 3.1 Data input from the keyboard The scan function is useful if you want to type (or paste) a few numbers into a vector called x from the keyboard: x <-scan() 1: At the 1: 2: prompt appears, type in your prompt type your first number, then press the Enter key. When the second number and press Enter, and so on. When you have put in all the numbers you need (suppose there 9: prompt. are eight of them) then simply press the Enter key at the 1: 6 2: 7 3: 3 4: 4 5: 8 6: 5 7: 6 8: 2 9: Read 8 items The R Book , Second Edition. Michael J. Crawley. © 2013 John Wiley & Sons, Ltd. Published 2013 by John Wiley & Sons, Ltd.

164 142 THE R BOOK We want to skip the first row because that is a header containing the variable names, so we specify skip = 1 in . There are seven columns of data, so we specify seven fields of character variables "" : what the list supplied to data \\ worms.txt",skip=1,what=as.list(rep("",7))) scan("t: \\ Read 20 records [[1]] [1] "Nashs.Field" "Silwood.Bottom" "Nursery.Field" [4] "Rush.Meadow" "Gunness.Thicket" "Oak.Mead" [7] "Church.Field" "Ashurst" "The.Orchard" [10] "Rookery.Slope" "Garden.Wood" "North.Gravel" [13] "South.Gravel" "Observatory.Ridge" "Pond.Field" [16] "Water.Meadow" "Cheapside" "Pound.Hill" [19] "Gravel.Pit" "Farm.Wood" [[2]] [1] "3.6" "5.1" "2.8" "2.4" "3.8" "3.1" "3.5" "2.1" "1.9" "1.5" "2.9" "3.3" [13] "3.7" "1.8" "4.1" "3.9" "2.2" "4.4" "2.9" "0.8" [[3]] [1] "11" "2" "3" "5" "0" "2" "3" "0" "0" "4" "10" "1" "2" "6" "0" [16] "0" "8" "2" "1" "10" [[4]] [1] "Grassland" "Arable" "Grassland" "Meadow" "Scrub" "Grassland" [7] "Grassland" "Arable" "Orchard" "Grassland" "Scrub" "Grassland" [13] "Grassland" "Grassland" "Meadow" "Meadow" "Scrub" "Arable" [19] "Grassland" "Scrub" [[5]] [1] "4.1" "5.2" "4.3" "4.9" "4.2" "3.9" "4.2" "4.8" "5.7" "5" "5.2" "4.1" [13] "4" "3.8" "5" "4.9" "4.7" "4.5" "3.5" "5.1" [[6]] [1] "FALSE" "FALSE" "FALSE" "TRUE" "FALSE" "FALSE" "FALSE" "FALSE" "FALSE" [10] "TRUE" "FALSE" "FALSE" "FALSE" "FALSE" "TRUE" "TRUE" "TRUE" "FALSE" [19] "FALSE" "TRUE" [[7]] [1] "4" "7" "2" "5" "6" "2" "3" "4" "9" "7" "8" "1" "2" "0" "6" "8" "4" "5" "1" [20] "3" scan has created a list As you can see, of seven vectors of character string information. To convert this list into a dataframe, we use the as.data.frame function which turns the lists into columns in the dataframe (so long as the columns are all the same length): data <- as.data.frame(scan("t: \\ data \\ worms.txt",skip=1,what=as.list(rep("",7)))) In its present form, the variable names manufactured by are ridiculously long, so we need to replace scan them with the original variable names that are in the first row of the file. For this we can use scan again, but specify that we want to read only the first line, by specifying nlines=1 and removing the skip option: header <- unlist(scan("t: \\ data \\ worms.txt",nlines=1,what=as.list (rep("",7))))

166 144 THE R BOOK the information on the number of lines from method 2 and the information on the contents of each line from method 3. The first step is easy: temp \\ rt.txt",sep=" \ \\ length(scan("c: n")) Read 5 items [1] 5 So we have five lines of information in this file. To find the number of items per line were divide the total number of items \\ temp \\ rt.txt",sep=" \ length(scan("c: t")) Read 20 items [1] 20 by the number of lines: 20 / 5 4. To extract the information on each line, we want to take a line at a time, = and extract the missing values (i.e. remove the NAs ). So, for line 1 this would be \\ scan("c: \\ rt.txt",sep=" \ t")[1:4] temp Read 20 items [1] 138 NA NA NA then, to remove the NA we use na.omit ,toremovethe Read 20 items we use quiet=T and to leave only the numerical value we use as.numeric : \\ \ \\ rt.txt",sep=" as.numeric(na.omit(scan("c: t",quiet=T)[1:4])) temp [1] 138 To complete the job, we need to apply this logic to each of the five lines in turn, to produce a list of vectors of variable lengths (1, 2, 4, 2 and 1): sapply(1:5, function(i) as.numeric(na.omit( scan("c: \\ temp \\ rt.txt",sep=" \ t",quiet=T)[(4*i-3): (4*i)]))) [[1]] [1] 138 [[2]] [1] 27 44 [[3]] [1] 19 20 345 48 [[4]] [1] 115 2366 [[5]] [1] 59 That was about as complicated a procedure as you are likely to encounter in reading information from a file. In hindsight, we might have created the data as a dataframe with missing values explicitly added to the rows that had less than four numbers. Then a single read.table statement would have been enough.

170 148 THE R BOOK [[4]] [1] "115" "2366" "" [[5]] [1] "59" "" "" \\ temp \\ \ n") strsplit(readLines("c: rt.txt")," [[1]] \ t \ t \ t" [1] "138 [[2]] [1] "27 t44 \ t \ t" \ [[3]] \ t20 \ t345 \ t48" [1] "19 [[4]] [1] "115 \ t2366 \ t \ t" [[5]] [1] "59 t \ t \ t" \ The split by tab markers is closest to what we want to achieve, so we shall work on that. First, turn the character strings into numbers: \\ temp \\ rows<-lapply(strsplit(readLines("c: \ t"),as.numeric) rt.txt")," rows [[1]] [1] 138 NA NA [[2]] [1] 27 44 NA [[3]] [1] 19 20 345 48 [[4]] [1] 115 2366 NA [[5]] [1] 59 NA NA Now all that we need to do is to remove the NA s from each of the vectors: sapply(1:5, function(i) as.numeric(na.omit(rows[[i]]))) [[1]] [1] 138 [[2]] [1] 27 44 [[3]] [1] 19 20 345 48

172 150 THE R BOOK 3.6 Masking You may have attached the same dataframe twice. Alternatively, you may have two dataframes attached that have one or more variable names in common. The commonest cause of masking occurs with simple variable and . It is very easy to end up with multiple variables of the same name within a single session names like x y that mean totally different things. The warning after using attach should alert you to the possibility of such problems. If the vectors sharing the same name are of different lengths, then R is likely to stop you before you do anything too silly, but if the vectors are of the same length then you run the serious risk of fitting the ) or having the wrong wrong explanatory variable (e.g. fitting the wrong one from two vectors both called x response variable (e.g. from two vectors both called y ). The moral is:  use longer, more self-explanatory variable names;  do not calculate variables with the same name as a variable inside a dataframe;  detach always dataframes once you have finished using them;  remove calculated variables once you are finished with them ( rm ; see p. 10). attach in the first place, but to use functions like The best practice, however, is not to use instead with (see p. 113). If you get into a real tangle, it is often easiest to quit R and start another R session. The opposite problem occurs when you assign values to an existing variable name (perhaps by accident); the original contents of the name are lost. z<-10 ... ... z <- 2.5 Now, z is 2.5 and there is no way to retrieve the original value of 10. 3.7 Input and output formats Formatting is controlled using , typically within double quotes: escape sequences \ n newline r \ carriage return \ t tab character \ b backspace \ a bell \ form feed f \ v vertical tab Here is an example of the cat function. The default produces the output on the computer screen, but you can save the output to file using a file=file.name argument: data<-read.table("c: \\ temp \\ catdata.txt",header=T) attach(data) names(data) [1] "y" "soil" model<-lm(y~soil)

173 DATA INPUT 151 Suppose that you wanted to produce a slightly different layout for the ANOVA table than that produced by summary.aov(model) : summary.aov(model) Df Sum Sq Mean Sq F value Pr(>F) ind 2 99.2 49.60 4.245 0.025 * Residuals 27 315.5 11.69 with the sum of squares column before the degrees of freedom column, plus a row for the total sum of squares, different row labels, and no p value, like this: ANOVA table Source SS d.f. MS F Treatment 99.2 2 49.6 4.244691 Error 315.5 27 11.68519 Total 414.7 29 First, extract the necessary numbers from the summary.aov object: df1<-unlist(summary.aov(model)[[1]] [1])[1] df2<-unlist(summary.aov(model)[[1]] [1])[2] ss1<-unlist(summary.aov(model)[[1]] [2])[1] ss2<-unlist(summary.aov(model)[[1]] [2])[2] cat (" \ t \ t" ) and single new-line Here is the R code to produce the ANOVA table, using with multiple tabs markers ( \ n") at the end of each line: " { cat("ANOVA table"," \ n") cat("Source"," \ t \ t","SS"," \ t","d.f."," \ t","MS"," \ t \ t","F"," \ n") cat("Treatment"," \ \ t",df1," \ t",ss1/df1," \ t \ t", t",ss1," \ n") (ss1/df1)/(ss2/df2)," t",df2," \ \ t",ss2," cat("Error"," t \ t",ss2/df2," \ n") \ cat("Total"," \ t \ t",ss1+ss2," \ t",df1+df2," \ n") } Note the use of curly brackets to group the five functions into a single print object. cat 3.8 Checking files from the command line It can be useful to check whether a given filename exists in the path where you think it should be. The function is file.exists and is used like this: file.exists("c: \\ temp \\ Decay.txt") [1] TRUE For more on file handling, see ?files . 3.9 Reading dates and times from files You need to be very careful when dealing with dates and times in any sort of computing. R has a robust system for working with dates and times, but it takes some getting used to. Typically, you will read dates and times

174 152 THE R BOOK strptime function to explain into R as character strings, then convert them into dates and times using the exactly what the elements of the character string mean (e.g. which are the days, which are the months, what are the separators, and so on; see p. 103 for an explanation of the formats supported). 3.10 Built-in data files datasets package of R. You can see their names by typing: There are many built-in data sets within the data() To see the data sets in extra installed packages as well, type: data(package = .packages(all.available = TRUE)) You can read the documentation for a particular data set with the usual query: ?lynx Many of the contributed packages contain data sets, and you can view their names using the try function. This evaluates an expression and traps any errors that occur during the evaluation. The function establishes a try handler for errors that uses the default error handling protocol: try(data(package="spatstat"));Sys.sleep(3) try(data(package="spdep"));Sys.sleep(3) try(data(package="MASS")) where try is a wrapper to run an expression that might fail and allow the user’s code to handle error recovery, so this would work even if one of the packages was missing. Built-in data files can be attached in the normal way; then the variables within them accessed by their names: attach(OrchardSprays) decrease 3.11 File paths There are several useful R functions for manipulating file paths. A file path is a character string that looks something like this: c: \\ temp \\ thesis \\ chapter1 \\ data \\ problemA and you would not want to type all of that every time you wanted to read data or save material to file. You can set the default file path for a session using the current working directory: \\ \\ \\ thesis \\ setwd("c: temp data") chapter1 The basename function removes all of the path up to and including the last path separator (if any): basename("c: \\ temp \\ thesis \\ chapter1 \\ data \\ problemA") [1] "problemA" The dirname function returns the part of the path up to but excluding the last path separator, or "." if there is no path separator:

175 DATA INPUT 153 \\ temp thesis \\ chapter1 \\ data \\ problemA") dirname("c: \\ [1] "c:/temp/thesis/chapter1/data" Note that this function returns forward slashes as the separator, replacing the double backslashes. Suppose that you want to construct the path to a file from components in a platform-independent way. The function does this very simply: file.path A <- "c:" B <- "temp" C <- "thesis" D <- "chapter1" E <- "data" F <- "problemA" file.path(A,B,C,D,E,F) [1] "c:/temp/thesis/chapter1/data/problemA" The default separator is platform-dependent (/ in the example above, not \ , but you can specify the separator fsep ) like this: ( \\ file.path(A,B,C,D,E,F,fsep=" ") data \\ thesis \\ chapter1 \\ \\ \\ problemA" [1] "c: temp 3.12 Connections Connections are ways of getting information into and out of R, such as your keyboard and your screen. The three standard connections are known as stdin() stdout() for output, and stderr() for for input, terminal which cannot be opened or closed, and reporting errors. They are text-mode connections of class are read-only, write-only and write-only respectively. When R is reading a script from a file, the file is the ‘console’: this is traditional usage to allow in-line data. Functions to create, open and close connections include file, url, gzfile, bzfile, and The intention is that file and gzfile can be used gen- xzfile, pipe socketConnction. erally for text input (from files and URLs) and binary input, respectively. The functions file, pipe, and socketConnection return a connection fifo, url, gzfile, bzfile, xzfile, unz connection object which inherits from class isOpen returns a logical value, indicating whether the con- : isIncomplete nection is currently open; returns a logical value, indicating whether the last read attempt was blocked, or for an output text connection whether there is unflushed output. The functions are used like this: file(description = "", open = "", blocking = TRUE, encoding = getOption("encoding"), raw = FALSE) For file the description is a path to the file to be opened or a complete URL, or "" (the default) or "r" "clipboard" with description = "clipboard" in modes file and "w" only. . You can use There is a 32Kb limit on the text to be written to the Clipboard. This can be raised by using, for example, file("clipboard-128") to give 128Kb. For gzfile the description is the path to a file compressed by : it can also open for reading uncompressed files and those compressed by bzip2, xz or lzma . gzip the description is the bzfile the description is the path to a file compressed by bzip2 .For xzfile For

176 154 THE R BOOK xz or . A maximum of 128 connections can be allocated (not necessarily path to a file compressed by lzma open) at any one time. Not all modes are applicable to all connections: for example, URLs can only be opened for reading. Only file and socket connections can be opened for both reading and writing. For compressed files, the type of , bzip2 and xz compression involves trade-offs: gzip are successively less widely supported, need more resources for both compression and decompression, and achieve more compression (although individual files may buck the general trend). Typical experience is that bzip2 compression is 15% better on text files than gzip xz with maximal compression 30% better. With current computers decompression compression, and times even with are typically modest, and reading compressed files is usually faster than compress = 9 uncompressed ones because of the reduction in disc activity. 3.13 Reading data from an external database Open Data Base Connectivity (ODBC) provides a standard software interface for accessing database man- agement systems (DBMS) that is independent of programming languages, database systems and operating systems. Thus, any application can use ODBC to query data from a database, regardless of the platform it is on or the DBMS it uses. ODBC accomplishes this by using a driver as a translation layer between the application and the DBMS. The application thus only needs to know ODBC syntax, and the driver can then pass the query to the DBMS in its native format, returning the data in a format that the application can understand. Communication with the database uses SQL (‘Structured Query Language’). The example we shall use is called Northwind. This is a relational database that is downloaded for Access from Microsoft Office, and which is used in introductory texts on SQL (e.g. Kauffman et al., 2001). The database refers to a fictional company that operates as a food wholesaler. You should download the database from http://office.microsoft.com. There are many related tables in the database, each with its unique row identifier (ID):  : 18 rows describing the categories of goods sold (from baked goods to oils) including Category Categories name as well as Category.ID  Products : 45 rows with details of the various products sold by Northwind including CategoryID (as above) and Product.ID  Suppliers : 10 rows with details of the firms that supply goods to Northwind  Shippers : 3 rows with details of the companies that ship goods to customers  Employees : 9 rows with details of the people who work for Northwind  Customers : 27 rows with details of the firms that buy goods from Northwind  : empty rows with orders from various customers tagged by OrderID Orders  Order Details : 2155 rows containing one to many rows for each order (typically 2 or 3 rows per order), each row containing the product, number required, unit price and discount, as well as the (repeated) OrderID  Order Status : four categories – new, invoiced, shipped or completed  Order Details: empty rows  Order Details Status: six categories – none, no stock, back-ordered, allocated, invoiced and shipped  Inventory : details of the 45 products held on hand and reordered.

178 156 THE R BOOK library(RODBC) channel <- odbcConnect("northwind") You communicate with the database from R using SQL. The syntax is very simple. You create a dataframe in sqlQuery R from a table (or more typically from several related tables) in the database using the function like this: new.datatframe <- sqlQuery(channel, query) The channel is defined using the odbcConnect function, as shown above. The skill is in creating the (often query . The key components of an SQL query are: complicated) character string called a list of the variables required (or * for all variables) SELECT the name of the table containing these variables FROM specification of which rows of the table(s) are required WHERE JOIN the tables to be joined and the variables on which to join them columns with factors to act as grouping levels GROUP BY conditions applied after grouping HAVING ORDER BY sorted on which variables LIMIT offsets or counts The simplest cases require only SELECT and FROM . Let us start by creating a dataframe in R called containing all of the variables ( * ) and rows from the table Categories called in Northwind: query <- "SELECT * FROM Categories" stock <- sqlQuery(channel, query) ID This is what the R dataframe looks like – there are just two fields Category: and stock ID Category 1 1 Baked Goods & Mixes 2 2 Beverages 3 3 Candy 4 4 Canned Fruit & Vegetables 5 5 Canned Meat 6 6 Cereal 7 7 Chips 8 8 Condiments 9 9 Dairy Products 10 10 Dried Fruit & Nuts 11 11 Grains 12 12 Jams & Preserves 13 14 Pasta 14 15 Sauces 15 16 Snacks 16 17 Soups 17 18 Oils

180 158 THE R BOOK supply <- sqlQuery(channel, query) head(supply,10) ProductName OnHand 1 Northwind Traders Chai 200 2 Northwind Traders Syrup 300 3 Northwind Traders Cajun Seasoning 400 4 Northwind Traders Olive Oil 200 5 Northwind Traders Crab Meat 200 6 Northwind Traders Chicken Soup 500 Using with text is more challenging because the text needs to be enclosed in quotes and quotes WHERE are always tricky in character strings. The solution is to mix single and double quotes to paste together the query you want: name <- "NWTDFN-14" query <- paste("SELECT ProductName FROM Products WHERE ProductCode='",name,"'",sep="") code <- sqlQuery(channel, query) head(code,10) ProductName 1 Northwind Traders Walnuts This is what the query looks like as a single character string: query [1] "SELECT ProductName FROM Products WHERE ProductCode='NWTDFN-14'" You will need to ponder this query really hard in order to see why we had to do what we did to select the ProductCode on the basis of its being "NWTDFN-14 ". The problem is that quotes must be part of the character string that forms the query, but quotes generally mark the end of a character string. There is another example of selecting records on the basis of character strings on p. 197 where we use data from a large relational database to produce species distribution maps. While you have the Northwind example up and running, you should practise joining together three tables, and use some of the other options for selecting records, using LIKE with wildcards * (e.g. WHERE Variable.name LIKE "Product*" ).

184 162 THE R BOOK To see the contents of the whole dataframe, just type its name: worms Field.Name Area Slope Vegetation Soil.pH Damp Worm.density 1 Nashs.Field 3.6 11 Grassland 4.1 FALSE 4 2 Silwood.Bottom 5.1 2 Arable 5.2 FALSE 7 3 Nursery.Field 2.8 3 Grassland 4.3 FALSE 2 4 Rush.Meadow 2.4 5 Meadow 4.9 TRUE 5 5 Gunness.Thicket 3.8 0 Scrub 4.2 FALSE 6 6 Oak.Mead 3.1 2 Grassland 3.9 FALSE 2 7 Church.Field 3.5 3 Grassland 4.2 FALSE 3 8 Ashurst 2.1 0 Arable 4.8 FALSE 4 9 The.Orchard 1.9 0 Orchard 5.7 FALSE 9 10 Rookery.Slope 1.5 4 Grassland 5.0 TRUE 7 11 Garden.Wood 2.9 10 Scrub 5.2 FALSE 8 12 North.Gravel 3.3 1 Grassland 4.1 FALSE 1 13 South.Gravel 3.7 2 Grassland 4.0 FALSE 2 14 Observatory.Ridge 1.8 6 Grassland 3.8 FALSE 0 15 Pond.Field 4.1 0 Meadow 5.0 TRUE 6 16 Water.Meadow 3.9 0 Meadow 4.9 TRUE 8 17 Cheapside 2.2 8 Scrub 4.7 TRUE 4 18 Pound.Hill 4.4 2 Arable 4.5 FALSE 5 19 Gravel.Pit 2.9 1 Grassland 3.5 FALSE 1 20 Farm.Wood 0.8 10 Scrub 5.1 TRUE 3 Notice that R has expanded our abbreviated T and F into TRUE and FALSE . The object called worms now has all the attributes of a dataframe. For example, you can summarize it, using summary : summary(worms) Field.Name Area Slope Vegetation Ashurst : 1 Min. :0.800 Min. : 0.00 Arable :3 Cheapside : 1 1st Qu.:2.175 1st Qu.: 0.75 Grassland:9 Church.Field: 1 Median :3.000 Median : 2.00 Meadow :3 Farm.Wood : 1 Mean :2.990 Mean : 3.50 Orchard :1 Garden.Wood : 1 3rd Qu.:3.725 3rd Qu.: 5.25 Scrub :4 Gravel.Pit : 1 Max. :5.100 Max. :11.00 (Other) :14 Soil.pH Damp Worm.density Min. :3.500 Mode :logical Min. :0.00 1st Qu.:4.100 FALSE:14 1st Qu.:2.00 Median :4.600 TRUE :6 Median :4.00 Mean :4.555 NA's :0 Mean :4.35 3rd Qu.:5.000 3rd Qu.:6.25 Max. :5.700 Max. :9.00 Values of continuous variables are summarized under six headings: one parametric (the arithmetic mean) and five non-parametric (maximum, minimum, median, 25th percentile or first quartile, and 75th percentile or third quartile). Tukey’s famous five-number function ( fivenum ; see p. 42) is slightly different, with hinges

185 DATAFRAMES 163 rather than first and third quartiles. Levels of categorical variables are counted. Note that the field names are not listed in full because they are unique to each row; six of them are named, then R says ‘plus 14 others’ (Other) :14 . and allow summary of the dataframe on the basis of factor levels. For by aggregate The two functions instance, it might be interesting to know the means of the numeric variables for each vegetation type. The function for this is by : by(worms,Vegetation,mean) Vegetation: Arable Field.Name Area Slope Vegetation Soil.pH Damp Worm.density NA 3.866667 1.333333 NA 4.833333 0.000000 5.333333 --------------------------------------------------------------------------------------- Vegetation: Grassland Field.Name Area Slope Vegetation Soil.pH Damp Worm.density NA 2.9111111 3.6666667 NA 4.1000000 0.1111111 2.4444444 --------------------------------------------------------------------------------------- Vegetation: Meadow Field.Name Area Slope Vegetation Soil.pH Damp Worm.density NA 3.466667 1.666667 NA 4.933333 1.000000 6.333333 --------------------------------------------------------------------------------------- Vegetation: Orchard Field.Name Area Slope Vegetation Soil.pH Damp Worm.density NA 1.9 0.0 NA 5.7 0.0 9.0 --------------------------------------------------------------------------------------- Vegetation: Scrub Field.Name Area Slope Vegetation Soil.pH Damp Worm.density NA 2.425 7.000 NA 4.800 0.500 5.250 Notice that the logical variable Damp has been coerced to numeric ( TRUE = 1, FALSE = 0) and then averaged. Warning messages are printed for the non-numeric variables to which the function mean is not applicable Field.name Vegetation ), but this is a useful and quick overview of the (e.g. the factor levels for and effects of the five types of vegetation. by : here is worm density as a function of soil pH for each vegetation You can also fit models using type: by(worms, Vegetation, function(x) lm(Worm.density ~ Soil.pH, data=x)) Vegetation: Arable Call: lm(formula = Worm.density ~ Soil.pH, data = x) Coefficients: (Intercept) Soil.pH -9.689 3.108 ------------------------------------------------------------------------ --------------- etc. for each level of vegetation in alphabetical order ------------------------------------------------------------------------ --------------- Vegetation: Scrub Call: lm(formula = Worm.density ~ Soil.pH, data = x)

186 164 THE R BOOK Coefficients: (Intercept) Soil.pH 4.4758 0.1613 4.1 Subscripts and indices The key thing about working effectively with dataframes is to become completely at ease with using subscripts (or indices, as some people call them). In R, subscripts appear in square brackets [ ]. A dataframe is a two- dimensional object, comprising rows and columns. The rows are referred to by the first (left-hand) subscript, the columns by the second (right-hand) subscript. Thus worms[3,5] [1] 4.3 Soil.pH (the variable in column 5). To extract a range of values (say the 14th to is the value in row 3 of : 19th rows) from worm density (the variable in the seventh column) we use the colon operator to generate a series of subscripts (14, 15, 16, 17, 18 and 19): worms[14:19,7] [1]068451 To extract a group of rows and a group of columns, you need to generate a series of subscripts for both the Area and Slope (columns 2 and 3) from rows 1 to 5: row and column subscripts. Suppose we want worms[1:5,2:3] Area Slope 1 3.6 11 2 5.1 2 3 2.8 3 4 2.4 5 5 3.8 0 all the entries in a row the This next point is very important, and is hard to grasp without practice. To select syntax is ‘number comma blank’. Similarly, to select all the entries in a column the syntax is ‘blank comma number’. Thus, to select all the columns in row 3 we type worms[3,] Field.Name Area Slope Vegetation Soil.pH Damp Worm.density 3 Nursery.Field 2.8 3 Grassland 4.3 FALSE 2 whereas to select all the rows in column 3 we need worms[,3] [1]11235023004101260082110 This is a key feature of the R language, and one that causes problems for beginners. Note that these two apparently similar commands create objects of different classes : class(worms[3,]) [1] "data.frame"

187 DATAFRAMES 165 class(worms[,3]) [1] "integer" and You can create sets of rows or columns. For instance, to extract all the rows for Field.Name c (columns 1 and 5) use the concatenate function, Soil.pH , to make a vector of the required column numbers c(1,5) : worms[,c(1,5)] Field.Name Soil.pH 1 Nashs.Field 4.1 2 Silwood.Bottom 5.2 3 Nursery.Field 4.3 4 Rush.Meadow 4.9 5 Gunness.Thicket 4.2 6 Oak.Mead 3.9 7 Church.Field 4.2 8 Ashurst 4.8 9 The.Orchard 5.7 10 Rookery.Slope 5.0 11 Garden.Wood 5.2 12 North.Gravel 4.1 13 South.Gravel 4.0 14 Observatory.Ridge 3.8 15 Pond.Field 5.0 16 Water.Meadow 4.9 17 Cheapside 4.7 18 Pound.Hill 4.5 19 Gravel.Pit 3.5 20 Farm.Wood 5.1 The commands for selecting rows and columns from the dataframe are summarized in Table 4.1. 4.2 Selecting rows from the dataframe at random In bootstrapping or cross-validation we might want to select certain rows from the dataframe at random. sample function to do this: the default replace = FALSE performs shuffling (each row is We use the selected once and only once), while the option replace = TRUE (sampling with replacement) allows for multiple copies of certain rows and the omission of others. Here we use the default to select replace = F a unique 8 of the 20 rows at random: worms[sample(1:20,8),] Field.Name Area Slope Vegetation Soil.pH Damp Worm.density 7 Church.Field 3.5 3 Grassland 4.2 FALSE 3 3 Nursery.Field 2.8 3 Grassland 4.3 FALSE 2 19 Gravel.Pit 2.9 1 Grassland 3.5 FALSE 1 4 Rush.Meadow 2.4 5 Meadow 4.9 TRUE 5 18 Pound.Hill 4.4 2 Arable 4.5 FALSE 5

188 166 THE R BOOK 12 North.Gravel 3.3 1 Grassland 4.1 FALSE 1 1 Nashs.Field 3.6 11 Grassland 4.1 FALSE 4 15 Pond.Field 4.1 0 Meadow 5.0 TRUE 6 data . Suppose that is one of the row numbers in your dataframe Table 4.1. Selecting parts of a dataframe called n [n,] selects all of the columns, m is one of the columns. Note that the syntax that you want to select or remove, and selects all of the rows. [,m] while meaning command n of the dataframe data[n,] select all of the columns from row drop the whole of row n from the dataframe data[-n,] select all of the columns from rows 1 to n data[1:n,] of the dataframe data[-(1:n),] n of the dataframe drop all of the columns from rows 1 to select all of the columns from rows , j , and k of the dataframe data[c(i,j,k),] i use a logical test ( x>y ) to select all columns from certain rows data[x > y,] data[,m] select all of the rows from column of the dataframe m drop the whole of column m data[,-m] from the dataframe data[,1:m] select all of the rows from columns 1 to m of the dataframe data[,-(1:m)] drop all of the rows from columns 1 to m of the dataframe data[,c(i,j,k)] select all of the rows from columns , j , and k of the dataframe i use a logical test ( ) to select all rows from certain columns data[,x > y] x>y add duplicate copies of columns i , j , and k data[,c(1:m,i,j,k)] to the dataframe data[x > y,a != b] extract certain rows ( x>y ) and certain columns ( a ! = b ) data[c(1:n,i,j,k),] add duplicate copies of rows i , j , and k to the dataframe Note that the row numbers are in random sequence (not sorted), so that if you want a sorted random sample you will need to order the dataframe after the randomization. 4.3 Sorting dataframes It is common to want to sort a dataframe by rows, but rare to want to sort by columns. Because we are sorting by rows (the first subscript) we specify the order of the row subscripts before the comma. Thus, to sort the Slope ), we write dataframe on the basis of values in one of the columns (say, worms[order(Slope),] Field.Name Area Slope Vegetation Soil.pH Damp Worm.density 5 Gunness.Thicket 3.8 0 Scrub 4.2 FALSE 6 8 Ashurst 2.1 0 Arable 4.8 FALSE 4 9 The.Orchard 1.9 0 Orchard 5.7 FALSE 9 15 Pond.Field 4.1 0 Meadow 5.0 TRUE 6 16 Water.Meadow 3.9 0 Meadow 4.9 TRUE 8 12 North.Gravel 3.3 1 Grassland 4.1 FALSE 1 19 Gravel.Pit 2.9 1 Grassland 3.5 FALSE 1 2 Silwood.Bottom 5.1 2 Arable 5.2 FALSE 7 6 Oak.Mead 3.1 2 Grassland 3.9 FALSE 2 13 South.Gravel 3.7 2 Grassland 4.0 FALSE 2 18 Pound.Hill 4.4 2 Arable 4.5 FALSE 5 3 Nursery.Field 2.8 3 Grassland 4.3 FALSE 2 7 Church.Field 3.5 3 Grassland 4.2 FALSE 3

189 DATAFRAMES 167 10 Rookery.Slope 1.5 4 Grassland 5.0 TRUE 7 4 Rush.Meadow 2.4 5 Meadow 4.9 TRUE 5 14 Observatory.Ridge 1.8 6 Grassland 3.8 FALSE 0 17 Cheapside 2.2 8 Scrub 4.7 TRUE 4 11 Garden.Wood 2.9 10 Scrub 5.2 FALSE 8 20 Farm.Wood 0.8 10 Scrub 5.1 TRUE 3 1 Nashs.Field 3.6 11 Grassland 4.1 FALSE 4 There are some points to notice here. Because we wanted the sorting to apply to all the columns, the column subscript (after the comma) is blank: [order(Slope),] . The original row numbers are retained Slope = 0) in the leftmost column. Where there are ties for the sorting variable (e.g. there are five ties for then the rows are in their original order. If you want the dataframe in reverse order (ascending order) then use the function outside the order function like this: rev worms[rev(order(Slope)),] Field.Name Area Slope Vegetation Soil.pH Damp Worm.density 1 Nashs.Field 3.6 11 Grassland 4.1 FALSE 4 20 Farm.Wood 0.8 10 Scrub 5.1 TRUE 3 11 Garden.Wood 2.9 10 Scrub 5.2 FALSE 8 17 Cheapside 2.2 8 Scrub 4.7 TRUE 4 14 Observatory.Ridge 1.8 6 Grassland 3.8 FALSE 0 4 Rush.Meadow 2.4 5 Meadow 4.9 TRUE 5 10 Rookery.Slope 1.5 4 Grassland 5.0 TRUE 7 7 Church.Field 3.5 3 Grassland 4.2 FALSE 3 3 Nursery.Field 2.8 3 Grassland 4.3 FALSE 2 18 Pound.Hill 4.4 2 Arable 4.5 FALSE 5 13 South.Gravel 3.7 2 Grassland 4.0 FALSE 2 6 Oak.Mead 3.1 2 Grassland 3.9 FALSE 2 2 Silwood.Bottom 5.1 2 Arable 5.2 FALSE 7 19 Gravel.Pit 2.9 1 Grassland 3.5 FALSE 1 12 North.Gravel 3.3 1 Grassland 4.1 FALSE 1 16 Water.Meadow 3.9 0 Meadow 4.9 TRUE 8 15 Pond.Field 4.1 0 Meadow 5.0 TRUE 6 9 The.Orchard 1.9 0 Orchard 5.7 FALSE 9 8 Ashurst 2.1 0 Arable 4.8 FALSE 4 5 Gunness.Thicket 3.8 0 Scrub 4.2 FALSE 6 Notice now that when there are ties (e.g. Slope = 0), the original rows are also in reverse order. More complicated sorting operations might involve two or more variables. This is achieved very simply order function. R will sort on the basis of by separating a series of variable names by commas within the the left-hand variable, with ties being broken by the second variable, and so on. Suppose that we want to order the rows of the database on worm density within each vegetation type: worms[order(Vegetation,Worm.density),] Field.Name Area Slope Vegetation Soil.pH Damp Worm.density 8 Ashurst 2.1 0 Arable 4.8 FALSE 4 18 Pound.Hill 4.4 2 Arable 4.5 FALSE 5 2 Silwood.Bottom 5.1 2 Arable 5.2 FALSE 7 14 Observatory.Ridge 1.8 6 Grassland 3.8 FALSE 0

191 DATAFRAMES 169 Perhaps you want only certain columns in the sorted dataframe? Suppose we want vegetation, worm.density, soil pH and slope, and we want them in that order from left to right. We specify the column numbers in the c(4,7,5,3) : sequence we want them to appear as a vector: worms[order(Vegetation,Worm.density),c(4,7,5,3)] Vegetation Worm.density Soil.pH Slope 8 Arable 4 4.8 0 18 Arable 5 4.5 2 2 Arable 7 5.2 2 14 Grassland 0 3.8 6 12 Grassland 1 4.1 1 19 Grassland 1 3.5 1 3 Grassland 2 4.3 3 6 Grassland 2 3.9 2 13 Grassland 2 4.0 2 7 Grassland 3 4.2 3 1 Grassland 4 4.1 11 10 Grassland 7 5.0 4 4 Meadow 5 4.9 5 15 Meadow 6 5.0 0 16 Meadow 8 4.9 0 9 Orchard 9 5.7 0 20 Scrub 3 5.1 10 17 Scrub 4 4.7 8 5 Scrub 6 4.2 0 11 Scrub 8 5.2 10 You can select the columns on the basis of their variables names, but this is more fiddly to type, because you need to put the variable names in quotes like this: worms[order(Vegetation,Worm.density), c("Vegetation", "Worm.density", "Soil.pH", "Slope")] 4.4 Using logical conditions to select rows from the dataframe A very common operation is selecting certain rows from the dataframe on the basis of values in one or more of the variables (the columns of the dataframe). Suppose we want to restrict the data to cases from damp fields. We want all the columns, so the syntax for the subscripts is [‘which rows’, blank]: worms[Damp == T,] Field.Name Area Slope Vegetation Soil.pH Damp Worm.density 4 Rush.Meadow 2.4 5 Meadow 4.9 TRUE 5 10 Rookery.Slope 1.5 4 Grassland 5.0 TRUE 7 15 Pond.Field 4.1 0 Meadow 5.0 TRUE 6 16 Water.Meadow 3.9 0 Meadow 4.9 TRUE 8 17 Cheapside 2.2 8 Scrub 4.7 TRUE 4 20 Farm.Wood 0.8 10 Scrub 5.1 TRUE 3

192 170 THE R BOOK Damp is a logical variable (with just two potential values, or FALSE ) we can refer Note that because TRUE T F . Also notice that the T in this case is not enclosed in quotes: to true or false in abbreviated form, or T means true, not the character string the . The other important point is that the symbol for the logical "T" condition is (two successive equals signs with no gap between them; see p. 26). == The logic for the selection of rows can refer to values (and functions of values) in more than one column. Suppose that we wanted the data from the fields where worm density was higher than the median >median(Worm.density) ) and soil pH was less than 5.2. In R, the logical operator for AND is the & ( (‘ampersand’) symbol: worms[Worm.density > median(Worm.density) & Soil.pH < 5.2,] Field.Name Area Slope Vegetation Soil.pH Damp Worm.density 4 Rush.Meadow 2.4 5 Meadow 4.9 TRUE 5 5 Gunness.Thicket 3.8 0 Scrub 4.2 FALSE 6 10 Rookery.Slope 1.5 4 Grassland 5.0 TRUE 7 15 Pond.Field 4.1 0 Meadow 5.0 TRUE 6 16 Water.Meadow 3.9 0 Meadow 4.9 TRUE 8 18 Pound.Hill 4.4 2 Arable 4.5 FALSE 5 Suppose that we want to extract all the columns that contain numbers (rather than characters or logical is.numeric can be applied across all the columns of worms variables) from the dataframe. The function sapply to create the appropriate subscripts like this: using worms[,sapply(worms,is.numeric)] Area Slope Soil.pH Worm.density 1 3.6 11 4.1 4 2 5.1 2 5.2 7 3 2.8 3 4.3 2 4 2.4 5 4.9 5 5 3.8 0 4.2 6 6 3.1 2 3.9 2 7 3.5 3 4.2 3 8 2.1 0 4.8 4 9 1.9 0 5.7 9 10 1.5 4 5.0 7 11 2.9 10 5.2 8 12 3.3 1 4.1 1 13 3.7 2 4.0 2 14 1.8 6 3.8 0 15 4.1 0 5.0 6 16 3.9 0 4.9 8 17 2.2 8 4.7 4 18 4.4 2 4.5 5 19 2.9 1 3.5 1 20 0.8 10 5.1 3 We might want to extract the columns that were factors: worms[,sapply(worms,is.factor)]

195 DATAFRAMES 173 14 Observatory.Ridge 1.8 6 Grassland 3.8 FALSE 0 15 Pond.Field 4.1 0 Meadow 5.0 TRUE 6 16 Water.Meadow 3.9 0 Meadow 4.9 TRUE 8 17 Cheapside 2.2 8 Scrub 4.7 TRUE 4 18 Pound.Hill 4.4 2 Arable 4.5 FALSE 5 19 Gravel.Pit NA 1 Grassland 3.5 FALSE 1 20 Farm.Wood 0.8 10 Scrub 5.1 TRUE 3 By inspection we can see that we should like to leave out row 2 (one missing value), row 7 (three missing values) and row 19 (one missing value). This could not be simpler: na.omit(data) Field.Name Area Slope Vegetation Soil.pH Damp Worm.density 1 Nashs.Field 3.6 11 Grassland 4.1 FALSE 4 3 Nursery.Field 2.8 3 Grassland 4.3 FALSE 2 4 Rush.Meadow 2.4 5 Meadow 4.9 TRUE 5 5 Gunness.Thicket 3.8 0 Scrub 4.2 FALSE 6 6 Oak.Mead 3.1 2 Grassland 3.9 FALSE 2 8 Ashurst 2.1 0 Arable 4.8 FALSE 4 9 The.Orchard 1.9 0 Orchard 5.7 FALSE 9 10 Rookery.Slope 1.5 4 Grassland 5.0 TRUE 7 11 Garden.Wood 2.9 10 Scrub 5.2 FALSE 8 12 North.Gravel 3.3 1 Grassland 4.1 FALSE 1 13 South.Gravel 3.7 2 Grassland 4.0 FALSE 2 14 Observatory.Ridge 1.8 6 Grassland 3.8 FALSE 0 15 Pond.Field 4.1 0 Meadow 5.0 TRUE 6 16 Water.Meadow 3.9 0 Meadow 4.9 TRUE 8 17 Cheapside 2.2 8 Scrub 4.7 TRUE 4 18 Pound.Hill 4.4 2 Arable 4.5 FALSE 5 20 Farm.Wood 0.8 10 Scrub 5.1 TRUE 3 and you see that rows 2, 7 and 19 have been omitted in creating the new dataframe. Alternatively, you can use the na.exclude function. This differs from na.omit only in the class of the na.action attribute of the result, which gives different behaviour in functions making use of and napredict : when naresid na.exclude is used the residuals and predictions are padded to the correct length by inserting NA sfor cases omitted by na.exclude (in this example they would be of length 20, whereas na.omit would give residuals and predictions of length 17). new.frame <- na.exclude(data) The function to test for the presence of missing values across a dataframe is complete.cases : complete.cases(data) [1] TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE You could use this as a less efficient analogue of na.omit(data) , but why would you? data[complete.cases(data),]

196 174 THE R BOOK It is well worth checking the individual variables separately, because it is possible that one or more variables contribute most of the missing values, and it may be preferable to remove these variables from the modelling rather than lose the valuable information about the other explanatory variables associated with to count the missing values for each variable in the dataframe, or use with these cases. Use apply summary the function is.na to sum the missing values in each variable: apply(apply(data,2,is.na),2,sum) Field.Name Area Slope Vegetation Soil.pH Damp Worm.density 01 1 0 11 1 You can see that in this case no single variable contributed more missing values than any other. NA 4.5.1 Replacing s with zeros You would need to think carefully before doing this, but there might be circumstances when you wanted to replace the missing values by zero (or by some other missing-value indicator). Continuing the missing- NA data worms example, above, where the dataframe called contained five missing values, this is how to replace all the NA s by zeros: data[is.na(data)]<-0 4.6 Using order and !duplicated to eliminate pseudoreplication In this rather more complicated example, you are asked to extract a single record for each vegetation type, and that record is to be the case within each vegetation type that has the greatest worm density. There are two rev(order(Worm.density)) steps to this: first order all of the rows in a new dataframe using , then select the subset of these rows which is not duplicated ( !duplicated ) within each vegetation type in the new dataframe (using ): new\$Vegetation new <- worms[rev(order(Worm.density)),] new[!duplicated(new\$Vegetation),] Field.Name Area Slope Vegetation Soil.pH Damp Worm.density 9 The.Orchard 1.9 0 Orchard 5.7 FALSE 9 16 Water.Meadow 3.9 0 Meadow 4.9 TRUE 8 11 Garden.Wood 2.9 10 Scrub 5.2 FALSE 8 10 Rookery.Slope 1.5 4 Grassland 5.0 TRUE 7 2 Silwood.Bottom 5.1 2 Arable 5.2 FALSE 7 4.7 Complex ordering with mixed directions Sometimes there are multiple sorting variables, but the variables have to be sorted in opposing directions. In this example, the task is to order the database first by vegetation type in alphabetical order (the default) and then within each vegetation type to sort by worm density in decreasing order (highest densities first). The trick here is to use order (rather than rev(order) ) but to put a minus sign in front of Worm.density like this:

198 176 THE R BOOK columns on the basis of logical operations, but it is perfectly possible. It is less likely that you will want to select Suppose that for some reason you want to select the columns that contain the character ‘S’ (upper-case S). In grep , which returns the subscript (a number or set of numbers) indicating which R the function for this is character strings within a vector of character strings contained an upper-case S. The names of the variables names function: within a dataframe are obtained by the names(worms) [1] "Field.Name" "Area" "Slope" "Vegetation" [5] "Soil.pH" "Damp" "Worm.density" so we want our function to pick out variables numbers 3 and 5 because they are the only ones containing grep upper-case S: grep("S",names(worms)) [1] 3 5 [,c(3,5)] to select columns 3 and 5: Finally, we can use these numbers as subscripts worms[,grep("S",names(worms))] Slope Soil.pH 1114.1 2 2 5.2 3 3 4.3 4 5 4.9 5 0 4.2 6 2 3.9 7 3 4.2 8 0 4.8 9 0 5.7 10 4 5.0 11 10 5.2 12 1 4.1 13 2 4.0 14 6 3.8 15 0 5.0 16 0 4.9 17 8 4.7 18 2 4.5 19 1 3.5 20 10 5.1 4.8 A dataframe with row names instead of row numbers You can suppress the creation of row numbers and allocate your own unique names to each row by altering the syntax of the read.table function. The first column of the worms database contains the names of the fields in which the other variables were measured. Up to now, we have read this column as if it was the first variable (p. 161).

200 178 THE R BOOK 3 c FALSE 0.61765685 4 d TRUE 0.78541650 5 e FALSE 0.51168828 6 f TRUE 0.53526324 7 g TRUE 0.05552335 8 h TRUE 0.78486234 9 i FALSE 0.68385443 10 j FALSE 0.89367837 Note that the order of the columns is controlled simply by the sequence of the vector names (left to right) specified within the data.frame function. In this next example, we create a table of counts of random integers from a Poisson distribution, then convert the table into a dataframe. First, we make a table object: y <- rpois(1500,1.5) table(y) y 01234567 344 502 374 199 63 11 5 2 Now it is simple to convert this table object into a dataframe with two variables, the count and the frequency, as.data.frame using the function: short<-as.data.frame(table(y)) short y Freq 1 0 344 2 1 502 3 2 374 4 3 199 54 63 65 11 76 5 87 2 In some cases you might want to expand a dataframe like the one above such that it had a separate row for every distinct count (i.e. 344 rows with y = 0, 502 rows with y = 1, 374 rows with y = 2, and so on). This is very straightforward using subscripts. We need to create a vector of indices containing 344 repeats of ...,8),notr epeats of 1, 502 repeats of 2 and so on. Note that these repeats are of the row numbers (1, 2, 3, the values of y (0,1,2, ...,7). index<-rep(1:8,short\$Freq) This simple command has produced a vector with the right number of repeats of each of the row numbers length(index) [1] 1500 hist(index,-0.5:8.5)

201 DATAFRAMES 179 Histogram of index 500 400 300 Frequency 200 100 0 0 24 68 index To get the long version of the dataframe, we just use index as the row specifier [ index, ]: long<-short[index,] Here is a look at the bottom of this long dataframe: tail(long) y Freq 7.1 6 5 7.2 6 5 7.3 6 5 7.4 6 5 87 2 8.1 7 2 Note the way that R has handled the duplicate row numbers, creating a nested series to indicate the repeats of each of the original row numbers. A longer-winded alternative might use lapply with rep to do the same thing: long2 <- as.data.frame(lapply(short, function(x) rep(x, short\$Freq))) tail(long2) y Freq 1495 6 5 1496 6 5 1497 6 5 1498 6 5

202 180 THE R BOOK 1499 7 2 1500 7 2 to generate the repeats of each row by the value specified Note the use of the anonymous function in lapply . Before you did anything useful with this longer dataframe, you would probably want to get rid of Freq in the redundant column called . Freq 4.10 Eliminating duplicate rows from a dataframe Sometimes a dataframe will contain duplicate rows where all the variables have exactly the same values in two or more rows. Here is a simple example: dups <- read.table("c: temp \\ dups.txt",header=T) \\ dups var1 var2 var3 Var4 11231 21221 33211 44421 53211 66125 71232 Note that row number 5 is an exact duplicate of row number 3. To create a dataframe with all the duplicate rows stripped out, use the unique function like this: unique(dups) var1 var2 var3 var4 1 1231 2 1221 3 3211 4 4421 6 6125 7 1232 Notice that the row names in the new dataframe are the same as in the original, so that you can spot that row number 5 was removed by the operation of the function unique . To view the rows that are duplicates in a dataframe (if any) use the function to create a duplicated vector of TRUE and FALSE to act as the filter: dups[duplicated(dups),] var1 var2 var3 var4 53211 4.11 Dates in dataframes There is an introduction to the complexities of using dates and times on pp. 101–113. Here we illustrate a simple example:

203 DATAFRAMES 181 \\ temp sortdata.txt",header=T) nums <- read.table("c: \\ attach(nums) head(nums) name date response treatment 1 albert 25/08/2003 0.05963704 A 2 ann 21/05/2003 1.46555993 A 3 john 12/10/2003 1.59406539 B 4 ian 02/12/2003 2.09505949 A 5 michael 18/10/2003 2.38330748 B 6 ann 02/07/2003 2.86983693 B The idea is to order the rows by date. The ordering is to be applied to all four columns of the dataframe. Note date does not work in the way we want it to: that ordering on the basis of our variable called nums[order(date),] name date response treatment 53 rachel 01/08/2003 32.98792196 B 65 albert 02/06/2003 38.41979568 A 6 ann 02/07/2003 2.86983693 B 10 cecily 02/11/2003 6.81467570 A 4 ian 02/12/2003 2.09505949 A 29 michael 03/05/2003 15.59890900 B 67 william 03/09/2003 38.95014474 A This is because of the format used for depicting the date is a character string in which the first characters are the day, then the month, then the year, so the dataframe has been sorted into alphabetical order, rather than date order as required. In order to sort by date we need first to convert our variable into date-time format using the strptime function (see p. 103 for details): dates <- strptime(date,format="%d/%m/%Y") dates [1] "2003-08-25" "2003-05-21" "2003-10-12" "2003-12-02" "2003-10-18" [6] "2003-07-02" "2003-09-27" "2003-06-05" "2003-06-11" "2003-11-02" Note how strptime has produced a date object with year first, then a hyphen, then month, then a hyphen, then day, and this will sort into the desired sequence. We bind the new variable to the dataframe like this: nums <- cbind(nums,dates) Now that the new variable is in the correct format, the dates can be sorted correctly: nums[order(dates),] name date response treatment dates 49 albert 21/04/2003 30.66632632 A 2003-04-21 63 james 24/04/2003 37.04140266 A 2003-04-24 24 john 27/04/2003 12.70257306 A 2003-04-27 33 william 30/04/2003 18.05707279 B 2003-04-30 29 michael 03/05/2003 15.59890900 B 2003-05-03 71 ian 06/05/2003 39.97237868 A 2003-05-06

206 184 THE R BOOK Because at least one of the variable names is identical in the two dataframes (in this case, two variables are identical, namely species ) we can use the simplest of all merge commands: Genus and merge(flowering,lifeforms) Genus species flowering lifeform 1 Acer platanoides May tree 2 Ajuga reptans June herb 3 Lamium album January herb complete The important point to note is that the merged dataframe contains only those rows which had entries in both dataframes. Two rows from the lifeforms database were excluded because there were no flowering Acer platanoides and Conyza sumatrensis ), and three rows from the flowering database time data for them ( Chamerion angustifolium were excluded because there were no life-form data for them ( Conyza bilbaoana , Brassica napus and ). If you want to include all the species, with missing values ( NA ) inserted when flowering times or life forms all=T option: are not known, then use the (both <- merge(flowering,lifeforms,all=T)) Genus species flowering lifeform 1 Acer platanoides May tree 2 Acer palmatum tree 3 Ajuga reptans June herb 4 Brassica napus April 5 Chamerion angustifolium July 6 Conyza bilbaoana August 7 Conyza sumatrensis annual 8 Lamium album January herb One complexity that often arises is that the same variable has different names in the two dataframes that need to be merged. The simplest solution is often to edit the variable names in your spreadsheet before reading them into R, but failing this, you need to specify the names in the first dataframe (known conventionally as x y dataframe) using the by.x and the dataframe) and the second dataframe (known conventionally as the by.y options in merge. We have a third dataframe containing information on the seed weights of all eight species, but the variable is called name1 and the variable species is called name2 . Genus (seeds <- read.table("c: \\ temp \\ seedwts.txt",header=T)) name1 name2 seed 1 Acer platanoides 32.0 2 Lamium album 12.0 3 Ajuga reptans 4.0 4 Chamerion angustifolium 1.5 5 Conyza bilbaoana 0.5 6 Brassica napus 7.0 7 Acer palmatum 21.0 8 Conyza sumatrensis 0.6 Just using merge(both,seeds) fails miserably: you should try it, to see what happens. We need to inform name1 the function that Genus and merge are synonyms (different names for the same variable), as are species and name2 .

207 DATAFRAMES 185 merge(both,seeds,by.x=c("Genus","species"),by.y=c("name1","name2")) Genus species flowering lifeform seed 1 Acer palmatum tree 21.0 2 Acer platanoides May tree 32.0 3 Ajuga reptans June herb 4.0 4 Brassica napus April 7.0 5 Chamerion angustifolium July 1.5 6 Conyza bilbaoana August 0.5 7 Conyza sumatrensis annual 0.6 8 Lamium album January herb 12.0 x Note that the variable names used in the merged dataframe are the names used in the dataframe. 4.14 Adding margins to a dataframe Suppose we have a dataframe showing sales by season and by person: frame <- read.table("c: \\ temp sales.txt",header=T) \\ frame name spring summer autumn winter 1 Jane.Smith 14 18 11 12 2 Robert.Jones 17 18 10 13 3 Dick.Rogers 12 16 9 14 4 William.Edwards 15 14 11 10 5 Janet.Jones 11 17 11 16 We want to add margins to this dataframe showing departures of the seasonal means from the overall mean (as an extra row at the bottom) and departures of the people’s means (as an extra column on the right). Finally, we want the sales in the body of the dataframe to be represented by departures from the overall mean. people <- rowMeans(frame[,2:5]) people <- people-mean(people) people [1] 0.30 1.05 -0.70 -0.95 0.30 It is very straightforward to add a new column to the dataframe using cbind : (new.frame <- cbind(frame,people)) name spring summer autumn winter people 1 Jane.Smith 14 18 11 12 0.30 2 Robert.Jones 17 18 10 13 1.05 3 Dick.Rogers 12 16 9 14 -0.70 4 William.Edwards 15 14 11 10 -0.95 5 Janet.Jones 11 17 11 16 0.30 Robert Jones is the most effective sales person ( + 1.05) and William Edwards is the least effective (–0.95). The column means are calculated in a similar way:

208 186 THE R BOOK seasons <- colMeans(frame[,2:5]) seasons <- seasons-mean(seasons) seasons spring summer autumn winter 0.35 3.15 -3.05 -0.45 Sales are highest in summer ( + 3.15) and lowest in autumn (–3.05). Now there is a hitch, however, because there are only four column means but there are six columns in new.frame , so we cannot use rbind directly. The simplest way to deal with this is to make a copy of one of the rows of the new dataframe new.row <- new.frame[1,] and then edit this to include the values we want: a label in the first column to say ‘seasonal means’ then the four column means, and then a zero for the grand mean of the effects: new.row[1] <- "seasonal effects" new.row[2:5] <- seasons new.row[6] <- 0 new.row name spring summer autumn winter people 1 seasonal effects 0.35 3.15 -3.05 -0.45 0 Now we can use rbind to add our new row to the bottom of the extended dataframe: (new.frame <- rbind(new.frame,new.row)) name spring summer autumn winter people 1 Jane.Smith 14.00 18.00 11.00 12.00 0.30 2 Robert.Jones 17.00 18.00 10.00 13.00 1.05 3 Dick.Rogers 12.00 16.00 9.00 14.00 -0.70 4 William.Edwards 15.00 14.00 11.00 10.00 -0.95 5 Janet.Jones 11.00 17.00 11.00 16.00 0.30 6 seasonal effects 0.35 3.15 -3.05 -0.45 0.00 The last task is to replace the counts of sales in the dataframe new.frame[1:5,2:5] by departures from the overall mean sale per person per season (the grand mean, gm = 13.45). We need to use unlist to stop R from estimating a separate mean for each column, then create a vector of length 4 containing repeated values sweep to subtract the grand mean from of the grand mean (one for each column of sales). Finally, we use each value: gm <- mean(unlist(new.frame[1:5,2:5])) gm <- rep(gm,4) new.frame[1:5,2:5] <- sweep(new.frame[1:5,2:5],2,gm) new.frame name spring summer autumn winter people 1 Jane.Smith 0.55 4.55 -2.45 -1.45 0.30 2 Robert.Jones 3.55 4.55 -3.45 -0.45 1.05 3 Dick.Rogers -1.45 2.55 -4.45 0.55 -0.70 4 William.Edwards 1.55 0.55 -2.45 -3.45 -0.95

209 DATAFRAMES 187 5 Janet.Jones -2.45 3.55 -2.45 2.55 0.30 6 seasonal effects 0.35 3.15 -3.05 -0.45 0.00 To complete the table we want to put the grand mean in the bottom right-hand corner: new.frame[6,6] <- gm[1] new.frame name spring summer autumn winter people 1 Jane.Smith 0.55 4.55 -2.45 -1.45 0.30 2 Robert.Jones 3.55 4.55 -3.45 -0.45 1.05 3 Dick.Rogers -1.45 2.55 -4.45 0.55 -0.70 4 William.Edwards 1.55 0.55 -2.45 -3.45 -0.95 5 Janet.Jones -2.45 3.55 -2.45 2.55 0.30 6 seasonal effects 0.35 3.15 -3.05 -0.45 13.45 The best per-season performance was shared by Jane Smith and Robert Jones who each sold 4.55 units more than the overall average in summer. 4.15 Summarizing the contents of dataframes The usual function to obtain cross-classified summary functions like the mean or median for a single vector tapply (p. 245), but there are three useful functions for summarizing whole dataframes: is  summary summarize all the contents of all the variables;  aggregate tapply ; create a table after the fashion of  perform functions for each level of specified factors. by summary and by with the worms database was described on p. 163. The aggregate function is used Use of like tapply to apply a function ( mean in this case) to the levels of a specified categorical variable ( Veg- ) etation , Slope , Soil.pH and Worm.density Area in this case) for a specified range of variables ( which are specified using their subscripts as a column index, worms[,c(2,3,5,7)] : aggregate(worms[,c(2,3,5,7)],by=list(veg=Vegetation),mean) veg Area Slope Soil.pH Worm.density 1 Arable 3.866667 1.333333 4.833333 5.333333 2 Grassland 2.911111 3.666667 4.100000 2.444444 3 Meadow 3.466667 1.666667 4.933333 6.333333 4 Orchard 1.900000 0.000000 5.700000 9.000000 5 Scrub 2.425000 7.000000 4.800000 5.250000 The by argument needs to be a list even if, as here, we have only one classifying factor. Here are the aggregated summaries cross-classified by Vegetation and Damp : aggregate(worms[,c(2,3,5,7)],by=list(veg=Vegetation,d=Damp),mean) veg d Area Slope Soil.pH Worm.density 1 Arable FALSE 3.866667 1.333333 4.833333 5.333333 2 Grassland FALSE 3.087500 3.625000 3.987500 1.875000

210 188 THE R BOOK 3 Orchard FALSE 1.900000 0.000000 5.700000 9.000000 4 Scrub FALSE 3.350000 5.000000 4.700000 7.000000 5 Grassland TRUE 1.500000 4.000000 5.000000 7.000000 6 Meadow TRUE 3.466667 1.666667 4.933333 6.333333 7 Scrub TRUE 1.500000 9.000000 4.900000 3.500000 Note that this summary is unbalanced because there were no damp arable or orchard sites and no dry meadows.

211 5 Graphics Producing high-quality graphics is one of the main reasons for doing statistical computing. The particular plot function you need will depend on the number of variables you want to plot and the pattern you wish to highlight. The plotting functions in this chapter are dealt with under four headings:  plots with two variables;  plots for a single sample;  multivariate plots;  special plots for particular purposes. Changes to the detailed look of the graphs are dealt with in Chapter 29. 5.1 Plots with two variables With two variables (typically the on the y axis and the response variable on the x explanatory variable axis), the kind of plot you should produce depends upon the nature of your explanatory variable. When the explanatory variable is a continuous variable, such as length or weight or altitude, then the appropriate plot is a scatterplot . In cases where the explanatory variable is categorical, such as genotype or colour or gender, then the appropriate plot is either a (when you want to show the scatter in the raw box-and-whisker plot data) or a barplot (when you want to emphasize the effect sizes). The most frequently used plotting functions for two variables in R are the following:  plot(x,y) scatterplot of y against x ;  box-and-whisker plot of y at each factor level; plot(factor, y)  barplot(y) heights from a vector of y values (one bar per factor level). The R Book , Second Edition. Michael J. Crawley. © 2013 John Wiley & Sons, Ltd. Published 2013 by John Wiley & Sons, Ltd.

212 190 THE R BOOK 5.2 Plotting with two continuous explanatory variables: Scatterplots The points and lines , plot function draws axes and adds a scatterplot of points. Two extra functions, plot points and lines add extra points or lines to an existing plot. There are two ways of specifying , and you should choose whichever you prefer:  plot(x,y) Cartesian  formula plot(y~x) The advantage of the formula-based plot is that the plot function and the model fit look and feel the same (response variable, tilde, explanatory variable). If you use Cartesian plots (eastings first, then northings, like x then y ’ while the model has ‘ y the grid reference on a map) then the plot has ‘ x ’. then At its most basic, the plot function needs only two arguments: first the name of the explanatory variable ( x in this case), and second the name of the response variable ( y in this case): plot(x,y) . The data we want to plot are read into R from a file: data1 <- read.table("c: temp \\ scatter1.txt",header=T) \\ attach(data1) names(data1) [1] "x1" "y1" Producing the scatterplot could not be simpler: just type plot(x1,y1,col="red") with the vector of values first, then the vector of y values (changing the colour of the points is optional). x 60 50 y1 40 30 20 0 20 40 60 80 100 x1 Notice that the axes are labelled with the variable names, unless you chose to override these with xlab and ylab . It is often a good idea to have longer, more explicit labels for the axes than are provided by the variable

213 GRAPHICS 191 x1 y1 in this case). Suppose we want to change the label x1 into names that are used as default options ( and y axis from y1 to ‘Response variable’. Then we the longer label ‘Explanatory variable’ and the label on the xlab use ylab like this: and plot(x1,y1,col="red",xlab="Explanatory variable",ylab="Response variable") 60 50 40 Response variable 30 20 020406080100 Explanatory variable The great thing about graphics in R is that it is extremely straightforward to add things to your plots. In the present case, we might want to add a regression line through the cloud of data points. The function for this is abline lm(y1~x1) as explained on p. 491: which can take as its argument the linear model object abline(lm(y1~x1)) 60 50 40 Response variable 30 20 0 20 40 60 80 100 Explanatory variable

214 192 THE R BOOK Just as it is easy to add lines to the plot, so it is straightforward to add more points. The extra points are in another file: \\ temp \\ scatter2.txt",header=T) data2 <- read.table("c: attach(data2) names(data2) [1] "x2" "y2" x2,y2 ) are added using the The new points ( function like this: points points(x2,y2,col="blue",pch=16) and we can finish by adding a regression line through the extra points: abline(lm(y2~x2)) 60 50 40 Response variable 30 20 0 20 40 60 80 100 Explanatory variable This example shows a very important feature of the function. Notice that several of the lower values plot not from the second (blue) data set have appeared on the graph. This is because (unless we say otherwise at first set of points to be drawn. the outset) R chooses ‘pretty’ scaling for the axes based on the data range in the x and y axes, then points are simply If, as here, the range of subsequent data sets lies outside the scale of the left off without any warning message. One way to cure this problem is to plot the data with type="n" so that the axes are scaled to all encompass all the points from all the data sets (using the concatenation function, c), then to use points and lines to add both sets of data to the blank axes, like this: plot(c(x1,x2),c(y1,y2),xlab="Explanatory variable", ylab="Response variable",type="n") points(x1,y1,col="red") points(x2,y2,col="blue",pch=16)

215 GRAPHICS 193 abline(lm(y1~x1)) abline(lm(y2~x2)) 60 50 40 Response variable 30 20 20 40 60 80 0 100 Explanatory variable Now all of the points from both data sets appear on the scatterplot. You might want to take control of the selection of the limits for the x and axes, rather than accept the ‘pretty’ default values. In the last plot, for y instance, the minimum on the y axis was about 13 (but it is not exactly obvious). You might want to specify that the minimum on the y axis was zero. This is achieved with the ylim argument, which is a vector of length 2, specifying the minimum and maximum values for the y ylim=c(0,70) . You will want to axis: control the scaling of the axes when you want two comparable graphs side by side, or when you want to overlay several lines or sets of points on the same axes. range function applied to the data sets in aggregate: A good way to find out the axis values is to use the range(c(x1,x2)) [1] 0.02849861 99.93262000 range(c(y1,y2)) [1] 13.41794 62.59482 Here the axis needs to go from 0.02 up to 99.93 (0 to 100 would be pretty) and the y axis needs to go from x 13.4 up to 62.6 (0 to 70 would be pretty). This is how the axes are drawn; the points and lines are added exactly as before: plot(c(x1,x2),c(y1,y2),xlim=c(0,100),ylim=c(0,70), xlab="Explanatory variable",ylab="Response variable",type="n") points(x1,y1,col="red") points(x2,y2,col="blue",pch=16) abline(lm(y1~x1)) abline(lm(y2~x2))

216 194 THE R BOOK 70 60 50 40 30 Response variable 01020 20 40 60 80 0 100 Explanatory variable Adding a legend to the plot to explain the difference between the two colours of points would be useful. The thing to understand about the legend function is that the number of lines of text inside the legend box is de- c("treatment","control"). termined by the length of the vector containing the labels (2 in this case: The other two vectors must be of the same length as this: one for the plotting symbols pch=c(1,16) and one for the colours col=c("red","blue"). The legend function can be used with locator(1) to allow you to select exactly where on the plot surface the legend box should be placed. Click the mouse button of the box around the legend to be. top left when the cursor is where you want the legend(locator(1),c("treatment","control"),pch=c(1,16), col=c("red","blue")) 70 treatment control 60 50 40 30 Response variable 01020 020 40 60 80 100 Explanatory variable

217 GRAPHICS 195 This is about as complicated as you would want to make any figure. Adding more information would begin to detract from the message. 5.2.1 Plotting symbols: pch There are 256 different plotting symbols available in Windows (0 to 255). Here is a graphic showing all of them in sequence, from bottom left to top right: plot(0:10,0:10,xlim=c(0,32),ylim=c(0,40),type="n",xaxt="n",yaxt="n", xlab="",ylab="") x <- seq(1,31,2) s <- -16 f<- -1 for (y in seq(2,40,2.5)) { s <- s+16 f <- f+16 y2 <- rep(y,16) points(x,y2,pch=s:f,cex=0.7) text(x,y-1,as.character(s:f),cex=0.6) } 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 212 213 214 215 216 217 218 219 220 221 222 223 208 209 210 211 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 « 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 97 99 100 101 102 103 104 105 106 107 108 109 110 111 98 96 82 83 84 85 86 80 88 89 90 91 92 93 94 95 81 87 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 64 49 61 51 52 53 54 55 56 57 58 59 60 48 62 63 50 42 33 36 37 38 39 40 41 35 43 44 45 46 47 32 34 17 18 19 20 21 22 23 24 16 26 27 28 29 30 31 25 U 15 3 4 5 6 7 8 1 10 11 12 13 14 0 2 9

218 196 THE R BOOK pch ) are shown in the bottom two rows, with their number immediately The basic plotting symbols ( pch , is a small open circle in black. Note that values between 26 and 32 are not pch=1 beneath. The default value, pch between 33 and 127 represent implemented at present and are ignored (blanks are plotted). Values for the ASCII character set, while values between 128 and 255 are the symbols from the Windows character set. pch=19 pch=20 are solid circles of different sizes ( pch = 20 is the ‘bullet point’ The symbols for and pch=19 ). The difference between pch=16 and is two-thirds the size of pch=19 is that the latter uses and a border, and so it is larger when line width is large relative to character expansion cex . The symbol for lwd is the ‘dot’ and is treated specially (it is a rectangle of side 0.01 inch, scaled by cex , and if pch=46 cex (the default), each side is at least one pixel, which is 1/72 inch on the =1 , postscript and xfig pdf devices, so that the dot does not disappear on scaling down the image. 5.2.2 Colour for symbols in plots The plotting symbols ( pch ) numbered 21 to 25 allow you to specify the background (fill) colour and the bg border colour separately. In the illustration below, the background colours ( ) 1 to 8 are shown in the columns and numbered on the x axis. The border colours ( col ) 1 to 8 are shown in the rows and numbered y on the axis. plot(0:9,0:9,pch=16,type="n",xaxt="n",yaxt="n",ylab="col",xlab="bg") axis(1,at=1:8) axis(2,at=1:8) for (i in 1:8) points(1:8,rep(i,8),pch=c(21,22,24),bg=1:8,col=i) 87654 col 321 12 345 678 bg col=1) are effective Some combinations are visually more effective than others. Black borders in row 1 ( with all the shapes and fill colours, but red borders in row 2 ( col=2 ) work well only with green, pale blue, yellow and grey fill colours. If you specify only col , then open (white) symbols bordered with the specified colour are produced. If you specify only bg , then solid symbols filled with the specified colour, but bordered in black, are produced. If you specify both col and bg with the same colours, then solid, apparently unbordered symbols are produced. If you specify both col and bg using different colours, then

220 198 THE R BOOK Wytham Meads Wytham Wood Farmoor Botley Cumnor Kennington Appleton Hinton Waldrist Cothill Tubney Buckland Radley Friford Abingdon Buscot Pusey Marcham Faringdon Drayton Charney Bassett Coleshill Sutton Courtenay Denchworth Steventon Fernham Watchfield Didcot Grove Wallingford Uffington Harwell Shrivenham AERE Harwell Wantage Whitehorse Hill Aston Tirrold Cookham Ashbury Letcombe Bassett Blewbury Cliveden Reach Chum Moulsford Farnborough Kingstone Down Maidenhead Fawley East Ilsley Streatley Upper Lambourn Compton Bray Brightwalton Knowl Hill Lambourn Basildon Beedon White Waltham Peasemore East Garston Pangbourne Ashampstead Windsor Twyford Shefford Yattendon Sulham Old Windsor Shurlock Row Reading Woodley Hermitage Welford Cranbourne Bradfield Binfield Boxford Windsor Great Park Theale Fence Wood Chilton Foliat Winnersh Pingewood Cold Ash Bracknell Elcot Wokingham Hungerford Ascot Speen Beenham Burghfield Thatcham Wilderness Arborfield Swinley Park Grazely Newbury Kintbury Aldermaston Nine Mile Ride Swallowfield Inkpen Mortimer Greenham Common Farley Hill Bagshot Health Shalbourne Riseley Silchester Walbury Camp Sandhurst 5.2.4 Identifying individuals in scatterplots The best way to identify multiple individuals in scatterplots is to use a combination of colours and symbols. A useful trick is to use to convert a grouping factor (the variable acting as the subject identifier) as.numeric into a colour and/or a symbol. Here is an example where reaction time is plotted against duration of sleep deprivation for 18 subjects: \\ temp \\ data <- read.table("c: sleep.txt",header=T) attach(data) plot(Days,Reaction) I think you will agree that the raw scatterplot is uninformative; the individuals need to stand out more clearly from one another. The main purpose of the graphic is to show the relationship between sleep deprivation (measured in days) and reaction time. Another aim is to draw attention to the differences between the 18 subjects in their mean reaction times, and to differences in the rate of increase of reaction time with the duration of sleep deprivation. Because there are so many subjects, the graph is potentially very confusing. One improvement is to join together the time series for the individual subjects, using a non-intrusive line colour. Let us do that first. We need to create a vector s to contain the numeric values (1 to 18) of the Subject identity numbers (which range, with gaps, between 308 and 372): s <- as.numeric(factor(Subject)) This vector will be used in subscripts to select the x and y coordinates of each subject’s time series in turn. Next, the subjects, k , are taken one at a time in a loop, and lines with type="b" (both points and lines) are drawn in a non-intrusive colour ( gray is useful for this):

221 GRAPHICS 199 plot(Days,Reaction,type="n") for (k in 1:max(s)) { x <- Days[s==k] y <- Reaction[s==k] lines(x,y,type="b",col="gray") } Next, we need to select plotting symbols and colours for each subject. The colour-filled symbols pch=21 , pch=22 pch=24 are very useful here. Let us use the non-black colours ( bcol from 2 to 8) for each of and the first two plotting symbols ( sym ), then use colours 2 to 5 for the third plotting symbol for the remaining subjects: sym <- rep(c(21,22,24),c(7,7,4)) bcol <- c(2:8,2:8,2:5) Finally, we can take each subject in turn and use points to add the coloured symbols (each with black col=1 ) to the graph: edges, { for (k in 1:max(s)) points(Days[s==k],Reaction[s==k],pch=sym[k],bg=bcol[k],col=1) } 450 400 350 Reaction 300 250 200 02468 Days I think that there is insufficient room on the plotting surface to insert a legend with 18 labels in it. For a plot as complicated as this, it is best to put the explanations of the plotting symbols in the text. Perhaps the clearest pattern to emerge from the graphic is that subject 331 (the yellow-filled circle) clearly had a hangover on day 6, because he/she was the third fastest reactor after 9 days of deprivation.

222 200 THE R BOOK 5.2.5 Using a third variable to label a scatterplot The following example concerns the response of a grass species Festuca rubra as measured by its biomass ) to two explanatory variables, soil pH and total hay yield (the mass of all plant species FR in small samples ( combined). A scatterplot of against h a y shows the locations of the various samples. The idea is to use the pH function to label each of the points on the scatterplot with the dry mass of F. rubra in that particular text Festuca sample, to see whether there is systematic variation in the mass of with changes in hay yield and soil pH. data <- read.table("c: temp \\ pgr.txt",header=T) \\ attach(data) names(data) [1] "FR" "hay" "pH" plot(hay,pH) text(hay, pH, labels=round(FR, 2), pos=1, offset=0.5,cex=0.7) 7.06.56.05.55.04.54.03.5 pH 2 3456 789 hay centred on the The labels are value of the point ( pos=1 ) and are offset half a character below the point x ( offset=0.5 ). They show the value of FR rounded to two significant digits ( labels=round(FR, 2) ) at 70% character expansion ( ). There is an obvious problem with this method when there is lots of cex=0.7 overlap between the labels (as in the top right), but the technique works well for more widely spaced points. The plot shows that high values of Festuca biomass are concentrated at intermediate values of both soil pH and hay yield. You can also use a third variable to choose the colour of the points in your scatterplot. Here the points with FR above median are shown in red, the others in black: plot(hay,pH,pch=16,col=ifelse(FR>median(FR),"red","black")) legend(locator(1),c("FR>median","FR<=median"),pch=16,col=c("red","black"))

223 GRAPHICS 201 7.0 6.5 FR>median FR<=median 6.0 5.5 pH 5.0 4.54.5 3.5 789 23456 hay For three-dimensional plots see , contour and image on p. 931. wireframe 5.2.6 Joining the dots Sometimes you want to join the points on a scatterplot by lines. The trick is to ensure that the points on the x axis are ordered: if they are not ordered, the result is a mess, as you will see below. smooth <- read.table("c: temp \\ smoothing.txt",header=T) \\ attach(smooth) names(smooth) [1] "x" "y" Begin by producing a vector of subscripts representing the ordered values of the explanatory variable. Then draw lines with this vector as subscripts to both the x and y variables: plot(x,y,pch=16) sequence <- order(x) lines(x[sequence],y[sequence]) If you do not order the x values, and just use the lines function, this is what happens: plot(x,y,pch=16) lines(x,y)

224 202 THE R BOOK 14 14 y y 108 108 64 64 6810 024 024 6810 x x There is a plot option type="b" (this stands for ‘both’ points and lines) which draws the points and joins them together with lines. You can choose the plotting symbol ( pch ) and the line type ( lty )to be used. 5.2.7 Plotting stepped lines When plotting square edges between two points, you need to decide whether to go across and then up, or up and then across. The issue should become clear with an example. We have two vectors from 0 to 10: x<-0:10 y<-0:10 plot(x,y) There are three ways we can join the dots: with a straight line lines(x,y,col="red") with a stepped line going across first then up, using lower-case ‘s’ lines(x,y,col="blue",type="s") or with a stepped green line going up first, then across using upper-case ‘S’ (‘upper case, up first’ is the way to remember it): lines(x,y,col="green",type="S")

225 GRAPHICS 203 10 8 6 y 4 2 0 6810 02 4 x 5.3 Adding other shapes to a plot Once you have produced a set of axes using plot it is straightforward to locate and insert other kinds of things. Here are two unlabelled axes, without tick marks ( xaxt="n" ), both scaled from 0 to 10 but without any of the 11 points drawn on the axes ( type="n" ): plot(0:10,0:10,xlab="",ylab="",xaxt="n",yaxt="n",type="n") You can easily add extra graphical objects to plots:  rect rectangles  arrows arrows and headed bars  polygon more complicated filled shapes, including objects with curved sides For the purposes of demonstration we shall add a single-headed arrow, a double-headed arrow, a rectangle and a six-sided polygon to this space. We want to put a solid square object in the top right-hand corner, and we know the precise coordinates to use. The syntax for the rect function is to provide four numbers: rect(xleft, ybottom, xright, ytop) Thus, to plot the square from (6,6) to (9,9) involves: rect(6,6,9,9)

226 204 THE R BOOK col ) or with shading lines ( , angle ) as described on You can fill the shape with solid colour ( density p. 920. locator function 5.3.1 Placing items on a plot with the cursor, using the You might want to point with the cursor and get R to tell you the coordinates of the corners of the rectangle. locator() function for this. The rect function does not accept locator You can use the as its argument, but you can easily write a function (here called )todothis: corners { corners <- function() coos <- c(unlist(locator(1)),unlist(locator(1))) rect(coos[1],coos[2],coos[3],coos[4]) } Run the function like this: corners() Then click in the bottom left-hand corner and again in the top right-hand corner, and a rectangle will be drawn from your screen-supplied pointers. Drawing arrows is straightforward. The syntax for the arrows function is to draw a line from the point ( x0, y0 ) to the point ( x1, y1 ) with the arrowhead, by default, at the ‘second’ end ( x1, y1 ): arrows(x0, y0, x1, y1) Thus, to draw an arrow from (1,1) to (3,8) with the head at (3,8) type: arrows(1,1,3,8) A horizontal double-headed arrow from (1,9) to (5,9) is produced by adding like this: code=3 arrows(1,9,5,9,code=3) angle = 90 A vertical bar with two square ends (e.g. like an error bar) uses instead of the default angle = 30 ): arrows(4,1,4,6,code=3,angle=90) Here is a function that draws an arrow from the cursor position of your first click to the position of your second click: click.arrows <- function() { coos <- c(unlist(locator(1)),unlist(locator(1))) arrows(coos[1],coos[2],coos[3],coos[4]) } To run this, type click.arrows() then click the cursor on the two ends. We now wish to draw a polygon. To do this, it is often useful to save the values of a series of locations. Here we intend to save the coordinates of six points in a vector called locations to define a polygon for plotting: locations <- locator(6)

227 GRAPHICS 205 After you have clicked over the sixth location, control returns to the screen. What kind of object has locator produced? class(locations) [1] "list" list , and we can extract the vectors of It has produced a and y values from the list using \$ to name the x elements of the list (R has created the helpful names x and y ): locations \$x [1] 5.484375 7.027344 9.019531 8.589844 6.792969 5.230469 \$y [1] 3.9928797 4.1894975 2.5510155 0.7377620 0.6940691 2.1796262 Now we draw a lavender-coloured polygon like this: polygon(locations,col="lavender") Note that the polygon function has automatically closed the shape, drawing a line from the last point to the first. 5.3.2 Drawing more complex shapes with polygon The polygon function can be used to draw more complicated shapes, including curved ones. In this example we are asked to shade the area beneath a standard normal curve for values of z that are less than or equal

228 206 THE R BOOK dnorm ) line for the standard normal (mean 0 and standard to –1. First draw the probability density ( = = deviation 1): z <- seq(-3,3,0.01) pd <- dnorm(z) plot(z,pd,type="l") z ≤ –1 in red: Then fill the area to the left of polygon(c(z[z<=-1],-1),c(pd[z<=-1],pd[z==-3]),col="red") 0.4 0.3 pd 0.2 0.1 0.0 –3 –1 0 123 –2 z Note the insertion of the point ( -1, pd[z == -3] ) to create the right-angled corner to the polygon on set to the same value as when the z = –1 and pd axis at z is –3 to make sure that the bottom line is z horizontal. 5.4 Drawing mathematical functions 3 curve function is convenient for this. Here is a plot of x The –3 x between x = –2 and x = 2: curve(xˆ3-3*x, -2, 2)

229 GRAPHICS 207 2 1 0 xˆ3–3 * x –1 –2 0 12 –1 –2 x Here is the more cumbersome code to do the same thing using : plot x <- seq(-2,2,0.01) y <- xˆ3-3*x plot(x,y,type="l") plot , you need to decide how many segments you want to generate to create the curve (using seq With y with steps of 0.01 in this example), then calculate the matching plot with type="l" . values, then use = ) and can cause problems if you misread the points This stands for ‘type line’ (rather than the default symbol as a number ‘one’ rather than a lower-case letter ‘L’. 5.4.1 Adding smooth parametric curves to a scatterplot Up to this point our response variable was shown as a scatter of data points. In many cases, however, we want to show the response as a smooth curve. The important tip is that to produce reasonably smooth-looking curves in R you should draw about 100 straight-line sections between the minimum and maximum values of your x axis. The Ricker curve is named after the famous Canadian fish biologist who introduced this two-parameter hump-shaped model for describing recruitment to a fishery as a function of the density of the parental stock, y x . We wish to compare two Ricker curves with the following parameter values: − 0 . 045 x 055 x − 0 . x , y e = 518 . e x = 482 y B A The first decision to be made is the range of x values for the plot. In our case this is easy because we know from the literature that the minimum value of x is 0 and the maximum value of x is 100. Next we need to generate about 100 values of x at which to calculate and plot the smoothed values of y : xv <- 0:100

230 208 THE R BOOK y y at each of these x values: and Next, calculate vectors containing the values of B A yA <- 482*xv*exp(-0.045*xv) yB <- 518*xv*exp(-0.055*xv) We are now ready to draw the two curves, but we do not know how to scale the y axis. We could find the y and axis, but it is y to specify the extremes of the then use ylim maximum and minimum values of y B A type="n" more convenient to use the option lines to add the to draw the axes without any data, then use two smooth functions later. The blank axes are produced like this: plot(c(xv,xv),c(yA,yB),xlab="stock",ylab="recruits",type="n") ), lty = 2, col = "blue" as a dashed blue line ( y We want to draw the smooth curve for A lines(xv,yA,lty=2,col="blue") and the curve for y as a solid red line ( lty = 1, col = "red" ), B lines(xv,yB,lty=1,col="red") Next, we want to see which (if either) of these lines best describes our field data, by overlaying a scatter of points (as black solid circles, pch = 16 ) on the smooth curves: info <- read.table("c: \\ temp \\ plotfit.txt",header=T) attach(info) names(info) [1] "x" "y" points(x,y,pch=16) 4000 3000 2000 recruits 2000 0 0 20 40 60 80 100 stock You can see that the blue dotted line is a much better description of our data than is the solid red line. Estimating the parameters of non-linear functions like the Ricker curve from data is explained in Chapter 20.

231 GRAPHICS 209 5.4.2 Fitting non-parametric curves through a scatterplot It is common to want to fit a non-parametric smoothed curve through data, especially when there is no obvious candidate for a parametric function. R offers a range of options:  (a non-parametric curve fitter); lowess  loess (a modelling tool);  gam (fits generalized additive models; p. 666);  for polynomial regression (fit a linear model involving powers of x ). lm We will illustrate each of these options using the jaws data. First, we load the data: \\ data <- read.table("c: \\ jaws.txt",header=T) temp attach(data) names(data) [1] "age" "bone" Before we fit our various curves to the data, we need to consider how best to display the results together. Without doubt, the graphical parameter you will change most often just happens to be the least intuitive to use. This is the number of graphs per screen, called somewhat unhelpfully, mfrow . This stands for ‘multiple frames by rows’. The idea is simple, but the syntax is hard to remember. You need to specify the number of rows of plots you want, and number of plots per row, in a vector of two numbers. The first number is the number of rows and the second number is the number of graphs per row. The vector is made using concatenate c in the normal way. The default single-plot screen is par(mfrow=c(1,1)). Two plots side by side is par(mfrow=c(1,2)) and a panel of four plots in a 2 × 2 square is par(mfrow=c(2,2)). To move from one plot to the next, you need to execute a new plot function. Control stays within the points lines or text . Remember to return to the same plot frame while you execute functions like , par(mfrow=c(1,1)) default single plot when you have finished your multiple plot by executing . If you have more than two graphs per row or per column, the character expansion cex is set to 0.5 and you get half-size characters and labels. par(mfrow=c(2,2)) Let us now plot our four graphs with different smooth functions fitted through the jaws data. First, the simple non-parametric smoother called lowess . You provide the lowess function with arguments for the explanatory variable and the response variable, then provide this object as an argument to the function lines like this: plot(age,bone,pch=16,main="lowess") lines(lowess(age,bone),col="red") It is a reasonable fit overall, but a poor descriptor of the jaw size for the lowest five ages. Let us try loess , which is a model-fitting function. We use the fitted model to predict the jaw sizes: plot(age,bone,pch=16,main="loess") model <- loess(bone~age) xv <- 0:50 yv <- predict(model,data.frame(age=xv)) lines(xv,yv,col="red")

232 210 THE R BOOK This is much better at describing the jaw size of the youngest animals, but shows a slight decrease for the oldest animals which might not be realistic. Next, we use a generalized additive model ( gam , from the library s(age), a smooth function of age: ) to fit bone as mgcv library(mgcv) plot(age,bone,pch=16,main="gam") model <- gam(bone~s(age)) xv <- 0:50 yv <- predict(model,list(age=xv)) lines(xv,yv,col="red") loess The line is almost indistinguishable from the line produced by . Finally, a polynomial: plot(age,bone,pch=16,main="cubic polynomial")) model <- lm(bone~age+I(ageˆ2)+I(ageˆ3)) xv <- 0:50 yv <- predict(model,list(age=xv)) lines(xv,yv,col="red") As so often with polynomials, the line is more curvaceous than we really want. Note the use of capital I (the ‘as is’ function) in front of the quadratic and cubic terms. The fit is good for young animals, but is rather wavy where we might expect to see an asymptote. It tips up at the end, whereas the last two smoothers tipped down. lowess loess 140 140 100 100 bone bone 60 60 0 20 0 20 0 40 50 20 30 10 0 50 30 20 10 40 age age gam cubic polynomial 140 140 100 100 bone bone 60 60 0 20 0 20 0 10 20 30 40 50 0 20 50 40 30 10 age age

233 GRAPHICS 211 Because it is a built-in function and does not require any external packages to be loaded, my recommendation is for loess (top right); it is a reasonable fit, and is not over-curvaceous. You fit a with a specified vector of values for the explanatory variable, then draw the curve predict model, then use lines . using 5.5 Shape and size of the graphics window The default graphics window is a square, measuring 7 inches by 7 inches (I know it should be metric, but it is not). This is fine for most purposes, but it needs to be changed if you want to put two graphs side by side, using . par(mfrow=c(1,2)) \\ data <- read.table("c: \\ pollute.txt",header=T) temp attach(data) If you use the default window the graphs will come out looking far too narrow: par(mfrow=c(1,2)) plot(Population,Pollution) plot(Temp,Pollution) 100 100 80 80 60 60 Pollution Pollution 40 40 20 20 0 1000 2000 3000 45 50 55 60 65 70 75 Population Temp The simplest solution is to use the mouse to drag up the base of the graphics window until you obtain a more pleasing shape. Alternatively, you can invoke the windows function, specifying first the width then

234 212 THE R BOOK the height in inches. The best choice for this case is (7,4): windows(7,4) par(mfrow=c(1,2)) plot(Population,Pollution) plot(Temp,Pollution) 10080604020 10080604020 Pollution Pollution 45 50 55 60 65 70 75 2000 3000 0 1000 Temp Population 5.6 Plotting with a categorical explanatory variable When the explanatory variable is categorical rather than continuous, we cannot produce a scatterplot. Instead, we choose between a barplot and a boxplot . I prefer box-and-whisker plots because they convey so much more information, and this is the default plot in R with a categorical explanatory variable. Categorical variables are factors with two or more levels (see p. 20). Our first example uses the factor called month (with levels 1 to 12) to investigate weather patterns at Silwood Park: \\ temp weather <- read.table("c: SilwoodWeather.txt",header=T) \\ attach(weather) names(weather) [1] "upper" "lower" "rain" "month" "yr" There is one bit of housekeeping we need to do before we can plot the data. We need to declare month to be a factor. At the moment, R thinks it is just a number: month <- factor(month) Now we can plot using a categorical explanatory variable ( month ) and, because the first variable is a factor, we get a boxplot rather than a scatterplot: plot(month,upper) Note that there are no axis labels in the default box-and-whisker plot, and to get informative labels we should need to type: plot(month,upper,ylab="daily maximum temperature",xlab="month")

235 GRAPHICS 213 30 20 10 daily maximum temperature 0 123 4567 89101112 month The boxplot summarizes a great deal of information very clearly. The horizontal line shows the median percentiles upper daily temperature for each month. The bottom and top of the box show the 25th and 75th , respectively (i.e. the location of the middle 50% of the data, also called the first and third quartiles ). The vertical dashed lines are called the ‘whiskers’. For the upper whisker, we see one of two things: either the maximum value or, when there are outliers present, the largest data point that is less than 1.5 times the interquartile range above the 75th percentile. The quantity ‘1.5 times the interquartile range of the data’ is roughly 2 standard deviations, and the interquartile range is the difference in the response variable between its first and third quartiles. Points more than 1.5 times the interquartile range above the third quartile and points more than 1.5 times the interquartile range below the first quartile outliers and plotted are defined as individually. Thus, when there are no outliers the whiskers simply show the maximum and minimum values (as shown here only in month 12). Boxplots not only show the location and spread of data but also indicate skewness (which shows up as asymmetry in the sizes of the upper and lower parts of the box). For example, in February the range of lower temperatures was much greater than the range of higher temperatures. Boxplots are also excellent for spotting errors in the data when the errors are represented by extreme outliers. Note that the box-and-whisker plot is based entirely on the data points themselves; there are no estimated parameters like means or standard deviations. The whiskers always end at data points, so the upper and lower whiskers are typically asymmetric, even when there are outliers both above and below (e.g. in November). 5.6.1 Boxplots with notches to indicate significant differences Boxplots are very good at showing the distribution of the data points around the median, but they are not so good at indicating whether or not the median values are significantly different from one another. Tukey invented notches to get the best of both worlds. The notches are drawn as a ‘waist’ on either side of the median and are intended to give a rough impression of the significance of the differences between two medians. Boxes in which the notches do not overlap are likely to prove to have significantly different medians under an appropriate test. Boxes with overlapping notches probably do not have significantly different medians. The

236 214 THE R BOOK size of the notch increases with the magnitude of the interquartile range and declines with the square root of the replication, like this: IQR . notch =± 1 58 , √ n n is the replication per sample. Notches are based on assumptions where IQR is the interquartile range and of asymptotic normality of the median and roughly equal sample sizes for the two medians being compared, and are said to be rather insensitive to the underlying distributions of the samples. The idea is to give roughly a 95% confidence interval for the difference in two medians, but the theory behind this is somewhat vague. Here are the Silwood Weather data (above) with the option notches=TRUE : 30 20 10 daily maximum temperature 0 123456789101112 month There is no significant difference in daily maximum temperature between July and August (the notches for months 7 and 8 overlap completely), but maxima in September are significantly lower than in August. If the boxes do not overlap (e.g. months 9 and 10) then the difference in their medians will be highly significant under the appropriate test. When the sample sizes are small and/or the within-sample variance is high, the notches are not drawn as you might expect them (i.e. as a waist within the box). Instead, the notches are extended above the 75th percentile and/or below the 25th percentile. This looks odd, but it is an intentional feature, supposed to act as a warning of the likely invalidity of the test (see p. 217). 5.6.2 Barplots with error bars Rather than use plot to produce a boxplot, an alternative is to use a barplot to show the heights of the mean values from the different treatments. We need to begin by calculating the heights of the bars, typically by using the function tapply to work out the mean values for each level of the categorical explanatory

237 GRAPHICS 215 variable. Data for this example come from an experiment on plant competition, with five factor levels in a clipping r5 and single categorical variable called : a control (unclipped), two root clipping treatments ( n25 r10 n50 ) in which the leaves of neighbouring plants were ) and two shoot clipping treatments ( and biomass . reduced by 25% and 50%. The response variable is yield at maturity (a dry weight) called \\ temp \\ compexpt.txt",header=T) trial <- read.table("c: attach(trial) names(trial) [1] "biomass" "clipping" First, calculate the heights of the bars using tapply to compute the five mean values: means <- tapply(biomass,clipping,mean) Then the barplot is produced very simply: barplot(means,xlab="treatment",ylab="mean yield",col="green") 600 500 400 300 mean yield 200 100 0 control n25 n50 r10 r5 treatment Unless we add error bars to such a barplot, the graphic gives no indication of the extent of the uncertainty associated with each of the estimated treatment means, and hence is unsuitable for publication. There is no built-in function for drawing error bars on barplots, but it easy to write a function to do this. One obvious issue is that the y axis as drawn by the previous call to barplot is likely to be too short to accommodate the error bar extending from the top of the tallest bar. Another issue is that it is not obvious where to centre each of the error bars (i.e. the x coordinates of the middles of the bars). The next decision to make is what kind of bar to draw. Many journals prefer plus or minus one standard error of the mean. An old fashioned approach is to use plus or minus the 95% confidence interval of the mean. Perhaps the most informative error bar is plus or minus one half of the least significant difference between two means (because then non-overlapping bars indicates significant difference, and overlapping bars indicates non-significance; see p. 515). On the assumption that you want to publish your work in Science or Nature , we shall use plus or minus one standard error of the mean, because this is their error bar of choice. First, work out the error variance from the ANOVA table of lm(y~x) where x is categorical. Now calculate

238 216 THE R BOOK sem, the standard error of the mean. Work out the the replication per factor level, and use this to compute . To scale the top of the y axis, mean values that will be represented by the heights of the bars using tapply add the standard error to the largest of the means. Determine the labels for the bars from the factor levels of the explanatory variable using nn <- as.character(levels(x)) . Find the locations of the centres of the bars along the axis using xs <- barplot . Here is the function in full: x seBars <- function(x,y) { model <- lm(y~factor(x)) reps <- length(y)/length(levels(x)) sem <- summary(model)\$sigma/sqrt(reps) m <- as.vector(tapply(y,x,mean)) upper <- max(m)+sem nn <- as.character(levels(x)) xs <- barplot(m,ylim=c(0,upper),names=nn, ylab=deparse(substitute(y)),xlab=deparse(substitute(x))) { for (i in 1:length(xs)) arrows(xs[i],m[i]+sem,xs[i],m[i]-sem,angle=90,code=3,length=0.1) } } You run it like this, specifying the categorical variable first, then the continuous response variable: seBars(clipping,biomass) 600 500 400 300 biomass 200 100 0 r5 control n25 n50 r10 clipping For comparison, here are the box-and-whisker plots for the same data, without and with notches: windows(7,4) par(mfrow=c(1,2)) plot(clipping,biomass) plot(clipping,biomass,notch=T)

239 GRAPHICS 217 700600500400 650 550 450 control n50 r10 r5 n50 r10 r5 control illustrating the curious behaviour of the notches when the sample sizes are small. 5.6.3 Plots for multiple comparisons When there are many levels of a categorical explanatory variable, we need to be cautious about the statistical issues involved with multiple comparisons (see p. 531). Here we contrast two graphical techniques for displaying multiple comparisons: boxplots with notches, and Tukey’s ‘honest significant difference’. The data show the response of yield to a categorical variable ( fact ) with eight levels representing eight different genotypes of seed (cultivars) used in the trial: \\ temp \\ box.txt",header=T) data <- read.table("c: attach(data) names(data) [1] "fact" "response" plot(response~factor(fact),notch=TRUE) 15 10 response 5 12 345 678 factor(fact)

240 218 THE R BOOK Because the genotypes (factor levels) are unordered, it is hard to judge from the plot which levels might be significantly different from which others. We start, therefore, by calculating an index which will rank the mean values of response across the different factor levels: index <- order(tapply(response,fact,mean)) ordered <- factor(rep(index,rep(20,8))) boxplot(response~ordered,notch=T,names=as.character(index), xlab="ranked treatments",ylab="response") 15 10 response 5 34 1 67 8 2 5 ranked treatments There are several points to clarify here. We plot the response as a function of the factor called ordered (rather than fact ) so that the boxes are ranked from lowest mean yield on the left (cultivar 6) to greatest mean on the right (cultivar 5). We change the names of the boxes to reflect the values of index (i.e. the fact : otherwise they would read 1 to 8). Note that the vector called original values of is of length index 8 (the number of boxes on the plot), but ordered is of length 160 (the number of values of response). Looking at the notches, no two adjacent pairs of medians appear to be significantly different, but the median of treatment 4 appears to be significantly greater than the median of treatment 6, and the median of treatment 5 appears to be significantly greater than the median of treatment 8 (but only just). The statistical analysis of these data might involve user-specified contrasts (p. 434), once it is established that there are significant differences to be explained. This we assess with a one-way analysis of variance to test the hypothesis that at least one of the means is significantly different from the others (see p. 501): model <- aov(response~factor(fact)) summary(model) Df Sum Sq Mean Sq F value Pr(>F) factor(fact) 7 925.7 132.24 17.48 <2e-16 *** Residuals 152 1150.1 7.57

241 GRAPHICS 219 p 0.0001) for accepting that there are significant differences between Indeed, there is compelling evidence ( < the mean yields of the eight different crop cultivars. a priori Alternatively, if you want to do multiple comparisons, then because there is no way of specifying contrasts between the eight treatments, you might use Tukey’s honest significant difference (see p. 531): plot(TukeyHSD(model)) 95% family-wise confidence level 2–15–18–15–28–26–35–48–48–58–7 5 –10 –5 0 Differences in mean levels of factor(fact) Comparisons having intervals that do not overlap the vertical dashed line are significantly different. The vertical dashed line indicates no difference between the mean values for the factor-level comparisons indicated on the y axis. Thus, we can say that the contrast between cultivars 8 and 7 (8–7) falls just short of significance (despite the fact that their notches do not overlap; see above), but the comparisons 7–6 and 8–6 are both significant (their boxes do not overlap, let alone their notches). The missing comparison labels on the y axis of the HSD plot have to be inferred from a knowledge of the number of factor levels (8 in this example). So, since 8 vs. 7 is labelled, the next one up must be 8–6 and the one above that is 7–6, then we find 8–5 labelled, so it must be 7–5 above that and 6–5 above that, then 8–4 labelled, and so on. 5.6.4 Using colour palettes with categorical explanatory variables You can create a vector of colours from a palette, then refer to the colours by their subscripts within the palette. The key is to create the right number of colours for your needs. Here, we use the built-in heat.colors to shade the temperature bars in Silwood Weather. We want the colours to grade from cold to hot then back

242 220 THE R BOOK to cold again from January to December: data <- read.table("c: \\ silwoodweather.txt",header=T) \\ temp attach(data) month <- factor(month) season <- heat.colors(12) temp <- c(11,10,8,5,3,1,2,3,5,8,10,11) plot(month,upper,col=season[temp]) 30 20 10 0 8 9 10 11 1234567 12 Colouring the other parts of the box-and-whisker plot is explained on p. 918. 5.7 Plots for single samples When we have a just one variable, the choice of plots is more restricted:  hist(y) histograms to show a frequency distribution  plot(y) y in sequence index plots to show the values of  time series plots plot.ts(y)  pie(x) compositional plots like pie diagrams 5.7.1 Histograms and bar charts A common mistake among beginners is to confuse histograms and bar charts. Histograms have the response x axis, and the y axis shows the frequency (or the probability density) of different values of variable on the the response. In contrast, a bar chart has the response variable on the y axis and a categorical explanatory variable on the x axis.

243 GRAPHICS 221 Let us look at an example: the response variable is the growth rate of daphnia in different water qualities; there are four different detergents and three different clones of daphnia. temp \\ daphnia.txt",header=T) data<-read.table("c: \\ attach(data) names(data) [1] "Growth.rate" "Water" "Detergent" "Daphnia" The histogram shows the frequency with which each growth rate was observed over the experiment as a whole. There are many different bar charts we could draw: here are the mean growth rates cross-classified by clone and detergent: par(mfrow=c(1,2)) hist(Growth.rate,seq(0,8,0.5),col="green",main="") y <- as.vector(tapply(Growth.rate,list(Daphnia,Detergent),mean)) barplot(y,col="green",ylab="Growth rate",xlab="Treatment") Growth rate Frequency 04812 024 02 468 Growth.rate Treatment There is a superficial similarity between the two plots in that both have numerous green vertical bars. But there the similarity ends. The histogram on the left has Growth.rate on the x axis, but the bar plot on the right has Growth.rate y axis. The y axis on the histogram shows the count (frequency) of the on the number of times that values from a given interval of growth rates were observed in the whole experiment. The y axis on the bar plot shows the arithmetic mean growth rate for that particular experimental treatment. There is no need to labour the point, but you must be absolutely sure that you understand the difference between a histogram and a bar plot, and try not to refer to a bar chart as a histogram or vice versa. 5.7.2 Histograms The divisions of the x axis into which the values of the response variable are distributed and then counted are called bins . Histograms are profoundly tricky, because what you see depends on the subjective judgements of where exactly to put the bin margins. Wide bins produce one picture, narrow bins produce a different picture, unequal bins produce confusion. par(mfrow=c(2,2)) hist(Growth.rate,seq(0,8,0.25),col="green",main="") hist(Growth.rate,seq(0,8,0.5),col="green",main="") hist(Growth.rate,seq(0,8,2),col="green",main="") hist(Growth.rate,c(0,3,4,8),col="green",main="")

244 222 THE R BOOK Frequency Frequency 01234567 024681012 2 68 68 2 4 4 0 0 Growth.rate Growth.rate Density Frequency 010203040 0.00 0.10 0.20 0.30 2 68 68 2 0 4 4 0 Growth.rate Growth.rate The bins are 0.25 units wide in the lop left-hand histrogram, 0.5 wide in the top right, 2.0 wide in the bottom left, and there are three different widths (3, 1, then 4) in the bottom right. The narrower the bins, the lower the peak frequencies (note that the y scale changes: 7, 12, 40). Small bins produce multimodality (top left), broad bins unimodality (bottom right). When there are different bin widths (bottom right), the default in R is for hist to convert the counts (frequencies) into densities (so that the total green area is 1.0). The convention adopted in R for showing bin boundaries is to employ square and round brackets, so that [a,b) a but less than b ’ [square then round), and (a,b] means ‘greater means ‘greater than or equal to a but less than or equal to b ’ (round then square]. The point is that it must be unequivocal which bin gets than a given number when that number falls exactly on a boundary between two bins. You need to take care that the bins can accommodate both your minimum and maximum values. The function cut takes a continuous vector and cuts it up into bins which you can then use for counting. To show how it works, we shall use with the daphnia data to produce the density distribution shown cut above in the bottom right. First, we create a vector of bin edges. To do this, we need to know the range of the growth rates: range(Growth.rate) [1] 1.761603 6.918344 So a lower bound of 0 and an upper bound of 8 will encompass all of the data. We want edges at 3 and 4, so the vector of bin edges is: edges <- c(0,3,4,8) The next bit is what can seem confusing at first. We create a new vector called bin which contains the names of the bins (the factor levels) into which each value of growth rate will be placed. Obviously, this new vector

245 GRAPHICS 223 Growth.rate . It is a factor with as many levels as there are bins (three in this case). is the same length as The names of the factor levels indicate the bin margins and the edge convention, indicated by round and square brackets (0,3] in this default case: bin <- cut(Growth.rate,edges) bin [1] (0,3] (0,3] (3,4] (0,3] (3,4] (4,8] (4,8] (3,4] (4,8] (0,3] (3,4] [12] (0,3] (3,4] (4,8] (4,8] (4,8] (4,8] (4,8] (0,3] (3,4] (3,4] (3,4] [23] (3,4] (3,4] (3,4] (4,8] (4,8] (0,3] (0,3] (3,4] (3,4] (4,8] (4,8] [34] (0,3] (3,4] (4,8] (0,3] (0,3] (3,4] (3,4] (3,4] (4,8] (4,8] (4,8] [45] (4,8] (3,4] (0,3] (3,4] (4,8] (4,8] (4,8] (3,4] (4,8] (4,8] (0,3] [56] (3,4] (0,3] (4,8] (4,8] (4,8] (0,3] (3,4] (4,8] (0,3] (0,3] (0,3] [67] (4,8] (4,8] (4,8] (0,3] (0,3] (0,3] Levels: (0,3] (3,4] (4,8] is.factor(bin) [1] TRUE As you can see, the default of the function is to produce bins with the round bracket on the left and the cut square bracket on the right: (0,3] (3,4] and (4,8]. This is the option right = TRUE (the right-hand value will be included in the bin (square bracket), and the left-hand value will appear in the next bin to the left, if one exists). If you want to include the left-hand value in the bin and exclude the right-hand value (as you might with a mapping study), then you need to specify the option in the cut function right = FALSE (see the example on p. 842). Counting the number of cases in each bin could not be simpler: table(bin) bin (0,3] (3,4] (4,8] 21 22 29 To get the heights of the bars for the density plot we need to allow for the areas of the rectangles. First, the total of the counts, sum(table(bin)) [1] 72 and the relative widths of the bins, diff(edges) [1]314 (table(bin)/sum(table(bin)))/diff(edges) bin (0,3] (3,4] (4,8] 0.09722222 0.30555556 0.10069444

246 224 THE R BOOK These are the heights of the three bars in the density plot (bottom right, above). They do not add to 1 because the bars are of different widths. It is the total area of the three bars that is 1 under this convention. 5.7.3 Histograms of integers Histograms are excellent for showing the mode, the spread and the symmetry (skew) of a set of data, but the hist is deceptively simple. Here is a histogram of 1000 random integers drawn from a Poisson R function distribution with a mean of 1.7. With the default ‘pretty’ scaling to produce eight bars, the histogram produces a graphic that does not clearly distinguish between the zeros and the ones: values <- rpois(1000,1.70) hist(values,main="",xlab="random numbers from a Poisson with mean 1.7") 500 400 300 Frequency 200 100 0 02468 random numbers from a Poisson with mean 1.7 With low-value integer data like this, it is much better to specify the bins explicitly, using the breaks argument. The most sensible breaks for count data are –0.5 to + 0.5 to capture the zeros, 0.5 to 1.5 to capture the 1s, and so on; breaks=(-0.5:8.5) generates such a sequence automatically. Now the histogram makes clear that 1s are roughly twice as frequent as zeros: hist(values,breaks=(-0.5:8.5),main="", xlab="random numbers from a Poisson with mean 1.7")

247 GRAPHICS 225 300 250 200 150 Frequency 10050 0 46 8 0 2 random numbers from a Poisson with mean 1.7 That’s more like it. Now we can see that the mode is 1 (not 0), and that 2s are substantially more frequent than 0s. The distribution is said to be ‘skewed to the right’ (or ‘positively skewed’) because the long tail is on the right-hand side of the histogram. 5.7.4 Overlaying histograms with smooth density functions If it is in any way important, then you should always specify the break points yourself. Unless you do this, the hist function may not take your advice about the number of bars or the width of bars. For small-integer data (less than 20, say), the best plan is to have one bin for each value. You create the breaks by starting at –0.5 to accommodate the zeros and going up to max( y ) + 0.5 to accommodate the biggest count. Here are 158 random integers from a negative binomial distribution with = 1.5 and k = 1.0: μ y <- rnbinom(158,mu=1.5,size=1) bks <- -0.5:(max(y)+0.5) hist(y,bks,main="") To get the best fit of a density function for this histogram we should estimate the parameters of our particular sample of negative binomially distributed counts: mean(y) [1] 1.772152 var(y) [1] 4.228009 mean(y)ˆ2/(var(y)-mean(y)) [1] 1.278789

248 226 THE R BOOK k of the negative binomial distribution is known as and the mean is known as mu . size In R, the parameter We want to generate the probability density for each count between 0 and 11, for which the R function is : dnbinom xs <- 0:11 ys <- dnbinom(xs,size=1.2788,mu=1.772) lines(xs,ys*158) 6050403020100 Frequency 68 024 y Not surprisingly, since we generated the data, the negative binomial distribution is a very good description of the frequency distribution. The frequency of 1s is a bit low and of 0s is a bit high, but the other frequencies are very well described. 5.7.5 Density estimation for continuous variables The problems associated with drawing histograms of continuous variables are much more challenging. The subject of density estimation is an important issue for statisticians, and whole books have been written about it (Silverman, 1986; Scott, 1992). You can get a feel for what is involved by browsing the help ?density window. The algorithm used in density.default disperses the mass of the empirical distribution function over a regular grid of at least 512 points, uses the fast Fourier transform to convolve this approximation with a discretized version of the kernel, and then uses linear approximation to evaluate the density at the specified points. The choice of bandwidth is a compromise between smoothing enough to rub out insignificant bumps, and smoothing too much so that real peaks are eliminated. The rule of thumb for bandwidth is ) x max( x ) − min( = b 2(1 + log n ) 2

249 GRAPHICS 227 is the number of data points; for details see Venables and Ripley, 2002). We can compare n hist (where with Venables and Ripley’s truehist for the Old Faithful eruptions data: library(MASS) attach(faithful) The rule of thumb for bandwidth gives: (max(eruptions)-min(eruptions))/(2*(1+log(length(eruptions),base=2))) [1] 0.192573 but this produces much too bumpy a fit. A bandwidth of 0.6 looks much better: windows(7,4) par(mfrow=c(1,2)) hist(eruptions,15,freq=FALSE,main="",col=27) lines(density(eruptions,width=0.6,n=200)) truehist(eruptions,nbins=15,col=27) lines(density(eruptions,n=200)) 0.6 0.6 0.4 0.4 Density 0.2 0.2 0.0 0.0 4.5 4.5 1.5 3.5 3.5 2.5 1.5 2.5 eruptions eruptions Note that although we asked for 15 bins, we actually got 18. Note also, that although both histograms have 18 bins, they differ substantially in the heights of several of the bars. The left has two peaks above hist = hist truehist on the right has three. There is a sub-peak in the trough of density at about 0.5 while truehist . And so on. Such are the problems with histograms. Note, also, that the default 3.5 but not of probability density curve (on the right) picks out the heights of the peaks and troughs much less well than our bandwidth of 0.6 (on the left). 5.7.6 Index plots The other plot that is useful for single samples is the index plot. Here, plot takes a single argument which is a continuous variable and plots the values on the axis, with the x coordinate determined by the position y of the number in the vector (its ‘index’, which is 1 for the first number, 2 for the second, and so on up to length(y) for the last value). This kind of plot is especially useful for error checking. Here is a data set that has not yet been quality checked, with an index plot of response\$y : response <- read.table("c: \\ temp \\ das.txt",header=T) plot(response\$y)

250 228 THE R BOOK response\$y 5 101520 100 020406080 Index The error stands out like a sore thumb. We should check whether this might have been a data entry error, such as a decimal point in the wrong place. But which value is it, precisely, that is wrong? What is clear is that y it is the only point for which which function to find out its index (the subscript > 15, so we can use the within y ): which(response\$y > 15) [1] 50 We can then use this value as the subscript to see the precise value of the erroneous y : response\$y[50] [1] 21.79386 Having checked in the lab notebook, it is obvious that this number should be 2.179 rather than 21.79, so we replace the 50th value of y with the correct value: response\$y[50] <- 2.179386 Now we can repeat the index plot to see if there are any other obvious mistakes plot(response\$y) That’s more like it. 5.7.7 Time series plots When a time series is complete, the time series plot is straightforward, because it just amounts to joining the dots in an ordered set of y values. The issues arise when there are missing values in the time series, particularly groups of missing values for which periods we typically know nothing about the behaviour of the time series.

251 GRAPHICS 229 ts.plot plot.ts .Hereis ts.plot in There are two functions in R for plotting time series data: and action, producing three time series on the same axes using different line types: data(UKLungDeaths) ts.plot(ldeaths, mdeaths, fdeaths, xlab="year", ylab="deaths", lty=c(1:3)) 4000 3500 3000 2500 deaths 2000 1500 1000 500 1974 1975 1976 1977 1978 1979 1980 year The upper, solid line shows total deaths, the heavier dashed line shows male deaths and the faint dotted line shows female deaths. The difference between the sexes is clear, as is the pronounced seasonality, with deaths peaking in midwinter. The alternative function plot.ts works for plotting objects inheriting from class=ts (rather than simple vectors of numbers in the case of ts.plot ). data(sunspots) plot(sunspots)

252 230 THE R BOOK 50 2 200 150 sunspots 100 50 0 1850 1900 1750 1800 1950 Time The simple statement plot(sunspots) works because sunspots inherits from the time series class, and has the dates for plotting on the x axis built into the object: class(sunspots) [1] "ts" is.ts(sunspots) [1] TRUE str(sunspots) Time-Series [1:2820] from 1749 to 1984: 58 62.6 70 55.7 85 83.5 94.8 ... 5.7.8 Pie charts Statisticians do not like pie charts because they think that people should know what 50% looks like. Pie charts, however, can sometimes be useful to illustrate the proportional make-up of a sample in presentations. The function pie takes a vector of numbers, turns them into proportions, and divides up the circle on the basis of those proportions. It is essential to use a label to indicate which pie segment is which. The label is provided as a vector of character strings, here called data\$names . Because there are blank spaces in some of the names (‘oil shales’ and ‘methyl clathrates’) we cannot use read.table with a tab-delimited text file to enter the data. Instead, we save the file called piedata as a comma-delimited file, with a ‘.csv’ extention,

253 GRAPHICS 231 read.csv in place of , like this: and input the data to R using read.table temp \\ piedata.csv") data <- read.csv("c: \\ data names amounts 1 coal 4 2 oil 2 3 gas 1 4 oil shales 3 5 methyl clathrates 6 The pie chart is created like this: pie(data\$amounts,labels=as.character(data\$names)) oil coal gas oil shales methyl clathrates You can change the colours of the segments if you want to (p. 910). 5.7.9 The stripchart function For sample sizes that are too small to use box-and-whisker plots, an alternative plotting method is to use the stripchart stripchart is to look carefully at the location of individual function. The point of using values within the small sample, and to compare values across cases. The plot can be specified stripchart by a model formula y~factor and the strips can be specified to run vertically rather than horizontally. Here is an example from the built-in OrchardSprays data set where the response variable is called decrease and there is a single categorical variable called (with eight levels A–H). Note the use of with treatment instead of attach : data(OrchardSprays) with(OrchardSprays, stripchart(decrease ~ treatment, ylab = "decrease", vertical = TRUE, log = "y"))

254 232 THE R BOOK 100 50 20 decrease 10 5 2 ABCDE FGH This has the general layout of the box-and-whisker plot, but shows all the raw data values. Note the logarithmic axis, log = "y" , and the vertical alignment of the eight strip charts. y 5.7.10 A plot to test for normality Here is a simple function that plots a data set and compares it to a plot of normally distributed data with the same mean and standard deviation: normal.plot <- function(y) { s <- sd(y) plot(c(0,3),c(min(0,mean(y)-s * 4* qnorm(0.75)),max(y)),xaxt="n",xlab="",type="n",ylab="") # for your data's boxes and whiskers, centred at x = 1 top <- quantile(y,0.75) bottom <- quantile(y,0.25) w1u <- quantile(y,0.91) w2u <- quantile(y,0.98) w1d <- quantile(y,0.09) w2d <- quantile(y,0.02) rect(0.8,bottom,1.2,top) lines(c(0.8,1.2),c(mean(y),mean(y)),lty=3) lines(c(0.8,1.2),c(median(y),median(y))) lines(c(1,1),c(top,w1u)) lines(c(0.9,1.1),c(w1u,w1u)) lines(c(1,1),c(w2u,w1u),lty=3) lines(c(0.9,1.1),c(w2u,w2u),lty=3)

255 GRAPHICS 233 nou <- length(y[y>w2u]) points(rep(1,nou),jitter(y[y>w2u])) lines(c(1,1),c(bottom,w1d)) lines(c(0.9,1.1),c(w1d,w1d)) lines(c(1,1),c(w2d,w1d),lty=3) lines(c(0.9,1.1),c(w2d,w2d),lty=3) nod <- length(y[y

256 234 THE R BOOK 30 0 10 02 –10 data normal Our data (on the left) are non-normal in several obvious ways: the median is lower than the mean (the solid line is below the horizontal dotted line inside the box), the 75th percentile is rather low (the top of the normal box on the right is higher), and our data have two serious outliers (the open circles). Most obviously, however, our data have no negative values, which normally distributed data with a mean and standard deviation as specified would certainly be expected to have (the 9th and 2nd percentiles on the right-hand box are both well below zero, but our minimum value was 0). 5.8 Plots with multiple variables Initial data inspection using plots is even more important when there are many variables, any one of which might contain mistakes or omissions. The principal plot functions when there are multiple variables are:  pairs for a matrix of scatterplots of every variable against every other;  coplot for conditioning plots where y is plotted against x for different values of z ;  where a set of panel plots is produced. xyplot We illustrate these functions with the ozone data. 5.8.1 The pairs function With two or more continuous explanatory variables (i.e. in a multiple regression; see p. 395) it is valuable to be able to check for subtle dependencies between the explanatory variables. The pairs function plots every variable in the dataframe on the y axis against every other variable on the x axis: you will see at once what this means from the following example:

257 GRAPHICS 235 \\ temp \\ ozone.data.txt",header=T) ozonedata <- read.table("c: attach(ozonedata) names(ozonedata) [1] "rad" "temp" "wind" "ozone" pairs function needs only the name of the whole dataframe as its first argument. We exercise the option The to add a non-parametric smoother to the scatterplots: pairs(ozonedata,panel=panel.smooth) 60 70 80 90 0 50 100 150 rad 150 250 0 50 temp 60 70 80 90 wind 5101520 100 150 ozone 50 0 0 50 5101520 150 250 The response variables are named in the rows and the explanatory variables are named in the columns. Thus, in the upper row, labelled rad , the response variable (on the y axis) is solar radiation. In the bottom row the response variable, ,isonthe y axis of all three panels. There appears to be a strong negative non-linear ozone relationship between ozone and wind speed, a positive non-linear relationship between air temperature and ozone (middle panel in the bottom row) and an indistinct, perhaps humped, relationship between ozone and solar radiation (left-most panel in the bottom row). As to the explanatory variables, there appears to be a negative correlation between wind speed and temperature.

258 236 THE R BOOK function coplot 5.8.2 The A real difficulty with multivariate data is that the relationship between two variables may be obscured by the effects of other processes. When you draw a two-dimensional plot of y x , then all of the effects of the against other explanatory variables are squashed flat onto the plane of the paper. In the simplest case, we have one response variable (ozone) and just two explanatory variables (wind speed and air temperature). The function is written like this: coplot(ozone~wind|temp,panel = panel.smooth) Given : temp 60 70 80 90 5 10 5 15 10 25 15 20 150100500 ozone 150 100500 5101520 wind We have the response ( ) on the left of the tilde and the explanatory variable on the x axis ( wind )on ozone the right, with the conditioning variable after the conditioning operator | (here read as ‘given temp ’). An option employed here is to fit a non-parametric smoother through the scatterplot to emphasize the contrasting trends in each of the panels. The coplot panels are ordered from lower left to upper right, associated with the values of the condi- tioning variable in the upper panel ( temp ) from left to right. Thus, the lower left-hand plot is for the lowest ◦ ◦ F) and the upper right plot is for the highest temperatures (82–96 F). This coplot temperatures (56–72

259 GRAPHICS 237 temp , there is highlights an interesting interaction. At the two lowest levels of the conditioning variable, little or no relationship between ozone concentration and wind speed, but in the four remaining panels (at higher temperatures) there is a distinct negative relationship between wind speed and ozone. The hard thing to understand about coplot involves the ‘shingles’ that are shown in the upper margin (given in temp this case). The overlap between the shingles is intended to show how much overlap there is between one panel and the next in terms of the data points they have in common. In this default configuration, half of the data in a panel is shared with the panel to the left, and half of the data is shared with the panel to the right overlap = 0.5 ( ). You can alter the shingle as far as the other extreme, when all the data points in a panel are unique to that panel (there is no overlap between adjacent shingles; ). overlap = -0.05 5.8.3 Interaction plots These are useful when the response to one factor depends upon the level of another factor. They are a particularly effective graphical means of interpreting the results of factorial experiments (p. 516). Here is an experiment with grain yields in response to irrigation and fertilizer application: yields <- read.table("c: \\ temp \\ splityield.txt",header=T) attach(yields) names(yields) [1] "yield" "block" "irrigation" "density" "fertilizer" yield last in The interaction plot has a rather curious syntax, because the response variable ( ) comes x the list of arguments. The factor listed first forms the fertilizer ), and axis of the plot (three levels of the factor listed second produces the family of lines (two levels of irrigation ). The lines join the mean values of the response for each combination of factor levels: interaction.plot(fertilizer,irrigation,yield) 120 irrigation irrigated 115 control 110 105 mean of yield 100 95 90 NNPP fertilizer

260 238 THE R BOOK The interaction plot shows that the mean response to fertilizer depends upon the level of irrigation, as evidenced by the fact that the lines are not parallel. 5.9 Special plots 5.9.1 Design plots An effective way of visualizing effect sizes in designed experiments is the function which plot.design is used just like a model formula: plot.design(Growth.rate~Water*Detergent*Daphnia) Clone2 4.5 Clone3 BrandB Wear 4.0 BrandC BrandA Tyne BrandD 3.5 mean of Growth.rate 3.0 Clone1 Detergent Water Daphnia Factors This shows the main effects of the three factors, drawing attention to the major differences between the daphnia clones and the small differences between the detergent brands A, B and C. The default (as here) is to plot means, but other functions can be specified such as median , var or sd . Alternatively, you can supply your own anonymous function. Here, for instance, are the standard errors for the different factor levels: plot.design(Growth.rate~Water*Detergent*Daphnia, fun=function(x) sqrt(var(x)/3) )

261 GRAPHICS 239 0.9 BrandD Wear 0.8 BrandC BrandA Clone2 0.7 Tyne BrandB 0.6 0.5 function(x) sqrt(var(x)/3) of Growth.rate 0.4 Clone1 Detergent Water Daphnia Factors 5.9.2 Bubble plots The bubble plot is useful for illustrating variation in a third variable across different locations in the x – y plane. Here is a simple function for drawing bubble plots (see also p. 940): bubble.plot <- function(xv,yv,rv,bs=0.1) { r <- rv/max(rv) yscale <- max(yv)-min(yv) xscale <- max(xv)-min(xv) plot(xv,yv,type="n", xlab=deparse(substitute(xv)), ylab=deparse(substitute(yv))) for (i in 1:length(xv)) bubble(xv[i],yv[i],r[i],bs,xscale,yscale) } bubble <- function (x,y,r,bubble.size,xscale,yscale) { theta <- seq(0,2*pi,pi/200) yv <- r*sin(theta)*bubble.size*yscale xv <- r*cos(theta)* bubble.size*xscale lines(x+xv,y+yv) } The example data are on grass yields at different combinations of biomass and soil pH: ddd <- read.table("c: \\ temp \\ pgr.txt",header=T) attach(ddd) names(ddd) [1] "FR" "hay" "pH" bubble.plot(hay,pH,FR)

262 240 THE R BOOK 7.06.56.05.55.04.54.03.5 pH 23456 789 hay hay=6 and pH = 6 In the vicinity of shows one very high value, four intermediate values, Festuca rubra two low values and one very low value. Evidently, hay crop and soil pH are not the only factors determining the abundance of F. rubra in this experiment. 5.9.3 Plots with many identical values Sometimes, especially with count data, it happens that two or more points fall in exactly the same location in a scatterplot. In such a case, the repeated values of y are hidden, one buried beneath the other, and you might want to indicate the number of cases represented at each point on the scatterplot. \\ temp \\ longdata.txt",header=T) numbers <- read.table("c: attach(numbers) names(numbers) [1] "xlong" "ylong" The first option is to ‘jitter’ the points within the plot function. This means to increase or decrease their x and/or y coordinates by a small random amount until each data point shows separately: plot(jitter(xlong,amount=1),jitter(ylong,amount=1),xlab="input",ylab="count")

263 GRAPHICS 241 6560 55 5045 count 40 35 01020 40 50 30 input You need to experiment with the amount argument to get the degree of scatter you require (this specifies the limit on the x or y axis of the amount of jitter on either side of the actual value). An alternative function is called , so called because it produces one ‘petal’ of a flower sunflowerplot for each value of y (if there is more than one) that is located at that particular point. Here it is in action: sunflowerplot(xlong,ylong) 65 60 55 50 ylong 45 40 35 0 1020304050 xlong

264 242 THE R BOOK x increases from 1 on the left to 50 on the right. The As you can see, the replication at each point increases as petals stop being particularly informative once there are more than about 20 of them (about half way along x axis). Single values (as on the extreme left) are shown without any petals, while two points in the same the place have two petals. As an option, you can specify two vectors containing the unique values of x and y with a third vector containing the frequency of each combination (the number of repeats of each value). 5.10 Saving graphics to file For publication-quality graphics, you are likely to want to save each of your plots as a PDF or PostScript file. You do this simply by specifying the ‘device’ before you start plotting, then turning the device off once you have finished. The default device is your computer screen, and you can obtain a rough and ready copy of the graph (press Ctrl + C) which you can then paste into a document outside R (press Ctrl + V). data <- read.table("c: temp \\ pollute.txt",header=T) \\ attach(data) You are most likely to want to save to a PDF file. Here is how you do so: pdf("c: temp \\ pollution.pdf",width=7,height=4) \\ par(mfrow=c(1,2)) plot(Population,Pollution) plot(Temp,Pollution) dev.off() Here is how you save to a PostScript file: postscript("c: \\ temp \\ pollution.ps",width=7,height=4) par(mfrow=c(1,2)) plot(Population,Pollution) plot(Temp,Pollution) dev.off() There are numerous options for the and postscript functions, but width and height are the ones pdf you are likely to want to change most often. The sizes are in inches. You can specify any non-default width, height, onefile, family, title, fonts, arguments that you want to change ( paper, encoding, pointsize, bg, fg, pagecentre, useDingbats, colormodel, fillOddEven and compress ) using the functions pdf.options(..., reset = FALSE) and ps.options(..., reset = FALSE) before you invoke either pdf or postscript . The logical option resets all the options to their default, ‘factory-fresh’ values. Don’t forget to set reset = TRUE dev.off() once you have finished. 5.11 Summary It is worth restating the really important things about plotting.  Plots : plot(x,y) gives a scatterplot if x is continuous, and a box-and-whisker plot if x is a factor. Some people prefer the alternative syntax plot(y~x) using ‘tilde’ as in a model formula; one advantage is that this has a subset option.

265 GRAPHICS 243  Type type="l" or null (axes only) type="n" . of plot: Options include lines  : lines(x,y) plots a smooth function of y against x using the x and y values provided. You Lines lines(y~x) might prefer .  Line types lty=2 (an option in plot or lines ). : Useful dotted or dashed lines;  Points : points(x,y) adds another set of data points to a plot. You might prefer points(y~x) .  points Plotting characters or pch="*" (an option in pch=16 or plot ). for different data sets:  Axes : setting non-default limits to the x or y axis scales uses xlim=c(0,25) and/or ylim=c(0,1) as an option in . plot  Labels :use xlab and ylab to label the x and y axes.  Scales: use ylim and xlim to control the top and bottom values on your axes.

266 6 Tables The alternative to using graphics is to summarize your data in tabular form. Broadly speaking, if you want to convey use a table, and if you want to show effects then use graphics. You are more likely to want detail to use a table to summarize data when your explanatory variables are categorical (such as people’s names, or different commodities) than when they are continuous (in which case a scatterplot is likely to be more informative; see p. 189). There are two very important functions that you need to distinguish:  table for counting things;  tapply for averaging things, and applying other functions across factor levels. 6.1 Tables of counts The table function is perhaps the most useful of all the simple vector functions, because it does so much work behind the scenes. We have a vector of objects (they could be numbers or character strings) and we want to know how many of each is present in the vector. Here are 1000 integers from a Poisson distribution with mean 0.6: counts<-rpois(1000,0.6) We want to count up all of the zeros, ones, twos, and so on. A big task, but here is the table function in action: table(counts) counts 012345 539 325 110 24 1 1 There were 539 zeros, 325 ones, 110 twos, 24 threes, 1 four, 1 five and nothing larger than 5. That is a lot of work (imagine tallying them for yourself). The function works for characters as well as for numbers, and for The R Book , Second Edition. Michael J. Crawley. © 2013 John Wiley & Sons, Ltd. Published 2013 by John Wiley & Sons, Ltd.

267 TABLES 245 multiple classifying variables: infections<-read.table("c: \\ disease.txt",header=T) \\ temp attach(infections) head(infections) status gender 1 clear male 2 clear male 3 clear male 4 clear male 5 clear male 6 clear male and so on for 1000 rows. You want to know how many males and females were infected and how many were clear of infection: table(status,gender) gender status females male clear 284 515 infected 53 68 If you want the genders as the rows rather than the columns, then put gender first in the argument list to table : table(gender,status) status gender clear infected females 284 53 male 515 68 The function is likely to be one of the R functions you use most often in your own work. table 6.2 Summary tables The most important function in R for generating summary tables is the somewhat obscurely named tapply function. It is called tapply because it applies a named function (such as mean or variance) across specified margins (factor levels) to create a table. If you have used the PivotTable function in Excel you will be familiar with the concept. Here is in action: tapply data<-read.table("c: \\ temp \\ Daphnia.txt",header=T) attach(data) names(data) [1] "Growth.rate" "Water" "Detergent" "Daphnia" The response variable is growth rate of the animals, and there are three categorical explanatory variables: the river from which the water was sampled, the kind of detergent experimentally added, and the clone of

268 246 THE R BOOK daphnia employed in the experiment. In the simplest case we might want to tabulate the mean growth rates for the four brands of detergent tested, tapply(Growth.rate,Detergent,mean) BrandA BrandB BrandC BrandD 3.884832 4.010044 3.954512 3.558231 or for the two rivers, tapply(Growth.rate,Water,mean) Tyne Wear 3.685862 4.017948 or for the three daphnia clones, tapply(Growth.rate,Daphnia,mean) Clone1 Clone2 Clone3 2.839875 4.577121 4.138719 Two-dimensional summary tables are created by replacing the single explanatory variable (the second argument in the function call) by a indicating which variable is to be used for the rows of the summary list table and which variable is to be used for creating the columns of the summary table. To get the daphnia list(Daphnia,Detergent) –rowsfirst clones as the rows and detergents as the columns, we write then columns – and use tapply to create the summary table as follows: tapply(Growth.rate,list(Daphnia,Detergent),mean) BrandA BrandB BrandC BrandD Clone1 2.732227 2.929140 3.071335 2.626797 Clone2 3.919002 4.402931 4.772805 5.213745 Clone3 5.003268 4.698062 4.019397 2.834151 If we wanted the median values (rather than the means), then we would just alter the third argument of the tapply function like this: tapply(Growth.rate,list(Daphnia,Detergent),median) BrandA BrandB BrandC BrandD Clone1 2.705995 3.012495 3.073964 2.503468 Clone2 3.924411 4.282181 4.612801 5.416785 Clone3 5.057594 4.627812 4.040108 2.573003 To obtain a table of the standard errors of the means (where each mean is based on six numbers: two √ 2 s . There is no built-in function for the / n replicates and three rivers) the function we want to apply is anonymous function inside the tapply function standard error of a mean, so we create what is known as an with function(x)sqrt(var(x)/length(x)) like this: tapply(Growth.rate,list(Daphnia,Detergent), function(x) sqrt(var(x)/length(x))) BrandA BrandB BrandC BrandD Clone1 0.2163448 0.2319320 0.3055929 0.1905771 Clone2 0.4702855 0.3639819 0.5773096 0.5520220 Clone3 0.2688604 0.2683660 0.5395750 0.4260212

269 TABLES 247 tapply is asked to produce a three-dimensional table, it produces a stack of two-dimensional When tables, the number of stacked tables being determined by the number of levels of the categorical variable that third in the list ( in this case): comes Water tapply(Growth.rate,list(Daphnia,Detergent,Water),mean) , , Tyne BrandA BrandB BrandC BrandD Clone1 2.811265 2.775903 3.287529 2.597192 Clone2 3.307634 4.191188 3.620532 4.105651 Clone3 4.866524 4.766258 4.534902 3.365766 , , Wear BrandA BrandB BrandC BrandD Clone1 2.653189 3.082377 2.855142 2.656403 Clone2 4.530371 4.614673 5.925078 6.321838 Clone3 5.140011 4.629867 3.503892 2.302537 In cases like this, the function ftable (which stands for ‘flat table’) often produces more pleasing output: ftable(tapply(Growth.rate,list(Daphnia,Detergent,Water),mean)) Tyne Wear Clone1 BrandA 2.811265 2.653189 BrandB 2.775903 3.082377 BrandC 3.287529 2.855142 BrandD 2.597192 2.656403 Clone2 BrandA 3.307634 4.530371 BrandB 4.191188 4.614673 BrandC 3.620532 5.925078 BrandD 4.105651 6.321838 Clone3 BrandA 4.866524 5.140011 BrandB 4.766258 4.629867 BrandC 4.534902 3.503892 BrandD 3.365766 2.302537 Notice that the order of the rows, columns or tables is determined by the alphabetical sequence of the factor levels (e.g. Tyne comes before Wear in the alphabet). If you want to override this, you must specify that the factor levels are ordered in a non-standard way: water<-factor(Water,levels=c("Wear","Tyne")) Now the summary statistics for the Wear appear in the left-hand column of output: ftable(tapply(Growth.rate,list(Daphnia,Detergent,water),mean)) Wear Tyne Clone1 BrandA 2.653189 2.811265 BrandB 3.082377 2.775903 BrandC 2.855142 3.287529 BrandD 2.656403 2.597192

270 248 THE R BOOK Clone2 BrandA 4.530371 3.307634 BrandB 4.614673 4.191188 BrandC 5.925078 3.620532 BrandD 6.321838 4.105651 Clone3 BrandA 5.140011 4.866524 BrandB 4.629867 4.766258 BrandC 3.503892 4.534902 BrandD 2.302537 3.365766 The function to be applied in generating the table can be supplied with extra arguments: tapply(Growth.rate,Detergent,mean,trim=0.1) BrandA BrandB BrandC BrandD 3.874869 4.019206 3.890448 3.482322 trim The mean function, specifying the fraction (between 0 and 0.5) of the argument is part of the observations to be trimmed from each end of the sorted vector of values before the mean is computed. Values of trim outside that range are taken as the nearest endpoint. An extra argument is essential if you want means when there are missing values: tapply(Growth.rate,Detergent,mean,na.rm=T) Without the argument specifying that you want to average over the non-missing values ( na.rm=T means ‘it is true that I want to remove the missing values’) , the mean function will simply fail, producing NA as the answer. You can use to create new, abbreviated dataframes comprising summary parameters estimated tapply from larger dataframe. Here, for instance, we want a dataframe of mean growth rate classified by detergent and daphina clone (i.e. averaged over river water and replicates). The trick is to convert the factors to numbers before using , then use these numbers to extract the relevant levels from the original factors: tapply dets <- as.vector(tapply(as.numeric(Detergent),list(Detergent,Daphnia),mean)) levels(Detergent)[dets] [1] "BrandA" "BrandB" "BrandC" "BrandD" "BrandA" "BrandB" "BrandC" "BrandD" [9] "BrandA" "BrandB" "BrandC" "BrandD" clones<-as.vector(tapply(as.numeric(Daphnia),list(Detergent,Daphnia),mean)) levels(Daphnia)[clones] [1] "Clone1" "Clone1" "Clone1" "Clone1" "Clone2" "Clone2" "Clone2" "Clone2" [9] "Clone3" "Clone3" "Clone3" "Clone3" You will see that these vectors of factor levels are the correct length for the new reduced dataframe (12, rather than the original length 72). The 12 mean values that will form our response variable in the new, reduced dataframe are given by: tapply(Growth.rate,list(Detergent,Daphnia),mean) Clone1 Clone2 Clone3 BrandA 2.732227 3.919002 5.003268 BrandB 2.929140 4.402931 4.698062 BrandC 3.071335 4.772805 4.019397 BrandD 2.626797 5.213745 2.834151

271 TABLES 249 means , and the three new vectors combined into a dataframe: These can now be converted into a vector called means <- as.vector(tapply(Growth.rate,list(Detergent,Daphnia),mean)) detergent <- levels(Detergent)[dets] daphnia <- levels(Daphnia)[clones] data.frame(means,detergent,daphnia) means detergent daphnia 1 2.732227 BrandA Clone1 2 2.929140 BrandB Clone1 3 3.071335 BrandC Clone1 4 2.626797 BrandD Clone1 5 3.919002 BrandA Clone2 6 4.402931 BrandB Clone2 7 4.772805 BrandC Clone2 8 5.213745 BrandD Clone2 9 5.003268 BrandA Clone3 10 4.698062 BrandB Clone3 11 4.019397 BrandC Clone3 12 2.834151 BrandD Clone3 The same result can be obtained using the as.data.frame.table function: as.data.frame.table(tapply(Growth.rate,list(Detergent,Daphnia),mean)) Var1 Var2 Freq 1 BrandA Clone1 2.732227 2 BrandB Clone1 2.929140 3 BrandC Clone1 3.071335 4 BrandD Clone1 2.626797 5 BrandA Clone2 3.919002 6 BrandB Clone2 4.402931 7 BrandC Clone2 4.772805 8 BrandD Clone2 5.213745 9 BrandA Clone3 5.003268 10 BrandB Clone3 4.698062 11 BrandC Clone3 4.019397 12 BrandD Clone3 2.834151 but you need to edit the variable names like this: new<-as.data.frame.table(tapply(Growth.rate,list(Detergent,Daphnia),mean)) names(new)<-c("detergents","daphina","means") head(new) detergents daphina means 1 BrandA Clone1 2.732227 2 BrandB Clone1 2.929140 3 BrandC Clone1 3.071335 4 BrandD Clone1 2.626797 5 BrandA Clone2 3.919002 6 BrandB Clone2 4.402931

272 250 THE R BOOK 6.3 Expanding a table into a dataframe For the purposes of model-fitting, we often want to expand a table of explanatory variables to create a dataframe with as many repeated rows as specified by a count. Here are the data: count.table<-read.table("c: \\ temp \\ tabledata.txt",header=T) attach(count.table) head(count.table) count sex age condition 1 12 male young healthy 2 7 male old healthy 3 9 female young healthy 4 8 female old healthy 5 6 male young parasitized 6 7 male old parasitized The idea is to create a new dataframe with a separate row for each case. That is to say, we want 12 copies of the first row (for healthy young males), seven copies of the second row (for healthy old males), and so on. The trick is to use lapply to apply the repeat function rep to each variable in count.table such that each row is repeated by the number of times specified in the vector called count : lapply(count.table,function(x)rep(x, count.table\$count)) \$count [1] 12 12 12 12 12 12 12 12 12 12 12 12 7 7 7 7 7 7 7 9 [21] 9 9 9 9 9 9 9 9 8 8 8 8 8 8 8 8 6 6 6 6 [41] 6 6 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 5 5 5 [61] 5 5 \$sex [1] male male male male male male male male male male [11] male male male male male male male male male female [21] female female female female female female female female female female [31] female female female female female female male male male male [41] male male male male male male male male male female [51] female female female female female female female female female female [61] female female Levels: female male \$age [1] young young young young young young young young young young [11] young young old old old old old old old young [21] young young young young young young young young old old [31] old old old old old old young young young young [41] young young old old old old old old old young [51] young young young young young young young old old old [61] old old Levels: old young

273 TABLES 251 \$condition [1] healthy healthy healthy healthy healthy healthy healthy [8] healthy healthy healthy healthy healthy healthy healthy [15] healthy healthy healthy healthy healthy healthy [21] healthy healthy healthy healthy healthy healthy healthy [28] healthy healthy healthy healthy healthy healthy healthy [35] healthy healthy parasitized parasitized parasitized parasitized [41] parasitized parasitized parasitized parasitized parasitized parasitized [47] parasitized parasitized parasitized parasitized parasitized parasitized [53] parasitized parasitized parasitized parasitized parasitized parasitized [59] parasitized parasitized parasitized parasitized Levels: healthy parasitized Then we convert this object from a to a data.frame using as.data.frame like this: list dbtable<-as.data.frame(lapply(count.table, function(x) rep(x, count.table\$count))) head(dbtable) count sex age condition 1 12 male young healthy 2 12 male young healthy 3 12 male young healthy 4 12 male young healthy 5 12 male young healthy 6 12 male young healthy To tidy up, we probably want to remove the redundant vector of counts: dbtable<-dbtable[,-1] head(dbtable) sex age condition 1 male young healthy 2 male young healthy 3 male young healthy 4 male young healthy 5 male young healthy 6 male young healthy tail(dbtable) sex age condition 57 female young parasitized 58 female old parasitized 59 female old parasitized 60 female old parasitized 61 female old parasitized 62 female old parasitized Now we can use the contents of dbtable as explanatory variables in modelling other responses of each of the 62 cases (e.g. the animals’ body weights). The alternative is to produce a long vector of row numbers

274 252 THE R BOOK and use this as a subscript on the rows of the short dataframe to turn it into a long dataframe with the same column structure (this is illustrated on p. 255). 6.4 Converting from a dataframe to a table The reverse procedure of creating a table from a dataframe is much more straightforward, and involves table function: nothing more than the table(dbtable) , ,condition = healthy Age sex old young female 8 9 male 7 12 , ,condition = parasitized Age sex old young female 5 8 male 7 6 You might want this tabulated object itself to be another dataframe, in which case use: as.data.frame(table(dbtable)) sex age condition Freq 1 female old healthy 8 2 male old healthy 7 3 female young healthy 9 4 male young healthy 12 5 female old parasitized 5 6 male old parasitized 7 7 female young parasitized 8 8 male young parasitized 6 You will see that R has invented the variable name Freq for the counts of the various contingencies. To change this to ‘count’ use names with the appropriate subscript [4] : frame<-as.data.frame(table(dbtable)) names(frame)[4]<-"count" frame sex age condition count 1 female old healthy 8 2 male old healthy 7 3 female young healthy 9 4 male young healthy 12 5 female old parasitized 5 6 male old parasitized 7 7 female young parasitized 8 8 male young parasitized 6

275 TABLES 253 prop.table 6.5 Calculating tables of proportions with of a table (the row totals or the column totals) are often useful for calculating proportions The margins : counts instead of counts. Here is a data matrix called counts<-matrix(c(2,2,4,3,1,4,2,0,1,5,3,3),nrow=4) counts [,1] [,2] [,3] [1,] 2 1 1 [2,] 2 4 5 [3,] 4 2 3 [4,] 3 0 3 The proportions will be different when they are expressed as a fraction of the row totals or of the column prop.table(counts,margin) . You need to remember that the totals. To find the proportions we use row subscripts come first, which is why margin=1 refers to the row totals: prop.table(counts,1) [,1] [,2] [,3] [1,] 0.5000000 0.2500000 0.2500000 [2,] 0.1818182 0.3636364 0.4545455 [3,] 0.4444444 0.2222222 0.3333333 [4,] 0.5000000 0.0000000 0.5000000 Use to express the counts as proportions of the relevant column total: margin=2 prop.table(counts,2) [,1] [,2] [,3] [1,] 0.1818182 0.1428571 0.08333333 [2,] 0.1818182 0.5714286 0.41666667 [3,] 0.3636364 0.2857143 0.25000000 [4,] 0.2727273 0.0000000 0.25000000 To check that the column proportions sum to 1, use colSums like this: colSums(prop.table(counts,2)) [1]111 sum(counts) , then simply omit the If you want the proportions expressed as a fraction of the grand total margin number: prop.table(counts) [,1] [,2] [,3] [1,] 0.06666667 0.03333333 0.03333333 [2,] 0.06666667 0.13333333 0.16666667 [3,] 0.13333333 0.06666667 0.10000000 [4,] 0.10000000 0.00000000 0.10000000 sum(prop.table(counts)) [1] 1

276 254 THE R BOOK In any particular case, you need to think carefully whether it makes sense to express your counts as proportions of the row totals, the column totals or the grand total. function 6.6 The scale For a numeric matrix, you might want to scale the values within a column so that they have a mean of 0. You might also want to know the standard deviation of the values within each column. These two actions are carried out simultaneously with the function: scale scale(counts) [,1] [,2] [,3] [1,] -0.7833495 -0.439155 -1.224745 [2,] -0.7833495 1.317465 1.224745 [3,] 1.3055824 0.146385 0.000000 [4,] 0.2611165 -1.024695 0.000000 attr(,"scaled:center") [1] 2.75 1.75 3.00 attr(,"scaled:scale") [1] 0.9574271 1.7078251 1.6329932 The values in the table are the counts minus the column means of the counts. The means of the columns attr(,"scaled:center") are 2.75, 1.75 and 3.0, while the standard deviations of the columns attr(,"scaled:scale") are 0.96, 1.71 and 1.63. To check that the scales are the standard devia- tions ( sd ) of the counts within a column, you could use apply to the columns (margin = 2) like this: apply(counts,2,sd) [1] 0.9574271 1.7078251 1.6329932 expand.grid function 6.7 The This is a useful function for generating tables from factorial combinations of factor levels. Suppose we have three variables: with five levels between 60 and 80 in steps of 5, weight with five levels between height 100 and 300 in steps of 50, and two sexes. Then: expand.grid(height = seq(60, 80, 5), weight = seq(100, 300, 50), sex = c("Male","Female")) height weight sex 1 60 100 Male 2 65 100 Male 3 70 100 Male 4 75 100 Male 5 80 100 Male ... 48 70 300 Female 49 75 300 Female 50 80 300 Female

277 TABLES 255 model.matrix function 6.8 The Creating tables of dummy variables for use in statistical modelling is extremely easy with the function. You will see what the function does with a simple example. Suppose that our model.matrix parasite indicating the identity of a gut parasite; this variable has five dataframe contains a factor called and knowlesii . Note that there was no header row in the data file, levels: vulgaris, kochii, splendens, viridis names : so the variable name parasite had to be added subsequently, using \\ data<-read.table("c: \\ parasites.txt") temp names(data)<-"parasite" attach(data) head(data) parasite 1 vulgaris 2 splendens 3 knowlesii 4 vulgaris 5 knowlesii 6 viridis levels(parasite) [1] "knowlesii" "kochii" "splendens" "viridis" "vulgaris" In our modelling we want to create a two-level dummy variable (present or absent) for each parasite species (in five extra columns), so that we can ask questions such as whether the mean value of the response variable is significantly different in cases where each parasite was present and when it was absent. So for vulgaris = TRUE , knowlesii=FALSE , kochii=FALSE the first row of the dataframe, we want , splendens=FALSE and viridis=FALSE. The long-winded way of doing this is to create a new factor for each species separately: vulgaris<-factor(1*(parasite=="vulgaris")) kochii<-factor(1*(parasite=="kochii")) table(vulgaris) vulgaris 01 99 52 table(kochii) kochii 01 134 17 and so on, with 1 for TRUE (meaning present) and 0 for FALSE (meaning absent). This is how easy it is to do with model.matrix : model.matrix(~parasite-1) parasiteknowlesii parasitekochii parasitesplendens parasiteviridis parasitevulgaris 100001 200100

278 256 THE R BOOK 310000 400001 510000 600010 ... etc. down to ... 147 1 0 0 0 0 148 0 0 0 1 0 149 0 0 0 0 1 150 0 1 0 0 0 151 0 0 1 0 0 attr(,"assign") [1]11111 attr(,"contrasts") attr(,"contrasts")\$parasite [1] "contr.treatment" The in the model formula ensures that we create a dummy variable for each of the five parasite species -1 (technically, it suppresses the creation of an intercept). Now we can join these five columns of dummy variables to the dataframe containing the response variable and the other explanatory variables. Suppose we original.frame . We just join the new columns to it, had an new.frame<-data.frame(original.frame, model.matrix(~parasite-1)) attach(new.frame) after which we can use variable names like parasiteknowlesii in statistical modelling. 6.9 Comparing table tabulate and You will often want to count how many times different values are represented in a vector. This simple example illustrates the difference between the two functions. Here is in action: table table(c(2,2,2,7,7,11)) 2711 321 It produces names for each element in the vector ( 2 , 7 , 11 ), and counts only those elements that are present (e.g. there are no zeros or ones in the output vector). The function counts all of the integers tabulate (turning real numbers into the nearest integer if necessary), starting at 1 and ending at the maximum (11 in this case), putting a zero in the resulting vector for every missing integer, like this: tabulate(c(2,2,2,7,7,11)) [1]03000020001 Because there are no 1s in our example, a count of zero is returned for the first element. There are three 2s but then a long gap to two 7s, then another gap to the maximum 11. It is important that you understand that tabulate will ignore negative numbers and zeros without warning: tabulate(c(2,0,-3,2,2,7,-1, 0,0,7,11)) [1]03000020001

279 TABLES 257 table is much more useful than , but there are occasions when you want For most applications, tabulate the zero counts to be retained. The commonest case is where you are generating a set of vectors, and you want all the vectors to be the same length (e.g. so that you can bind them to a dataframe). Suppose, for instance, that you want to make a dataframe containing three different realizations of a negative binomial distribution of counts, where the rows contain the frequencies of 0, 1, 2, 3, . . . successes. size = 1 and prob = 0.2 and Let us take an example where the negative binomial parameters are generate 100 random numbers from it, repeated three times: table(rnbinom(100,1,0.2)) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2410126134352543522 table(rnbinom(100,1,0.2)) 0 1 2 3 4 5 6 7 8101112131516 231710889147271111 table(rnbinom(100,1,0.2)) 0 1 2 3 4 5 6 7 8 9 10 11 12 15 16 19 20 23 24 25 151513126963334221111111 The three realizations produce vectors of different lengths (15, 15 again (but with a 15 and a 16 but no 9 or 14), and 20 respectively). With tabulate , we can specify the length of the output vector (the number of bins, nbins ), but we need to make sure it is long enough, because overruns will be ignored without warning. We also need to remember to add 1 to the random integers generated, so that the zeros are counted rather than ignored. From what we have seen (above), it looks as if 30 bins should work well enough, so here are nbins=30 six realizations with : tabulate(rnbinom(100,1,0.2)+1,30) [1]21141081061143314121000001000000000 [1]21141376101042043201000101000000010 [1]251316857425340120101000200000100 [1]2221121385332422110100000000000000 [1]18101311126544521031100010000000001 [1]261591247753213300200000000000100 This looks fine, until you check the fifth row. There are only 98 numbers here, so two unknown values of counts greater than 29 have been generated but ignored without warning. To see how often this might happen, we can run the test 1000 times and tally the number of numbers presented by : tabulate totals<-numeric(1000) for (i in 1:1000) totals[i] <- sum(tabulate(rnbinom(100,1,0.2)+1,30)) table(totals) totals 98 99 100 5 114 881 As you see, we lost one or more numbers on 119 occasions out of 1000. So take care with tabulate , and remember that 1 was added to all the counts to accommodate the zeros.

280 7 Mathematics You can do a lot of maths in R. Here we concentrate on the kinds of mathematics that find most frequent application in scientific work and statistical modelling:  functions;  continuous distributions;  discrete distributions;  matrix algebra;  calculus;  differential equations. 7.1 Mathematical functions For the kinds of functions you will meet in statistical computing there are only three mathematical rules b the that you need to learn: these are concerned with powers, exponents and logarithms. In the expression x x the explanatory variable appears as a power – in this b power explanatory variable is raised to the .Ine x , denoted by x of is the logarithm . The inverse of e = special case, of e is the x 2.718 28, of which exponent x log( x ) is the same as ln( x ). ) – note that all our logs are to the base e and that, for us, writing log( It is also useful to remember a handful of mathematical facts that are useful for working out behaviour at the limits . We would like to know what happens to y when x gets very large (e.g. x →∞ ) and what happens to y goes to 0 (i.e. what the intercept is, if there is one). These are the most important rules: when x 0  x Anything to the power zero is 1: = 1. x  One raised to any power is still 1: = 1. 1  Infinity plus 1 is infinity: ∞+ 1 =∞ . 1 –1  ∞ One over infinity (the reciprocal of infinity, = 0. ) is zero: ∞ ∞  1.2 =∞ A number > 1 raised to the power infinity is infinity: . ∞  A fraction (e.g. 0.99) raised to the power infinity is zero: 0.99 = 0. The R Book , Second Edition. Michael J. Crawley. © 2013 John Wiley & Sons, Ltd. Published 2013 by John Wiley & Sons, Ltd.

281 MATHEMATICS 259 1 b −  Negative powers are reciprocals: = x . b x √ 1 / 3 3  Fractional powers are roots: = x . x ∞  =∞ The base of natural logarithms, e, is 2.718 28, so e . 1 1 −∞  = 0. Last, but perhaps most usefully: = e = ∞ ∞ e There are built-in functions in R for logarithmic, probability and trigonometric functions (p. 17). 7.1.1 Logarithmic functions The logarithmic function is given by y a ln( bx ) . = Here the logarithm is to base e. The exponential function, in which the response y is the antilogarithm of the continuous explanatory variable x , is given by bx = e y a . Both these functions are smooth functions, and to draw smooth functions in R you need to generate a series of 100 or more regularly spaced values between min(x) and max(x) : x x <- seq(0,10,0.1) In R the exponential function is exp and the natural log function (ln) is log . Let a = b = 1. To plot the exponential and logarithmic functions with these values together in a row, write windows(7,4) par(mfrow=c(1,2)) y <- exp(x) plot(y~x,type="l",main="Exponential") y <- log(x) plot(y~x,type="l",main="Logarithmic") Exponential Logarithmic 2 1 15000 0 y y –1 5000 –2 0 0 246 810 0 2 4 6 810 x x

282 260 THE R BOOK function can be used in an alternative way, specifying the Cartesian coordinates of the Note that the plot plot(y~x) (see p. 190). plot(x,y) rather than the formula line using These functions are most useful in modelling process of exponential growth and decay. 7.1.2 Trigonometric functions Here are the cosine (base/hypotenuse), sine (perpendicular/hypotenuse) and tangent (perpendicular/base) (measured in radians) over the range 0 to 2 functions of . Recall that the full circle is 2 π radians, so 1 x π = 360/2 π = 57.295 78 degrees. radian windows(7,7) par(mfrow=c(2,2)) x <- seq(0,2*pi,2*pi/100) y1 <- cos(x) y2 <- sin(x) y3 <- tan(x) plot(y1~x,type="l",main="cosine") plot(y2~x,type="l",main="sine") plot(y3~x,type="l",ylim=c(-3,3),main="tangent") sine cosine 1.0 1.0 0.5 0.5 y1 y2 0.0 0.0 –0.5 –0.5 –1.0 –1.0 012345 6 012345 6 x x tangent 01 23 y3 –1 –2 –3 012345 6 x The tangent of x has discontinuities, shooting off to positive infinity at x = π /2 and again at x = 3 π /2. Restricting the range of values plotted on the y axis (here from –3 to + 3) therefore gives a better picture of

283 MATHEMATICS 261 the shape of the tan function. Note that R joins the plus infinity and minus infinity ‘points’ with a straight line . π /2 and at x = 3 π /2 within the frame of the graph defined by ylim = at x 7.1.3 Power laws There is an important family of two-parameter mathematical functions of the form b y = ax , power laws . Depending on the value of the power, known as , the relationship can take one of five forms. b In the trivial case of = 0 the function is y = a (a horizontal straight line). The four more interesting shapes b are as follows: x <- seq(0,1,0.01) y <- xˆ0.5 plot(x,y,type="l",main="01 b<0 1.0 100 80 0.8 60 0.6 y y 40 0.4 20 0.2 0 0.0 0.6 1.0 0.4 0.2 0.0 0.8 1.0 0.8 0.6 0.4 0.0 0.2 x x

284 262 THE R BOOK a b are easy to estimate from and These functions are useful in a wide range of disciplines. The parameters data because the function is linearized by a log–log transformation, b x log( b + ) ) = log( a ) , ax = ) y log( log( a b . These are often called allometric relationships so that on log–log axes the intercept is log( ) and the slope is b = 1 the proportion of because when that becomes y varies with x . x An important empirical relationship from ecological entomology that has applications in a wide range of statistical analysis is known as Taylor’s power law . It has to do with the relationship between the variance and the mean of a sample. In elementary statistical models, the variance is assumed to be constant (i.e. the variance does not depend upon the mean). In field data, however, Taylor found that variance increased with the mean according to a power law, such that on log–log axes the data from most systems fell above a line 1 (the pattern shown by data that are Poisson distributed, where the variance = through the origin with slope is equal to the mean) and below a line through the origin with a slope of 2. Taylor’s power law states that, for a particular system:  log(variance) is a linear function of log(mean);  the scatter about this straight line is small;  the slope of the regression of log(variance) against log(mean) is greater than 1 and less than 2;  the parameter values of the log–log regression are fundamental characteristics of the system. 7.1.4 Polynomial functions Polynomial functions are functions in which appears several times, each time raised to a different power. x They are useful for describing curves with humps, inflections or local maxima like these: decelerating humped 30 25 20 y y 15 51015 510 0246810 0246810 x x inflection local maximum 40 30 y y 20 10 5101520 0246810 0246810 x x

285 MATHEMATICS 263 The top left-hand panel shows a decelerating positive function, modelled by the quadratic x <- seq(0,10,0.1) y1 <- 2+5*x-0.2*xˆ2 2 x Making the negative coefficient of the term larger produces a curve with a hump as in the top right-hand panel: y2 <- 2+5*x-0.4*xˆ2 Cubic polynomials can show points of inflection, as in the lower left-hand panel: y3 <- 2+4*x-0.6*xˆ2+0.04*xˆ3 Finally, polynomials containing powers of 4 are capable of producing curves with local maxima, as in the lower right-hand panel: y4 <- 2+4*x+2*xˆ2-0.6*xˆ3+0.04*xˆ4 par(mfrow=c(2,2)) plot(x,y1,type="l",ylab="y",main="decelerating") plot(x,y2,type="l",ylab="y",main="humped") plot(x,y3,type="l",ylab="y",main="inflection") plot(x,y4,type="l",ylab="y",main="local maximum") Inverse polynomials are an important class of functions which are suitable for setting up generalized linear models with gamma errors and inverse link functions: 1 n 2 3 + bx + cx = a dx + ... + zx . + y Various shapes of function are produced, depending on the order of the polynomial (the maximum power) and the signs of the parameters: par(mfrow=c(2,2)) y1 <- x/(2+5*x) y2 <- 1/(x-2+4/x) y3 <- 1/(xˆ2-2+4/x) plot(x,y1,type="l",ylab="y",main="Michaelis-Menten") plot(x,y2,type="l",ylab="y",main="shallow hump") plot(x,y3,type="l",ylab="y",main="steep hump")

286 264 THE R BOOK Michaelis-Menten shallow hump 0.20 0.5 0.4 0.15 0.3 y y 0.10 0.2 0.05 0.1 0.0 0.00 0246810 0246810 x x steep hump 0.3 0.2 y 0.1 0.0 0246810 x There are two ways of parameterizing the Michaelis–Menten equation: x ax and y = . = y + bx c + dx 1 In the first case, the asymptotic value of a/b and in the second it is 1/ d . is y 7.1.5 Gamma function  ( t ) is an extension of the factorial function, t !, to positive real numbers: The gamma function ∫ ∞ x − t − 1 t ) =  ( . e x d x 0 It looks like this: par(mfrow=c(1,1)) t <- seq(0.2,4,0.01) plot(t,gamma(t),type="l") abline(h=1,lty=2)

287 MATHEMATICS 265 6 gamma(t) 12345 2 34 1 t ( t ) is equal to 1 at both t = 1 and t = 2. For integer values of t Note that  ( t + 1) = t !  , 7.1.6 Asymptotic functions Much the most commonly used asymptotic function is ax , y = bx 1 + which has a different name in almost every scientific discipline. For example, in biochemistry it is called Michaelis–Menten, and shows reaction rate as a function of enzyme concentration; in ecology it is called Holling’s disc equation and shows predator feeding rate as a function of prey density. The graph passes through the origin and rises with diminishing returns to an asymptotic value at which increasing the value of x does not lead to any further increase in y . The other common function is the asymptotic exponential bx − (1 y a = − e ) . This, too, is a two-parameter model, and in many cases the two functions would describe data equally well (see p. 719 for an example of this comparison). Let us work out the behaviour at the limits of our two asymptotic functions, starting with the asymptotic x = 0wehave exponential. For − b × 0 0 e y = a (1 − × = − e ) ) = a (1 − 1) = a (1 0 = 0 , a x =∞ ,wehave so the graph goes through the origin. At the other extreme, for − b ×∞ −∞ − e y = a (1 0) = (1 − e ) ) = a (1 − a = a (1) = a , which demonstrates that the relationship is asymptotic, and that the asymptotic value of y is a .

288 266 THE R BOOK x = Turning to the Michaelis–Menten equation, or 0 the limit is easy: 0 0 a 0 × = . = = 0 = y 1 0 × 1 b + 1 0 + =∞ is somewhat more difficult, because we end up with However, determining the behaviour at the limit x /(1 +∞ ) =∞ / ∞ , which you might imagine is always going to be 1 no matter what the values of a y =∞ b and . In fact, there is a special mathematical rule for this case, called l’Hospital’s rule: when you get a ratio of infinity to infinity, you work out the ratio of the derivatives to obtain the behaviour at the limit. The numerator bx ax is a . The denominator is 1 + x so its derivative with respect to x is is so its derivative with respect to + b = b . So the ratio of the derivatives is a/b 0 , and this is the asymptotic value of the Michaelis–Menten equation. 7.1.7 Parameter estimation in asymptotic functions There is no way of linearizing the asymptotic exponential model, so we must resort to non-linear least squares ) to estimate parameter values for it (p. 715). One of the advantages of the Michaelis–Menten function nls ( is that it is easy to linearize. We use the reciprocal transformation bx 1 + 1 = . ax y At first glance, this is no great help. But we can separate the terms on the right because they have a common denominator. Then we can cancel the x s, like this: 1 bx 1 b 1 + = + = ax ax y a ax , and Y 1/ y, X = so if we put x , A = 1/ a = C = b/a , we see that 1/ Y = AX + C , which is linear: C is the intercept and A is the slope. So to estimate the values of a and b from data, we , carry out a linear regression, then would transform both y to reciprocals, plot a graph of 1/ y against 1/ x and x back-transform, to get: 1 , a = A b = aC . Suppose that we knew that the graph passed through the two points (0.2, 44.44) and (0.6, 70.59). How do a and b ? First, we calculate the four reciprocals. The slope of the we work out the values of the parameters linearized function, A , is the change in 1/ y divided by the change in 1/ x : (1/44.44 - 1/70.59)/(1/0.2 - 1/0.6) [1] 0.002500781

289 MATHEMATICS 267 = a = 1/0.0025 = 400. Now we rearrange the equation and use one of the points (say x A 0.2, y = 1/ so = : 44.44) to get the value of b ) ) ( ( 2 0 1 ax 400 × . 1 − − 1 = = . 4 1 = b 44 . 0 y x 2 . 44 7.1.8 Sigmoid (S-shaped) functions The simplest S-shaped function is the where, for 0 ≤ y ≤ 1, two-parameter logistic bx + a e = y bx + a e 1 + which is central to the fitting of generalized linear models for proportion data (Chapter 16). three-parameter logistic function allows y to vary on any scale: The a y = . − cx 1 + b e The intercept is a /(1 + b ), the asymptotic value is a and the initial slope is measured by c . Here is the curve with parameters 100, 90 and 1.0: par(mfrow=c(2,2)) x <- seq(0,10,0.1) y <- 100/(1+90*exp(-1*x)) plot(x,y,type="l",main="three-parameter logistic") function has asymptotes at the left- ( a ) and right-hand ( The ) ends of the x axis four-parameter logistic b c ) the response to x about the midpoint ( d ) where the curve has its inflexion: and scales ( b − a = + y a . c ( d − x ) + 1 e Letting a = 20, b = 120, c = 0.8 and d = 3, the function 100 y + = 20 − 8 0 . ) × (3 x e 1 + looks like this: y <- 20+100/(1+exp(0.8*(3-x))) plot(x,y,ylim=c(0,140),type="l",main="four-parameter logistic") Negative sigmoid curves have the parameter c < 0, as for the function 100 . 20 + y = × 0 − ) . 8 x (3 − e 1 +

290 268 THE R BOOK An asymmetric S-shaped curve much used in demography and life insurance work is the Gompertz growth model , cx b e a = y e . b and c . For a negative sigmoid, b is negative The shape of the function depends on the signs of the parameters c + 0.02): is positive (here (here –1) and x <- -200:100 y <- 100*exp(-exp(0.02*x)) plot(x,y,type="l",main="negative Gompertz") For a positive sigmoid both parameters are negative: x <- 0:100 y <- 50*exp(-5*exp(-0.08*x)) plot(x,y,type="l",main="positive Gompertz") three-parameter logistic four-parameter logistic 140 100 y y 60 20 0 0 20406080100 0246810 0246810 x x positive Gompertz negative Gompertz 100 80 60 y y 40 20 0 0 1020304050 0 20406080100 0 100 –50 –200 –150 –100 50 x x

291 MATHEMATICS 269 7.1.9 Biexponential model This is a useful four-parameter non-linear function, which is the sum of two exponential functions : of x dx bx a e = y + c e . b , c and d ( a is assumed to be positive): the upper Various shapes depend upon the signs of the parameters c left-hand panel shows b and d negative (it is the sum of two exponential decay curves, so the fast positive, decomposing material disappears first, then the slow, to produce two different phases); the upper right-hand panel shows and d positive, b negative (this produces an asymmetric U-shaped curve); the lower left-hand c c negative, b and d positive (this can, but does not always, produce a curve with a hump); and panel shows the lower right panel shows b and c positive, d negative. When b, c and d are all negative (not illustrated), the function is known as the first-order compartment model in which a drug administered at time 0 passes through the system with its dynamics affected by three physiological processes: elimination, absorption and clearance. + – + – + – + + 20 18 y y 16 10 12 14 16 18 20 14 8 6 0246810 0246810 x x + + – + + + + – 500 700 600 450 500 400 y y 400 350 300 300 200 0246810 0246810 x x

292 270 THE R BOOK #1 a<-10 b <- -0.8 c<-10 d <- -0.05 y <- a*exp(b*x)+c*exp(d*x) plot(x,y,main="+ - + -",type="l") #2 a<-10 b <- -0.8 c<-10 d <- 0.05 y <- a*exp(b*x)+c*exp(d*x) plot(x,y,main="+ - + +",type="l") #3 a <- 200 b <- 0.2 c<- -1 d <- 0.7 y <- a*exp(b*x)+c*exp(d*x) plot(x,y,main="+ + - +",type="l") #4 a <- 200 b <- 0.05 c <- 300 d <- -0.5 y <- a*exp(b*x)+c*exp(d*x) plot(x,y,main="+ + + -",type="l") 7.1.10 Transformations of the response and explanatory variables We have seen the use of transformation to linearize the relationship between the response and the explanatory variables:  log( y ) against x for exponential relationships;  log( ) against log( x ) for power functions; y  y ) against x for logarithmic relationships; exp(  1/ y against 1/ x for asymptotic relationships;  p /(1 – p )) against x for proportion data. log( Other transformations are useful for variance stabilization: √  y to stabilize the variance for count data;  arcsin( y ) to stabilize the variance of percentage data.

293 MATHEMATICS 271 7.2 Probability functions There are many specific probability distributions in R (normal, Poisson, binomial, etc.), and these are discussed in detail later. Here we look at the base mathematical functions that deal with elementary probability. The n factorial items. How many ways can four items be arranged? function gives the number of permutations of The first position could have any one of the 4 items in it, but by the time we get to choosing the second item 3 ways of choosing the second item. we shall already have specified the first item so there are just 4 – 1 = 2 ways of choosing the third item, and by the time we get to the last item we have no = There are only 4 – 2 degrees of freedom at all: the last number must be the one item out of four that we have not used in positions × 1, 2 or 3. So with 4 items the answer is 4 × (4 – 2) × (4 – 3) which is 4 × 3 × 2 × 1 = 24. In (4 – 1) general, factorial( n ) is given by n ! = n ( n − 1)( n − 2) ... × 3 × 2 . The R function is x from 0 to 10 using the step option and we can plot it for values of factorial , in plot with a logarithmic scale on the y axis log="y", type="s" par(mfrow=c(1,1)) x <- 0:6 plot(x,factorial(x),type="s",main="factorial x",log="y") factorial x 500200100502010521 factorial (x) 0123456 x The other important base function for probability calculations in R is the choose function which calculates binomial coefficients . These show the number of ways there are of selecting x items out of n items when the item can be one of just two types (e.g. either male or female, black or white, solvent or insolvent). Suppose we have 8 individuals and we want to know how many ways there are that 3 of them could be males (and

294 272 THE R BOOK hence 5 of them females). The answer is given by ( ) n ! n , = x − x )! x n !( n = 8 and x = 3 we get so with ( ) 8! × 6 8 × 7 n = = = 56 , x 3 − 2 3)! × 3!(8 and in R choose(8,3) [1] 56 Obviously there is only one way that all 8 individuals could be male or female, so there is only one way of getting 0 or 8 ‘successes’. One male could be the first individual you select, or the second, or the third, and so on. So there are 8 ways of selecting 1 out of 8. By the same reasoning, there must be 8 ways of selecting 7 males out of 8 individuals (the lone female could be in any one of the 8 positions). The following is a graph of the number of ways of selecting from 0 to 8 males out of 8 individuals: plot(0:8,choose(8,0:8),type="s",main="binomial coefficients") binomial coefficients 70 60 5040 3020 choose(8, 0:8) 100 04 68 2 0:8 7.3 Continuous probability distributions R has a wide range of built-in probability distributions, for each of which four functions are available: the probability density function (which has a d prefix); the cumulative probability ( p ); the quantiles of the distribution ( q ); and random numbers generated from the distribution ( r ). Each letter can be prefixed to the R function names in Table 7.1 (e.g. dbeta ).

295 MATHEMATICS 273 The probability distributions supported by R. The meanings of Table 7.1. the parameters are explained in the text. Parameters R function Distribution beta shape1, shape2 beta binom binomial sample size, probability cauchy Cauchy location, scale exponential rate (optional) exp chisq chi-squared degrees of freedom F F df1, df2 Fisher’s gamma shape gamma geometric geom probability hyper hypergeometric m, n, k lognormal lnorm mean, standard deviation logis logistic location, scale nbinom negative binomial size, probability norm normal mean, standard deviation pois Poisson mean Wilcoxon signed rank statistic sample size signrank n t Student’s t degrees of freedom uniform minimum, maximum (opt.) unif Weibull shape weibull wilcox Wilcoxon rank sum m, n The cumulative probability function is a straightforward notion: it is an S-shaped curve showing, for any value of x , the probability of obtaining a sample value that is less than or equal to x . Here is what it looks like for the normal distribution: curve(pnorm(x),-3,3) arrows(-1,0,-1,pnorm(-1),col="red") arrows(-1,pnorm(-1),-3,pnorm(-1),col="green") 1.0 0.8 0.6 pnorm(x) 0.4 0.2 0.0 –3 –2 –1 0 123 x

296 274 THE R BOOK x (–1) leads up to the cumulative probability (red arrow) and the probability associated with The value of axis (green arrow). The value on the axis is obtaining a value of this size (–1) or smaller is on the y y 0.1586553 : pnorm(-1) [1] 0.1586553 The probability density is the slope of this curve (its derivative). You can see at once that the slope is x = –2, increases up to a peak (at never negative. The slope starts out very shallow up to about = 0inthis x example) then gets shallower, and becomes very small indeed above about = 2. Here is what the density x dnorm ) looks like: function of the normal ( curve(dnorm(x),-3,3) 0.4 0.3 0.2 dnorm(x) 0.1 0.0 –3 –2 –1 0 123 x For a discrete random variable, like the Poisson or the binomial, the probability density function is y axis scaled as probabilities rather than counts, and the straightforward: it is simply a histogram with the discrete values of x (0,1,2,3, ...)onthehori zontal axis. But for a continuous random variable, the definition y axis, but rather the of the probability density function is more subtle: it does not have probabilities on the x . derivative (the slope) of the cumulative probability function at a given value of 7.3.1 Normal distribution This distribution is central to the theory of parametric statistics. Consider the following simple exponential function: m ) . −| x | y = exp( As the power ( ) in the exponent increases, the function becomes more and more like a step function. The m following panels show the relationship between y 1, 2, 3 and 8, respectively: x for m = and

297 MATHEMATICS 275 par(mfrow=c(2,2)) x <- seq(-3,3,0.01) y <- exp(-abs(x)) plot(x,y,type="l",main= "x") y <- exp(-abs(x)ˆ2) plot(x,y,type="l",main= "xˆ2") y <- exp(-abs(x)ˆ3) plot(x,y,type="l",main= "xˆ3") y <- exp(-abs(x)ˆ8) plot(x,y,type="l",main= "xˆ8") x x^2 1.00.80.60.4 1.00.80.60.4 y y 0.20.0 0.2 –1 –2 –3 123 0 –2 0 –3 –1 123 x x x^3 x^8 1.00.80.60.4 1.00.80.60.4 y y 0.20.0 0.20.0 123 0 –1 –2 –3 –3 –2 –1 0 123 x x 2 y = exp (– x The second of these panels (top right), where ), is the basis of an extremely important and famous probability density function. Once it has been scaled, so that the integral (the area under the curve from – ∞ +∞ ) is unity, this is the normal distribution. Unfortunately, the scaling constants are rather cumbersome. to When the distribution has mean 0 and standard deviation 1 (the standard normal distribution) the equation becomes: 1 2 / 2 z − . = z ( f ) e √ 2 π

298 276 THE R BOOK Suppose we have measured the heights of 100 people. The mean height was 170 cm and the standard deviation was 8 cm. We can ask three sorts of questions about data like these: what is the probability that a randomly selected individual will be:  shorter than a particular height?  taller than a particular height?  between one specified height and another? The area under the whole curve is exactly 1; everybody has a height between minus infinity and plus infinity. True, but not particularly helpful. Suppose we want to know the probability that one of our people, selected ; that at random from the group, will be less than 160 cm tall. We need to convert this height into a value of z . What do we know a number of standard deviations from the mean is to say, we need to convert 160 cm into about the standard normal distribution? It has a mean of 0 and a standard deviation of 1. So we can convert ̄ any value y y and standard deviation s very simply by calculating , from a distribution with mean ̄ y − y . = z s So we convert 160 cm into a number of standard deviations. It is less than the mean height (170 cm) so its value will be negative: 160 − 170 . 25 . =− 1 = z 8 Now we need to find the probability of a value of the standard normal taking a value of –1.25 or smaller. This is (the integral) of the density function. The function we need for this the area under the left-hand tail pnorm : we provide it with a value of z (or, more generally, with a quantile) and it provides us with the is probability we want: pnorm(-1.25) [1] 0.1056498 So the answer to our first question (the shaded area, top left) is just over 10%. Next, what is the probability of selecting one of our people and finding that they are taller than 185 cm (top right)? The first two parts of the exercise are exactly the same as before. First we convert our value of 185 cm into a number of standard deviations: 170 − 185 = 1 . 875 . = z 8 Then we ask what probability is associated with this, using pnorm : pnorm(1.875) [1] 0.9696036 But this is the answer to a different question. This is the probability that someone will be less than or equal to 185 cm tall (that is what the function pnorm has been written to provide). All we need to do is to work out the complement of this:

299 MATHEMATICS 277 1-pnorm(1.875) [1] 0.03039636 So the answer to the second question is about 3%. Finally, we might want to know the probability of selecting a person between 165 cm and 180 cm. We have a bit more work to do here, because we need to calculate two z values: − 170 170 − 180 165 z 625 and z 25 . 1 0 . . = =− = = 2 1 8 8 The important point to grasp is this: we want the probability of selecting a person between these two z values, subtract the smaller probability from the larger probability : so we pnorm(1.25)-pnorm(-0.625) [1] 0.6283647 Thus we have a 63% chance of selecting a medium-sized person (taller than 165 cm and shorter than 180 cm) from this sample with a mean height of 170 cm and a standard deviation of 8 cm (bottom left). 0.4 0.4 0.3 0.3 0.2 0.2 probability density probability density 0.1 0.1 0.0 0.0 192 186 178 170 162 154 146 154 162 170 146 178 186 192 height height 0.4 0.3 0.2 probability density 0.1 0.0 154 192 186 178 170 162 146 height

300 278 THE R BOOK The trick with curved polygons like these is to finish off their closure properly. In the bottom left- hand panel, for instance, we want to return to the axis at 180 cm then draw straight along the x x axis to 165 cm. We do this by concatenating two extra points on the end of the vectors of z and p coordinates: x <- seq(-3,3,0.01) z <- seq(-3,-1.25,0.01) p <- dnorm(z) z <- c(z,-1.25,-3) p <- c(p,min(p),min(p)) plot(x,dnorm(x),type="l",xaxt="n",ylab="probability density",xlab="height") axis(1,at=-3:3,labels=c("146","154","162","170","178","186","192")) polygon(z,p,col="red") z <- seq(1.875,3,0.01) p <- dnorm(z) z <- c(z,3,1.875) p <- c(p,min(p),min(p)) plot(x,dnorm(x),type="l",xaxt="n",ylab="probability density",xlab="height") axis(1,at=-3:3,labels=c("146","154","162","170","178","186","192")) polygon(z,p,col="red") z <- seq(-0.635,1.25,0.01) p <- dnorm(z) z <- c(z,1.25,-0.635) p <- c(p,0,0) plot(x,dnorm(x),type="l",xaxt="n",ylab="probability density",xlab="height") axis(1,at=-3:3,labels=c("146","154","162","170","178","186","192")) polygon(z,p,col="red") 7.3.2 The central limit theorem If you take repeated samples from a population with finite variance and calculate their averages, then the averages will be normally distributed. This is called the central limit theorem . Let us demonstrate it for ourselves. We can take five uniformly distributed random numbers between 0 and 10 and work out the average. The average will be low when we get, say, 2,3,1,2,1 and high when we get 9,8,9,6,8. Typically, of course, the average will be close to 5. Let us do this 10 000 times and look at the distribution of the 10 000 means. The data are rectangularly (uniformly) distributed on the interval 0 to 10, so the distribution of the raw data should be flat-topped: par(mfrow=c(1,1)) hist(runif(10000)*10,main="")

301 MATHEMATICS 279 500 400 300 Frequency 200 100 0 6810 024 runif(10000)* 10 What about the distribution of sample means, based on taking just five uniformly distributed random numbers? means <- numeric(10000) for (i in 1:10000) { means[i] <- mean(runif(5)*10) } hist(means,ylim=c(0,1600),main="") Nice, but how close is this to a normal distribution? One test is to draw a normal distribution with the same parameters on top of the histogram. But what are these parameters? The normal is a two-parameter distribution that is characterized by its mean and its standard deviation. We can estimate these two parameters from our sample of 10 000 means (your values will be slightly different because of the randomization): mean(means) [1] 4.998581 sd(means) [1] 1.289960 Now we use these two parameters in the probability density function of the normal distribution ( dnorm )to create a normal curve with our particular mean and standard deviation. To draw the smooth line of the normal curve, we need to generate a series of values for the x axis; inspection of the histograms suggest that sensible limits would be from 0 to 10 (the limits we chose for our uniformly distributed random numbers). A good rule of thumb is that for a smooth curve you need at least 100 values, so let us try this: xv <- seq(0,10,0.1) There is just one thing left to do. The probability density function has an integral of 1.0 (that is the area beneath the normal curve), but we had 10 000 samples. To scale the normal probability density function to our particular case, however, depends on the height of the highest bar (about 1500 in this case). The height, in turn, depends on the chosen bin widths; if we doubled with width of the bin there would be roughly twice

302 280 THE R BOOK y axis. To get the height of the bars as many numbers in the bin and the bar would be twice as high on the on our frequency scale, therefore, we multiply the total frequency, 10 000 by the bin width, 0.5 to get 5000. We multiply 5000 by the probability density to get the height of the curve. Finally, we use lines to overlay the smooth curve on our histogram: yv <- dnorm(xv,mean=4.998581,sd=1.28996)*5000 lines(xv,yv) 1500 1000 Frequency 500 0 8 246 means The fit is excellent. The central limit theorem really works. Almost any distribution, even a ‘badly behaved’ one like the uniform distribution we worked with here, will produce a normal distribution of sample means taken from it. A simple example of the operation of the central limit theorem involves the use of dice. Throw one die lots of times and each of the six numbers should come up equally often: this is an example of a uniform distribution: par(mfrow=c(2,2)) hist(sample(1:6,replace=T,10000),breaks=0.5:6.5,main="",xlab="one die") Now throw two dice and add the scores together: this is the ancient game of craps. There are 11 possible scores from a minimum of 2 to a maximum of 12. The most likely score is 7 because there are six ways that this could come about: 1 , 66 , 12 , 55 , 23 , 44 , 3 For many throws of craps we get a triangular distribution of scores, centred on 7: a <- sample(1:6,replace=T,10000) b <- sample(1:6,replace=T,10000) hist(a+b,breaks=1.5:12.5,main="", xlab="two dice")

303 MATHEMATICS 281 There is already a clear indication of central tendency and spread. For three dice we get c <- sample(1:6,replace=T,10000) hist(a+b+c,breaks=2.5:18.5,main="", xlab="three dice") and the bell shape of the normal distribution is starting to emerge. By the time we get to five dice, the binomial distribution is virtually indistinguishable from the normal: d <- sample(1:6,replace=T,10000) e <- sample(1:6,replace=T,10000) hist(a+b+c+d+e,breaks=4.5:30.5,main="", xlab="five dice") 1500 1500 1000 1000 Frequency Frequency 500 500 0 0 123456 24681012 one die two dice 800 1000 600 600 400 Frequency Frequency 200 2000 0 55 10 10 15 15 20 25 30 three dice five dice The smooth curve is given by a normal distribution with the same mean and standard deviation: mean(a+b+c+d+e) [1] 17.5937 sd(a+b+c+d+e) [1] 3.837668 lines(seq(1,30,0.1),dnorm(seq(1,30,0.1),17.5937,3.837668)*10000)

304 282 THE R BOOK 7.3.3 Maximum likelihood with the normal distribution The probability density of the normal is ] [ 2 1 μ − y ( ) , ( f y | = ) μ, σ exp − √ 2 2 σ σ 2 π which is read as saying the probability density for a data value y μ and a variance of , given (|) a mean of 2 σ , is calculated from this rather complicated-looking two-parameter exponential function. For any given 2 , it gives a value between 0 and 1. Recall that likelihood is the product of the combination of and σ μ .Soifwehave n values of y in our probability densities, for each of the values of the response variable, y experiment, the likelihood function is ]) ( [ n 2 ∏ 1 μ ) − ( y i , exp − = ( ) μ, σ L √ 2 2 σ σ π 2 = 1 i has been replaced by y y where the only change is that and we multiply together the probabilities for i each of the n data points. There is a little bit of algebra we can do to simplify this: we can get rid of the ∏ , in two steps. First, the constant term, multiplied by itself n product operator, times, can just be written as √ n . Second, remember that the product of a set of antilogs (exp) can be written as the antilog of a 2 π ) / σ 1 ( ) ( ∏ ∑ . This means that the product of the right-hand part x like this: ) = exp exp( x x sum of the values of i i i of the expression can be written as ∑ ] [ n 2 ( y − μ ) i i = 1 , − exp 2 σ 2 so we can rewrite the likelihood of the normal distribution as [ ] n ∑ 1 1 2 L ) μ, σ = ( − ) μ ( − y exp . ( ) i n √ 2 2 σ σ π 2 i 1 = μ and σ are unknown, and the purpose of the exercise is to use statistical modelling to The two parameters n ). So how do we find the y different values of determine their maximum likelihood values from the data (the values of μ and σ that maximize this likelihood? The answer involves calculus: first we find the derivative of the function with respect to the parameters, then set it to zero, and solve. It turns out that because of the exp function in the equation, it is easier to work out the log of the likelihood, ∑ n 2 2 ) − n log( σ ) − log(2 ( y π − μ ) , / 2 σ μ, σ ) =− l ( i 2 l μ, σ ) = and maximize this instead. Obviously, the values of the parameters that maximize the log-likelihood ( L ( log( )) will be the same as those that maximize the likelihood. From now on, we shall assume that μ, σ summation is over the index i from 1 to n . Now for the calculus. We start with the mean, μ . The derivative of the log-likelihood with respect to μ is ∑ d l 2 = y ( − μ ) /σ . i μ d

305 MATHEMATICS 283 μ Set the derivative to zero and solve for : ∑ ∑ 2 y ( /σ μ = 0so − ( y μ − ) ) = 0 . i i ∑ Taking the summation through the bracket, and noting that μ = n μ , ∑ ∑ ∑ y i . = y = n n and μ − μ = 0so μ y i i n The maximum likelihood estimate of is the arithmetic mean. μ Next we find the derivative of the log-likelihood with respect to σ : ∑ 2 n ( y l d − ) μ i + =− , 3 σ σ d σ 3 2 is 2/ x . Solving, we get x x and the derivative of –1/ )is1/ x recalling that the derivative of log( ∑ ) ( 2 ∑ ( y n − ) μ n i 2 2 3 + = σ − μ ) 0so = σ ( y = − n i 3 σ σ σ ∑ 2 − ) μ y ( i 2 σ . = n 2 σ The maximum likelihood estimate of the variance is the mean squared deviation of the y values from the mean. This is a biased estimate of the variance, however, because it does not take account of the fact that we estimated the value of μ from the data. To unbias the estimate, we need to lose 1 degree of freedom to n n (see p. 119 and restricted maximum reflect this fact, and divide the sum of squares by – 1 rather than by likelihood estimators in Chapter 19). Here, we illustrate R’s built-in probability functions in the context of the normal distribution. The density dnorm has a value of function (a quantile) as its argument. Optional arguments specify the mean and z standard deviation (the default is the standard normal with mean 0 and standard deviation 1). Values of z outside the range –3.5 to + 3.5 are very unlikely. par(mfrow=c(2,2)) curve(dnorm,-3,3,xlab="z",ylab="Probability density",main="Density") The probability function pnorm also has a value of z (a quantile) as its argument. Optional arguments specify the mean and standard deviation (default is the standard normal with mean 0 and standard deviation z less than or equal to the value specified, and is an 1). It shows the cumulative probability of a value of S-shaped curve: curve(pnorm,-3,3,xlab="z",ylab="Probability",main="Probability") Quantiles of the normal distribution qnorm have a cumulative probability as their argument. They perform the opposite function of pnorm , returning a value of z when provided with a probability. curve(qnorm,0,1,xlab="p",ylab="Quantile (z)",main="Quantiles") The normal distribution random number generator rnorm produces random real numbers from a distribu- tion with specified mean and standard deviation. The first argument is the number of numbers that you want

306 284 THE R BOOK to be generated: here are 1000 random numbers with mean 0 and standard deviation 1: y <- rnorm(1000) hist(y,xlab="z",ylab="frequency",main="Random numbers") Probability Density 0.4 1.0 0.8 0.3 0.60.4 0.2 Probability Probability density 0.1 0.20.0 0.0 –2 –1 0 –3 123 123 –3 –1 –2 0 z z Quantiles Random numbers 150 12 0 100 frequency Quantile (z) –1 50 –2 0 0 –1 –2 123 –3 1.0 0.6 0.0 0.2 0.4 0.8 z p The four functions ( d , p , q and r ) work in similar ways with all the other probability distributions. 7.3.4 Generating random numbers with exact mean and standard deviation rnorm then, naturally, the sample you generate will not have If you use a random number generator like exactly the mean and standard deviation that you specify, and two runs will produce vectors with different means and standard deviations. Suppose we want 100 normal random numbers with a mean of exactly 24 and a standard deviation of precisely 4: yvals <- rnorm(100,24,4) mean(yvals) [1] 24.2958 sd(yvals) [1] 3.5725

307 MATHEMATICS 285 Close, but not spot on. If you want to generate random numbers with an exact mean and standard deviation, then do the following: ydevs <- rnorm(100,0,1) Now compensate for the fact that the mean is not exactly 0 and the standard deviation is not exactly 1 by ex- pressing all the values as departures from the sample mean scaled in units of the sample’s standard deviations: ydevs <- (ydevs-mean(ydevs))/sd(ydevs) Check that the mean is 0 and the standard deviation is exactly 1: mean(ydevs) [1] -2.449430e-17 sd(ydevs) [1] 1 The mean is as close to 0 as makes no difference, and the standard deviation is 1. Now multiply this vector by your desired standard deviation and add to your desired mean value to get a sample with exactly the means and standard deviation required: yvals <- 24 + ydevs*4 mean(yvals) [1] 24 sd(yvals) [1] 4 7.3.5 Comparing data with a normal distribution Various tests for normality are described on p. 346. Here we are concerned with the task of comparing a histogram of real data with a smooth normal distribution with the same mean and standard deviation, in order to look for evidence of non-normality (e.g. skew or kurtosis). par(mfrow=c(1,1)) fishes <- read.table("c: \\ temp \\ fishes.txt",header=T) attach(fishes) names(fishes) [1] "mass" mean(mass) [1] 4.194275 max(mass) [1] 15.53216 Now the histogram of the mass of the fish is produced, specifying integer bins that are 1 gram in width, up to a maximum of 16.5 g: hist(mass,breaks=-0.5:16.5,col="green",main="")

308 286 THE R BOOK inside the function: the sequence For the purposes of demonstration, we generate everything we need lines values for plotting (0 to 16), and the height of the density function (the number of fish ( length(mass) x of ) times the probability density for each member of this sequence, for a normal distribution with mean(mass) and standard deviation sqrt(var(mass)) as its parameters, like this: lines(seq(0,16,0.1),length(mass)*dnorm(seq(0,16,0.1),mean(mass),sqrt(var(mass)))) 60 50 40 30 Frequency 20 10 0 0510 15 mass The distribution of fish sizes is clearly not normal. There are far too many fishes of 3 and 4 grams, too few of 6 or 7 grams, and too many really big fish (more than 8 grams). This kind of skewed distribution is probably better described by a gamma distribution (see Section 7.3.10) than a normal distribution. 7.3.6 Other distributions used in hypothesis testing The main distributions used in hypothesis testing are: , for testing hypotheses involving count chi-squared Fisher’s t , in analysis of variance (ANOVA) for comparing two variances; and Student’s data; ,insmall- F sample work for comparing two parameter estimates. These distributions tell us the size of the test statistic that could be expected by chance alone when nothing was happening (i.e. when the null hypothesis was true). Given the rule that a big value of the test statistic tells us that something is happening, and hence that the null hypothesis is false, these distributions define what constitutes a big value of the test statistic (its critical value ). For instance, if we are doing a chi-squared test, and our test statistic is 14.3 on 9 degrees of freedom (d.f.), we need to know whether this is a large value (meaning the null hypothesis is probably false) or a small value (meaning that the null hypothesis cannot be rejected). In the old days we would have looked up the value in chi-squared tables. We would have looked in the row labelled 9 (the degrees of freedom row) and the column headed by α = 0.05. This is the conventional value for the acceptable probability of committing a Type I error: that is to say, we allo wa1in20 chance of rejecting the null hypothesis when it is actually true (see p. 358). Nowadays, we just type: 1-pchisq(14.3,9) [1] 0.1120467

309 MATHEMATICS 287 This indicates that 14.3 is actually a relatively small number when we have 9 d.f. We would conclude that nothing is happening, because a value of chi-squared as large as 14.3 has a greater than an 11% probability of arising by chance alone when the null hypothesis is true. We would want the probability to be less than 5% before we rejected the null hypothesis. So how large would the test statistic need to be, before we would qchisq to answer this. Its reject the null hypothesis (i.e. what is the critical value of chi-squared)? We use two arguments are 1 – α and the number of degrees of freedom: qchisq(0.95,9) [1] 16.91898 So the test statistic would need to be larger than 16.92 in order for us to reject the null hypothesis when there were 9 d.f. and qf in an exactly analogous manner for Fisher’s F . Thus, the probability of getting pf We could use a variance ratio of 2.85 by chance alone when the null hypothesis is true, given that we have 8 d.f. in the numerator and 12 d.f. in the denominator, is just under 5% (i.e. the value is just large enough to allow us to reject the null hypothesis): 1-pf(2.85,8,12) [1] 0.04992133 Note that with , degrees of freedom in the numerator (8) come first in the list of arguments, followed by pf d.f. in the denominator (12). t statistic and pt and qt Similarly, with Student’s t in tables for a two-tailed . For instance, the value of test at α /2 = 0.025 with 10 d.f. is qt(0.975,10) [1] 2.228139 7.3.7 The chi-squared distribution This is perhaps the second-best known of all the statistical distributions, introduced to generations of school children in their geography lessons and comprehensively misunderstood thereafter. It is a special case of the gamma distribution (p. 293) characterized by a single parameter, the number of degrees of freedom. The ν (‘nu’, pronounced ‘new’), and the variance is equal to 2 ν . The mean is equal to the degrees of freedom density function looks like this: 1 x − 1 2 − 2 ν/ / x , e = ) x ( f 2 ν/ 2) 2  ( ν/ where is the gamma function (see p. 17). The chi-squared distribution is important because many quadratic  forms follow it under the assumption that the data follow the normal distribution. In particular, the sample variance is a scaled chi-squared variable. Likelihood ratio statistics are also approximately distributed as a chi-squared (see the F distribution, below). When the cumulative probability is used, an optional third argument can be provided to describe non- centrality. If the non-central chi-squared is the sum of ν independent normal random variables, then the non-centrality parameter is equal to the sum of the squared means of the normal variables. Here are the

310 288 THE R BOOK ) based on three normal means (of 1, 1.5 and cumulative probability plots for a non-centrality parameter ( ncp = ncp 2) and another with 4 means and 10: windows(7,4) par(mfrow=c(1,2)) x <- seq(0,30,.25) plot(x,pchisq(x,3,7.25),type="l",ylab="p(x)",xlab="x") plot(x,pchisq(x,5,10),type="l",ylab="p(x)",xlab="x") 1.0 1.0 0.8 0.8 0.6 0.6 p(x) p(x) 0.4 0.4 0.2 0.2 0.0 0.0 30 10 5 0 30 25 20 15 20 25 15 10 5 0 x x 2 2 2 . 5 The cumulative probability on the left has 3 d.f. and non-centrality parameter 1 + 2 + = 7 . 25, while 1 the distribution on the right has 4 d.f. and non-centrality parameter 10 (note the longer left-hand tail at low probabilities). Chi-squared is also used to establish confidence intervals for sample variances. The quantity 2 n − 1) s ( 2 σ 2 n – 1) multiplied by the ratio of the sample variance to the unknown population is the degrees of freedom ( s 2 2 . This follows a chi-squared distribution, so we can establish a 95% confidence interval for σ as σ variance follows: 2 2 n s ( 1) − 1) − n ( s 2 σ ≤ ≤ 2 2 χ χ α/ − 1 2 α/ 2 2 2 s 10.2 on 8 d.f. Then the interval on σ Suppose the sample variance is given by = 8*10.2/qchisq(.975,8) [1] 4.65367 8*10.2/qchisq(.025,8) [1] 37.43582 2 which means that we can be 95% confident that the population variance lies in the range 4.65 ≤ σ ≤ 37.44.

311 MATHEMATICS 289 distribution 7.3.8 Fisher’s F This is the famous variance ratio test that occupies the penultimate column of every ANOVA table. The ratio F distribution, and you will often want to use the quantile of treatment variance to error variance follows the . You specify, in order, the probability of your one-tailed test (this will to look up critical values of qf F usually be 0.95), then the two degrees of freedom – numerator first, then denominator. So the 95% value of F with 2 and 18 d.f. is qf(.95,2,18) [1] 3.554557 This is what the density function of F looks like for 2 and 18 d.f. (left) and 6 and 18 d.f. (right): x <- seq(0.05,4,0.05) plot(x,df(x,2,18),type="l",ylab="f(x)",xlab="x") plot(x,df(x,6,18),type="l",ylab="f(x)",xlab="x") 0.8 0.6 0.6 0.4 f(x) f(x) 0.4 0.2 0.2 0.0 0.0 34 34 012 012 x x The distribution is a two-parameter distribution defined by the density function F ( − r 2 / 1) (1 / 2( r + s )) r  rx s ) ( / = ) x ( f , ( r + s ) / 2 )  (1 / 2 [1 + ( rx / s )] s ) s  (1 / 2 r where r s is the degrees of freedom in the denominator. is the degrees of freedom in the numerator and The distribution is named after R.A. Fisher, the father of analysis of variance, and principal developer of assessing the significance of quantitative genetics. It is central to hypothesis testing, because of its use in the differences between two variances . The test statistic is calculated by dividing the larger variance by the smaller variance. The two variances are significantly different when this ratio is larger than the critical value of Fisher’s F . The degrees of freedom in the numerator and in the denominator allow the calculation of the critical value of the test statistic. When there is a single degree of freedom in the numerator, the distribution 2 t . Thus, while the rule of thumb for the critical value of is 2, so is equal to the square of Student’s t F : t =

312 290 THE R BOOK 2 F t = 4. To see how well the rule of thumb works, we can plot critical F against d.f. the rule of thumb for = in the numerator: windows(7,7) par(mfrow=c(1,1)) df <- seq(1,30,.1) plot(df,qf(.95,df,30),type="l",ylab="Critical F") lines(df,qf(.95,df,10),lty=2) 4.0 3.5 3.0 Critical F 2.5 2.0 0 5 10 15 30 20 25 df You see that the rule of thumb (critical = 4) quickly becomes much too large once the d.f. in the F numerator (on the axis) is larger than 2. The lower (solid) line shows the critical values of F when x the denominator has 30 d.f. and the upper (dashed) line shows the case in which the denominator has 10 d.f. The shape of the density function of the distribution depends on the degrees of freedom in the F numerator. x <- seq(0.01,3,0.01) plot(x,df(x,1,10),type="l",ylim=c(0,1),ylab="f(x)") lines(x,df(x,2,10),lty=6,col="red") lines(x,df(x,5,10),lty=2,col="green") lines(x,df(x,30,10),lty=3,col="blue") legend(2,0.9,c("1","2","5","30"),col=(1:4),lty=c(1,6,2,3), title="numerator d.f.")

313 MATHEMATICS 291 1.0 numerator d.f. 1 0.8 2 5 30 0.6 f(x) 0.4 0.2 0.0 0.0 1.0 2.0 0.5 2.5 3.0 1.5 x The probability density x ) declines monotonically when the numerator has 1 or 2 d.f., but rises to a maximum f ( for 3 d.f. or more (5 and 30 are shown here): all the graphs have 10 d.f. in the denominator. 7.3.9 Student’s distribution t This famous distribution was first published by W.S. Gossett in 1908 under the pseudonym of ‘Student’ because his then employer, the Guinness brewing company in Dublin, would not permit employees to publish r , with density function under their own names. It is a model with one parameter, ) ( / − r ( + 1) 2 2 ) ( x 1 2( + r / 1)  + 1 f x ) = ( , 2 / 1 ) ( r / ) r 2 ( π  r 1 where – < x < +∞ . This looks very complicated, but if all the constants are stripped away, you can see ∞ just how simple the underlying structure really is: ) ( 1 − 2 / 2 . f = ) x ( x 1 + We can plot this for values of x from –3 to + 3 as follows: curve( (1+xˆ2)ˆ(-0.5), -3, 3,ylab="t(x)",col="red") The main thing to notice is how fat the tails of the distribution are, compared with the normal distribution. U as The plethora of constants is necessary to scale the density function so that its integral is 1. If we define 1 − n 2 s , = U 2 σ

314 292 THE R BOOK n V as – 1 d.f. (see above). Now define then this is chi-squared distributed on 1 / 2 n ̄ μ ( ) y − V = σ and note that this is normally distributed with mean 0 and standard deviation 1 (the standard normal distribu- tion), so V 2 / 1 ( ) 1) n ( / U − is the ratio of a normal distribution and a chi-squared distribution. You might like to compare this with the F distribution (above), which is the ratio of two chi-squared distributed random variables. At what point does the rule of thumb for Student’s t = 2 break down so seriously that it is actually misleading? To find this out, we need to plot the value of Student’s t against sample size (actually against (quantile of t) and fix the probability at the two-tailed degrees of freedom) for small samples. We use qt value of 0.975: plot(1:30,qt(0.975,1:30), ylim=c(0,12),type="l", ylab="Students t value",xlab="d.f.",col="red") abline(h=2,lty=2,col="green") 12 810 6 Students t value 4 02 5 15 30 0 25 10 20 d.f. As you see, the rule of thumb only becomes really hopeless for degrees of freedom less than about 5 or so. For most practical purposes t ≈ 2 really is a good working rule of thumb. So what does the t distribution look like, compared to a normal? Let us redraw the standard normal as a dotted line ( lty=2 ): xvs <- seq(-4,4,0.01) plot(xvs,dnorm(xvs),type="l",lty=2, ylab="Probability density",xlab="Deviates")

315 MATHEMATICS 293 5 as a solid line to see the difference: = Now we can overlay Student’s t with d.f. lines(xvs,dt(xvs,df=5),col="red") 0.4 0.3 0.2 Probability density 0.1 0.0 –2 2 4 0 –4 Deviates The difference between the normal (blue dashed line) and Student’s t distributions (solid red line) is that t distribution has ‘fatter tails’. This means that extreme values are more likely with a t distribution than the ± with a normal, and the confidence intervals are correspondingly broader. So instead of a 95% interval of 2.57 for a Student’s t distribution with ± 1.96 with a normal distribution we should have a 95% interval of 5 degrees of freedom: qt(0.975,5) [1] 2.570582 In hypothesis testing we generally use two-tailed tests because typically we do not know the direction of the response in advance. This means that we put 0.025 in each of two tails, rather than 0.05 in one tail. 7.3.10 The gamma distribution The gamma distribution is useful for describing a wide range of processes where the data are positively skew (i.e. non-normal, with a long tail on the right). It is a two-parameter distribution, where the parameters are traditionally known as shape and rate. Its density function is: 1 /β x − 1 − α x ) = f ( x e , α  ( β ) α –1 α is the shape parameter and β where is the rate parameter (alternatively, β is known as the scale parameter). Special cases of the gamma distribution are the exponential ( α = 1) and chi-squared ( α = ν /2, β = 2). To see the effect of the shape parameter on the probability density, we can plot the gamma distribution for different values of shape and rate over the range 0.01 to 4: x <- seq(0.01,4,.01) par(mfrow=c(2,2))

316 294 THE R BOOK y <- dgamma(x,.5,.5) plot(x,y,type="l",col="red",main="alpha = 0.5") y <- dgamma(x,.8,.8) plot(x,y,type="l",col="red", main="alpha = 0.8") y <- dgamma(x,2,2) plot(x,y,type="l",col="red", main="alpha = 2") y <- dgamma(x,10,10) plot(x,y,type="l",col="red", main="alpha = 10") alpha = 0.8 alpha = 0.5 1.5 1.0 y y 1234 0.5 0 0.0 01 234 01234 xx alpha = 10 alpha = 2 1.2 0.6 0.8 0.4 y y 0.4 0.2 0.0 0.0 01 234 01234 xx α : 0.5, 0.8, 2 and 10. Note how α< 1 The graphs from top left to bottom right show different values of produces monotonic declining functions and α > 1 produces humped curves that pass through the origin, with the degree of skew declining as increases. α √ 2 and the kurtosis is 6/ , the skewness is 2 / α α . αβ , the variance is αβ The mean of the distribution is 2 , a skewness of 2 and a kurtosis Thus, for the exponential distribution we have a mean of β , a variance of β

317 MATHEMATICS 295 √ ν a skewness of 2 2 /ν and ν , a variance of 2 of 6, while for the chi-squared distribution we have a mean of . Observe also that a kurtosis of 12/ ν mean 1 = , variance β 1 × mean . shape = β We can now answer questions like this: what is the value of the 95% quantile expected from a gamma distribution with mean = 2 and variance = 3? This implies that rate is 2/3 and shape is 4/3 so: qgamma(0.95,2/3,4/3) [1] 1.732096 not An important use of the gamma distribution is in describing continuous measurement data that are normally distributed. Here is an example where body mass data for 200 fishes are plotted as a histogram and a gamma distribution with the same mean and variance is overlaid as a smooth curve: \\ temp fishes <- read.table("c: fishes.txt",header=T) \\ attach(fishes) names(fishes) [1] "mass" First, we calculate the two parameter values for the gamma distribution: rate <- mean(mass)/var(mass) shape <- rate*mean(mass) rate [1] 0.8775119 shape [1] 3.680526 We need to know the largest value of mass, in order to make the bins for the histogram: max(mass) [1] 15.53216 Now we can plot the histogram, using break points at 0.5 to get integer-centred bars up to a maximum of 16.5 to accommodate our biggest fish: par(mfrow=c(1,1)) hist(mass,breaks=-0.5:16.5,col="green",main="") The density function of the gamma distribution is overlaid using lines like this: lines(seq(0.01,15,0.01),length(mass)*dgamma(seq(0.01,15,0.01),shape,rate))

318 296 THE R BOOK 60 50 40 Frequency 0102030 5 10 0 15 mass The fit is much better than when we tried to fit a normal distribution to these same data earlier (see p. 286). 7.3.11 The exponential distribution This is a one-parameter distribution that is a special case of the gamma distribution. Much used in survival analysis, its density function is given on p. 874 and its use in survival analysis is explained on p. 884. The random number generator of the exponential is useful for Monte Carlo simulations of time to death when the hazard (the instantaneous risk of death) is constant with age. You specify the hazard, which is the reciprocal of the mean age at death: rexp(15,0.1) [1] 8.4679954 19.4649828 16.3599100 31.6182943 1.9592625 6.3877954 [7] 26.4725498 18.7831597 34.9983158 18.0820563 2.1303369 0.1319956 [13] 35.3649667 3.5672353 4.8672067 = 10 years; they give a sample mean of 9.66 These are 15 random lifetimes with an expected value of 1/0.1 years. 7.3.12 The beta distribution This has two positive constants, a and b , and x is bounded in the range 0 ≤ x ≤ 1: +  ( a b ) 1 − 1 b a − x ) . (1 − x x ( = ) f  ( b )  ( a ) In R we generate a family of density functions like this: par(mfrow=c(2,2))

319 MATHEMATICS 297 x <- seq(0,1,0.01) fx <- dbeta(x,2,3) plot(x,fx,type="l",main="a=2 b=3",col="red") fx <- dbeta(x,0.5,2) plot(x,fx,type="l",main="a=0.5 b=2",col="red") fx <- dbeta(x,2,0.5) plot(x,fx,type="l",main="a=2 b=0.5",col="red") fx <- dbeta(x,0.5,0.5) plot(x,fx,type="l",main="a=0.5 b=0.5",col="red") a = 0.5 b = 2 a = 2 b = 3 1.5 6 1.0 fx fx 24 0.5 0 0.0 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.6 1.0 0.8 0.4 0.0 x x a = 0.5 b = 0.5 a = 2 b = 0.5 3.0 2.5 2.0 fx fx 1.5 246 1.0 0 0.8 0.6 0.4 0.2 0.0 1.0 1.0 0.8 0.6 0.4 0.2 0.0 x x The important point is whether the parameters are greater or less than 1. When both are greater than 1 we get an n -shaped curve which becomes more skew as b > a (top left). If 0 < a < 1 and b > 1 then the slope of the density is negative (top right), while for > 1 and 0 < b < 1 the slope of the density is positive (bottom left). a The function is U-shaped when both a and b are positive fractions. If a = b = 1, then we obtain the uniform distribution on [0,1].

320 298 THE R BOOK Here are 10 random numbers from the beta distribution with shape parameters 2 and 3: rbeta(10,2,3) [1] 0.2908066 0.1115131 0.5217944 0.1691430 0.4456099 [6] 0.3917639 0.6534021 0.3633334 0.2342860 0.6927753 7.3.13 The Cauchy distribution This is a long-tailed two-parameter distribution, characterized by a location parameter a and a scale parameter a (which is also its median), and is a curiosity in that it has long enough b . It is real-valued, symmetric about tails that the expectation does not exist – indeed, it has no moments at all (it often appears in counter-examples in maths books). The harmonic mean of a variable with positive density at 0 is typically distributed as Cauchy, and the Cauchy distribution also appears in the theory of Brownian motion (e.g. random walks). The general form of the distribution is 1 , f = ( x ) 2 + (( x − a (1 / b ) b ) π ) for – ∞ < x < ∞ . There is also a one-parameter version, with a = 0 and b = 1, which is known as the standard Cauchy distribution and is the same as Student’s t distribution with one degree of freedom: 1 ( ) = x f , 2 + (1 ) π x ∞ for – x < ∞ . < windows(7,4) par(mfrow=c(1,2)) plot(-200:200,dcauchy(-200:200,0,10),type="l",ylab="p(x)",xlab="x", col="red") plot(-200:200,dcauchy(-200:200,0,50),type="l",ylab="p(x)",xlab="x", col="red") 0.030 0.005 0.020 p(x) p(x) 0.003 0.010 0.001 0.000 100 200 200 0 0 100 –200 –100 –200 –100 x x

321 MATHEMATICS 299 10 and the = Note the very long, fat tail of the Cauchy distribution. The left-hand density function has scale = 50; both have location = 0. right hand plot has scale 7.3.14 The lognormal distribution The lognormal distribution takes values on the positive real line. If the logarithm of a lognormal deviate is taken, the result is a normal deviate, hence the name. Applications for the lognormal include the distribution of particle sizes in aggregates, flood flows, concentrations of air contaminants, and failure times. The hazard function of the lognormal is increasing for small values and then decreasing. A mixture of heterogeneous items that individually have monotone hazards can create such a hazard function. Density, cumulative probability, quantiles and random generation for the lognormal distribution employ dlnorm, plnorm, qlnorm and rlnorm. Here, for instance, is the density function: the functions . The mean and standard deviation are optional, with default dlnorm(x, meanlog=0, sdlog=1) and sdlog = 1 . Note that these are not the mean and standard deviation; the lognormal meanlog = 0 √ 2 2 2 2 2 2 σ σ μ 4 + 2 μ + σ σ / σ 2 σ distribution has meane 1 and kurtosis e + 1)e − , skewness (e − , variance (e e + 2) 2 2 σ 2 σ 3 3e − 6. + 2e windows(7,7) plot(seq(0,10,0.05),dlnorm(seq(0,10,0.05)), type="l",xlab="x",ylab="LogNormal f(x)",col="x") 0.6 0.5 0.4 0.3 LogNormal f(x) 0.2 0.1 0.0 0246 810 x The extremely long tail and exaggerated positive skew are characteristic of the lognormal distribution. Logarithmic transformation followed by analysis with normal errors is often appropriate for data such as these.

322 300 THE R BOOK 7.3.15 The logistic distribution The logistic is the canonical link function in generalized linear models with binomial errors and is described in detail in Chapter 16 on the analysis of proportion data. The cumulative probability is a symmetrical S- shaped distribution that is bounded above by 1 and below by 0. There are two ways of writing the cumulative probability equation: a + bx e x ) = ( p bx a + e + 1 and 1 ) = p ( x − α x + 1 e β The great advantage of the first form is that it linearizes under the log-odds transformation (see p. 630) so that ) ( p , = a + bx ln q where is the probability of success and q = p p is the probability of failure. The logistic is a unimodal, 1– symmetric distribution on the real line with tails that are longer than the normal distribution. It is often used to model growth curves, but has also been used in bioassay studies and other applications. A motivation for using the logistic with growth curves is that the logistic distribution function f ( x ) has the property that the derivative – of x ) with respect to x is proportional to [ f ( x )– A ][ B ( f ( x )] with A < B . The interpretation is that the rate of f growth is proportional to the amount already grown, times the amount of growth that is still expected. windows(7,4) par(mfrow=c(1,2)) plot(seq(-5,5,0.02),dlogis(seq(-5,5,.02)), type="l",main="Logistic",col="red",xlab="x",ylab="p(x)") plot(seq(-5,5,0.02),dnorm(seq(-5,5,.02)), type="l",main="Normal",col="red",xlab="x",ylab="p(x)") Logistic Normal 0.4 0.20 0.3 0.2 p(x) p(x) 0.10 0.1 0.0 0.00 –4 2 4 –4 –2 0 –2 24 0 x x

323 MATHEMATICS 301 dlogis Here, the logistic density function (left) is compared with an equivalent normal density function (right) using the default mean 0 and standard deviation 1 in both cases. Note the much fatter tails of dnorm the logistic (there is still substantial probability at ± 4 standard deviations). Note also the difference in the axes (0.25 for the logistic, 0.4 for the normal). scales of the two y 7.3.16 The log-logistic distribution The log-logistic is a very flexible four-parameter model for describing growth or decay processes: ] [ − ) d exp( c (log( x ) b a = + y . c (log( x ) − d ) 1 + exp( Here are two cases. The first is a negative sigmoid with c –1.59 and a = –1.4: = windows(7,4) par(mfrow=c(1,2)) x <- seq(0.1,1,0.01) y <- -1.4+2.1*(exp(-1.59*log(x)-1.53)/(1+exp(-1.59*log(x)-1.53))) plot(log(x),y,type="l", main="c = -1.59", col="red") For the second we have c = 1.59 and a = 0.1: y <- 0.1+2.1*(exp(1.59*log(x)-1.53)/(1+exp(1.59*log(x)-1.53))) plot(log(x),y,type="l",main="c = 1.59",col="red") c = –1.59 c = 1.59 0.5 0.4 0.0 0.3 y y –0.5 0.2 –1.0 0.1 –1.0 –1.5 –2.0 0.0 –0.5 –0.5 0.0 –1.5 –2.0 –1.0 log(x) log(x) 7.3.17 The Weibull distribution weakest link analysis . If there are r links in a chain, and the The origin of the Weibull distribution is in Z strengths of each link are independently distributed on (0, ∞ ), then the distribution of weakest link V = i ) approaches the Weibull distribution as the number of links increases. min( Z j The Weibull is a two-paramter model that has the exponential distribution as a special case. Its value in demographic studies and survival analysis is that it allows for the death rate to increase or to decrease with age, so that all three types of survivorship curve can be analysed (as explained on p. 872). The density,

324 302 THE R BOOK – α survival and hazard functions with = are: μ λ α α − 1 − λ t t ) ( t = αλ f , e α t λ − e S t ( = ) , ( ) t f α − 1 t = αλ . t h ) = ( S ( t ) − 1 2 2  (1 + α The mean of the Weibull distribution is and the variance is μ (1 (  ) + 2 /α ) − (  (1 + 1 /α )) μ ), and the parameter α describes the shape of the hazard function (the background to determining the likelihood et al ., 2009). For = 1 (the exponential distribution) the hazard is constant, equations is given by Aitkin α > 1 the hazard increases with age and for α< 1 the hazard decreases with age. while for α Because the Weibull, lognormal and log-logistic all have positive skewness, it is difficult to discriminate between them with small samples. This is an important problem, because each distribution has differently shaped hazard functions, and it will be hard, therefore, to discriminate between different assumptions about the age-specificity of death rates. In survival studies, parsimony requires that we fit the exponential rather than the Weibull unless the shape parameter α is significantly different from 1. Here is a family of three Weibull distributions with = 1, 2 and 3 (red, green and blue lines, respectively): α windows(7,7) a<-3 l<-1 t <- seq(0,1.8,.05) ft <- a*l*tˆ(a-1)*exp(-l*tˆa) plot(t,ft,type="l",col="blue",ylab="f(t) ") a<-1 ft <- a*l*tˆ(a-1)*exp(-l*tˆa) lines(t,ft,type="l",col="red") a<-2 ft <- a*l*tˆ(a-1)*exp(-l*tˆa) lines(t,ft,type="l",col="green") legend(1.4,1.1,c("1","2","3"),title="alpha",lty=c(1,1,1),col=c(2,3,4)) 1.2 alpha 1 1.0 2 3 0.8 0.6 f(t) 0.4 0.2 0.0 1.5 0.0 0.5 1.0 t

325 MATHEMATICS 303 α the distribution becomes symmetrical, while for ≤ 1 the distribution has its Note that for large values of α t 0. mode at = 7.3.18 Multivariate normal distribution If you want to generate two (or more) vectors of normally distributed random numbers that are correlated with one another to a specified degree, then you need the function from the MASS library: mvrnorm library(MASS) Suppose we want two vectors of 1000 random numbers each. The first vector has a mean of 50 and the second has a mean of 60. The difference from rnorm is that we need to specify their covariance as well as the standard deviations of each separate variable. This is achieved with a positive-definite symmetric matrix specifying the covariance matrix of the variables. xy <- mvrnorm(1000,mu=c(50,60),matrix(c(4,3.7,3.7,9),2)) We can check how close the variances are to our specified values: var(xy) [,1] [,2] [1,] 3.849190 3.611124 [2,] 3.611124 8.730798 Not bad: we said the covariance should be 3.70 and the simulated data are 3.611 124. We extract the two x and y and plot them to look at the correlation: separate vectors x <- xy[,1] y <- xy[,2] plot(x,y,pch=16,ylab="y",xlab="x",col="blue") 70 65 y 60 55 50 52 54 56 44 46 48 x

326 304 THE R BOOK x and in more detail: It is worth looking at the variances of y var(x) [1] 3.84919 var(y) [1] 8.730798 , then the variance of the sum of the two variables would be equal to the If the two samples were independent sum of the two variances. Is this the case here? var(x+y) [1] 19.80224 var(x)+var(y) [1] 12.57999 No it is not. The variance of the sum (19.80) is much greater than the sum of the variances (12.58). This is x and y are positively correlated; big values of because tend to be associated with big values of y and vice x versa. This being so, we would expect the variance of the difference between and y to be less than the sum x of the two variances: var(x-y) [1] 5.357741 As predicted, the variance of the difference (5.36) is much less than the sum of the variances (12.58). We the variance of a sum of two variables is only equal to the variance of the difference of two conclude that variables when the two variables are independent x and y ? We found this . What about the covariance of already by applying the var function to the matrix xy (above). We specified that the covariance should be 3.70 in calling the multivariate normal distribution, and the difference between 3.70 and 3.611 124 is simply due to the random selection of points. The covariance is related to the separate variances through the correlation ρ as follows (see p. 373): coefficient √ 2 2 ) = x cov( , y ρ s s . x y For our example, this checks out as follows, where the sample value of ρ is cor(x,y) : cor(x,y)*sqrt(var(x)*var(y)) [1] 3.611124 which is our observed covariance between x and y with ρ = 0.622 917 8. 7.3.19 The uniform distribution This is the distribution that the random number generator in your calculator hopes to emulate. The idea is to generate numbers between 0 and 1 where every possible real number on this interval has exactly the same probability of being produced. If you have thought about this, it will have occurred to you that there is something wrong here. Computers produce numbers by following recipes. If you are following a recipe then the outcome is predictable. If the outcome is predictable, then how can it be random? As John von Neumann

327 MATHEMATICS 305 once said: ‘Anyone who uses arithmetic methods to produce random numbers is in a state of sin.’ This raises the question as to what, exactly, a computer-generated random number is. The answer turns out to be scientifically very interesting and very important to the study of encryption (for instance, any pseudorandom number sequence generated by a linear recursion is insecure, since, from a sufficiently long subsequence of the outputs, one can predict the rest of the outputs). If you are interested, look up the Mersenne twister online. Here we are only concerned with how well the modern pseudorandom number generator performs. Here is the outcome of the R function simulating the throwing of a six-sided die 10 000 times: the runif histogram ought to be flat: x <- ceiling(runif(10000)*6) table(x) x 123456 1680 1668 1654 1622 1644 1732 hist(x,breaks=0.5:6.5,main="") 1500 1000 Frequency 500 0 123 456 x This is remarkably close to theoretical expectation, reflecting the very high efficiency of R’s random-number generator. Try mapping 1 000 000 points to look for gaps: x <- runif(1000000) y <- runif(1000000) plot(x,y,pch=".",col="blue")

328 306 THE R BOOK 1.0 0.8 0.6 y 0.4 0.2 0.0 0.2 0.4 0.8 0.6 0.0 1.0 x pch="." ) shows The scatter of unfilled space (white dots amongst the sea produced by 1 000 000 blue dots no evidence of clustering. For a more thorough check we can count the frequency of combinations of numbers: with 36 cells, the expected frequency is 1 000 000/36 27 777.78 numbers per cell. We use the cut function = to produce 36 bins: table(cut(x,6),cut(y,6)) (-0.001,0.166] (0.166,0.333] (0.333,0.5] (0.5,0.667] (0.667,0.834] (0.834,1] (-0.000997,0.166] 27667 28224 27814 27601 27592 27659 (0.166,0.333] 27604 27790 27922 27687 27990 27701 (0.333,0.5] 27951 27668 27683 27773 27999 27959 (0.5,0.667] 27550 27767 27951 27912 27619 27577 (0.667,0.834] 27527 28106 27868 28262 27804 27460 (0.834,1] 27617 27662 27863 27867 27727 27577 As you can see the observed frequencies are remarkably close to expectation: range(table(cut(x,6),cut(y,6))) [1] 27460 28262 None of the cells contained fewer than 27 460 random points, and none more than 28 262. 7.3.20 Plotting empirical cumulative distribution functions The function ecdf is used to compute or plot an empirical cumulative distribution function. Here it is in action for the fishes data (p. 286 and 296): fishes <- read.table("c: \\ temp \\ fishes.txt",header=T) attach(fishes) names(fishes) [1] "mass" plot(ecdf(mass))

329 MATHEMATICS 307 ecdf(mass) 1.0 0.8 0.6 Fn(x) 0.4 0.2 0.0 10 05 15 x The pronounced positive skew in the data is evident from the fact that the left-hand side of the cumulative distribution is much steeper than the right-hand side (and see p. 350). 7.4 Discrete probability distributions 7.4.1 The Bernoulli distribution This is the distribution underlying tests with a binary response variable. The response takes one of only two p (a ‘success’) and is 0 with probability 1 – p (a ‘failure’). The density function values: it is 1 with probability is given by: − 1 x x p ) X ( p = (1 − p ) 2 variance x x minus the square of the expectation of The statistician’s definition of : is the expectation of 2 2 2 X E( ) = [E( X )] . We can see how this works with a simple distribution like the Bernoulli. There are − σ just two outcomes in f ( x ): a success, where x = 1 with probability p and a failure, where x = 0 with probability 1– p x is . Thus, the expectation of ∑ p p × xf ( x ) = 0 = (1 − p ) + 1 × p = 0 + ) E( = X 2 is x and the expectation of ∑ 2 2 2 2 X E( p f ( x ) = 0 = × (1 − p ) + 1 ) × p = 0 + p = , x

330 308 THE R BOOK so the variance of the Bernoulli distribution is 2 2 2 var( X = E( ) X p X ) E( = . − p − = ) (1 − p ) = pq p [ ] 7.4.2 The binomial distribution p This is a one-parameter distribution in which describes the probability of success in a binary trial. The x successes out of attempts is given by multiplying together the probability of obtaining one probability of n specific realization and the number of ways of getting that realization. x We need a way of generalizing the number of ways of getting n items. The answer is the items out of combinatorial formula ) ( n ! n , = x !( − x )! x n where the ‘exclamation mark’ means ‘factorial’. For instance, 5! = 5 × 4 × 3 × 2 = 120. This formula has immense practical utility. It shows you at once, for example, how unlikely you are to win the National factorial Lottery in which you are invited to select six numbers between 1 and 49. We can use the built-in function for this, factorial(49)/(factorial(6)*factorial(49-6)) [1] 13983816 which is roughly a 1 in 14 million chance of winning the jackpot. You are more likely to die between buying your ticket and hearing the outcome of the draw. As we have seen (p. 17), there is a built-in R function for the combinatorial function, choose(49,6) [1] 13983816 choose function from here on. and we use the The general form of the binomial distribution is given by ) ( n x x − n (1 − p ) , p ) x ( p = x using the combinatorial formula above. The mean of the binomial distribution is np and the variance is np (1 – p ). Since 1 – p is less than 1 it is obvious that the variance is less than the mean for the binomial distribution (except, of course, in the trivial case when = 0 and the variance is 0). It is easy to visualize the distribution p for particular values of n and p . p <- 0.1 n<-4 x <- 0:n px <- choose(n,x)*pˆx*(1-p)ˆ(n-x) barplot(px,names=x,xlab="outcome",ylab="probability",col="green")

331 MATHEMATICS 309 0.6 0.5 0.4 0.3 probability 0.2 0.1 0.0 34 012 outcome The four distribution functions available for the binomial in R (density, cumulative probability, quantiles dbinom(x, size, prob) and random generation) are used like this. The density function shows the probability for the specified count (e.g. the number of parasitized fish) out of a sample of n = size x , with probability of success = prob . So if we catch four fish when 10% are parasitized in the parent population, we have size = 4 and prob = 0.1 , (as illustrated above). Much the most likely number of parasitized fish in our sample is 0. p ( x ), plotting The cumulative probability shows the sum of the probability densities up to and including cumulative probability against the number of successes, for a sample of n = size and probability = prob . Our fishy plot looks like this: barplot(pbinom(0:4,4,0.1),names=0:4,xlab="parasitized fish", ylab="probability",col="red") 1.0 0.8 0.6 probability 0.4 0.2 0.0 012 34 parasitized fish

332 310 THE R BOOK This shows that the probability of getting 2 or fewer parasitized fish out of a sample of 4 is very close to 1. 0:4 Note that you can generate the series inside the density function ( ). To obtain a confidence interval for the expected number of fish to be caught in a sample of n = size and prob , we need qbinom a probability = , the quantile function for the binomial. The lower and upper limits of the 95% confidence interval are qbinom(.025,4,0.1) [1] 0 qbinom(.975,4,0.1) [1] 2 This means that with 95% certainty we shall catch between 0 and 2 parasitized fish out of 4 if we repeat the sampling exercise. We are very unlikely to get 3 or more parasitized fish out of a sample of 4 if the proportion parasitized really is 0.1. This kind of calculation is very important in power calculations in which we are interested in determining n = 4 in this case) is capable of doing the job we ask of it. Suppose whether or not our chosen sample size ( that the fundamental question of our survey is whether or not the parasite is present in a given lake. If we find one or more parasitized fish then the answer is clearly ‘yes’. But how likely are we to miss out on catching any parasitized fish and hence of concluding, wrongly, that the parasites are not present in the lake? With our sample size of n = 4 and p = 0.1 we have a probability of missing the parasite of 0.9 for each fish caught 4 = 0.6561 of missing out altogether on finding the parasite. This is obviously and hence a probability of 0.9 unsatisfactory. We need to think again about the sample size. What is the smallest sample, n , that makes the probability of missing the parasite altogether less than 0.05? We need to solve n . 05 = 0 . 9 0 . Taking logs, log (0 05) = n log (0 . . , 9) so log(0 . 05) 433 16 . 28 = = n 9) log(0 . which means that to make our journey worthwhile we should keep fishing until we have found more than 28 unparasitized fishes, before we reject the hypothesis that parasitism is present at a rate of 10%. Of course, it would take a much bigger sample to reject a hypothesis of presence at a substantially lower rate. Random numbers are generated from the binomial distribution like this. The first argument is the number of random numbers we want. The second argument is the sample size ( n = 4) and the third is the probability of success ( p = 0.1). rbinom(10,4,0.1) [1]0000010101 Here we repeated the sampling of 4 fish ten times. We got 1 parasitized fish out of 4 on three occasions, and 0 parasitized fish on the remaining seven occasions. We never caught 2 or more parasitized fish in any of these samples of 4.

333 MATHEMATICS 311 7.4.3 The geometric distribution p Suppose that a series of independent Bernoulli trials with probability are carried out at times 1, 2, 3, . . . . Now let W be the waiting time until the first success occurs. So x > x P = (1 − p ) ( W ) , which means that ( W = x ) = P ( W > x − 1) − P ( W > x ) . P The density function, therefore, is x − 1 ) = p (1 − p ) f ( x . fx <- dgeom(0:20,0.2) barplot(fx,names=0:20,xlab="outcome",ylab="probability",col="cyan") 0.20 0.15 0.10 probability 0.05 0.00 0 18 123456789 11 10 13 12 15 14 17 16 19 20 outcome For the geometric distribution, 1 − p  , the mean is p 1 − p  the variance is . 2 p

334 312 THE R BOOK = p The geometric has a very long tail. Here are 100 random numbers from a geometric distribution with 0.1: the modes are 0 and 1, but outlying values as large as 33 and 44 have been generated: table(rgeom(100,0.1)) 012345678910111213141517182122242829313344 14148511335352532311211221111 7.4.4 The hypergeometric distribution ‘Balls in urns’ are the classic sort of problem solved by this distribution. The density function of the hypergeometric is ( )( ) b b N − x n − x ( ) ( f = ) x . N n N coloured balls in the statistician’s famous urn: b of them are blue and r = Suppose that there are – b of N them are red. Now a sample of n balls is removed from the urn; this is sampling without replacement .Now f ( x ) gives the probability that x of these n balls are blue. The built-in functions for the hypergeometric are used like this: dhyper(q,b,r,n) rhy- and . Here per(m,b,r,n)  q is a vector of values of a random variable representing the number of blue balls out of a sample of size n drawn from an urn containing blue balls and r red ones. b  b is the number of blue balls in the urn. This could be a vector with non-negative integer elements.  r is the number of red balls in the urn = N b . This could also be a vector with non-negative integer – elements.  n number of balls drawn from an urn with b blue and r red balls. This can be a vector like b and r .  vector of probabilities with values between 0 and 1. p  m the number of hypergeometrically distributed random numbers to be generated. Let the urn contain N = 20 balls, of which 6 are blue and 14 are red. We take a sample of n = 5 balls so x could be 0, 1, 2, 3, 4 or 5 of them blue, but since the proportion blue is only 6/20 the higher frequencies are most unlikely. Our example is evaluated like this: ph <- dhyper(0:5,6,14,5) barplot(ph,names=(0:5),col="red",xlab="outcome",ylab="probability")

335 MATHEMATICS 313 0.3 0.2 probability 0.1 0.0 012 45 3 outcome We are very unlikely to get more than 3 red balls out of 5. The most likely outcome is that we get 1 or 2 red balls out of 5. We can simulate a set of Monte Carlo trials of size 5. Here are the numbers of red balls obtained in 20 realizations of our example: rhyper(20,6,14,5) [1]11121201323020112112 , b and r The binomial distribution is a limiting case of the hypergeometric which arises as N approach b/N p , and r / N approaches 1 – approaches (see p. 308). This is because as infinity in such a way that p the numbers get large, the fact that we are sampling without replacement becomes irrelevant. The binomial distribution assumes sampling with replacement from a finite population, or sampling without replacement from an infinite population. 7.4.5 The multinomial distribution t possible outcomes from an experimental trial, and the outcome i has probability p Suppose that there are . i + ... and ask what is the probability of obtaining n + + n independent trials where n n n = Now allow 2 1 t occurrences of the i th outcome: N the vector of i n ! n n n n t 3 1 2 p N P ( ) = p = n p , p ... i i t 1 2 3 n ... ! ! n ! ! n n 1 3 2 t i goes from 1 to t . Take an example with three outcomes, (say black, red and blue, so t = 3), where the where 0.25), noting that the probabilities = 0.5, p = = 0.25, p first outcome is twice as likely as the other two ( p 3 1 2 sum to 1. It is sensible to start by writing a function called multi to carry out the calculations for any numbers of successes a , b and c (black, red and blue, respectively) given our three probabilities (above): multi <- function(a,b,c) { factorial(a+b+c)/(factorial(a)*factorial(b)*factorial(c))*0.5ˆa*0.25ˆb*0.25ˆc }

336 314 THE R BOOK We illustrate just one case, in which the third outcome (blue) is fixed at four successes out of 24 trials. This means that the first and second outcomes must add to 24 – 4 = 20. We plot the probability of obtaining different numbers of blacks from 0 to 20: barplot(sapply(0:20,function (i) multi(i,20-i,4)),names=0:20,cex.names=0.7, xlab="outcome",ylab="probability",col="yellow") The most likely outcome for this example is that we would get 13 or 14 successes of type 1 (black) in a trial of size = 24 with probabilities 0.5, 0.25 and 0.25 for the three types of outcome, when the number of successes of the third case was 4 out of 24. Note the use of cex.names=0.7 to make the labels sufficiently small that all of the bars are given outcome names. 7.4.6 The Poisson distribution This is one of the most useful and important of the discrete probability distributions for describing count data. We know how many times something happened (e.g. kicks from cavalry horses, lightening strikes, bomb hits), but we have no way of knowing how many times it did not happen. The Poisson is a one-parameter distribution with the interesting property that its variance is equal to its mean. A great many processes show variance increasing with the mean, often faster than linearly (see the negative binomial distribution below). The density function of the Poisson shows the probability of obtaining a count of x when the mean count per λ unit is : x λ − λ e . = p x ) ( ! x x = 0: The zero term of the Poisson (the probability of obtaining a count of zero) is obtained by setting − λ p = e (0) , which is simply the antilog of minus the mean. Given p (1) is just (0), it is clear that p − λ p (0) λ = λ e p (1) = , and any subsequent probability is readily obtained by multiplying the previous probability by the mean and dividing by the count, λ . x ) = p ( x − 1) p ( x Functions for the density, cumulative distribution, quantiles and random number generation of the Poisson distribution are obtained by dpois(x, lambda) , ppois(q, lambda) , qpois(p, lambda) and rpois(n, lambda) , where lambda is the mean count per sample. The Poisson distribution holds a central position in three quite separate areas of statistics:  in the description of random spatial point patterns (see p. 838);  as the frequency distribution of counts of rare but independent events (see p. 314);  as the error distribution in GLMs for count data (see p. 579).

337 MATHEMATICS 315 If we wanted 600 simulated counts from a Poisson distribution with a mean of, say, 0.90 blood cells per slide, we just type: count <- rpois(600,0.9) We can use table to see the frequencies of each count generated: table(count) count 012345 244 212 104 33 6 1 hist to see a histogram of the counts: or hist(count,breaks = - 0.5:6.5,main="") 200 150 100 Frequency 50 0 3456 12 0 count Note the use of the vector of break points on integer increments from –0.5 to create integer bins for the histogram bars. 7.4.7 The negative binomial distribution This discrete, two-parameter distribution is useful for describing the distribution of count data, where the variance is often much greater than the mean. The two parameters are the mean μ and the clumping parameter k , given by 2 μ k = 2 σ − μ

338 316 THE R BOOK , the greater the degree of clumping. The density function is The smaller the value of k ) ( x ) ( − k μ 1)! − + ( k x μ . + ( 1 x ) = p 1)! μ + k − k !( x k x = 0 and simplifying: The zero term is found by setting ( ) k − μ = p (0) + 1 . k Successive terms in the distribution can then be computed iteratively from ( )( ) − 1 + k x μ p ( x ) = p ( x − 1) . + k x μ An initial estimate of the value of k can be obtained from the sample mean and variance, 2 ̄ x . k ≈ 2 ̄ − s x cannot be negative, it is clear that the negative binomial distribution should not be fitted to data where k Since the variance is less than the mean. The maximum likelihood estimate of k is found numerically, by iterating progressively more fine-tuned k until the left- and right-hand sides of the following equation are equal: values of ( ) max ) ( ∑ ( A ) x μ = + 1 ln n k k + x x = 0 x where the vector x ) contains the total frequency of values greater than ( . You could write a function to A work out the probability densities like this: negbin <- function(x,u,k) (1+u/k)ˆ(-k)*(u/(u+k))ˆx*gamma(k+x)/(factorial(x)*gamma(k)) then use the function to produce a barplot of probability densities for a range of x values (say 0 to 10), for a distribution with specified mean and aggregation parameter (say μ = 0.8, k = 0.2) like this: xf <- sapply(0:10, function(i) negbin(i,0.8,0.2)) barplot(xf,names=0:10,xlab="count",ylab="probability density",col="green")

339 MATHEMATICS 317 0.7 0.6 0.5 0.4 0.3 probability density 0.2 0.1 0.0 012345678910 count There is another, quite different way of looking at the negative binomial distribution. Here, the response for the r th success: variable is the waiting time W r ) ( x − 1 − x r r p ) = x ( f p ) (1 − . r − 1 x It is important to realize that r and increases from there (obviously, the r th success cannot occur starts at r th attempt). The density function dnbinom(x, size, prob) represents the number of before the failures x (e.g. tails in coin tossing) before size successes (or heads in coin tossing) are achieved, when the probability of a success (a head) is . prob Suppose we are interested in the distribution of waiting times until the fifth success occurs in a negative p = 0.1. We start the sequence of x values at 5: binomial process with plot(5:100,dnbinom(5:100,5,0.1),type="s",xlab="x",ylab="f(x)") 0.020 0.015 f(x) 0.010 0.005 0.000 80 100 60 20 40 x

340 318 THE R BOOK This shows that the most likely waiting time for the 5th success, when the probability of a success is 1/10, is about 31 trials after the fifth trial. Note that the negative binomial distribution is quite strongly skew to the right. It is easy to generate negative binomial data using the random number generator: rnbinom(n, size, n . When the second parameter, The number of random numbers required is , is set to 1 the prob). size distribution becomes the geometric (see above). The final parameter, prob , is the probability of success per trial, . Here we generate 100 counts with a mean of 0.6: p count <- rnbinom(100,1,0.6) table to see the frequency of the different counts: We can use table(count) 0 1 2356 65 18 13 2 1 1 It is sensible to check that the mean really is 0.6 (or very close to it): mean(count) [1] 0.61 The variance will be substantially greater than the mean: var(count) [1] 1.129192 of This gives an estimate of k 2 61 . 0 = 0 . 717 . . 61 1 . 129 − 0 The following data show the number of spores counted on 238 buried glass slides. We are interested in whether these data are well described by a negative binomial distribution. If they are we would like to find the maximum likelihood estimate of the aggregation parameter k . x <- 0:12 freq <- c(131,55,21,14,6,6,2,0,0,0,0,2,1) barplot(freq,names=x,ylab="frequency",xlab="spores",col="purple")

341 MATHEMATICS 319 120 100 80 frequency 0204060 0123456789101112 spores We start by looking at the variance–mean ratio of the counts. We cannot use mean and variance directly, frequencies of counts, rather than counts themselves. This is easy to rectify: we use rep because our data are y to create a vector of counts x ) is repeated the relevant number of times ( freq ). Now in which each count ( we can use mean and var directly: y <- rep(x,freq) mean(y) [1] 1.004202 var(y) [1] 3.075932 This shows that the data are highly aggregated (the variance–mean ratio is roughly 3, recalling that it would be 1 if the data were Poisson distributed). Our rough estimate of k is therefore mean(y)ˆ2/(var(y)-mean(y)) [1] 0.4867531 Here is a function that takes a vector of frequencies of counts x (between 0 and length(x) − 1 ) and computes the maximum likelihood estimate of , the aggregation parameter: k kfit <- function(x) { lhs <- numeric() rhs <- numeric() y <- 0:(length(x) - 1) j <- 0:(length(x)-2) m <- sum(x * y)/(sum(x)) s2 <- (sum(x * yˆ2) - sum(x * y)ˆ2/sum(x))/(sum(x)- 1) k1 <- mˆ2/(s2 - m) a <- numeric(length(x)-1)

342 320 THE R BOOK for(i in 1:(length(x) - 1)) a[i] <- sum(x [- c(1:i)]) i<-0 for (k in seq(k1/1.2,2*k1,0.001)) { i <- i+1 lhs[i] <- sum(x) * log(1 + m/k) rhs[i] <- sum(a/(k + j)) } k <- seq(k1/1.2,2*k1,0.001) plot(k, abs(lhs-rhs),xlab="k",ylab="Difference",type="l",col="red") d <- min(abs(lhs-rhs)) sdd <- which(abs(lhs-rhs)==d) k[sdd] } We can try it out with our spore count data. kfit(freq) [1] 0.5826276 25 20 15 Difference 10 5 0 0.4 0.5 0.6 0.7 0.8 0.9 k k = 0.58. The printout shows that the maximum The minimum difference is close to zero and occurs at about likelihood estimate of k is 0.582 (to the 3 decimal places we simulated; the last 4 decimals (6276) are meaningless and would not be printed in a more polished function). How would a negative binomial distribution with a mean of 1.0042 and a k value of 0.582 describe our count data? The expected frequencies are obtained by multiplying the probability density (above) by the total sample size (238 slides in this case). nb <- 238*(1+1.0042/0.582)ˆ(-0.582)*factorial(.582+(0:12)-1)/ (factorial(0:12)*factorial(0.582-1))*(1.0042/(1.0042+0.582))ˆ(0:12) We shall compare the observed and expected frequencies using barplot . We intend to alternate the observed and expected frequencies. There are three steps to the procedure:  Concatenate the observed and expected frequencies in an alternating sequence.

343 MATHEMATICS 321  Create list of labels to name the bars (alternating blanks and counts).  Produce a legend to describe the different bar colours. both freq ) ) is made like this, putting the 13 observed counts ( The concatenated list of frequencies (called in the odd-numbered bars and the 13 expected counts ( nb ) in the even-numbered bars (note the use of modulo to do this): %% both <- numeric(26) both[1:26 %% 2 != 0] <- freq both[1:26 %% 2 == 0] <- nb Because adjacent blue and green bars refer to the same count (the observed and expected frequencies) we do barplot names argument for labelling the bars (it would want to write a label on not want to use ’s built-in every bar, 26 labels in all). Instead, we want to write the count just once for each pair of bars, located beneath the observed and (green) bars, using . The trick is to produce a vector of length as.character(0:12) 26 containing the repeated bar labels, then replace the even-numbered entries with blanks like this (using modulo to pick out the even numbers): labs <- as.character(rep(0:12,each=2)) labs[1:26%%2==0] <- "" Now we can draw the combined barplot specifying cex.names=0.8 to ensure that all the bar labels are small enough to be printed: barplot(both,col=rep(c(3,4),13),ylab="frequency",names=labs,cex.names=0.8) The legend function creates a legend to show which bars represent the observed frequencies (black in this case) and which represent the expected, negative binomial frequencies (open bars). Just click when the cursor is in the position where you want the top left-hand corner of the legend box to be: legend(locator(1),c("observed","expected"),fill=c(3,4)) 120 observed expected 100 80 frequency 0204060 7 3 0 5 6 2 8 9 10 11 12 1 4

344 322 THE R BOOK The fit is very close, so we can be reasonably confident in describing the observed counts as negative binomially distributed. The tail of the observed distribution is rather fatter than the expected negative binomial tail, so we might want to measure the lack of fit between observed and expected distributions. A simple way to do is this is to use Pearson’s chi-squared, taking care to use only those cases where the expected frequency nb greater than 5: sum(((freq-nb)ˆ2/nb)[nb > 5]) [1] 1.634975 This is based on five legitimate comparisons, sum(nb>5) [1] 5 –1 = 2 d.f. because we have estimated p = 2 parameters from the data in estimating and hence on 5 – p k of the negative binomial) and lost one degree of freedom for the expected distribution (the mean and = contingency (the total number of counts must add up to 238). Our calculated value of chi-squared 1.63 is much less than the value in tables: qchisq(0.95,2) [1] 5.991465 so we accept the hypothesis that our data are not significantly different from a negative binomial with mean 1.0042 and k = 0.582. = 7.4.8 The Wilcoxon rank-sum statistic This function calculates the distribution of the Wilcoxon rank-sum statistic (also known as Mann–Whitney), and returns values for the exact probability at discrete values of q : dwilcox(q, m, n) . Here q is a vector of quantiles, m is the number of observations in sample x (a positive integer not greater than 50), and n is the number of observations in sample y (also a positive integer not greater than 50). The Wilcoxon rank-sum x in the combined sample statistic is the sum of the ranks of . The Wilcoxon rank-sum statistic takes c(x,y) on values W between the limits 1) + n 2 + m ( m m 1) + ( m ≤ ≤ . W 2 2 This statistic can be used for a non-parametric test of location shift between the parent populations x and y . 7.5 Matrix algebra There is a comprehensive set of functions for handling matrices in R. We begin with a matrix called a that has three rows and two columns. Data are typically entered into matrices columnwise, so the first three numbers (1, 0, 4) go in column 1 and the second three numbers (2, –1, 1) go in column 2: a <- matrix(c(1,0,4,2,-1,1),nrow=3) a [,1] [,2] [1,] 1 2

345 MATHEMATICS 323 [2,] 0 -1 [3,] 4 1 b , has the same number of columns as A has rows (i.e. three in this case). Entered Our second matrix, called columnwise, the first two numbers (1, –1) go in column 1, the second two numbers (2, 1) go in column 2, and the last two numbers (1, 0) go in column 3: b <- matrix(c(1,-1,2,1,1,0),nrow=2) b [,1] [,2] [,3] [1,]121 [2,] -1 1 0 7.5.1 Matrix multiplication To multiply one matrix by another matrix you take the rows of the first matrix and the columns of the second matrix. Put the first row of a side by side with the first column of b : a[1,] [1] 1 2 b[,1] [1] 1 -1 and work out the point products: a[1,]*b[,1] [1] 1 -2 then add up the point products sum(a[1,]*b[,1]) [1] -1 The sum of the point products is –1 and this is the first element of the product matrix. Next, put the first row of a with the second column of b : a[1,] [1] 1 2 b[,2] [1] 2 1 a[1,]*b[,2] [1] 2 2 sum(a[1,]*b[,2]) [1] 4

346 324 THE R BOOK + 2 4. So 4 goes in row 1 and column so the point products are 2, 2 and the sum of the point products is 2 = and match it against the first row of : b 2 of the answer. Then take the last column of a a[1,]*b[,3] [1] 1 0 sum(a[1,]*b[,3]) [1] 1 + 0 = 1. This goes in row 1, column 3 of the answer. And so on. so the sum of the point products is 1 a We repeat these steps for row 2 of matrix a (4, 1) to obtain the (0, –1) and then again for row 3 of matrix complete matrix of the answer. In R, the symbol for matrix multiplication is %*%. Here is the full answer: a %*% b [,1] [,2] [,3] [1,] -1 4 1 [2,] 1 -1 0 [3,]394 where you see the values we calculated by hand (–1, 4, 1) in the first row. a It is important to understand that with matrices b is not the same as b times a . The matrix resulting times from a multiplication has the number of rows of the matrix on the left ( a has 3 rows in the case above). But b has just two rows, so multiplication b %*% a [,1] [,2] [1,] 5 1 [2,] -1 -3 produces a matrix with 2 rows. The value 5 in row 1 column 1 of the answer is the sum of the point products (1 × 1) + (2 × 0) + (1 × 4) = 1 + 0 + 4 = 5. 7.5.2 Diagonals of matrices diag function like this: To create a diagonal matrix of 3 rows and 3 columns, with 1s on the diagonal use the (ym <- diag(1,3,3)) [,1] [,2] [,3] [1,]100 [2,]010 [3,]001 You can alter the values of the diagonal elements of a matrix like this: diag(ym) <- 1:3 ym [,1] [,2] [,3] [1,]100 [2,]020 [3,]003

347 MATHEMATICS 325 or extract a vector containing the diagonal elements of a matrix like this: diag(ym) [1]123 You might want to extract the diagonal of a variance–covariance matrix: M <- cbind(X=1:5, Y=rnorm(5)) var(M) XY X 2.50000000 0.04346324 Y 0.04346324 0.88056034 diag(var(M)) XY 2.5000000 0.8805603 7.5.3 Determinant × The determinant of the square (2 2) array ] [ ab cd , b , c and d is defined for any numbers a as ∣ ∣ ∣ ∣ ab ∣ ∣ − ≡ ad bc . ∣ ∣ cd is a square matrix of order (3 × 3): A Suppose that   a a a 11 12 13   a a a A = . 21 22 23 a a a 23 32 31 Then the third-order determinant of A is defined to be the number ∣ ∣ ∣ ∣ ∣ ∣ ∣ ∣ ∣ ∣ ∣ ∣ a a a a a a 21 22 22 23 23 21 ∣ ∣ ∣ ∣ ∣ ∣ − a + a . det = A a 12 13 11 ∣ ∣ ∣ ∣ ∣ ∣ a a a a a a 32 33 31 32 33 32 ∣ ∣ ∣ ∣ ab ∣ ∣ ≡ ad − bc to this equation gives Applying the rule ∣ ∣ cd det A = a a . a a a − a a a − a a + a a a a a + − a a a 32 21 33 31 23 13 12 21 32 12 23 11 13 33 22 22 31 11

348 326 THE R BOOK Take a numerical example:   123   211 . A = 412 This has determinant = (1 × 1 × 2) − (1 × 1 × 1) det A (2 × 1 × 4) − (2 × 2 × 2) + (3 × 2 × 1) − (3 × 1 × 4) + = 2 − 1 + 8 − 8 + 6 − 12 =− 5 . Here is the example in R using the determinant function det : A <- matrix(c(1,2,4,2,1,1,3,1,2),nrow=3) A [,1] [,2] [,3] [1,]123 [2,]211 [3,]412 det(A) [1] -5 λ , The great thing about determinants is that if any row or column of a determinant is multiplied by a scalar λ (since a factor λ will appear in each of the products). Also, then the value of the determinant is multiplied by if all the elements of a row or a column are zero then the determinant | A | = 0. Again, if all the corresponding elements of two rows or columns of | A | are equal then | A | = 0. For instance, here is the bottom row of A multiplied by 3: B<-A B[3,] <- 3*B[3,] B [,1] [,2] [,3] [1,]123 [2,]211 [3,] 12 3 6 and here is the determinant: det(B) [1] -15 Here is an example when all the elements of column 2 are zero, so det C = 0: C<-A C[,2] <- 0 C [,1] [,2] [,3] [1,]103

349 MATHEMATICS 327 [2,]201 [3,]402 det(C) [1] 0 A A must be linearly independent. This important concept is 0 then the rows and columns of If det = expanded in terms of contrast coefficients on p. 434. 7.5.4 Inverse of a matrix = 0a The operation of division is not defined for matrices. However, for a square matrix that has | A | –1 –1 can be defined. This multiplicative inverse is unique and has A A multiplicative inverse matrix denoted by the property that 1 1 − − A = A = I , AA I is the unit matrix. So if A is a square matrix for which | A | = where 0 the matrix inverse is defined by the relationship adj A − 1 A = , | | A (adj where the adjoint matrix of ) is the matrix of cofactors of A . The cofactors of A are computed as A A + i j = (–1) (these are the determinants of the matrices M a , where M are the ‘minors’ of the elements A ij ij ij ij A of i and column j have been deleted). The properties of the inverse matrix can be laid out from which row for two non-singular square matrices, A and B , of the same order as follows: − − 1 1 = A A = I , AA − 1 − 1 − 1 AB ( ) A = , B 1 1 ′ − ′ − ( A ( ) = ) , A 1 − − 1 A ( = A , ) 1 ∣ ∣ | | . = A 1 − ∣ ∣ A Here is R’s version of the inverse of the 3 × 3 matrix A (above) using the ginv function from the MASS library: library(MASS) ginv(A) [,1] [,2] [,3] [1,] -2.000000e-01 0.2 0.2 [2,] -2.224918e-16 2.0 -1.0 [3,] 4.000000e-01 -1.4 0.6 –1 –1 ) A where the number in row 2 column 1 is zero (except for rounding error). Here is the penultimate rule, ( = A , evaluated by R:

350 328 THE R BOOK ginv(ginv(A)) [,1] [,2] [,3] [1,]123 [2,]211 [3,]412 –1 |: A Here is the last rule, | | = 1/ | A 1/det(ginv(A)) [1] -5 7.5.5 Eigenvalues and eigenvectors and two column vectors A and K , where X We have a square matrix K , = AX λ such that and we want to discover the scalar multiplier = AX X . λ A is the unit matrix. This can only have one non-trivial solution λ I ) X = 0, where I This is equivalent to ( – when the determinant associated with the coefficient matrix A vanishes, so we must have | | A − λ = 0 . I When expanded, this determinant gives rise to an algebraic equation of degree n in λ called the characteristic equation . It has roots λ n , ,... , λ , each of which is called an eigenvalue . The corresponding solution λ 2 n 1 is called an λ . eigenvector of A corresponding to vector X i i Here is an example from population ecology. The matrix A shows the demography of different age classes: the top row shows fecundity (the number of females born per female of each age) and the sub-diagonals show survival rates (the fraction of one age class that survives to the next age class). When these numbers are constants the matrix is known as the Leslie matrix . In the absence of density dependence the constant > 1) or exponential A λ will lead either to exponential increase in total population size (if parameter values in 1 < 1) once the initial transients in age structure have damped away. Once exponential growth λ decline (if 1 has been achieved, then the age structure, as reflected by the proportion of individuals in each age class, will be a constant. This is known as the first eigenvector. L , which is to be multiplied by a column matrix of age-structured population Consider the Leslie matrix, sizes, n : L <- c(0,0.7,0,0,6,0,0.5,0,3,0,0,0.3,1,0,0,0) L <- matrix(L,nrow=4) Note that the elements of the matrix are entered in columnwise, not row-wise sequence. We make sure that the Leslie matrix is properly conformed: L [,1] [,2] [,3] [,4] [1,] 0.0 6.0 3.0 1

351 MATHEMATICS 329 [2,] 0.7 0.0 0.0 0 [3,] 0.0 0.5 0.0 0 [4,] 0.0 0.0 0.3 0 The top row contains the age-specific fecundities (e.g. 2-year-olds produce six female offspring per year), and the sub-diagonal contains the survivorships (70% of 1-year-olds become 2-year-olds, etc.). Now the population sizes at each age go in a column vector, n : n <- c(45,20,17,3) n <- matrix(n,ncol=1) n [,1] [1,] 45 [2,] 20 [3,] 17 [4,] 3 Population sizes next year in each of the four age classes are obtained by matrix multiplication: L %*% n [,1] [1,] 174.0 [2,] 31.5 [3,] 10.0 [4,] 5.1 We can check this the long way. The number of juveniles next year (the first element of ) is the sum of all n the babies born last year: 45*0+20*6+17*3+3*1 [1] 174 We write a function to carry out the matrix multiplication, giving next year’s population vector as a function of this year’s: fun <- function(x) L %*% x Now we can simulate the population dynamics over a period long enough (say, 40 generations) for the age structure to approach stability. So long as the population growth rate λ > 1 the population will increase exponentially, once the age structure has stabilized: n <- c(45,20,17,3) n <- matrix(n,ncol=1) structure <- numeric(160) dim(structure) <- c(40,4) for (i in 1:40) { n <- fun(n) structure[i,] <- n } matplot(1:40,log(structure),type="l")

352 330 THE R BOOK 35 30 25 20 15 log(structure) 10 5 0 40 01020 30 1:40 You can see that after some initial transient fluctuations, the age structure has more or less stabilized by year 20 (the lines for log population size of juveniles (top line), 1-, 2- and 3-year-olds are parallel). By year 40 the population is growing exponentially in size, multiplying by a constant of each year. λ The population growth rate (the per-year multiplication rate, λ ) is approximated by the ratio of total population sizes in the 40th and 39th years: sum(structure[40,])/sum(structure[39,]) [1] 2.164035 and the approximate stable age structure is obtained from the 40th value of n : structure[40,]/sum(structure[40,]) [1] 0.709769309 0.230139847 0.052750539 0.007340305 The exact values of the population growth rate and the stable age distribution are obtained by matrix algebra: they are the dominant eigenvalue and dominant eigenvector, respectively. Use the function eigen applied to the Leslie matrix, L , like this: eigen(L) \$values [1] 2.1694041+0.0000000i -1.9186627+0.0000000i -0.1253707+0.0975105i [4] -0.1253707-0.0975105i \$vectors [,1] [,2] [,3] [,4] [1,] 0.949264118+0i -0.93561508+0i -0.01336028-0.03054433i -0.01336028+0.03054433i [2,] 0.306298338+0i 0.34134741+0i -0.03616819+0.14241169i -0.03616819-0.14241169i [3,] 0.070595039+0i -0.08895451+0i 0.36511901-0.28398118i 0.36511901+0.28398118i [4,] 0.009762363+0i 0.01390883+0i -0.87369452+0.00000000i -0.87369452+0.00000000i

353 MATHEMATICS 331 The dominant eigenvalue is 2.1694 (compared with our empirical approximation of 2.1640 after 40 years). The stable age distribution is given by the first eigenvector (column 1, above), which we need to turn into proportions: eigen(L)\$vectors[,1]/sum(eigen(L)\$vectors[,1]) [1] 0.710569659+0i 0.229278977+0i 0.052843768+0i 0.007307597+0i This compares with our approximation (above) in which the proportion in the first age class was 0.709 77 after 40 years (rather than 0.710 57). 7.5.6 Matrices in statistical models Perhaps the main use of matrices in R is in statistical calculations, in generalizing the calculation of sums of squares and sums of products (see p. 450 for background). Here are the data used in Chapter 10 to introduce the calculation of sums of squares in linear regression: \\ numbers <- read.table("c: \\ tannin.txt",header=T) temp attach(numbers) names(numbers) [1] "growth" "tannin" The response variable is growth ( y ) and the explanatory variable is tannin concentration ( x ) in the diet of a group of insect larvae. We need the famous five (see p. 453): the sum of the y values, growth [1]121081167233 sum(growth) [1] 62 the sum of the squares of the y values, growthˆ2 [1] 144 100 64 121 36 49 4 9 9 sum(growthˆ2) [1] 536 the sum of the x values, tannin [1]012345678 sum(tannin) [1] 36 the sum of the squares of the x values, tanninˆ2 [1] 0 1 4 9 16 25 36 49 64

354 332 THE R BOOK sum(tanninˆ2) [1] 204 and y , we need the sum of the products, x and finally, to measure the covariation between growth*tannin [1]01016332435122124 sum(growth*tannin) [1] 175 You can see at once that for more complicated models (such as multiple regression) it is essential to be able to generalize and streamline this procedure. This is where matrices come in. Matrix multiplication involves the calculation of sums of products where a row vector is multiplied by a column vector of the same length to obtain a single value. Thus, we should be able to obtain the required sum of products, 175, by using matrix multiplication symbol %*% in place of the regular multiplication symbol: growth %*% tannin [,1] [1,] 175 That works fine. But what about sums of squares? Surely if we use matrix multiplication on the same vector we will get an object with many rows (nine in this case). Not so. growth %*% growth [,1] [1,] 536 R has coerced the left-hand vector of growth into a row vector in order to obtain the desired result. You can override this, if for some reason you wanted the answer to have nine rows, by specifying the transpose t() of the right-hand growth vector, growth %*% t(growth) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 144 120 96 132 72 84 24 36 36 [2,] 120 100 80 110 60 70 20 30 30 [3,] 96 80 64 88 48 56 16 24 24 [4,] 132 110 88 121 66 77 22 33 33 [5,] 72 60 48 66 36 42 12 18 18 [6,] 84 70 56 77 42 49 14 21 21 [7,] 24 20 16 22 12 14 4 6 6 [8,] 36 30 24 33 18 21 6 9 9 [9,] 36 30 24 33 18 21 6 9 9 but, of course, that is not what we want. R’s default is what we need. So this should also work in obtaining the sum of squares of the explanatory variable: tannin %*% tannin [,1] [1,] 204

355 MATHEMATICS 333 So far, so good. But how do we obtain the sums using matrix multiplication? The trick here is to matrix multiply the vector by a vector of 1s: here are the sum of the y values: growth %*% rep(1,9) [,1] [1,] 62 and the sum of the values, x tannin %*% rep(1,9) [,1] [1,] 36 n Finally, can we use matrix multiplication to arrive at the sample size, ? We do this by matrix multiplying a row vector of 1s by a column vector of 1s. This rather curious operation produces the right result, by adding up the nine 1s that result from the nine repeats of the calculation 1 × 1: rep(1,9 )%*% rep(1,9) [,1] [1,] 9 But how do we get all of the famous five in a single matrix? The thing to understand is the dimen- sionality of such a matrix. It needs to contain sums as well as sums of products. We have two variables ( growth and tannin ) and their matrix multiplication produces a single scalar value (see above). In order to get to the sums of squares as well as the sums of products we use cbind × 2 matrix to create a 9 like this: a <- cbind(growth,tannin) a growth tannin [1,] 12 0 [2,] 10 1 [3,] 8 2 [4,] 11 3 [5,] 6 4 [6,] 7 5 [7,] 2 6 [8,] 3 7 [9,] 3 8 To obtain a results table with 2 rows rather than 9 rows we need to multiply the transpose of matrix a by matrix a : t(a) %*% a growth tannin growth 536 175 tannin 175 204

356 334 THE R BOOK That’s OK as far as it goes, but it has only given us the sums of squares (536 and 204) and the sum of products (175). How do we get the sums as well? The trick is to bind a column of 1s onto the left of matrix a : b <- cbind(1,growth,tannin) b growth tannin [1,] 1 12 0 [2,] 1 10 1 [3,] 1 8 2 [4,] 1 11 3 [5,] 1 6 4 [6,] 1 7 5 [7,] 1 2 6 [8,] 1 3 7 [9,] 1 3 8 sample : It would look better if the first column had a variable name: let’s call it dimnames(b)[[2]] [1] <- "sample" Now to get a summary table of sums as well as sums of products, we matrix multiply by itself. We want b b the answer to have three rows (rather than nine) so we matrix multiply the transpose of (which has three b (which has nine rows): rows) by t(b) %*% b sample growth tannin sample 9 62 36 growth 62 536 175 tannin 36 175 204 So there you have it. All of the famous five, plus the sample size, in a single matrix multiplication. 7.5.7 Statistical models in matrix notation We continue this example to show how matrix algebra is used to generalize the procedures used in linear modelling (such as regression or analysis of variance) based on the values of the famous five. We want to be able to determine the parameter estimates (such as the intercept and slope of a linear regression) and to ) and unexplained variation SSR apportion the total sum of squares between variation explained by the model ( SSE ). Expressed in matrix terms, the linear regression model is ( Y + e , = Xb b , given by and we want to determine the least-squares estimate of ′ − 1 ′ X ) X Y , = ( X b and then carry out the analysis of variance ′ ′ ′ b X Y .

357 MATHEMATICS 335 We look at each of these in turn. , 1 and the errors e are simple n × 1 column vectors, X is an n × 2 matrix and The response variable β Y 1 vector of coefficients, as follows: × is a 2         e 10 1 12 1         11 1 10 e 2                 12 1 8 e 3                 [ ] 13 1 11 e 4         β 0         14 1 6 e = e , ,β = , = X 1 = , . = Y 5         β 1         15 1 7 e 6                 16 1 2 e 7                 17 1 3 e 8 18 1 3 e 9 y vector and the 1 vector are created like this: The Y <- growth one <- rep(1,9) ′ The sample size is given by 1 1 (transpose of vector 1 times vector 1): t(one) %*% one [,1] [1,] 9 The vector of explanatory variable(s) X is created by binding a column of ones to the left: X <- cbind(1,tannin) X tannin [1,] 1 0 [2,] 1 1 [3,] 1 2 [4,] 1 3 [5,] 1 4 [6,] 1 5 [7,] 1 6 [8,] 1 7 [9,] 1 8 In this notation ∑ 2 2 2 ′ 2 Y + y = , + ... + y y Y = y n 2 1 t(Y) %*% Y [,1] [1,] 536

358 336 THE R BOOK ∑ ′ ̄ = y = y n y Y + ... + y + = 1 y , n 1 2 t(one) %*% Y [,1] [1,] 62 ) ( ∑ 2 ′ ′ 11 Y Y . = y t(Y) %*% one %*% t(one) %*% Y [,1] [1,] 3844 ∑ ∑ 2 ′ n , 2 matrix containing gives a 2 and X x × . x X For the matrix of explanatory variables, we see that The numerical values are easy to find using matrix multiplication: t(X) %*% X tannin 936 tannin 36 204 ′ ′ ′ (a 9 2 matrix) is completely different from XX × X (a 2 9 matrix). The matrix X × Y gives a X Note that ∑ ∑ 1 matrix containing 2 y and the sum of products × xy : t(X) %*% Y [,1] 62 tannin 175 Now, using the beautiful symmetry of the normal equations, ∑ ∑ , = y x + n b b 1 0 ∑ ∑ ∑ 2 b x + x b = xy , 0 1 we can write the regression directly in matrix form as ′ ′ X = X Xb Y because we already have the necessary matrices to form the left- and right-hand sides. To find the least- ′ we need to divide both sides by X squares parameter values b X . This involves calculating the inverse of ′ X matrix. The inverse exists only when the matrix is square and when its determinant is non-singular. X the ∑ ∑ 2 2 ̄ ̄ x ) as its terms, with SSX = the sum of the squared differences ( x − , x x − The inverse contains and

359 MATHEMATICS 337 values and mean x ,or n.SSX as the denominator: between the x   ∑ 2 ̄ x x − ∑ ∑   2 2 ̄ ̄ ( n − x x ) − ) x x (   − ′ 1 . X ( ) = X     ̄ x − 1 ∑ ∑ 2 2 ̄ ̄ − ) ( x x − x ( x ) When every element of a matrix has a common factor, it can be taken outside the matrix. Here, the term n.SSX ) can be taken outside to give 1/( ∑ ∑   2 x x − 1 − 1 ′   ∑ X ( . = ) X ∑ 2 ̄ ( ) x n − x xn − ginv : Computing the numerical value of this is easy using the matrix function library(MASS) ginv(t(X) %*% X) [,1] [,2] [1,] 0.37777778 -0.06666667 [2,] -0.06666667 0.01666667 Now we can solve the normal equations ′ ′ − 1 ′ ′ − 1 X ( X ) b = ( X Y X ) X ( X X ) , − 1 ′ ′ using the fact that ( X ( X ) X ) = I to obtain the important general result: X ′ − 1 ′ = ( X b X ) X Y . ginv(t(X) %*% X) %*% t(X) %*% Y [,1] [1,] 11.755556 [2,] -1.216667 which you will recognize from our hand calculations as the intercept and slope respectively (see p. 455). The ANOVA computations are as follows. The correction factor is ′ ′ Y = CF 11 Y / n . CF <- t(Y) %*% one %*% t(one) %*% Y/9 CF [,1] [1,] 427.1111

360 338 THE R BOOK ′ SSY ,is Y – CF : The total sum of squares, Y t(Y) %*% Y - CF [,1] [1,] 108.8889 ′ ′ The regression sum of squares, SSR ,is b Y − X CF : b <- ginv(t(X) %*% X) %*% t(X) %*% Y t(b) %*% t(X) %*% Y - CF [,1] [1,] 88.81667 ′ ′ ′ : Y Y − X b Y ,is SSE The error sum of squares, t(Y) %*% Y - t(b) %*% t(X) %*% Y [,1] [1,] 20.07222 You should check these figures against the hand calculations on p. 457. Obviously, this is not a sensible way to carry out a single linear regression, but it demonstrates how to generalize the calculations for cases that have two or more continuous explanatory variables. 7.6 Solving systems of linear equations using matrices Suppose we have two equations containing two unknown variables: x + 4 y = 12 , 3 x 2 y = 8 . + We can use the function solve to find the values of the variables if we provide it with two matrices:  a square matrix A coefficients (3, 1, 4 and 2, columnwise); containing the  kv containing the known values (12 and 8). a column vector We set the two matrices up like this (columnwise, as usual): A <- matrix(c(3,1,4,2),nrow=2) A [,1] [,2] [1,] 3 4 [2,] 1 2 kv <- matrix(c(12,8),nrow=2) kv [,1] [1,] 12 [2,] 8

361 MATHEMATICS 339 solve Now we can solve the simultaneous equations, using the function like this: solve(A,kv) [,1] [1,] -4 [2,] 6 x = –4 and y = to give 6 (which you can easily verify by hand). The function is most useful when there are many simultaneous equations to be solved. 7.7 Calculus The rules of differentiation and integration are known to R. You will use them in modelling (e.g. in calculating starting values in non-linear regression) and for numeric minimization using optim ?D . Read the help files and ?integrate to understand the limitations of these functions. 7.7.1 Derivatives The R function for symbolic and algorithmic derivatives of simple expressions is D . Here are some simple ?deriv . examples to give you the idea. See also D(expression(2*xˆ3),"x") 2 * (3 * xˆ2) D(expression(log(x)),"x") 1/x D(expression(a*exp(-b * x)),"x") -(a * (exp(-b * x) * b)) D(expression(a/(1+b*exp(-c * x))),"x") a * (b * (exp(-c * x) * c))/(1 + b * exp(-c * x))ˆ2 trig.exp <- expression(sin(cos(x + yˆ2))) D(trig.exp, "x") -(cos(cos(x + yˆ2)) * sin(x + yˆ2)) 7.7.2 Integrals The R function is integrate . Here are some simple examples to give you the idea: integrate(dnorm,0,Inf) 0.5 with absolute error < 4.7e-05 integrate(dnorm,-Inf,Inf) 1 with absolute error < 9.4e-05 integrate(function(x) rep(2, length(x)), 0, 1)

362 340 THE R BOOK 2 with absolute error < 2.2e-14 integrand <- function(x) } { 1/((x+1)*sqrt(x)) integrate(integrand, lower = 0, upper = Inf) 3.141593 with absolute error < 2.7e-05 xv <- seq(0,10,0.1) plot(xv,integrand(xv),type="l") 2.5 2.0 1.5 integrand(xv) 1.0 0.5 0.0 0246810 xv The area under the curve is π = 3.141 593. 7.7.3 Differential equations We need to solve a system of ordinary differential equations (ODEs) using classical Runge–Kutta integration from the deSolve package (Soetaert et al., 2012): install.packages("deSolve") library(deSolve) The example involves a simple resource-limited plant–herbivore interaction where V = vegetation and N = herbivore population. We need to specify two differential equations: one for the vegetation (d V /d t ) and one

363 MATHEMATICS 341 N /d ): for the herbivore population (d t ) ( V − K V d = rV − bV N , t d K N d dN = cV N . − t d The steps involved in solving these ODEs in R are as follows:  phmodel in this case) containing the equations. Define a function (called  dv using with . Write the vegetation equation as  Write the herbivore equation as using with . dn  Combine these vectors into a list called result .  Generate a time series over which to solve the equations in times .  Set the parameter values in . parameters  Set the starting values for and N in initial . V  ode to create a matrix with the time series of Use and N in output . V None of this is at all complicated, but there are lots of steps, so it looks a bit daunting. First we write a function called phmodel (for plant–herbivore model) which tells R the structure of the two equations, showing how change in each population is related to the functional and numerical responses, and then puts the results into a list: phmodel <- function(t,state,parameters) { { with(as.list(c(state,parameters)), dv <- r*v*(K-v)/K-b*v*n dn <- c*v*n-d*n result <- c(dv,dn) list(result) } ) } The rightmost curly bracket ends the function, the plain right bracket closes the with function and the leftmost curly bracket ends the definition of the equations. To run the model we need to create a vector of times over which to calculate the population dynamics, times <- seq(0,500,length=501) then define the numeric values of the five parameters (these values will determine the behaviour of the two populations) parameters <- c(r=0.4,K=1000,b=0.02,c=0.01,d=0.3) and set the initial conditions (plant = 50 and herbivores = 10 at the start): initial <- c(v=50,n=10) That is the end of the preliminaries.

364 342 THE R BOOK ode (ordinary differential equation Solving the equations could not be easier. The important function is solver). The function takes four arguments: the starting values, the vector of times, the function containing the equations, and the list containing the parameter values: output <- ode(y=initial,time=times,func=phmodel,parms=parameters) object is a matrix with three columns: time, plant abundance ( v ) and herbivore abundance ( The ): output n head(output) time v n [1,] 0 50.00000 10.00000 [2,] 1 58.29220 12.75106 [3,] 2 62.99695 17.40172 [4,] 3 60.70065 24.09264 [5,] 4 50.79407 31.32860 [6,] 5 37.68312 36.12636 Plotting the two time series is done like this: plot(output[,1],output[,2], ylim=c(0,60),type="n",ylab="abundance",xlab="time") lines(output[,1],output[,2],col="green") lines(output[,1],output[,3],col="red") The graph shows plant abundance as a green line against time and herbivore abundance as a red line: 60504030 abundance 20100 100 0 500 200 300 400 time V /d t and d N /d t are both The system exhibits damped oscillations to a stable point equilibrium at which d equal to zero, so equilibrium plant abundance V * = d/c = 0.3/0.01 = 30 and equilibrium herbivore abundance – N = r ( K * V *)/ bK = 19.4.

365 MATHEMATICS 343 x axis and plant An alternative is to plot the output as a phase plane, with herbivore abundance on the abundance on the y axis: plot(output[,3],output[,2], ylim=c(0,70),xlim=c(0,70),type="n",ylab="plant",xlab="herbivore") lines(output[,2],output[,3],col="red") 70 60 504030 plant 20100 50 60 70 0 40 30 20 10 herbivore

366 8 Classical Tests There is absolutely no point in carrying out an analysis that is more complicated than it needs to be. Occam’s razor applies to the choice of statistical model just as strongly as to anything else: simplest is best. The so-called classical tests deal with some of the most frequently used kinds of analysis for single-sample and two-sample problems. 8.1 Single samples Suppose we have a single sample. The questions we might want to answer are these:  What is the mean value?  Is the mean value significantly different from current expectation or theory?  What is the level of uncertainty associated with our estimate of the mean value? In order to be reasonably confident that our inferences are correct, we need to establish some facts about the distribution of the data:  Are the values normally distributed or not?  Are there outliers in the data?  If data were collected over a period of time, is there evidence for serial correlation? Non-normality, outliers and serial correlation can all invalidate inferences made by standard parametric tests like Student’s t test. It is much better in cases with non-normality and/or outliers to use a non-parametric technique such as Wilcoxon’s signed-rank test. If there is serial correlation in the data, then you need to use time series analysis or mixed-effects models. The R Book , Second Edition. Michael J. Crawley. © 2013 John Wiley & Sons, Ltd. Published 2013 by John Wiley & Sons, Ltd.

367 CLASSICAL TESTS 345 8.1.1 Data summary To see what is involved in summarizing a single sample, read the data called y from the file called : classic.txt data <- read.table("c: \\ \\ classic.txt",header=T) temp names(data) [1] "y" attach(data) As usual, we begin with a set of single sample plots: an index plot (scatterplot with a single argument, in which data are plotted in the order in which they appear in the dataframe), a box-and-whisker plot (see p. 212) and a frequency plot (a histogram with bin widths chosen by R): par(mfrow=c(2,2)) plot(y) boxplot(y) hist(y,main="") y2 <- y y2[52] <- 21.75 plot(y2) y 2.0 2.2 2.4 2.6 2.8 3.0 2.0 2.2 2.4 2.6 2.8 3.0 20 80 100 0 40 60 Index 15 10 y2 Frequency 5 5101520 0 2.8 2.0 2.4 2.6 2.2 3.0 80 60 0 100 20 40 Index y

368 346 THE R BOOK The index plot (bottom right) is particularly valuable for drawing attention to mistakes in the dataframe. Suppose that the 52nd value had been entered as 21.75 instead of 2.175: the mistake stands out like a sore thumb in the plot. like this: Summarizing the data could not be simpler. We use the built-in function called summary summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.904 2.241 2.414 2.419 2.568 2.984 . The smallest value is 1.904 (labelled This gives us six pieces of information about the vector for y Min. Max. for maximum). There are two measures of central minimum) and the largest value is 2.984 (labelled 1st Qu. tendency: the median is 2.414 and the arithmetic mean in 2.419. The other two figures (labelled 3rd Qu. and ) are the first and third quartiles (the 25th and 75th percentiles; see p. 115). An alternative is Tukey’s ‘five-number summary’ which comprises minimum, lower hinge, median, upper hinge and maximum for the input data. Hinges are close to the first and third quartiles (compare with , above), but different for small samples (see below): summary fivenum(y) [1] 1.903978 2.240931 2.414137 2.569583 2.984053 Notice that in this case Tukey’s ‘hinges’ are not exactly the same as the 25th and 75th percentiles produced by . In our sample of 100 numbers the hinges are the average of the 25th and 26th sorted numbers summary and the average of the 75th and 76th sorted numbers. This is how the fivenum summary is produced: x takes the sorted values of y , and n is the length of y . Then five numbers, d , are calculated to use as subscripts to extract five averaged values from x like this: x <- sort(y) n <- length(y) d <- c(1, 0.5 * floor(0.5 * (n + 3)), 0.5 * (n + 1), n + 1 - 0.5 * floor(0.5 * (n + 3)), n) 0.5 * (x[floor(d)] + x[ceiling(d)]) [1] 1.903978 2.240931 2.414137 2.569583 2.984053 where the d values are [1] 1.0 25.5 50.5 75.5 100.0 with floor and ceiling providing the lower and upper subscripts for averaging (25 with 26 and 75 with 76 for the lower and upper hinges, respectively). 8.1.2 Plots for testing normality The simplest test of normality (and in many ways the best) is the ‘quantile–quantile plot’. This plots the ranked samples from our distribution against a similar number of ranked quantiles taken from a normal distribution. If our sample is normally distributed then the line will be straight. Departures from normality show up as various sorts of non-linearity (e.g. S-shapes or banana shapes). The functions you need are qqnorm and qqline (quantile–quantile plot against a normal distribution): par(mfrow=c(1,1)) qqnorm(y) qqline(y,lty=2)

369 CLASSICAL TESTS 347 Normal Q-Q Plot 3.0 2.8 2.6 2.4 Sample Quantiles 2.2 2.0 0 –1 2 –2 1 Theoretical Quantiles This shows a slight S-shape, but there is no compelling evidence of non-normality (our distribution is somewhat skew to the left; see the histogram, above). A novel plot for illustrating non-normality is shown on p. 232. 8.1.3 Testing for normality We might use for testing whether the data in a vector come from a normal distribution. shapiro.test are The null hypothesis is that the sample data normally distributed. Let us generate some data that are fail the normality test: log-normally distributed, in the hope that they will x <- exp(rnorm(30)) shapiro.test(x) Shapiro-Wilk normality test data: x W = 0.5701, p-value = 3.215e-08 They certainly do fail: p < 0.000 001. A p value is not the probability that the null hypothesis is true (this is a common misunderstanding). On the contrary, the p value is based on the assumption that the null hypothesis true. A p value is an estimate of the probability that a particular result ( is = 0.5701 in this case), or a W result more extreme than the result observed, could have occurred by chance, if the null hypothesis were true . In short, the p value is a measure of the credibility of the null hypothesis. A large p value (say, p = 0.23) means that there is no compelling evidence on which to reject the null hypothesis. Of course, saying ‘we do not reject the null hypothesis’ and ‘the null hypothesis is true’ are two quite different things. For instance, we may have failed to reject a false null hypothesis because our sample size was too low, or because our measurement error was too large. Thus, p values are interesting, but they do not tell the whole story: effect sizes and sample sizes are equally important in drawing conclusions.

370 348 THE R BOOK 8.1.4 An example of single-sample data We can investigate the issues involved by examining the data from Michelson’s famous experiment in 1879 contains his results to measure the speed of light (see Michelson, 1880). The dataframe called light –1 (km s ), but with 299 000 subtracted. light <- read.table("t: data \\ light.txt",header=T) \\ attach(light) hist(speed,main="",col="green") Frequency 02468 1000 1100 900 800 700 speed We get a summary of the non-parametric descriptors of the sample like this: summary(speed) Min. 1st Qu. Median Mean 3rd Qu. Max. 650 850 940 909 980 1070 From this, you see at once that the median (940) is substantially bigger than the mean (909), as a consequence of the strong negative skew in the data seen in the histogram. The interquartile range is the difference between the first and third quartiles: 980 − = 130. This is useful in the detection of outliers: a good 850 outlier rule of thumb is that an is a value that is more than 1.5 times the interquartile range above the third quartile or below the first quartile (130 × 1.5 = 195). In this case, therefore, outliers would be measurements of speed that were less than 850 − 195 = 655 or greater than 980 + 195 = 1175. You will see that there are no large outliers in this data set, but one or more small outliers (the minimum is 650). We want to test the hypothesis that Michelson’s estimate of the speed of light is significantly different from the value of 299 990 thought to prevail at the time. Since the data have all had 299 000 subtracted from them, the test value is 990. Because of the non-normality, the use of Student’s t test in this case is ill advised. The correct test is Wilcoxon’s signed-rank test. wilcox.test(speed,mu=990) Wilcoxon signed rank test with continuity correction

371 CLASSICAL TESTS 349 data: speed V = 22.5, p-value = 0.00213 alternative hypothesis: true location is not equal to 990 Warning message: In wilcox.test.default(speed, mu = 990) : cannot compute exact p-value with ties We reject the null hypothesis and accept the alternative hypothesis because = 0.002 13 (i.e. much less than p 0.05). The speed of light is significantly less than 299 990. 8.2 Bootstrap in hypothesis testing You have probably heard the old phrase about ‘pulling yourself up by your own bootlaces’. That is where the term ‘bootstrap’ comes from. It is used in the sense of getting ‘something for nothing’. The idea is very simple. You have a single sample of n measurements, but you can sample from this in very many ways, so long as you allow some values to appear more than once, and other samples to be left out (i.e. sampling with replacement ). All you do is calculate the sample mean lots of times, once for each sampling from your data, then obtain the confidence interval by looking at the extreme highs and lows of the estimated means using a quantile function to extract the interval you want (e.g. a 95% interval is specified using c(0.0275, 0.975) to locate the lower and upper bounds). y is 909 (see the previous example). The question we have been asked to address Our sample mean value of is this: how likely is it that the population mean that we are trying to estimate with our random sample of 100 values is as big as 990? We take 10 000 random samples with replacement using = 100 from the 100 n values of light and calculate 10 000 values of the mean. Then we ask: what is the probability of obtaining a mean as large as 990 by inspecting the right-hand tail of the cumulative probability distribution of our 10 000 bootstrapped mean values? This is not as hard as it sounds: a <- numeric(10000) for(i in 1:10000) a[i] <- mean(sample(speed,replace=T)) hist(a,main="",col="blue") 1500 1000 Frequency 500 0 950 900 850 a

372 350 THE R BOOK The test value of 990 is way off the scale to the right, so a mean of 990 is clearly most unlikely, given the data with . In our 10 000 samples of the data, we never obtained a mean value greater than max(a) = 979 0.0001. < p 979, so the probability that the mean is 990 is clearly 8.3 Skew and kurtosis So far, and without saying so explicitly, we have encountered the first two moments of a sample distribution. ∑ The quantity y was used in the context of defining the arithmetic mean of a single sample: this is the first ∑ ∑ 2 ̄ ̄ , the sum of squares, was used in calculating sample variance, . The quantity y ( y − y y ) = / n moment ∑ 2 2 ̄ / ( = y ) 1). Higher-order moments involve − ( n − y s and this is the second moment of the distribution ∑ ∑ 4 3 ̄ ̄ ( and − . y ) y y ) − y ( powers of the difference greater than 2 such as 8.3.1 Skew Skew (or skewness) is the dimensionless version of the third moment about the mean, ∑ 3 ̄ ( ) − y y m , = 3 n (because this is also which is rendered dimensionless by dividing by the cube of the standard deviation of y 3 measured in units of y ), √ 3 3 2 s = = ( y s ) ) sd( . 3 The skew is then given by m 3 skew = γ = . 1 s 3 It measures the extent to which a distribution has long, drawn-out on one side or the other. A normal tails distribution is symmetrical and has γ 0. Negative values of γ = mean skew to the left (negative skew) and 1 1 positive values mean skew to the right. 0.6 0.6 positive skew negative skew f(x) f(x) 0.2 0.4 0.2 0.4 0.0 0.0 01234 01234 xx windows(7,4) par(mfrow=c(1,2)) x <- seq(0,4,0.01) plot(x,dgamma(x,2,2),type="l",ylab="f(x)",xlab="x",col="red")

373 CLASSICAL TESTS 351 text(2.7,0.5,"positive skew") plot(4-x,dgamma(x,2,2),type="l",ylab="f(x)",xlab="x",col="red") text(1.3,0.5,"negative skew") To test whether a particular value of skew is significantly different from 0 (and hence the distribution from which it was calculated is significantly non-normal) we divide the estimate of skew by its approximate standard error: √ 6 . se = γ 1 n x It is straightforward to write an R function to calculate the degree of skew for any vector of numbers, , like this: skew <- function(x) { m3 <- sum((x-mean(x))ˆ3)/length(x) s3 <- sqrt(var(x))ˆ3 m3/s3 } Note the use of the function to work out the sample size, n , whatever the size of the vector length(x) x . The last expression inside the function is not assigned a variable name, and is returned as the value of skew(x) when this is executed from the command line. Let us test the following data set: data <- read.table("c: \\ temp \\ skewdata.txt",header=T) attach(data) names(data) [1] "values" hist(values) 0 Frequency 024681 020406080 values

374 352 THE R BOOK The data appear to be positively skew (i.e. to have a longer tail on the right than on the left). We use the new function skew to quantify the degree of skewness: skew(values) [1] 1.318905 test, dividing Now we need to know whether a skew of 1.319 is significantly different from zero. We do a t √ 6 : / n the observed value of skew by its standard error skew(values)/sqrt(6/length(values)) [1] 2.949161 value of 2.949 by chance alone, when the skew value Finally, we ask what is the probability of getting a t really is zero: 1-pt(2.949,28) [1] 0.003185136 We conclude that these data show significant non-normality ( p 0.0032). < The next step might be to look for a transformation that normalizes the data by reducing the skewness. One way of drawing in the larger values is to take square roots, so let us try this to begin with: skew(sqrt(values))/sqrt(6/length(values)) [1] 1.474851 This is not significantly skew. Alternatively, we might take the logs of the values: skew(log(values))/sqrt(6/length(values)) [1] -0.6600605 This is now slightly skew to the left (negative skew), but the value of Student’s t is smaller than with a square root transformation, so we might prefer a log transformation in this case. 8.3.2 Kurtosis This is a measure of non-normality that has to do with the peakyness, or flat-toppedness, of a distribution. The normal distribution is bell-shaped, whereas a kurtotic distribution is other than bell-shaped. In particular, a platykurtic , and a more pointy distribution is said to be leptokurtic . more flat-topped distribution is said to be Kurtosis is the dimensionless version of the fourth moment about the mean, ∑ 4 ̄ ( − y ) y , = m 4 n y (because this is also measured which is rendered dimensionless by dividing by the square of the variance of 4 ), y in units of 2 2 2 s y )) s = ( . = ) (var( 4 Kurtosis is then given by m 4 = γ kurtosis = − 3 . 2 s 4

375 CLASSICAL TESTS 353 − / s = 3. This formulation therefore has the desirable m 3 is included because a normal distribution has The 4 4 property of giving zero kurtosis for a normal distribution, while a flat-topped (platykurtic) distribution has a negative value of kurtosis, and a pointy (leptokurtic) distribution has a positive value of kurtosis. The approximate standard error of kurtosis is √ 24 . se = γ 2 n platykurtosis leptokurtosis f(x) f(x) plot(-200:200,dcauchy(-200:200,0,10),type="l",ylab="f(x)",xlab="",yaxt="n", xaxt="n",main="leptokurtosis",col="red") xv <- seq(-2,2,0.01) plot(xv,exp(-abs(xv)ˆ6),type="l",ylab="f(x)",xlab="",yaxt="n", xaxt="n",main="platykurtosis",col="red") An R function to calculate kurtosis might look like this: kurtosis <- function(x) { m4 <- sum((x-mean(x))ˆ4)/length(x) s4 <- var(x)ˆ2 } m4/s4 - 3 For our present data, we find that kurtosis is not significantly different from normal: kurtosis(values) [1] 1.297751 kurtosis(values)/sqrt(24/length(values)) [1] 1.45093 8.4 Two samples The classical tests for two samples include:  comparing two variances (Fisher’s F test, var.test );  t test, t.test ); comparing two sample means with normal errors (Student’s  comparing two means with non-normal errors (Wilcoxon’s rank test, wilcox.test );  comparing two proportions (the binomial test, prop.test );

376 354 THE R BOOK  correlating two variables (Pearson’s or Spearman’s rank correlation, cor.test );  chisq.test testing for independence of two variables in a contingency table (chi-squared, , or Fisher’s exact test, fisher.test ). 8.4.1 Comparing two variances Before we can carry out a test to compare two sample means (see below), we need to test whether the sample test variances are significantly different (see p. 356). The test could not be simpler. It is called Fisher’s F after the famous statistician and geneticist R.A. Fisher, who worked at Rothamsted in south-east England. To compare two variances, all you do is divide the larger variance by the smaller variance. Obviously, if the variances are the same, the ratio will be 1. In order to be significantly different, the ratio will need to be significantly bigger than 1 (because the larger variance goes on top, in the numerator). How will we know a significant value of the variance ratio from a non-significant one? The answer, as always, is to look up the of the variance ratio. In this case, we want critical values of Fisher’s critical value . The R function for this F qf F distribution’. is , which stands for ‘quantiles of the For our example of ozone levels in market gardens (see p. 354) there were 10 replicates in each garden, − so there were 10 = 9 degrees of freedom for each garden. In comparing two gardens, therefore, we have 1 9 d.f. in the numerator and 9 d.f. in the denominator. Although F tests in analysis of variance are typically one-tailed (the treatment variance is expected to be larger than the error variance if the means are significantly different; see p. 501), in this case we have no expectation as to which garden was likely to have the higher variance, so we carry out a two-tailed test ( p 1 − α /2). Suppose we work at the traditional α = 0.05, then = we find the critical value of like this: F qf(0.975,9,9) 4.025994 This means that a calculated variance ratio will need to be greater than or equal to 4.02 in order for us to conclude that the two variances are significantly different at α = 0.05. To see the test in action, we can compare the variances in ozone concentration for market gardens B and C: \\ temp \\ f.test.data.txt",header = T) f.test.data <- read.table("c: attach(f.test.data) names(f.test.data) [1] "gardenB" "gardenC" First, we compute the two variances: var(gardenB) [1] 1.333333 var(gardenC) [1] 14.22222 The larger variance is clearly in garden C, so we compute the F ratio like this: F.ratio <- var(gardenC)/var(gardenB) F.ratio [1] 10.66667

377 CLASSICAL TESTS 355 F The variance in garden C is more than 10 times as big as the variance in garden B. The critical value of , above), so, since the for this test (with 9 d.f. in both the numerator and the denominator) is 4.026 (see qf calculated value is larger than the critical value we reject the null hypothesis. The null hypothesis was that the two variances were not significantly different, so we accept the alternative hypothesis that the two variances are significantly different. In fact, it is better practice to present the value associated with the calculated F p ratio rather than just to reject the null hypothesis; to do this we use qf . We double the resulting pf rather than probability to allow for the two-tailed nature of the test: 2*(1-pf(F.ratio,9,9)) [1] 0.001624199 p so the probability that the variances are the same is 0.002. Because the variances are significantly different, < t it would be wrong to compare the two sample means using Student’s test. var.test for speeding up the procedure. All we provide are the names There is a built-in function called of the two variables containing the raw data whose variances are to be compared (we do not need to work out the variances first): var.test(gardenB,gardenC) F test to compare two variances data: gardenB and gardenC F = 0.0938, num df = 9, denom df = 9, p-value = 0.001624 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.02328617 0.37743695 sample estimates: ratio of variances 0.09375 F , is given as roughly 1/10 rather than roughly 10 because var.test put the Note that the variance ratio, variable name that came first in the alphabet (gardenB) on top (i.e. in the numerator) instead of the bigger of the two variances. But the value of 0.0016 is the same as we calculated by hand (above), and we reject the p null hypothesis. These two variances are highly significantly different. This test is highly sensitive to outliers, so use it with care. It is important to know whether variance differs significantly from sample to sample. Constancy of variance ( homoscedasticity ) is the most important assumption underlying regression and analysis of variance (p. 490). For comparing the variances of two samples, Fisher’s F test is appropriate (p. 354). For multiple samples you can choose between the Bartlett test and the Fligner–Killeen test. Here are both tests in action: \\ temp \\ refuge.txt",header=T) refs <- read.table("c: attach(refs) names(refs) [1] "B" "T" where T is an ordered factor with nine levels. Each level produces 30 estimates of yields except for level 9 which is a single zero. We begin by looking at the variances: tapply(B,T,var) 123456789 1354.024 2025.431 3125.292 1077.030 2542.599 2221.982 1445.490 1459.955 NA

378 356 THE R BOOK T because the tests require at least When it comes to the variance tests we shall have to leave out level 9 of = two replicates at each factor level. We need to know which data point refers to treatment 9: T which(T==9) [1] 31 So we shall omit the 31st data point using negative subscripts. First Bartlett: bartlett.test(B[-31],T[-31]) Bartlett test of homogeneity of variances data: B[-31] and T[-31] Bartlett's K-squared = 13.1986, df = 7, p-value = 0.06741 p = 0.067). Now Fligner: So there is no significant difference between the eight variances ( fligner.test(B[-31],T[-31]) Fligner-Killeen test of homogeneity of variances data: B[-31] and T[-31] Fligner-Killeen:med chi-squared = 14.3863, df = 7, p-value = 0.04472 are significant differences between the variances ( p < 0.05). What you do Hmm. This test says that there next depends on your outlook. There are obviously some close-to-significant differences between these eight variances, but if you simply look at a plot of the data, plot(T,B) , the variances appear to be very well behaved. A linear model shows some slight pattern in the residuals and some evidence of non-normality: model <- lm(B~T) plot(model) Residuals vs Fitted Normal Q–Q 3 150 1 2 0 50 –50 Residuals Standardized residuals –150 –3 –2 –1 0 0 –2 –1 0 1 2 3 2000 4000 6000 8000 –3 Fitted values Theoretical Quantiles Scale–Location Residuals vs Leverage 1.5 1.0 0.5 Cook’s distance –3 –2 –1 0 1 2 3 Standardized residuals Standardized residuals 0.0 2000 4000 6000 8000 0 0.000 0.005 0.010 0.015 0.020 Fitted values Leverage

379 CLASSICAL TESTS 357 The various tests can give wildly different interpretations. Here are the ozone data from three market gardens: temp \\ gardens.txt",header=T) \\ ozone <- read.table("c: attach(ozone) names(ozone) [1] "gardenA" "gardenB" "gardenC" y <- c(gardenA,gardenB,gardenC) garden <- factor(rep(c("A","B","C"),c(10,10,10))) F The question is whether the variance in ozone concentration differs from garden to garden or not. Fisher’s test comparing gardens B and C says that variance is significantly greater in garden C: var.test(gardenB,gardenC) F test to compare two variances data: gardenB and gardenC F = 0.0938, num df = 9, denom df = 9, p-value = 0.001624 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.02328617 0.37743695 sample estimates: ratio of variances 0.09375 Bartlett’s test, likewise, says there is a highly significant difference in variance across gardens: bartlett.test(y~garden) Bartlett test of homogeneity of variances data: y by garden Bartlett's K-squared = 16.7581, df = 2, p-value = 0.0002296 In contrast, the Fligner–Killeen test (preferred over Bartlett’s test by many statisticians) says there is no heteroscedasticity ) in these data: compelling evidence for non-constancy of variance ( fligner.test(y~garden) Fligner-Killeen test of homogeneity of variances data: y by garden Fligner-Killeen: med chi-squared = 1.8061, df = 2, p-value = 0.4053 The reason for the difference is that Fisher and Bartlett are sensitive to outliers, whereas Fligner–Killeen is not (it is a non-parametric test which uses the ranks of the absolute values of the centred samples, and weights a(i) = qnorm((1 + i/(n+1))/2) . Of the many tests for homogeneity of variances, this is the most robust against departures from normality (Conover et al ., 1981). In this particular case, I think the Flinger test is too forgiving: gardens B and C both had a mean of 5 parts per hundred milllion (pphm; well below the damage threshold of 8 pphm), but garden B never suffered damaging levels of ozone whereas garden C experienced damaging ozone levels on 30% of days. That difference is scientifically important, and deserves to be statistically significant.

380 358 THE R BOOK 8.4.2 Comparing two means Given what we know about the variation from replicate to replicate within each sample (the within-sample variance), how likely is it that our two sample means were drawn from populations with the same average? If it is highly likely, then we shall say that our two sample means are not significantly different. If it is rather unlikely, then we shall say that our sample means are significantly different. But perhaps a better way to proceed is to work out the probability that the two samples were indeed drawn from populations with the same mean. If this probability is very low (say, less than 5% or less than 1%) then we can be reasonably certain (95% or 99% in these two examples) than the means really are different from one another. Note, however, that we can never be 100% certain; the apparent difference might just be due to random sampling – we just happened to get a lot of low values in one sample, and a lot of high values in the other. There are two classical tests for comparing two sample means:  test when the samples are independent, the variances constant, and the errors are normally t Student’s distributed;  Wilcoxon’s rank-sum test when the samples are independent but the errors are not normally distributed (e.g. they are ranks or scores or some sort). What to do when these assumptions are violated (e.g. when the variances are different) is discussed later on. t 8.4.3 Student’s test Biometrika in 1908. The Student was the pseudonym of W.S. Gossett who published his influential paper in archaic employment laws in place at the time allowed his employer, the Guinness Brewing Company, to distribution, later perfected by prevent him publishing independent work under his own name. Student’s t R.A. Fisher, revolutionized the study of small-sample statistics where inferences need to be made on the basis 2 2 s of the sample variance σ unknown (indeed, usually unknowable). The test with the population variance statistic is the number of standard errors of the difference by which the two sample means are separated: ̄ ̄ − y y difference between the two means B A = = t . standard error of the difference se diff We know the standard error of the mean (see p. 43) but we have not yet met the standard error of the variance of a the difference between two means. For two independent (i.e. non-correlated) variables, . This important result allows us to write down the formula for difference is the sum of the separate variances the standard error of the difference between two sample means: √ 2 2 s s B A se = + diff n n B A t test. Our null hypothesis is that the two population We now have everything we need to carry out Student’s means are the same, and we shall accept this unless the value of Student’s t is so large that it is unlikely that such a difference could have arisen by chance alone. Everything varies, so in real studies our two sample means will never be exactly the same, no matter what the parent population means. For the ozone example introduced on p. 354, each sample has 9 degrees of freedom, so we have 18 d.f. in total. Another way of thinking of this is to reason that the complete sample size as 20, and we have estimated two parameters from ̄ ̄ = 2 and 18 d.f. We typically use 5% as the chance of rejecting the null y − , so we have 20 y the data, B A hypothesis when it is true (this is the Type I error rate). Because we did not know in advance which of the two

381 CLASSICAL TESTS 359 gardens was going to have the higher mean ozone concentration (and we usually do not), this is a two-tailed test, so the of Student’s t is: critical value qt(0.975,18) [1] 2.100922 This means that our test statistic needs to be bigger than 2.1 in order to reject the null hypothesis, and hence α to conclude that the two means are significantly different at 0.05. = The dataframe is attached like this: t.test.data <- read.table("c: \\ temp \\ t.test.data.txt",header=T) attach(t.test.data) par(mfrow=c(1,1)) names(t.test.data) [1] "gardenA" "gardenB" notch option of boxplot : A useful graphical test for two samples employs the ozone <- c(gardenA,gardenB) label <- factor(c(rep("A",10),rep("B",10))) boxplot(ozone~label,notch=T,xlab="Garden",ylab="Ozone") 7 Ozone 123456 AB Garden Because the notches of two plots do not overlap, we conclude that the medians are significantly different at the 5% level. Note that the variability is similar in both gardens, both in terms of the range (the whiskers) and the interquartile range (the boxes). To carry out a t test long-hand, we begin by calculating the variances of the two samples: s2A <- var(gardenA) s2B <- var(gardenB)

382 360 THE R BOOK t is the difference divided by the standard error of the difference. The value of the test statistic for Student’s The numerator is the difference between the two means, and the denominator is the square root of the sum of the two variances divided by their sample sizes: (mean(gardenA)-mean(gardenB))/sqrt(s2A/10+s2B/10) t as which gives the value of Student’s [1] -3.872983 -tests you can ignore the minus sign; it is only the absolute value of the difference between the With t two sample means that concerns us. So the calculated value of the test statistic is 3.87 and the critical value qt(0.975,18) , above). Since the calculated value of the test statistic is larger than the critical is 2.10 ( value, we reject the null hypothesis. Notice that the wording is exactly the same as it was for the F test (above). Indeed, the wording is always the same for all kinds of tests, and you should try to memorize it. The abbreviated form is easier to remember: ‘larger reject, smaller accept’. The null hypothesis was that the two population means are not significantly different, so we reject this and accept the alternative hypothesis that the two means are significantly different. Again, rather than merely rejecting the null hypothesis, it is better to state the probability that data as extreme as this (or more extreme) would be observed if the population pt rather than qt , and in this instance 2*pt because we mean values really were the same. For this we use are doing a two-tailed test: 2*pt(-3.872983,18) [1] 0.001114540 We conclude that p < 0.005. You will not be surprised to learn that there is a built-in function to do all the work for us. It is called, helpfully, t.test and is used simply by providing the names of the two vectors containing the samples on which the test is to be carried out ( gardenA gardenB in our case). and t.test(gardenA,gardenB) There is rather a lot of output. You often find this: the simpler the statistical test, the more voluminous the output. Welch Two Sample t-test data: gardenA and gardenB t = -3.873, df = 18, p-value = 0.001115 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.0849115 -0.9150885 sample estimates: mean of x mean of y 35 The result is exactly the same as we obtained the long way. The value of t is –3.873 and, since the sign is irrelevant in a t test, we reject the null hypothesis because the test statistic is larger than the critical value of 2.1. The mean ozone concentration is significantly higher in garden B than in garden A. The computer p value and a confidence interval. Note that, because the means are significantly different, output also gives a the confidence interval on the difference does not include zero (in fact, it goes from –3.085 up to –0.915). You might present the result like this: ‘Ozone concentration was significantly higher in garden B (mean = 5.0 pphm) than in garden A (mean 18).’ 3.0 pphm; t = 3.873, p = 0.0011 (2-tailed), d.f. = =

383 CLASSICAL TESTS 361 t.test that you can use when your explanatory variable consists of There is a formula-based version of ). a two-level factor (see ?t.test 8.4.4 Wilcoxon rank-sum test This is a non-parametric alternative to Student’s t test, which we could use if the errors were non-normal. , is calculated as follows. Both samples are put into a single array The Wilcoxon rank-sum test statistic, W and B in this case, as explained below). Then the aggregate list A with their sample names clearly attached ( is sorted, taking care to keep the sample labels with their respective values. A rank is assigned to each value, with ties getting the appropriate average rank (two-way ties get (rank + (rank i + 1))/2, three-way ties get i (rank + (rank i + 1) i (rank i + 2))/3, and so on). Finally the ranks are added up for each of the two + samples, and significance is assessed on the size of the smaller sum of ranks. First we make a combined vector of the samples: ozone <- c(gardenA,gardenB) ozone [1]34432313525567443565 Then we make a list of the sample names: label <- c(rep("A",10),rep("B",10)) label [1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" rank to get a vector containing the ranks, smallest to largest, within the Now use the built-in function combined vector: combined.ranks <- rank(ozone) combined.ranks [1] 6.0 10.5 10.5 6.0 2.5 6.0 1.0 6.0 15.0 2.5 15.0 15.0 [13] 18.5 20.0 10.5 10.5 6.0 15.0 18.5 15.0 Notice that the ties have been dealt with by averaging the appropriate ranks. Now all we need to do is calculate the sum of the ranks for each garden. We use tapply with sum as the required operation tapply(combined.ranks,label,sum) AB 66 144 Finally, we compare the smaller of the two values (66) with values in tables of Wilcoxon rank sums (e.g. Snedecor and Cochran, 1980, p. 555), and reject the null hypothesis if our value of 66 is smaller than the value in tables. For samples of size 10 and 10 like ours, the 5% value in tables is 78. Our value of 66 is smaller than this, so we reject the null hypothesis. The two sample means are significantly different (in agreement with our earlier t test). We can carry out the whole procedure automatically, and avoid the need to use tables of critical values of Wilcoxon rank sums, by using the built-in function wilcox.test : wilcox.test(gardenA,gardenB) Wilcoxon rank sum test with continuity correction data: gardenA and gardenB

384 362 THE R BOOK W = 11, p-value = 0.002988 alternative hypothesis: true location shift is not equal to 0 Warning message: In wilcox.test.default(gardenA, gardenB) : cannot compute exact p-value with ties The function uses a normal approximation algorithm to work out a z value, and from this a p value to assess value of 0.002 988 is much less than 0.05, so we the hypothesis that the two means are the same. This p reject the null hypothesis, and conclude that the mean ozone concentrations in gardens A and B are significantly different. The warning message at the end draws attention to the fact that there are ties in the data (repeats of the same ozone measurement), and this means that the p value cannot be calculated exactly (this is seldom a real worry). p values of the t test and the Wilcoxon test with the same data: p = 0.001 115 It is interesting to compare the t test when the errors and 0.002 988, respectively. The non-parametric test is much more appropriate than the are not normal, and the non-parametric test is about 95% as powerful with normal errors, and can be more powerful than the t test if the distribution is strongly skewed by the presence of outliers. Typically, as here, the test will give the lower p value, so the Wilcoxon test is said to be conservative: if a difference is significant t t test. under a Wilcoxon test it would be even more significant under a 8.5 Tests on paired samples Sometimes, two-sample data come from paired observations. In this case, we might expect a correlation between the two measurements, because they were either made on the same individual, or taken from the same location. You might recall that the variance of a difference is the average of 2 2 μ − μ − ) + , ( y y − μ )( ) ) − 2( y μ − y ( B B A A B B A A which is the variance of sample A, plus the variance of sample B, minus twice the covariance of A and B. When the covariance of A and B is positive , this is a great help because it reduces the variance of the difference, which makes it easier to detect significant differences between the means. Pairing is not always may be weak. y and y effective, because the correlation between A B The following data are a composite biodiversity score based on a kick sample of aquatic invertebrates: streams <- read.table("c: \\ \\ streams.txt",header=T) temp attach(streams) names(streams) [1] "down" "up" The elements are paired because the two samples were taken on the same river, one upstream and one downstream from the same sewage outfall. If we ignore the fact that the samples are paired, it appears that the sewage outfall has no impact on biodiversity score ( p = 0.6856): t.test(down,up) Welch Two Sample t-test data: down and up t = -0.4088, df = 29.755, p-value = 0.6856

385 CLASSICAL TESTS 363 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -5.248256 3.498256 sample estimates: mean of x mean of y 12.500 13.375 paired=T ), the picture However, if we allow that the samples are paired (simply by specifying the option is completely different: t.test(down,up,paired=TRUE) Paired t-test data: down and up t = -3.0502, df = 15, p-value = 0.0081 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.4864388 -0.2635612 sample estimates: mean of the differences -0.875 This is a good example of the benefit of writing rather than T . Because we have a variable called T TRUE t.test(down,up,paired=T). (p. 22) the test would fail if we typed Now the difference between the means is highly significant ( p = 0.0081). The moral is clear. If you can do a paired t test, then you should always do the paired test. It can never do any harm, and sometimes (as here) it can do a huge amount of good. In general, if you have information on blocking or spatial correlation (in this case, the fact that the two samples came from the same river), then you should always use it in the analysis. Here is the same paired test carried out as a one-sample test based on the differences between the pairs t (upstream diversity minus downstream diversity): difference <- up - down t.test(difference) One Sample t-test data: difference t = 3.0502, df = 15, p-value = 0.0081 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 0.2635612 1.4864388 sample estimates: mean of x 0.875 As you see, the result is identical to the two-sample t test with paired=TRUE ( p = 0.0081). The upstream values of the biodiversity score were greater by 0.875 on average, and this difference is highly significant. Working with the differences has halved the number of degrees of freedom (from 30 to 15), but it has more than compensated for this by reducing the error variance, because there is such a strong positive correlation and y . between y A B

386 364 THE R BOOK 8.6 The sign test This is one of the simplest of all statistical tests. Suppose that you cannot see measure a difference, but you can it (e.g. in judging a diving contest). For example, nine springboard divers were scored as better or worse, having trained under a new regime and under the conventional regime (the regimes were allocated in a randomized sequence to each athlete: new then conventional, or conventional then new). Divers were judged twice: one diver was worse on the new regime, and 8 were better. What is the evidence that the new regime produces significantly better scores in competition? The answer comes from a two-tailed binomial test. How likely is a response of 1/9 (or 8/9 or more extreme than this, i.e. 0/9 or 9/9) if the populations are actually the same (i.e. p = 0.5)? We use a binomial test for this, specifying the number of ‘failures’ (1) and the total sample size (9): binom.test(1,9) Exact binomial test data: 1 and 9 number of successes = 1, number of trials = 9, p-value = 0.03906 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.002809137 0.482496515 sample estimates: probability of success 0.1111111 We would conclude that the new training regime is significantly better than the traditional method, because p < 0.05. It is easy to write a function to carry out a sign test to compare two samples, x and y : sign.test <- function(x, y) { if(length(x) != length(y)) stop("The two variables must be the same length") d<-x-y binom.test(sum(d > 0), length(d)) } The function starts by checking that the two vectors are the same length, then works out the vector of the d . The binomial test is then applied to the number of positive differences ( sum(d > 0) ) and differences, the total number of numbers ( length(d) ). If there was no difference between the samples, then on average, the sum would be about half of . Here is the sign test used to compare the ozone levels in gardens length(d) A and B (see above): sign.test(gardenA,gardenB) Exact binomial test data: sum(d > 0) and length(d) number of successes = 0, number of trials = 10, p-value = 0.001953 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.0000000 0.3084971 sample estimates: probability of success 0

387 CLASSICAL TESTS 365 p value (0.002) from the sign test is larger than in the equivalent test ( p = 0.0011) that we Note that the t carried out earlier. This will generally be the case: other things being equal, the parametric test will be more powerful than the non-parametric equivalent. 8.7 Binomial test to compare two proportions Suppose that only four females were promoted, compared to 196 men. Is this an example of blatant sexism, as it might appear at first glance? Before we can judge, of course, we need to know the number of male and female candidates. It turns out that 196 men were promoted out of 3270 candidates, compared with 4 promotions out of only 40 candidates for the women. Now, if anything, it looks like the females did better than males in the promotion round (10% success for women versus 6% success for men). The question then arises as to whether the apparent positive discrimination in favour of women is statisti- cally significant, or whether this sort of difference could arise through chance alone. This is easy in R using the built-in binomial proportions test in which we specify two vectors, the first containing the prop.test c(4,196) number of successes for females and males and second containing the total number of female and male candidates c(40,3270) : prop.test(c(4,196),c(40,3270)) 2-sample test for equality of proportions with continuity correction data: c(4, 196) out of c(40, 3270) X-squared = 0.5229, df = 1, p-value = 0.4696 alternative hypothesis: two.sided 95 percent confidence interval: -0.06591631 0.14603864 sample estimates: prop 1 prop 2 0.10000000 0.05993884 Warning message: In prop.test(c(4, 196), c(40, 3270)) : Chi-squared approximation may be incorrect There is no evidence in favour of positive discrimination ( p = 0.4696). A result like this will occur more than 45% of the time by chance alone. Just think what would have happened if one of the successful female candidates had not applied. Then the same promotion system would have produced a female success rate of 3/39 instead of 4/40 (7.7% instead of 10%). In small samples, small changes have big effects. 8.8 Chi-squared contingency tables A great deal of statistical information comes in the form of counts (whole numbers or integers): the number of animals that died, the number of branches on a tree, the number of days of frost, the number of companies that failed, the number of patients who died. With count data, the number 0 is often the value of a response variable (consider, for example, what a 0 would mean in the context of the examples just listed). The analysis of count data is discussed in more detail in Chapters 14 and 15. The dictionary definition of contingency is: ‘A possible or uncertain event on which other things depend or are conditional’ (OED, 2012). In statistics, however, the contingencies are all the events that could possibly

388 366 THE R BOOK . A contingency table shows the counts of how many times each of the contingencies actually happened happen in a particular sample. Consider the following example that has to do with the relationship between hair colour and eye colour in white people. For simplicity, we just chose two contingencies for hair colour: ‘fair’ and ‘dark’. Likewise we just chose two contingencies for eye colour: ‘blue’ and ‘brown’. Each of these two categorical variables, eye colour and hair colour, has two levels (‘blue’ and ‘brown’, and ‘fair’ and ‘dark’, respectively). Between them, they define four possible outcomes (the contingencies): fair hair and blue eyes, fair hair and brown eyes, dark hair and blue eyes, and dark hair and brown eyes. We take a random sample of white people and count how many of them fall into each of these four categories. Then we fill in the 2 × 2 contingency table like this: Blue eyes Brown eyes 11 38 Fair hair Dark hair 14 51 These are our observed frequencies (or counts). The next step is very important. In order to make any progress model which predicts the expected frequencies. What would be a in the analysis of these data we need a sensible model in a case like this? There are all sorts of complicated models that you might select, but the simplest model (Occam’s razor, or the principle of parsimony) is that hair colour and eye colour are . We may not believe that this is actually true, but the hypothesis has the great virtue of being independent falsifiable. It is also a very sensible model to choose because it makes it possible to predict the expected frequencies based on the assumption that the model is true. We need to do some simple probability work. What is the probability of getting a random individual from this sample whose hair was fair? A total of 49 people (38 + 11) had fair hair out of a total sample of 114 people. So the probability of fair hair is 49/114 and the probability of dark hair is 65/114. Notice that because we have only two levels of hair colour, these + two probabilities add up to 1 ((49 65)/114). What about eye colour? What is the probability of selecting someone at random from this sample with blue eyes? A total of 52 people had blue eyes (38 + 14) out of the sample of 114, so the probability of blue eyes is 52/114 and the probability of brown eyes is 62/114. As before, these sum to 1 ((52 + 62)/114). It helps to add the subtotals to the margins of the contingency table like this: Blue eyes Brown eyes Row totals Fair hair 38 49 11 14 65 Dark hair 51 Column totals 52 62 114 and blue Now comes the important bit. We want to know the expected frequency of people with fair hair eyes, to compare with our observed frequency of 38. Our model says that the two are independent. This is essential information, because it allows us to calculate the expected probability of fair hair and blue eyes. If, and only if, the two traits are independent, then the probability of having fair hair and blue eyes is the product of the two probabilities. So, following our earlier calculations, the probability of fair hair and blue eyes is 49/114 × 52/114. We can do exactly equivalent things for the other three cells of the contingency table: Blue eyes Brown eyes Total count in each row Fair hair 49/114 × 52/114 49/114 × 62/114 49 Dark hair × 52/114 65/114 × 62/114 65 65/114 Total count in each column 52 62 114 Now we need to know how to calculate the expected frequency. It couldn’t be simpler. It is just the prob- ability multiplied by the total sample ( n = 114). So the expected frequency of blue eyes and fair hair is

389 CLASSICAL TESTS 367 52/114 = 22.35, which is much less than our observed frequency of 38. It is beginning to 114 49/114 × × look as if our hypothesis of independence of hair and eye colour is false. You might have noticed something useful in the last calculation: two of the sample sizes cancel out. C ) divided by Therefore, the expected frequency in each cell is just the row total ( R ) times the column total ( ) like this: G the grand total ( R × C . E = G We can now work out the four expected frequencies: Blue eyes Brown eyes Row totals Fair hair 22.35 26.65 49 29.65 35.35 65 Dark hair 52 Column totals 62 114 Notice that the row and column totals (the so-called ‘marginal totals’) are retained under the model. It is clear that the observed frequencies and the expected frequencies are different. But in sampling, everything always varies, so this is no surprise. The important question is whether the expected frequencies are significantly different from the observed frequencies. We can assess the significance of the differences between observed and expected frequencies in a variety of ways:  Pearson’s chi-squared;  G test;  Fisher’s exact test. 8.8.1 Pearson’s chi-squared 2 χ We begin with Pearson’s chi-squared test. The test statistic is 2 ∑ E ( O − ) 2 , = χ E O E is the expected frequency. It makes the calculations easier if we write is the observed frequency and where the observed and expected frequencies in parallel columns, so that we can work out the corrected squared differences more easily. 2 ( O − E ) 2 ( O O E ) E − E Fair hair and blue eyes 38 22.35 244.92 10.96 Fair hair and brown eyes 26.65 244.92 9.19 11 Dark hair and blue eyes 14 29.65 244.92 8.26 Dark hair and brown eyes 51 35.35 244.92 6.93 2 All we need to do now is to add up the four components of chi-squared to get χ = 35.33. The question now arises: is this a big value of chi-squared or not? This is important, because if it is a bigger value of chi-squared than we would expect by chance, then we should reject the null hypothesis. If, on

390 368 THE R BOOK the other hand, it is within the range of values that we would expect by chance alone, then we should accept the null hypothesis. 2 = We always proceed in the same way at this stage. We have a calculated value of the test statistic: χ 35.33. We compare this value of the test statistic with the relevant critical value. To work out the critical value of chi-squared we need two things:  the number of degrees of freedom, and  the degree of certainty with which to work. r c ), and the degrees of In general, a contingency table has a number of rows ( ) and a number of columns ( freedom is given by . f . = ( r − 1) × ( c − 1) d . So we have (2 1) × (2 − 1) = 1 degree of freedom for a 2 × 2 contingency table. You can see why there − is only one degree of freedom by working through our example. Take the ‘fair hair, brown eyes’ box (the top right in the table) and ask how many values this could possibly take. The first thing to note is that the count could not be more than 49, otherwise the row total would be wrong. But in principle, the number in this box is free to take any value between 0 and 49. We have one degree of freedom for this box. But when we have fixed this box to be, say, 11, Brown eyes Blue eyes Row totals 11 49 Fair hair Dark hair 65 Column totals 52 62 114 you will see that we have no freedom at all for any of the other three boxes. The top left box has to be 49 − 11 = 38 because the row total is fixed at 49. Once the top left box is defined as 38 then the bottom left box has to be 52 − = 14 because the column total is fixed (the total number of people with blue eyes was 38 − = 51. Thus, because the marginal totals are 52). This means that the bottom right box has to be 65 14 ,a2 × 2 contingency table has just one degree of freedom. constrained The next thing we need to do is say how certain we want to be about the falseness of the null hypothesis. The more certain we want to be, the larger the value of chi-squared we would need to reject the null hypothesis. It = 5%. is conventional to work at the 95% level. That is our certainty level, so our uncertainty level is 100 – 95 Expressed as a decimal, this is called alpha ( = 0.05). Technically, alpha is the probability of rejecting the α null hypothesis when it is true . This is called a Type I error. A Type II error is accepting the null hypothesis when it is false . quantiles ( Critical values in R are obtained by use of ) of the appropriate statistical distribution. For the q chi-squared distribution, this function is called qchisq . The function has two arguments: the certainty level ( p = 0.95), and the degrees of freedom (d.f. = 1): qchisq(0.95,1) [1] 3.841459 The critical value of chi-squared is 3.841. Since the calculated value of the test statistic is greater than the critical value we reject the null hypothesis. What have we learned so far? We have rejected the null hypothesis that eye colour and hair colour are independent. But that is not the end of the story, because we have not established the way in which they are related (e.g. is the correlation between them positive or negative?). To do this we need to look carefully

391 CLASSICAL TESTS 369 at the data, and compare the observed and expected frequencies. If fair hair and blue eyes were positively correlated, would the observed frequency be greater or less than the expected frequency? A moment’s thought should convince you that the observed frequency will be greater than the expected frequency when the traits are positively correlated (and less when they are negatively correlated). In our case we expected only 22.35 but we observed 38 people (nearly twice as many) to have both fair hair and blue eyes. So it is clear that fair hair and blue eyes are associated. positively 2 matrix like this: In R the procedure is very straightforward. We start by defining the counts as a 2 × count <- matrix(c(38,14,11,51),nrow=2) count [,1] [,2] [1,] 38 11 [2,] 14 51 Notice that you enter the data (not row-wise) into the matrix. Then the test uses the chisq.test columnwise function, with the matrix of counts as its only argument: chisq.test(count) Pearson's Chi-squared test with Yates' continuity correction data: count X-squared = 33.112, df = 1, p-value = 8.7e-09 The calculated value of chi-squared is slightly different from ours, because Yates’ correction has been applied correct=F as the default (see Sokal and Rohlf, 1995, p. 736). If you switch the correction off ( ), you get the value we calculated by hand: chisq.test(count,correct=F) Pearson's Chi-squared test data: count X-squared = 35.3338, df = 1, p-value = 2.778e-09 It makes no difference at all to the interpretation: there is a highly significant positive association between fair hair and blue eyes for this group of people. If you need to extract the frequencies expected under the null hypothesis of independence then use: chisq.test(count,correct=F)\$expected [,1] [,2] [1,] 22.35088 26.64912 [2,] 29.64912 35.35088 G test of contingency 8.8.2 The idea is exactly the same. We are looking for evidence of non-independence of hair colour and eye colour. Even the distribution of the critical value is the same: chi-squared. The difference is in the test statistic. Instead ∑ 2 / E , we compute the deviance from a log-linear model (see of computing Pearson’s chi-squared ) E ( O − p. 562): ) ( ∑ O . G O 2 = ln E

392 370 THE R BOOK Here are the calculations: ) ) ( ( O O ln O ln O E E E Fair hair and blue eyes 38 22.35 0.5307598 20.168874 11 26.65 − 0.8848939 Fair hair and brown eyes 9.733833 − Dark hair and blue eyes 29.65 − 0.7504048 − 10.505667 14 51 18.692889 0.3665272 Dark hair and brown eyes 35.35 G is twice the sum of the right-hand column: 2 The test statistic 18.622 26 = 37.244 53. This value is × compared with chi-squared in tables with 1 d.f. as before. The calculated value of the test statistic is much greater than the critical value (3.841) so we reject the null hypothesis of independence. Hair colour and eye colour are correlated in this group of people. We need to look at the data to see which way the correlation goes. We see far more people with fair hair and blue eyes (38) than expected under the null hypothesis of 2 = 35.33 (above) so the test independence (22.35) so the correlation is . Pearson’s chi-squared was χ positive 2 G = 37.24 in the test) but the interpretation is identical. χ statistic values are slightly different ( 8.8.3 Unequal probabilities in the null hypothesis So far we have assumed equal probabilities, but can deal with cases with unequal probabilities. chisq.test This example has 21 individuals distributed over four categories: chisq.test(c(10,3,2,6)) Chi-squared test for given probabilities data: c(10, 3, 2, 6) X-squared = 7.381, df = 3, p-value = 0.0607 The four counts are not significantly different if the probability of appearing in each of the four cells is 0.25 (the calculated p -value is greater than 0.05). However, if the null hypothesis was that the third and fourth are highly significant. cells had 1.5 times the probability of the first two cells, then these counts chisq.test(c(10,3,2,6),p=c(0.2,0.2,0.3,0.3)) Chi-squared test for given probabilities data: c(10, 3, 2, 6) X-squared = 11.3016, df = 3, p-value = 0.0102 Warning message: In chisq.test(c(10, 3, 2, 6), p = c(0.2, 0.2, 0.3, 0.3)) : Chi-squared approximation may be incorrect Note the warning message associated with the low expected frequencies in cells 1 and 2. 8.8.4 Chi-squared tests on table objects You can use the chisq.test function with table objects as well as vectors. To test the random number generator as a simulator of the throws of a six-sided die we could simulate 100 throws like this, then use table to count the number of times each number appeared: die <- ceiling(runif(100,0,6)) table(die)

393 CLASSICAL TESTS 371 die 123456 23 15 20 14 12 16 So we observed only 12 fives in this trail and 23 ones. But is this a significant departure from fairness of the will answer this: die? chisq.test chisq.test(table(die)) Chi-squared test for given probabilities data: table(die) X-squared = 5, df = 5, p-value = 0.4159 0.4159). Note that the syntax is chisq.test(table(die)) not p No. This is a fair die ( = chisq.test(die) and that there are 5 degrees of freedom in this case. 8.8.5 Contingency tables with small expected frequencies: Fisher’s exact test When one or more of the expected frequencies is less than 4 (or 5 depending on the rule of thumb you follow) tests) for your contingency table. This is then it is wrong to use Pearson’s chi-squared or log-linear models ( G because small expected values inflate the value of the test statistic, and it no longer can be assumed to follow and d the chi-squared distribution. The individual counts are a, b, c like this: Column 2 Row totals Column 1 a b a + b Row 1 c d + d Row 2 c + a b + d n Column totals c The probability of any one particular outcome is given by ( )! + b )!( c + d )!( a + c )!( b + d a = p a ! ! c ! d ! n ! b n is the grand total. where Our data concern the distribution of eight ants’ nests over 10 trees of each of two species of tree (A and B). There are two categorical explanatory variables (ants and trees), and four contingencies, ants (present or absent) and trees (A or B). The response variable is the vector of four counts c(6,4,2,8) entered columnwise: Tree A Tree B Row totals With ants 2 8 6 Without ants 4 8 12 Column totals 10 10 20 We can calculate the probability for this particular outcome: factorial(8)*factorial(12)*factorial(10)*factorial(10)/ (factorial(6)*factorial(2)*factorial(4)*factorial(8)*factorial(20)) [1] 0.07501786

394 372 THE R BOOK more extreme than But this is only part of the story. We need to compute the probability of outcomes that are this. There are two of them. Suppose only 1 ant colony had been found on tree B. Then the table values would the marginal totals are constrained ). be 7, 1, 3, 9 but the row and column totals would be exactly the same ( The numerator always stays the same, so this case has probability factorial(8)*factorial(12)*factorial(10)*factorial(10)/ (factorial(7)*factorial(3)*factorial(1)*factorial(9)*factorial(20)) [1] 0.009526078 There is an even more extreme case if no ant colonies at all were found on tree B. Now the table elements become 8, 0, 2, 10 with probability factorial(8)*factorial(12)*factorial(10)*factorial(10)/ (factorial(8)*factorial(2)*factorial(0)*factorial(10)*factorial(20)) [1] 0.0003572279 and we need to add these three probabilities together: 0.07501786 + 0.009526078 + 0.000352279 [1] 0.08489622 But there was no reason for expecting that the result would be in this direction. It might have been a priori tree A that happened to have relatively few ant colonies. We need to allow for extreme counts in the opposite direction by doubling this probability (all Fisher’s exact tests are two-tailed): 2*(0.07501786 + 0.009526078 + 0.000352279) [1] 0.1697924 This shows that there is no evidence of any correlation between tree and ant colonies. The observed pattern, or a more extreme one, could have arisen by chance alone with probability = 0.17. p fisher.test There is a built-in function called , which saves us all this tedious computation. It takes as its argument a 2 × 2 matrix containing the counts of the four contingencies. We make the matrix like this (compare with the alternative method of making a matrix, above): x <- as.matrix(c(6,4,2,8)) dim(x) <- c(2,2) x [,1] [,2] [1,] 6 2 [2,] 4 8 We then run the test like this: fisher.test(x) Fisher's Exact Test for Count Data data: x p-value = 0.1698 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.6026805 79.8309210

395 CLASSICAL TESTS 373 sample estimates: odds ratio 5.430473 -value that we calculated by hand. Another way of using the function is p You see the same non-significant to provide it with two vectors containing factor levels, instead of a two-dimensional matrix of counts. This saves you the trouble of counting up how many combinations of each factor level there are: \\ temp \\ table <- read.table("c: fisher.txt",header=TRUE) head(table) tree nests 1 A ants 2 B ants 3 A none 4 A ants 5 B none 6 A none attach(table) fisher.test(tree,nests) Fisher's Exact Test for Count Data data: tree and nests p-value = 0.1698 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.6026805 79.8309210 sample estimates: odds ratio 5.430473 The fisher.test procedure can be used with matrices much bigger than 2 × 2. 8.9 Correlation and covariance x and With two continuous variables, , the question naturally arises as to whether their values are correlated y with each other. Correlation is defined in terms of the variance of x , the variance of y , and the covariance of x and y (the way the two vary together, which is to say the way they covary) on the assumption that both 2 2 s . We denote the and s variables are normally distributed. We have symbols already for the two variances: y x x and y by cov( x , y ), so the correlation coefficient r is defined as covariance of cov( x , y ) √ . r = 2 2 s s x y x and y . We know how to calculate variances, so it remains only to work out the value of the covariance of Covariance is defined as the expectation of the vector product xy. The covariance of x and y is the expectation of the product minus the product of the two expectations. Note that when and y are independent (i.e. they x are not correlated) then the covariance between x and y is 0, so E[ xy ] = E[ x ].E[ y ] (i.e. the product of their mean values).

396 374 THE R BOOK Let us work through a numerical example: \\ twosample.txt",header=T) temp data <- read.table("c: \\ attach(data) plot(x,y,pch=21,col="red",bg="orange") 140 120 100 y 80 60 40 20 30 40 50 10 20 x There is clearly a strong positive correlation between the two variables. First, we need the variance of x and the variance of y : var(x) [1] 199.9837 var(y) [1] 977.0153 ), is given by the and y ,cov( x , y x var function when we supply it with two vectors like The covariance of this: var(x,y) [1] 414.9603 √ . 96 / Thus, the correlation coefficient should be 414 02: 199 98 × 977 . . var(x,y)/sqrt(var(x)*var(y)) [1] 0.9387684 Let us see if this checks out: cor(x,y) [1] 0.9387684

397 CLASSICAL TESTS 375 So now you know the definition of the correlation coefficient: it is the covariance divided by the geometric mean of the two variances. 8.9.1 Data dredging cor The R function returns the correlation matrix of a data matrix, or a single value showing the correlation between one vector and another (as above): \\ Pollute.txt",header=T) \\ pollute <- read.table("c: temp attach(pollute) cor(pollute) Pollution Temp Industry Population Wind Rain Wet.days Pollution 1.00000000 -0.43360020 0.64516550 0.49377958 0.09509921 0.05428389 0.36956363 Temp -0.43360020 1.00000000 -0.18788200 -0.06267813 -0.35112340 0.38628047 -0.43024212 Industry 0.64516550 -0.18788200 1.00000000 0.95545769 0.23650590 -0.03121727 0.13073780 Population 0.49377958 -0.06267813 0.95545769 1.00000000 0.21177156 -0.02606884 0.04208319 Wind 0.09509921 -0.35112340 0.23650590 0.21177156 1.00000000 -0.01246601 0.16694974 Rain 0.05428389 0.38628047 -0.03121727 -0.02606884 -0.01246601 1.00000000 0.49605834 Wet.days 0.36956363 -0.43024212 0.13073780 0.04208319 0.16694974 0.49605834 1.00000000 The phrase ‘data dredging’ is used disparagingly to describe the act of trawling through a table like this, desperately looking for big values which might suggest relationships that you can publish. This behaviour is not to be encouraged. The raw correlation suggests that there is a very strong positive relationship between Population ( Industry = 0.9555). The correct approach is model simplification (see p. 391), and r which indicates that people live in places with less, not more, polluted air. Note that the correlations are y on x would give identical in opposite halves of the matrix (in contrast to regression, where regression of x different parameter values and standard errors than regression of ). The correlation between two vectors on y produces a single value: cor(Pollution,Wet.days) [1] 0.3695636 Correlations with single explanatory variables can be highly misleading if (as is typical) there is substantial correlation amongst the explanatory variables (collinearity; see p. 490). 8.9.2 Partial correlation and y when a third variable, With more than two variables, you often want to know the correlation between x z , is held constant. The partial correlation coefficient measures this. It enables correlation due to a shared say, common cause to be distinguished from direct correlation. It is given by r − r r yz xz xy √ = . r z . xy 2 2 )(1 − r r (1 ) − xz yz x and y holding the other Suppose we had four variables and we wanted to look at the correlation between z and w , constant. Then two, r − r r z . xy z z w. y x w. √ r = . w . xy z 2 2 − ) − (1 )(1 r r w. y w. z x z

398 376 THE R BOOK sem You will need partial correlation coefficients if you want to do path analysis. R has a package called ) and another for carrying out structural equation modelling (including the production of path.diagram for converting correlations into partial correlations using the function (or vice corpcor cor2pcor called ). versa with pcor2cor 8.9.3 Correlation and the variance of differences between variables Samples often exhibit positive correlations that result from pairing, as in the upstream and downstream invertebrate biodiversity data that we investigated earlier. There is an important general question about the effect of correlation on the variance of differences between variables. In the extreme, when two variables are so perfectly correlated that they are identical, then the difference between one variable and the other is zero. So it is clear that the variance of a difference will decline as the strength of positive correlation increases. The following data show the depth of the water table (in centimetres below the surface) in winter and summer at 10 locations: \\ temp \\ wtable.txt",header=T) data <- read.table("c: attach(data) names(data) [1] "summer" "winter" We begin by asking whether there is a correlation between summer and winter water table depths across locations: cor(summer, winter) [1] 0.6596923 There is a reasonably strong positive correlation ( p = 0.037 95, which is marginally significant; see below). Not surprisingly, places where the water table is high in summer tend to have a high water table in winter as well. If you want to determine the significance of a correlation (i.e. the value associated with the calculated p r ) then use value of rather than cor . This test has non-parametric options for Kendall’s tau or cor.test Spearman’s rank, depending on the method you specify ( method="k" or method="s" ), but the default method is Pearson’s product-moment correlation ( ): method="p" cor.test(summer, winter) Pearson's product-moment correlation data: summer and winter t = 2.4828, df = 8, p-value = 0.03795 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.05142655 0.91094772 sample estimates: cor 0.6596923 Now, let us investigate the relationship between the correlation coefficient and the three variances: the summer variance, the winter variance, and the variance of the differences (winter minus summer water

399 CLASSICAL TESTS 377 table depth): varS <- var(summer) varW <- var(winter) varD <- var(winter-summer) varS;varW;varD [1] 15.13203 [1] 7.541641 [1] 8.579066 The correlation coefficient ρ is related to these three variances by: 2 2 2 σ + − σ σ y z z y − ρ = σ 2 σ z y So, using the values we have just calculated, we get the correlation coefficient to be (varS+varW-varD)/(2*sqrt(varS)*sqrt(varW)) [1] 0.6596923 which checks out. We can also see whether the variance of the difference is equal to the sum of the component variances (see p. 362): varD [1] 8.579066 varS+varW [1] 22.67367 independent No, it is not. They would be equal only if the two samples were . In fact, we know that the two variables are positively correlated, so the variance of the difference should be less than the sum of the : × s s × r variances by an amount equal to 2 × 2 1 varS+varW-varD [1] 14.09461 2 * cor(summer,winter) * sqrt(varS) * sqrt(varW) [1] 14.09461 That’s more like it. 8.9.4 Scale-dependent correlations Another major difficulty with correlations is that scatterplots can give a highly misleading impression of what is going on. The moral of this exercise is very important: things are not always as they seem. The following data show the number of species of mammals ( y ) in forests of differing productivity ( x ): productivity <- read.table("c: \\ temp \\ productivity.txt",header=T) attach(productivity) head(productivity)

400 378 THE R BOOK xyf 113a 224a 332a 441a 553a 661a plot(x,y,pch=21,col="blue",bg="green", xlab="Productivity",ylab="Mammal species") 25 20 15 10 Mammal species 5 0 0 5 10 15 20 25 30 Productivity cor.test(x,y) Pearson's product-moment correlation data: x and y t = 7.5229, df = 52, p-value = 7.268e-10 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.5629686 0.8293555 sample estimates: cor 0.7219081 There is evidently a significant positive correlation ( p < 0.000 001) between mammal species and productivity: increasing productivity is associated with increasing species richness. However, when we look at the relationship for each region ( f ) separately using coplot , we see exactly the opposite relationship:

401 CLASSICAL TESTS 379 Given : f g f e d c b a 0 5 10 15 20 25 30 05 20 25 10 15 30 0 5 10 15 20 25 y 0 5 10 15 20 25 0 5 10 15 20 25 20 25 05 30 10 15 x The pattern is obvious. In every single case, increasing productivity is associated with reduced mammal species richness within each region (regions are labelled – g from bottom left). The lesson is clear: you a need to be extremely careful when looking at correlations across different scales . Things that are positively correlated over short time scales may turn out to be negatively correlated in the long term. Things that appear to be positively correlated at large spatial scales may turn out (as in this example) to be negatively correlated at small scales. 8.10 Kolmogorov–Smirnov test People know this test for its wonderful name, rather than for what it actually does. It is an extremely simple test for asking one of two different questions:  Are two sample distributions the same, or are they significantly different from one another in one or more (unspecified) ways?  Does a particular sample distribution arise from a particular hypothesized distribution? The two-sample problem is the one most often used. The apparently simple question is actually very broad. It is obvious that two distributions could be different because their means were different. But two distributions

402 380 THE R BOOK with exactly the same mean could be significantly different if they differed in variance, or in skew or kurtosis (see p. 350). cumulative distribution functions . These give the probability The Kolmogorov–Smirnov test works on that a randomly selected value of X is less than or equal to x : ( x ) F P [ X ≤ x ] . = This sounds somewhat abstract. Suppose we had insect wing sizes ( y ) for two geographically separated populations (A and B) and we wanted to test whether the distribution of wing lengths was the same in the two places: data <- read.table("c: \\ \\ ksdata.txt",header=T) temp attach(data) names(data) [1] "y" "site" We start by extracting the data for the two populations, and describing the samples: table(site) site AB 10 12 There are 10 samples from site A and 12 from site B. tapply(y,site,mean) AB 4.355266 11.665089 tapply(y,site,var) AB 27.32573 90.30233 Their means are quite different, but the size of the difference in their variances precludes using a t test. We start by plotting the cumulative probabilities for the two samples on the same axes, bearing in mind that there are 10 values if A and 12 values of B: plot(seq(0,1,length=12),cumsum(sort(B)/sum(B)),type="l", ylab="Cumulative probability",xlab="Index",col="red") lines(seq(0,1,length=10),cumsum(sort(A)/sum(A)),col="blue")

403 CLASSICAL TESTS 381 1.0 0.8 0.6 0.4 Cumulative probability 0.2 0.0 0.6 1.0 0.0 0.4 0.8 0.2 Index It certainly looks as if population A (the blue line) is different. We test the significance of the difference between the two distributions with ks.test like this: ks.test(y[site=="A"],y[site=="B"]) Two-sample Kolmogorov-Smirnov test data: y[site == "A"] and y[site == "B"] D = 0.55, p-value = 0.04889 alternative hypothesis: two-sided The test works despite the difference in length of the two vectors, and shows a marginally significant difference between the two sites ( p = 0.049). The other test involves comparing one sample with the probability function of a named distribution. Let us test whether the larger sample from site B is normally distributed, using pnorm as the probability function, with specified mean and standard deviation: ks.test(y[site=="B"],"pnorm",mean(y[site=="B"]),sd(y[site=="B"])) One-sample Kolmogorov-Smirnov test data: y[site == "B"] D = 0.1844, p-value = 0.7446 alternative hypothesis: two-sided There is no evidence that the samples from site B depart significantly from normality. Note, however, that the Shapiro–Wilk test shapiro.test(y[site=="B"]) Shapiro-Wilk normality test

404 382 THE R BOOK data: y[site == "B"] W = 0.876, p-value = 0.0779 comes much closer to suggesting significant non-normality (above), while Normal Q–Q Plot Sample Quantiles 510152025 –1.5 –1.0 0.0 0.5 1.0 1.5 –0.5 Theoretical Quantiles the standard model-checking quantile–quantile plot looks suspiciously non-normal: qqnorm(y[site=="B"],pch=16,col="blue") qqline(y[site=="B"],col="green",lty=2) 8.11 Power analysis The power of a test is the probability of rejecting the null hypothesis when it is false. It has to do with β is the probability of accepting the null hypothesis when it is false. In an ideal world, we Type II errors: β as small as possible. But there is a snag. The smaller we make the probability of would obviously make committing a Type II error, the greater we make the probability of committing a Type I error, and rejecting the null hypothesis when, in fact, it is correct. This is a classic trade-off. A compromise is called for. Most statisticians work with α = 0.05 and β = 0.2. The power of a test is defined as 1 − β = 0.8 under the standard assumptions. The issues involved are your choices of alpha and beta (the trade-off between Type I and Type II errors), the size of the effect you want to detect as being significant, the variance of the samples, and the sample size. If we are doing a two-sample t test, the value of the test statistic is the difference between the two means, d , 2 , and equal s divided by the standard error of the difference between two means (assuming equal variances, sample sizes, n ): d √ t = 2 s 2 n

405 CLASSICAL TESTS 383 Let us rearrange this expression to find the sample size as a function of the other variables: √ 2 2 2 d d s s = 2 = ⇒ 2 2 t n t n so 2 2 t s 2 n = . 2 d t depends on our choice of power (1 − β = 0.8) and significance level ( α = 0.025). Roughly The value of speaking, the quantile associated with the 0.025 tail of a normal distribution is 1.96, and the quantile associated 2 is roughly 7.8. To get our rule of thumb, we = 2.8, so t with 0.8 is 0.84. We add these quantiles to estimate t round this up to 8. Now the formula for 8 × variance/the square of the difference: × n is 2 2 16 s = n . 2 d The smaller the effect size that we want to be able to detect as being significant, the larger the sample size will need to be. Suppose that the control value of our response variable is known from the literature to have a mean of 20 and a standard deviation of 2 (so the variance is 4). The rule of thumb would give the following relationship: 60 mean = 20 50 variance = 4.0 40 30 sample size per treatment 20 10 2.5 3.0 3.5 4.0 1.0 2.0 1.5 difference to be significant So if you want to be able to detect an effect size of 1.0 you will need at least 60 samples per treatment. The standard idea of a ‘big-enough’ sample ( n = 30) would enable you to detect an effect size of about 1.5 in this example. If you could only afford 10 replicates per treatment, you should not expect to be able to detect effects smaller than about 2.5.

406 384 THE R BOOK t tests: There are built-in functions in R for carrying out power analyses for ANOVA, proportion data and t power.t.test tests; power calculations for one- and two-sample power.prop.test power calculations two-sample test for proportions; power.anova.test power calculations for balanced one-way ANOVA tests. The arguments to the n (the number of observations per group), delta power.t.test function are sd (the (the difference in means we want to be able to detect; you will need to think hard about this value), sig.level (the significance level, i.e. Type I error probability, where standard deviation of the sample), power you will often accept the default value of 5%), (the power you want the test to have, where you will type t test you want to carry out: two-sample, one-sample often accept the default value of 80%), (the type of or paired) and alternative (whether you want to do a one- or a two-tailed test, where you will typically n want to do the default, two-tailed test). One of the parameters delta , power , sd and sig.level must , be passed as NULL , and that parameter will be calculated from the others. This sounds like a lot of work, but you will typically use all of the defaults so you only need to specify the difference, delta , and the standard sd n that will give you the power you want. deviation, , to work out the sample size So how many replicates do we need in each of two samples to detect a difference of 10% with power = 80% when the mean is 20 (i.e. delta = 2.0) and the standard deviation is about 3.5? power.t.test(delta=2,sd=3.5,power=0.8) Two-sample t test power calculation n = 49.05349 delta = 2 sd = 3.5 sig.level = 0.05 power = 0.8 alternative = two.sided NOTE: n is number in *each* group The (perhaps rather shocking) answer is that you need at least 50 replicates from each sample (100 data points in all). If you had been working with a rule of thumb like ‘30 is a big enough sample’ then you would be severely disappointed in this case. You simply could not have detected a difference of 10% with this experimental design. You need 50 replicates in each sample (100 replicates in all) to achieve a power of 80%. You can n (15 in each work out what size of difference your sample of 30 would allow you to detect, by specifying treatment) and omitting : delta power.t.test(n=15,sd=3.5,power=0.8) Two-sample t test power calculation n=15 delta = 3.709303 sd = 3.5 sig.level = 0.05 power = 0.8 alternative = two.sided This shows that you could have detected an 18.5% change (100 × 3.709/20), which is roughly double the effect size you hoped to be able to detect (10% = 2.0). The work you need to do before carrying out a power analysis before designing your experiment is to find values for the standard deviation (from the literature or

407 CLASSICAL TESTS 385 by carrying out a pilot experiment) and the size of the difference your want to detect (from discussions with your sponsor and your colleagues). Experiments in ecology are often planned to be able to detect 50% effects. Aspiring to estimate effects as small as 10% would lead to impossibly large sample sizes (see the discussion in Perry et al., 2003). 8.12 Bootstrap We want to use bootstrapping to obtain a 95% confidence interval for the mean of a vector of numbers called : values temp data <- read.table("c: skewdata.txt",header=T) \\ \\ attach(data) names(data) [1] "values" We shall sample with replacement from values using , then work out the sample(values,replace=T) mean, repeating this operation 10 000 times, and storing the 10 000 different mean values in a vector called : ms ms <- numeric(10000) for (i in 1:10000) { } ms[i] <- mean(sample(values,replace=T)) quantile function applied to ms : we want to know the The answer to our problem is provided by the values of ms associated with its 0.025 and 0.975 tails: quantile(ms,c(0.025,0.975)) 2.5% 97.5% 24.97918 37.62932 Thus the intervals below and above the mean are mean(values)-quantile(ms,c(0.025,0.975)) 2.5% 97.5% 5.989472 -6.660659 √ 2 . 96 × = How does this compare with the parametric confidence interval, CI 1 n / s ? 1.96*sqrt(var(values)/length(values)) [1] 6.569802 Close, but not identical. Our bootstrapped intervals are skew because the data are skewed, but the parametric interval, of course, is symmetric. Now let us see how to do the same thing using the boot function from the library called boot : install.packages("boot") library(boot) The syntax of boot is very simple: boot(data, statistic, R)

408 386 THE R BOOK boot lies in understanding how to write the statistic function. is the number of The trick to using R in this example), and is the name of the data object to be R=10000 data resamplings you want to do ( in this case). The attribute we want to estimate repeatedly is the mean value of resampled ( . values values values . The second argument is an index (a vector of Thus, the first argument to our function must be boot to select random assortments of subscripts) that is used within . Our statistic function values can use the built-in function to calculate the mean value of the sample of values . mean mymean <- function(values,i) mean(values[i]) The key point is that we write not mean(values) . Now we can run the bootstrap mean(values[i]) for 10 000 iterations: myboot <- boot(values,mymean,R=10000) myboot ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = values, statistic = mymean, R = 10000) Bootstrap Statistics : original Bias std. error t1* 30.96866 -0.08155796 3.266455 The output is interpreted as follows. The is the mean of the whole sample: original mean(values) [1] 30.96866 while bias is the difference between the arithmetic mean and the mean of the bootstrapped samples which are in the variable called myboot\$t: mean(myboot\$t)-mean(values) [1] -0.08155796 and std. error is the standard deviation of the simulated values in myboot\$t : sqrt(var(myboot\$t)) [,1] [1,] 3.266455 myboot can be used to do other things. For instance, we can compare our homemade The components of vector ( ms above) with a histogram of myboot\$t : windows(7,4) par(mfrow=c(2,1)) hist(ms) hist(myboot\$t) They differ in detail because they were generated with different series of random numbers. Here are the 95% intervals for comparison with ours, calculated from the quantiles of myboot\$t : mean(values)-quantile(myboot\$t,c(0.025,0.975)) 2.5% 97.5% 6.126120 -6.599232

409 CLASSICAL TESTS 387 boot.ci boot object: There is a function for calculating confidence intervals from the boot.ci(myboot) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 10000 bootstrap replicates CALL : boot.ci(boot.out = myboot) Intervals : Level Normal Basic 95% (24.65, 37.45) (24.37, 37.10) Level Percentile BCa 95% (24.84, 37.57) (25.63, 38.91) Calculations and Intervals on Original Scale Warning message: bootstrap variances needed for studentized intervals in: boot.ci(myboot) is the parametric CI based on the standard error of the mean and the sample size (p. 514). The Normal Percentile interval is the quantile from the bootstrapped estimates: quantile(myboot\$t,c(0.025,0.975)) 2.5% 97.5% 24.84254 37.56789 which, as we saw earlier, was close to our home-made values (above). The BCa interval is the bias-corrected accelerated percentile. It is not greatly different in this case, but it is the interval preferred by statisticians. A more complex example of the use of bootstrapping involving a generalized linear model is explained on p. 570. For other examples see ?boot , and for more depth read the Davison and Hinkley (1997) book from which the boot package was developed (as programmed by A.J. Canty).

410 9 Statistical Modelling The hardest part of any statistical work is getting started. And one of the hardest things about getting started is choosing the right kind of statistical analysis. The choice depends on the nature of your data and on the response variable you particular question you are trying to answer. The key is to understand what kind of have, and to know the nature of your explanatory variables. The response variable is the thing you are working on: it is the variable whose variation you are attempting to understand. This is the variable that goes on the y x axis of the graph; you are interested in the extent to axis of the graph. The explanatory variable goes on the which variation in the response variable is associated with variation in the explanatory variable. You also need to consider the that the variables in your analysis measure what they purport to measure. A continuous way measurement is a variable such as height or weight that can take any real numbered value. A categorical variable is a factor with two or more levels: sex is a factor with two levels (male and female), and colour might be a factor with seven levels (red, orange, yellow, green, blue, indigo, violet). It is essential, therefore, that you can answer the following questions:  Which of your variables is the response variable?  Which are the explanatory variables?  Are the explanatory variables continuous or categorical, or a mixture of both?  What kind of response variable do you have: is it a continuous measurement, a count, a proportion, a time at death, or a category? These simple keys will lead you to the appropriate statistical method: The explanatory variables (a) All explanatory variables continuous Regression (b) All explanatory variables categorical Analysis of variance ( ANOVA ) (c) Explanatory variables both continuous and categorical Analysis of covariance ( ANCOVA ) The R Book , Second Edition. Michael J. Crawley. © 2013 John Wiley & Sons, Ltd. Published 2013 by John Wiley & Sons, Ltd.

412 390 THE R BOOK  Think about model choice (p. 1) ◦ Which explanatory variables should be included? ◦ What transformation of the response is most appropriate? ◦ Which interactions should be included? ◦ Which non-linear terms should be included? ◦ Is there pseudoreplication, and if so, how should it be dealt with? ◦ Should the explanatory variables be transformed?  Try to use the simplest kind of analysis that is appropriate to your data and the question you are trying to answer (e.g. do a one-way ANOVA rather than a mixed-effects model) (p. 344).  Fit a maximal model and simplify it by stepwise deletion (p. 391).  plot(model) Check the minimal adequate model for constancy of variance and normality of errors using (p. 405).  Emphasize the effect sizes and standard errors ( summary.lm ), and play down the analysis of deviance table ( summary.aov ) (p. 382).  Document carefully what you have done, and explain all the steps you took. That way, you should be able to understand what you did and why you did it, when you return to the analysis in 6 months’ time. 9.2 Maximum likelihood What, exactly, do we mean when we say that the parameter values should afford the ‘best fit of the model to the data’? The convention we adopt is that our techniques should lead to unbiased, variance-minimizing . We define ‘best’ in terms of estimators . This notion may be unfamiliar, so it is worth maximum likelihood investing some time to get a feel for it. This is how it works:  given the data,  and given our choice of model,  what values of the parameters of that model  make the observed data most likely? We judge the model on the basis how likely the data would be if the model were correct. 9.3 The principle of parsimony (Occam’s razor) One of the most important themes running through this book concerns model simplification. The principle of parsimony is attributed to the early fourteenth-century English nominalist philosopher, William of Occam, who insisted that, given a set of equally good explanations for a given phenomenon, the correct explanation is the simplest explanation . It is called Occam’s razor because he ‘shaved’ his explanations down to the bare minimum: his point was that in explaining something, assumptions must not be needlessly multiplied.

413 STATISTICAL MODELLING 391 known to exist should not, unless it is absolutely In particular, for the purposes of explanation, things not necessary, be postulated as existing. For statistical modelling, the principle of parsimony means that:  models should have as few parameters as possible;  linear models should be preferred to non-linear models;  experiments relying on few assumptions should be preferred to those relying on many;  models should be pared down until they are minimal adequate ;  simple explanations should be preferred to complex explanations. The process of model simplification is an integral part of hypothesis testing in R. In general, a variable is retained in the model only if it causes a significant increase in deviance when it is removed from the current model . Seek simplicity, then distrust it. In our zeal for model simplification, however, we must be careful not to throw the baby out with the bathwater. Einstein made a characteristically subtle modification to Occam’s razor. He said: ‘A model should be as simple as possible. But no simpler.’ Remember, too, what Oscar Wilde said: ‘Truth is rarely pure, and never simple.’ 9.4 Types of statistical model Fitting models to data is the central function of R. The process is essentially one of exploration; there are no fixed rules and no absolutes. The object is to determine a minimal adequate model (see Table 9.1) from the large set of potential models that might be used to describe the given set of data. In this book we discuss five Statistical modelling involves the selection of a minimal adequate model from a potentially large set of Table 9.1. more complex models, using stepwise model simplification. Model Interpretation Saturated model One parameter for every data point Fit: perfect Degrees of freedom: none Explanatory power of the model: none Maximal model Contains all ( p ) factors, interactions and covariates that might be of any interest. Many of the model’s terms are likely to be insignificant Degrees of freedom: − p − 1 n Explanatory power of the model: it depends ′ ≤ p parameters Minimal adequate model ≤ p A simplified model with 1 Fit: less than the maximal model, but not significantly so ′ − 1 n Degrees of freedom: p − 2 = SSR/SSY Explanatory power of the model: r ̄ Just one parameter, the overall mean Null model y Fit: none; SSE = SSY Degrees of freedom: n − 1 Explanatory power of the model: none

414 392 THE R BOOK types of model:  the null model;  the minimal adequate model;  the current model;  the maximal model; and  the saturated model. The stepwise progression from the saturated model (or the maximal model, whichever is appropriate) through a series of simplifications to the minimal adequate model is made on the basis of deletion tests . These are F tests or chi-squared tests that assess the significance of the increase in deviance that results when a given term is removed from the current model. Models are representations of reality that should be both accurate and convenient. However, it is impossible to maximize a model’s realism, generality and holism simultaneously, and the principle of parsimony is a vital tool in helping to choose one model over another. Thus, we would only include an explanatory variable in a model if it significantly improved the fit of the model. The fact that we went to the trouble of measuring some- thing does not mean we have to have it in our model. Parsimony says that, other things being equal, we prefer:  a model with − 1 parameters to a model with n parameters; n  a model with k − 1 explanatory variables to a model with k explanatory variables;  a linear model to a model which is curved;  a model without a hump to a model with a hump;  a model without interactions to a model containing interactions between factors. Other considerations include a preference for models containing explanatory variables that are easy to measure over variables that are difficult or expensive to measure. Also, we prefer models that are based on a sound mechanistic understanding of the process over purely empirical functions. Some variables are so important that we retain them in the model even though their parameters are not significantly different from zero (e.g. density dependence in population models). Parsimony requires that the model should be as simple as possible. This means that the model should not contain any redundant parameters or factor levels. We achieve this by fitting a maximal model and then simplifying it by following one or more of these steps:  remove non-significant interaction terms;  remove non-significant quadratic or other non-linear terms;  remove non-significant explanatory variables;  group together factor levels that do not differ from one another;  in ANCOVA, set non-significant slopes of continuous explanatory variables to zero. All the above are subject, of course, to the caveats that the simplifications make good scientific sense and do not lead to significant reductions in explanatory power. Just as there is no perfect model, so there may be no optimal scale of measurement for a model. Suppose, for example, we had a process that had Poisson errors with multiplicative effects amongst the explanatory

415 STATISTICAL MODELLING 393 variables. Then, we must choose between three different scales, each of which optimizes one of three different properties: √  would give constancy of variance; y the scale of 2/3  y would give approximately normal errors; the scale of  the scale of ln( y ) would give additivity. Thus, any measurement scale is always going to be a compromise, and we should choose the scale that gives the best overall performance of the model. 9.5 Steps involved in model simplification There are no hard-and-fast rules, but the procedure laid out in Table 9.2 works well in practice. With large numbers of explanatory variables, and many interactions and non-linear terms, the process of model simplification can take a very long time. But this is time well spent because it reduces the risk of overlooking an important aspect of the data. It is important to realize that there is no guaranteed way of finding all the important structures in a complex dataframe. 9.5.1 Caveats Model simplification is an important process but it should not be taken to extremes. For example, care should be taken with the interpretation of deviances and standard errors produced with fixed parameters that have been estimated from the data. Again, the search for ‘nice numbers’ should not be pursued uncritically. Sometimes there are good scientific reasons for using a particular number (e.g. a power of 0.66 in an allometric Table 9.2. Model simplification process. Step Procedure Explanation 1 Fit the maximal model Fit all the factors, interactions and covariates of interest. Note the residual deviance. If you are using Poisson or binomial errors, check for overdispersion and rescale if necessary. summary . Remove Begin model simplification Inspect the parameter estimates using the R function 2 update - , starting with the the least significant terms first, using highest-order interactions. 3 If the deletion causes an Leave that term out of the model. insignificant increase in Inspect the parameter values again. deviance Remove the least significant term remaining. update + . These are the 4 If the deletion causes a Put the term back in the model using statistically significant terms as assessed by deletion from the maximal significant increase in model. deviance 5 Keep removing terms from Repeat steps 3 or 4 until the model contains nothing but significant terms. the model This is the minimal adequate model. If none of the parameters is significant, then the minimal adequate model is the null model.

416 394 THE R BOOK relationship between respiration and body mass). It is much more straightforward, for example, to say that yield increases by 2 kg per hectare for every extra unit of fertilizer, than to say that it increases by 1.947 kg. Similarly, it may be preferable to say that the odds of infection increase 10-fold under a given treatment, than to say that the logits increase by 2.321; without model simplification this is equivalent to saying that there is a 10.186-fold increase in the odds. It would be absurd, however, to fix on an estimate of 6 rather than 6.1 just because 6 is a whole number. 9.5.2 Order of deletion The data in this book fall into two distinct categories. In the case of planned experiments, all of the treatment combinations are equally represented and, barring accidents, there are no missing values. Such experiments are said to be orthogonal . In the case of observational studies, however, we have no control over the number of individuals for which we have data, or over the combinations of circumstances that are observed. Many of the explanatory variables are likely to be correlated with one another, as well as with the response variable. non-orthogonal Missing treatment combinations are commonplace, and the data are said to be . This makes an important difference to our statistical modelling because, in orthogonal designs, the variation that is attributed to a given factor is constant, and does not depend upon the order in which factors are removed from the model. In contrast, with non-orthogonal data, we find that the variation attributable to a given factor does depend upon the order in which factors are removed from the model. We must be careful, therefore, to judge the significance of factors in non-orthogonal studies, when they are removed from the maximal model (i.e. from the model including all the other factors and interactions with which they might be confounded). Remember that, for non-orthogonal data, order matters . Also, if your explanatory variables are correlated with each other, then the significance you attach to a given explanatory variable will depend upon whether you delete it from a maximal model or add it to the null model. If you always test by model simplification then you will not fall into this trap. The fact that you have laboured long and hard to include a particular experimental treatment does not justify the retention of that factor in the model if the analysis shows it to have no explanatory power. ANOVA tables are often published containing a mixture of significant and non-significant effects. This is not a problem in orthogonal designs, because sums of squares can be unequivocally attributed to each factor and interaction term. But as soon as there are missing values or unequal weights, then it is impossible to tell how the parameter estimates and standard errors of the significant terms would have been altered if the non-significant terms had been deleted. The best practice is as follows:  Say whether your data are orthogonal or not.  Explain any correlations amongst your explanatory variables.  Present a minimal adequate model.  Give a list of the non-significant terms that were omitted, and the deviance changes that resulted from their deletion. If you do this, then readers can judge for themselves the relative magnitude of the non-significant factors, and the importance of correlations between the explanatory variables. The temptation to retain terms in the model that are ‘close to significance’ should be resisted. The best way to proceed is this. If a result would have been important if it had been statistically significant, then it is worth repeating the experiment with higher replication and/or more efficient blocking, in order to demonstrate the importance of the factor in a convincing and statistically acceptable way.

417 STATISTICAL MODELLING 395 9.6 Model formulae in R The structure of the model is specified in the model formula like this: ∼ explanatory variable(s) response variable where the tilde symbol ∼ reads ‘is modelled as a function of’ (see Table 9.3 for examples). So a simple linear regression of would be written as y on x y~x and a one-way ANOVA where sex is a two-level factor would be written as y~sex Table 9.3. Examples of R model formulae. In a model formula, the function I (upper case ‘I’) stands for ‘as is’ and , or calculating quadratic terms, . I(1:10) I(xˆ2) is used for generating sequences, Comments Model formula Model y~1 1 is the intercept in regression models, but here it is Null the overall mean y y~x x is a continuous explanatory variable Regression Do not fit an intercept y~x-1 Regression through origin One-way ANOVA sex is a two-level categorical variable y~sex as above, but do not fit an intercept (gives two y~sex-1 One-way ANOVA means rather than a mean and a difference) y~sex + genotype genotype is a four-level categorical variable Two-way ANOVA y~N*P*K N , P and K are two-level factors to be fitted along Factorial ANOVA with all their interactions y~N*P*K - N:P:K As above, but do not fit the three-way interaction Three-way ANOVA but with two A common slope for y against x y~x + sex Analysis of covariance intercepts, one for each sex Two slopes and two intercepts y~x * sex Analysis of covariance y~a/b/c c nested within factor b within factor a Factor Nested ANOVA y~a*b*c+Error(a/b/c) A factorial experiment but with three plot sizes and Split-plot ANOVA three different error variances, one for each plot size y~x + z Two continuous explanatory variables, flat surface Multiple regression fit y~x * z Fit an interaction term as well ( x+z+x:z ) Multiple regression z y~x + I(xˆ2) + z + I(zˆ2) Fit a quadratic term for both x and Multiple regression z Fit a quadratic polynomial for x and linear y <- poly(x,2) + z Multiple regression y~(x + z + w)ˆ2 Fit three variables plus all their interactions up to Multiple regression two-way in a generalized y~s(x) + s(z) y is a function of smoothed x and z Non-parametric model additive model log(y)~I(1/x) + sqrt(z) All three variables are transformed in the model Transformed response and explanatory variables

418 396 THE R BOOK The right-hand side of the model formula shows:  the number of explanatory variables and their identities – their attributes (e.g. continuous or categorical) are usually defined prior to the model fit;  the interactions between the explanatory variables (if any);  non-linear terms in the explanatory variables. On the right of the tilde, one also has the option to specify offsets or error terms in some special cases. As with the response variable, the explanatory variables can appear as transformations, or as powers or polynomials. It is very important to note that symbols are used differently in model formulae than in arithmetic expressions. In particular: indicates inclusion of an explanatory variable in the model (not addition); + - indicates deletion of an explanatory variable from the model (not subtraction); * indicates inclusion of explanatory variables and interactions (not multiplication); indicates nesting of explanatory variables in the model (not division); / y~x | z is read as ‘ y as a function of indicates conditioning (not ‘or’), so that given z ’. | x There are several other symbols that have special meaning in model formulae. A colon denotes an interaction, so that means the two-way interaction between A and B , and N:P:K:Mg means the four-way interaction A:B between , P , K and Mg . N Some terms can be written in an expanded form. Thus: A*B*C A+B+C+A:B+A:C+B:C+A:B:C; is the same as is the same as A+B%in%A+C%in%B%in%A; A/B/C (A+B+C)ˆ3 A*B*C; is the same as (A+B+C)ˆ2 is the same as A*B*C - A:B:C. 9.6.1 Interactions between explanatory variables Interactions between two two-level categorical variables of the form A*B mean that two main effect means and one interaction mean are evaluated. On the other hand, if factor A B has has three levels and factor A four levels, then seven parameters are estimated for the main effects (three means for and four means B ). The number of interaction terms is ( a − 1)( b − 1), where a and for are the numbers of levels of the b factors and B , respectively. So in this case, R would estimate (3 − A − 1) = 6 parameters for the 1)(4 interaction. Interactions between two continuous variables are fitted differently. If x z are two continuous explana- and and the interaction term x:z behaves as if a new variable had x*z tory variables, then x+z+x:z means fit been computed that was the pointwise product of the two vectors x and z . The same effect could be obtained by calculating the product explicitly, product.xz <- x * z then using the model formula y~x + z + product.xz . Note that the representation of the interaction by the product of the two continuous variables is an assumption, not a fact. The real interaction might be of an altogether different functional form (e.g. x * zˆ2 ). Interactions between a categorical variable and a continuous variable are interpreted as an analysis of covariance; a separate slope and intercept are fitted for each level of the categorical variable. So y~A*x

419 STATISTICAL MODELLING 397 A had three levels; this would estimate six parameters from would fit three regression equations if the factor the data – three slopes and three intercepts. A B , and The slash operator is used to denote nesting. Thus, with categorical variables y ~ A/B plus means fit ‘ within A ’. This could be written in two other equivalent ways: A B y~A+A:B y~A+B %in% A both of which alternatives emphasize that there is no point in attempting to estimate a main effect for (it is B probably just a factor label like ‘tree number 1’ that is of no scientific interest; see p. 681). Some functions for specifying non-linear terms and higher-order interactions are useful. To fit a polynomial regression in and z , we could write x y ~ poly(x,3) + poly(z,2) x to fit a cubic polynomial in z . To fit interactions, but only up to a certain level, and a quadratic polynomial in the ˆ operator is useful. The formula y ~ (A + B + C)ˆ2 fits all the main effects and two-way interactions (i.e. it excludes the three-way interaction that A*B*C would have included). The I function (upper-case letter ‘i’) stands for ‘as is’. It overrides the interpretation of a model symbol x as a formula operator when the intention is to use it as an arithmetic operator. Suppose you wanted to fit 1/ as an explanatory variable in a regression. You might try y ~ 1/x but this actually does something very peculiar. It fits x nested within the intercept (whatever that might represent). When it appears in a model formula, the slash operator is assumed to imply nesting. To obtain the effect we want, we use I (‘as is’) to write y ~ I(1/x) We also need to use I when we want * to represent multiplication and ˆ to mean ‘to the power’ rather than 2 in a quadratic regression we would write and x an interaction model expansion: thus to fit x y~x+I(xˆ2) 9.6.2 Creating formula objects You can speed up the creation of complicated model formulae using paste to create series of variable names and collapse to join the variable names together by symbols. Here, for instance, is a multiple regression formula with 25 continuous explanatory variables created using the as.formula function: xnames <- paste("x", 1:25, sep="") (model.formula <- as.formula(paste("y~", paste(xnames, collapse= "+")))) y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+ x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + x22 + x23 + x24 + x25

420 398 THE R BOOK 9.7 Multiple error terms When there is nesting (e.g. split plots in a designed experiment; see p. 685) or temporal pseudoreplication function as part of the model formula. Suppose you had a three-factor (see p. 695) you can include an Error , A and C . The twist is that each treatment is applied to plots factorial experiment with categorical variables B A is applied to replicated whole fields, B is applied at random to half fields and of different sizes: is applied C to smaller split–split plots within each half-field. This is shown in a model formula like this: y ~ A*B*C + Error(A/B/C) Note that the terms within the model formula are separated by asterisks to show that it is a full factorial with Error statement. There are all interaction terms included, whereas the terms are separated by slashes in the Error as many terms in the statement as there are different sizes of plots – three in this case, although the smallest plot size ( C in this example) can be omitted from the list – and the terms are listed left to right from the largest to the smallest plots; see p. 686 for details and examples. For the more modern mixed-effects model using lmer (in package lme4 ), the preferred method is to create unique factor level names, rather than use slashes to indicate nesting. The colon operator is useful for this, when both of the arguments are factors. So if our nested factors , B and C are numbers (say 1:2, 1:4 A and 1:3, respectively) A <- rep(1:2,each=12) B <- rep(1:4,each=3,length=24) C <- rep(1:3,length=24) then we compute new factors a , b , and c a <- factor(A) b <- factor(A):factor(B) c <- factor(A):factor(B):factor(C) and then fit each as a separate random effect (see p. 692) lmer(y ~ x + (1|a)+(1|b)+(1|c) ) 9.8 The intercept as parameter 1 The simple command y~1 causes the null model to be fitted. This works out the grand mean (the overall average) of all the data, and the total deviance (or the total sum of squares, SSY , in models with normal errors and the identity link). In some cases, this may be the minimal adequate model; it is possible that none of the explanatory variables we have measured contribute anything significant to our understanding of the variation in the response variable. This is normally what you do not want to happen at the end of your three-year research project. To remove the intercept (parameter 1) from a regression model (i.e. to force the regression line through the origin) you fit ‘–1’ like this: y~x-1

421 STATISTICAL MODELLING 399 You should not do this unless you know exactly what you are doing, and exactly why you are doing it. Removing the intercept from an ANOVA model where all the variables are categorical has a different effect: y~sex-1 This gives the mean for males and the mean for females in the summary table, rather than the mean for females and the difference in mean for males. 9.9 The update function in model simplification update function used during model simplification, the dot ‘.’ is used to specify ‘what is there already’ In the on either side of the tilde. So if your original model said model <- lm(y~A*B) update function to remove the interaction term then the could be written like this: A:B model2 <- update(model,~.- A:B) Note that there is no need to repeat the name of the response variable, and the punctuation ‘tilde dot’ means take model as it is, and remove from it the interaction term A:B. 9.10 Model formulae for regression The important point to grasp is that model formulae look very like equations but there are important differences. Our simplest useful equation looks like this: y a + bx . = It is a two-parameter model with one parameter for the intercept, a , and another for the slope, b , of the graph of the continuous response variable y x . The model formula for the against a continuous explanatory variable same relationship looks like this: y~x The equals sign is replaced by a tilde, and all of the parameters are left out. It we had a multiple regression x and with two continuous explanatory variables , the equation would be z y=a+bx+cz, but the model formula is y~x + z It is all wonderfully simple. But just a minute. How does R know what parameters we want to estimate from the data? We have only told it the names of the explanatory variables. We have said nothing about how to fit them, or what sort of equation we want to fit to the data. The key to this is to understand what kind of explanatory variable is being fitted to the data. If the explanatory variable x specified on the right of the tilde is a continuous variable, then R assumes that you want to do a regression, and hence that you want to estimate two parameters in a linear regression whose equation is y = a + bx .

422 400 THE R BOOK A common misconception is that linear models involve a straight-line relationship between the not response variable and the explanatory variables. This is the case, as you can see from these two linear models: windows(7,4) par(mfrow=c(1,2)) x <- seq(0,10,0.1) plot(x,1+x-xˆ2/15,type="l",col="red") plot(x,3+0.1*exp(x),type="l",col="red") 4 1500 2/15 ∧ 321 3 + 0.1 * exp(x) 1 + x – x 5000 46 246 88 10 0 10 02 xx The definition of a linear model is an equation that contains mathematical variables, parameters and random a , b and variables and that is linear in the parameters and in the random variables. What this means is that if c are parameters then obviously = a + bx y is a linear model, but so is 2 = a + bx − cx y 2 because can be replaced by z which gives a linear relationship x = y + bx + cz , a and so is x = a be y + because we can create a new variable = exp( x ), so that z y = a + bz . Some models are non-linear but can be readily linearized by transformation. For example, y = exp( a + bx )

423 STATISTICAL MODELLING 401 is non-linear, but on taking logs of both sides, it becomes ) a + bx ln( = y If the equation you want to fit is more complicated than this, then you need to specify the form of the equation, or nlme and use non-linear methods ( nls ) to fit the model to the data (see p. 715). 9.11 Box–Cox transformations Sometimes it is not clear from theory what the optimal transformation of the response variable should be. In these circumstances, the Box–Cox transformation offers a simple empirical solution. The idea is to find the power transformation, λ (lambda), that maximizes the likelihood when a specified set of explanatory variables is fitted to λ − 1 y λ as the response. The value of lambda can be positive or negative, but it cannot be zero (you would get a zero-divide error when the formula was applied to the response variable, y ). For the case λ = 0 the Box–Cox y ). Suppose that λ = –1. The formula now becomes transformation is defined as log( − 1 1 1 − 1 / y − 1 y = = 1 − , 1 − y 1 − and this quantity is regressed against the explanatory variables and the log-likelihood computed. In this example, we want to find the optimal transformation of the response variable, which is timber volume: data <- read.delim("c: \\ temp \\ timber.txt") attach(data) names(data) [1] "volume" "girth" "height" We start by loading the MASS library of Venables and Ripley: library(MASS) The boxcox function is very easy to use: just specify the model formula, and the default options take care of everything else. windows(7,7) boxcox(volume~log(girth)+log(height))

424 402 THE R BOOK 95% 0 –10 –20 log-Likelihood –30 –40 –50 1 2 –1 –2 0 λ It is clear that the optimal value of lambda is close to zero (i.e. the log transformation). We can zoom in to get a more accurate estimate by specifying our own, non-default, range of lambda values. It looks as if it would be sensible to plot from –0.5 to 0.5: + boxcox(volume~log(girth)+log(height),lambda=seq(-0.5,0.5,0.01)) 6420 95% –2 log-Likelihood –4 –6 –0.4 –0.2 0.0 0.2 0.4 λ

425 STATISTICAL MODELLING 403 λ 0.08, but the log-likelihood for λ = 0 is very close to the maximum. This The likelihood is maximized at ≈− also gives a much more straightforward interpretation, so we would go with that, and model log(volume) log(girth) and log(height) as a function of (see p. 262). What if we had not log-transformed the explanatory variables? What would have been the optimal boxcox function, simply changing the transformation of volume in that case? To find out, we rerun the model formula like this: boxcox(volume~girth+height) We can zoom in from 0.1 to 0.6 like this: boxcox(volume~girth+height,lambda=seq(0.1,0.6,0.01)) 95% log-Likelihood 12345 0.1 0.3 0.4 0.5 0.6 0.2 λ This suggests that the cube root transformation would be best ( λ = 1 / 3). Again, this accords with dimensional arguments, since the response and explanatory variables would all have dimensions of length in this case. 9.12 Model criticism There is a temptation to become personally attached to a particular model. Statisticians call this ‘falling in love with your model’. It is as well to remember the following truths about models:  All models are wrong.  Some models are better than others.  The correct model can never be known with certainty.  The simpler the model, the better it is.

426 404 THE R BOOK There are several ways that we can improve things if it turns out that our present model is inadequate:  Transform the response variable.  Transform one or more of the explanatory variables.  Try fitting different explanatory variables if you have any.  Use a different error structure.  Use non-parametric smoothers instead of parametric functions.  y Use different weights for different values. All of these are investigated in the coming chapters. In essence, you need a set of tools to establish whether, and how, your model is inadequate. For example, the model might:  predict some of the y values poorly;  show non-constant variance;  show non-normal errors;  be strongly influenced by a small number of influential data points;  show some sort of systematic pattern in the residuals;  exhibit overdispersion. 9.13 Model checking After fitting a model to data we need to investigate how well the model describes the data. In particular, we should look to see if there are any systematic trends in the goodness of fit. For example, does the goodness of fit increase with the observation number, or is it a function of one or more of the explanatory variables? We can work with the raw residuals: residuals = y − fitted values . For instance, we should routinely plot the residuals against:  the fitted values (to look for heteroscedasticity);  the explanatory variables (to look for evidence of curvature);  the sequence of data collection (to look for temporal correlation);  standard normal deviates (to look for non-normality of errors). 9.13.1 Heteroscedasticity A good model must also account for the variance–mean relationship adequately and produce additive effects on the appropriate scale (as defined by the link function). A plot of standardized residuals against fitted values

427 STATISTICAL MODELLING 405 should look like the sky at night (points scattered at random over the whole plotting region), with no trend in the size or degree of scatter of the residuals. A common problem is that the variance increases with the mean, so that we obtain an expanding, fan-shaped pattern of residuals (right-hand panel): 10 5 5 0 0 Residuals Residuals –5 –5 –10 –10 15 10 20 35 35 25 30 25 30 10 15 20 Fitted values Fitted values The plot on the left is what we want to see: no trend in the residuals with the fitted values. The plot on the right is a problem. There is a clear pattern of increasing residuals as the fitted values get larger. This is a picture of what heteroscedasticity looks like. 9.13.2 Non-normality of errors Errors may be non-normal for several reasons. They may be skew, with long tails to the left or right. Or they may be kurtotic, with a flatter or more pointy top to their distribution. In any case, the theory is based on the assumption of normal errors, and if the errors are not normally distributed, then we shall not know how this affects our interpretation of the data or the inferences we make from it. It takes considerable experience to interpret normal error plots. Here we generate a series of data sets where we introduce different but known kinds of non-normal errors. Then we plot them using a simple home-made mcheck (first developed by John Nelder in the original GLIM language; the name stands function called for ‘model checking’). The idea is to see what patterns are generated in normal plots by the different kinds plot(model) rather than mcheck (see of non-normality. In real applications we would use the generic below). First, we write the function mcheck . The idea is to produce two plots, side by side: a plot of the residuals against the fitted values on the left, and a plot of the ordered residuals against the quantiles of the normal distribution on the right. mcheck <- function (obj,...) { rs <- obj\$resid fv <- obj\$fitted windows(7,4) par(mfrow=c(1,2)) plot(fv,rs,xlab="Fitted values",ylab="Residuals",pch=16,col="red") abline(h=0, lty=2) qqnorm(rs,xlab="Normal scores",ylab="Ordered residuals",main="",pch=16) qqline(rs,lty=2,col="green") par(mfrow=c(1,1)) invisible(NULL) }

428 406 THE R BOOK \$ (component selection) to extract the residuals and fitted values from the model object Note the use of (the expression x\$name is the which is passed to the function as component of x ). The functions obj name and qqline qqnorm are built-in functions to produce normal probability plots. It is good programming practice to set the graphics parameters back to their default settings before leaving the function. The aim is to create a catalogue of some of the commonest problems that arise in model checking. We need a vector of values for the following regression models: x x <- 0:30 Now we manufacture the response variables according to the equation y 10 + x + ε = where the errors, ε , have zero mean but are taken from different probability distributions in each case. Normal errors e <- rnorm(31,mean=0,sd=5) yn <- 10+x+e mn <- lm(yn~x) mcheck(mn) 10 10 5 5 0 0 Residuals Ordered residuals –10 –5 –10 –5 10 15 20 25 30 35 –2 –1 0 1 2 Fitted values Normal scores There is no suggestion of non-constant variance (left plot) and the normal plot (right) is reasonably straight. The judgement as to what constitutes an important departure from normality takes experience, and this is the reason for looking at some distinctly non-normal, but known, error structures next. Uniform errors eu <- 20*(runif(31)-0.5) yu <- 10+x+eu mu <- lm(yu~x) mcheck(mu)

429 STATISTICAL MODELLING 407 10 10 05 05 Residuals Ordered residuals –5 –5 20 40 30 10 –1 0 1 2 –2 Fitted values Normal scores Uniform errors show up as a distinctly S-shaped pattern in the quantile–quantile plot on the right. The fit in the centre is fine, but the largest and smallest residuals are too small (they are constrained in this example to ± 10). be Negative binomial errors enb <- rnbinom(31,2,.3) ynb <- 10+x+enb mnb <- lm(ynb~x) mcheck(mnb) 15 15 10 10 05 05 Residuals Ordered residuals –5 –5 2 –2 –1 0 1 40 50 20 30 Normal scores Fitted values The large negative residuals are all above the line, but the most obvious feature of the plot is the single, very large positive residual (in the top right-hand corner). In general, negative binomial errors will produce a J-shape on the quantile–quantile plot. The biggest positive residuals are much too large to have come from a normal distribution. These values may turn out to be highly influential (see below). Gamma errors and increasing variance Here the shape parameter is set to 1 and the rate parameter to 1/ x , and the variance increases with the square of the mean: eg <- rgamma(31,1,1/x) yg <- 10+x+eg

430 408 THE R BOOK mg <- lm(yg~x) mcheck(mg) 30 30 10 10 Residuals –10 –10 Ordered residuals –30 –30 2 –2 10 20 70 30 40 60 50 0 –1 1 Normal scores Fitted values The left-hand plot shows the residuals increasing steeply with the fitted values, and illustrates an asymmetry between the size of the positive and negative residuals. The right-hand plot shows the highly non-normal distribution of errors. 9.14 Influence One of the commonest reasons for a lack of fit is through the existence of outliers in the data. It is important to appear understand, however, that a point may to be an outlier because of misspecification of the model, and not because there is anything wrong with the data. It is important to understand that analysis of residuals is a very poor way of looking for influence. Precisely because a point is highly influential, it forces the regression line close to it, and hence the influential point may have a very small residual. Take this circle of data that shows absolutely no relationship between and x : y x <- c(2,3,3,3,4) y <- c(2,3,2,1,2) We want to draw two graphs side by side, and we want them to have the same axis scales: windows(7,4) par(mfrow=c(1,2)) plot(x,y,xlim=c(0,8),ylim=c(0,8)) Obviously, there is no relationship between y and x in the original data. But let us add an outlier at the point (7, 6) using concatenation c and see what happens: x1 <- c(x,7) y1 <- c(y,6) plot(x1,y1,xlim=c(0,8),ylim=c(0,8)) abline(lm(y1~x1),col="blue")

431 STATISTICAL MODELLING 409 y y1 02468 02468 02 468 02 468 xx1 y on x . The outlier is said to be highly Now, there is a significant regression of . This makes our influential write-up much more complicated. We need to own up and show that the entire edifice depends upon the single point at (7, 6). This requires an explanation of two models rather than one. We cannot pretend that the point (7, 6) does not exist (that would be a scientific scandal), but we must describe just how influential it is. Testing for the presence of influential points is an important part of statistical modelling. You cannot rely on analysis of the residuals, because by their very influence, these points force the regression line close to them: reg <- lm(y1~x1) summary(reg) Call: lm(formula = y1 ~ x1) Residuals: 123456 0.78261 0.91304 -0.08696 -1.08696 -0.95652 0.43478 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.5217 0.9876 -0.528 0.6253 x1 0.8696 0.2469 3.522 0.0244 * As you can see, the influential point (no. 6) has the second smallest residual (0.434 78). Instead, we look at the most extreme values of the explanatory variable, both to the left (extreme low values) and the right 2 ̄ : − x ) (extreme high values, as with point no. 6), as judged by ( x influence.measures(reg) Influence measures of lm(formula = y1 ~ x1) : dfb.1 dfb.x1 dffit cov.r cook.d hat inf 1 0.687 -0.5287 0.7326 1.529 0.26791 0.348 2 0.382 -0.2036 0.5290 1.155 0.13485 0.196 3 -0.031 0.0165 -0.0429 2.199 0.00122 0.196

432 410 THE R BOOK 4 -0.496 0.2645 -0.6871 0.815 0.19111 0.196 5 -0.105 -0.1052 -0.5156 1.066 0.12472 0.174 6 -3.023 4.1703 4.6251 4.679 7.62791 0.891 * You can see that point no. 6 is highlighted by an asterisk, drawing attention to its high influence. To extract is.inf the subscripts of the influential points, use the attribute like this: influence.measures(reg)\$is.inf dfb.1_ dfb.x1 dffit cov.r cook.d hat 1 FALSE FALSE FALSE FALSE FALSE FALSE 2 FALSE FALSE FALSE FALSE FALSE FALSE 3 FALSE FALSE FALSE FALSE FALSE FALSE 4 FALSE FALSE FALSE FALSE FALSE FALSE 5 FALSE FALSE FALSE FALSE FALSE FALSE 6 TRUE TRUE TRUE TRUE TRUE FALSE As you see, all of the influence measures (with the exception of ), pick out point no. 6. For more detail, hat use lm.influence(reg): lm.influence(reg) \$hat 123456 0.3478261 0.1956522 0.1956522 0.1956522 0.1739130 0.8913043 \$coefficients (Intercept) x1 1 0.67826087 -0.130434783 2 0.37015276 -0.049353702 3 -0.03525264 0.004700353 4 -0.44065805 0.058754407 5 -0.10068650 -0.025171625 6 -2.52173913 0.869565217 \$sigma 123456 0.9660918 0.9491580 1.1150082 0.8699177 0.9365858 0.8164966 \$wt.res 123456 0.78260870 0.91304348 -0.08695652 -1.08695652 -0.95652174 0.43478261 \$hat The first component, , is a vector containing the diagonal of the hat matrix. This is the orthogonal projector matrix onto the model space. Large values of elements of this vector mean that changing y will i ). y have a big impact on the fitted values (i.e. the hat diagonals are measures of the leverage of i Most interesting in the present context are the coefficients affecting the two parameters of the model (intercept and slope). The rows contain the change in the estimated coefficients which results when the i th case is dropped from the regression. Data in row 6 have much the biggest effect on both slope and intercept. The third component, \$sigma , is a vector whose i th element contains the estimate of the residual standard error obtained when the i th case is dropped from the regression; thus 0.816 496 6 is the residual

433 STATISTICAL MODELLING 411 lm(y1[-6]~x1[-6]) , and the error variance in this case is standard error when point no. 6 is dropped, 2 = 0.666 666, as you can see below: 0.816 496 6 summary.aov(lm(y1[-6]~x1[-6])) Df Sum Sq Mean Sq F value Pr(>F) x1[-6] 1 0 0.0000 0 1 Residuals 3 2 0.6667 is a vector of weighted residuals (or deviance residuals in a generalized linear model) \$wt.res Finally, or raw residuals if weights are not set (as in this example). 9.15 Summary of statistical models in R Models are fitted using one of the following model-fitting functions: fits a linear model with normal errors and constant variance; generally this is used for regression lm analysis using continuous explanatory variables. aov fits analysis of variance with normal errors, constant variance and the identity link; generally used for categorical explanatory variables or ANCOVA with a mix of categorical and continuous explanatory variables. fits generalized linear models to data using categorical or continuous explanatory variables, by glm error structures (e.g. Poisson for count data or binomial for specifying one of a family of proportion data) and a particular link function . gam fits generalized additive models to data with one of a family of error structures (e.g. Poisson for count data or binomial for proportion data) in which the continuous explanatory variables can (optionally) be fitted as arbitrary smoothed functions using non-parametric smoothers rather than specific parametric functions. lme lmer fit linear mixed-effects models with specified mixtures of fixed effects and random and effects and allow for the specification of correlation structure among the explanatory variables and autocorrelation of the response variable (e.g. time series effects with repeated measures). allows for non-normal errors and non-constant variance with the same error families as lmer a GLM. nls fits a non-linear regression model via least squares, estimating the parameters of a specified non-linear function. nlme fits a specified non-linear function in a mixed-effects model where the parameters of the non-linear function are assumed to be random effects; it allows for the specification of correlation structure among the explanatory variables and autocorrelation of the response variable (e.g. time series effects with repeated measures). loess fits a local regression model with one or more continuous explanatory variables using non-parametric techniques to produce a smoothed model surface. tree and rpart fit a regression tree model using binary recursive partitioning whereby the data are successively split along coordinate axes of the explanatory variables so that at any node the split is chosen that maximally distinguishes the response variable in the left and right branches. With a categorical response variable, the tree is called a classification tree, and the model used for classification assumes that the response variable follows a multinomial distribution.

434 412 THE R BOOK generic functions can be used to obtain information about the model. For most of these models, a range of The most important and most frequently used are as follows: produces parameter estimates and standard errors from , and ANOVA tables from aov ; lm summary and this will often determine your choice between . For either lm or aov you can lm aov summary.aov or choose to get the alternative form of output (an ANOVA summary.lm table or a table of parameter estimates and standard errors; see p. 517). produces diagnostic plots for model checking, including residuals against fitted values, plot normality checks, influence tests, etc. anova is a wonderfully useful function for comparing different models and producing ANOVA tables. update is used to modify the last model fit; it saves both typing effort and computing time. Other useful generic functions include the following: coef gives the coefficients (estimated parameters) from the model. fitted gives the fitted values, predicted by the model for the values of the explanatory variables included. resid gives the residuals (the differences between measured and predicted values of y ). predict uses information from the fitted model to produce smooth functions for plotting a line through the scatterplot of your data. Make sure you provide a list or a dataframe containing all of the necessary information on each of the explanatory variables in your model to enable the prediction to be made. 9.16 Optional arguments in model-fitting functions Unless you argue to the contrary, all of the rows in the dataframe will be used in the model fitting, there will be no offsets, and all values of the response variable will be given equal weight. Variables named in the model formula will come from the defined dataframe ( data=mydata ), the with function (p. 113) or from the attached dataframe (if there is one). Here we illustrate the following options:  subset  weights  data  offset  na.action We shall work with an example involving analysis of covariance (see p. 538 for details) where we have a mix of both continuous and categorical explanatory variables: \\ temp \\ ipomopsis.txt",header=T) data <- read.table("c: attach(data) names(data) [1] "Root" "Fruit" "Grazing" The response is seed production ( Fruit ) with a continuous explanatory variable ( Root , Root diameter) and a two-level factor ( Grazing , with levelss Grazed and Ungrazed ).

435 STATISTICAL MODELLING 413 9.16.1 Subsets Perhaps the most commonly used modelling option is to fit the model to a subset of the data (e.g. fit the model to data from just the grazed plants). You could do this using subscripts on the response variable and all the explanatory variables: model <- lm(Fruit[Grazing=="Grazed"]~Root[Grazing=="Grazed"]) but it is much more straightforward to use the argument, especially when there are lots of explanatory subset variables: model <- lm(Fruit~Root,subset=(Grazing=="Grazed")) and summary.aov tables are neater The answer, of course, is the same in both cases, but the summary.lm with . Note the round brackets used with the subset option (not the square brackets used with subset subscripts in the first example) 9.16.2 Weights The default is for all the values of the response to have equal weights (all equal to 1) weights = rep(1, n.observations) Where data points are to be weighted unequally, the classical approach is to weight each value by the inverse of the variance of the distribution from which that point is drawn. This downplays the influence of highly variable data. Instead of using initial root size as a covariate (as above) you could use Root as a weight in fitting a model with Grazing as the sole categorical explanatory variable: model <- lm(Fruit~Grazing,weights=Root) summary(model) Call: lm(formula = Fruit~Grazing, weights = Root) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 70.725 4.849 14.59 <2e-16 *** GrazingUngrazed -16.953 7.469 -2.27 0.029 * Residual standard error: 62.51 on 38 degrees of freedom Multiple R-Squared: 0.1194, Adjusted R-squared: 0.0962 F-statistic: 5.151 on 1 and 38 DF, p-value: 0.02899 When weights ( w ) are specified the model is fitted using weighted least squares, in which the quantity to ∑ ∑ 2 2 (rather than ), where d is the difference between the response variable and d be minimized is w d × the fitted values predicted by the model. Needless to say, the use of weights alters the parameter estimates and their standard errors: model <- lm(Fruit~Grazing) summary(model)

436 414 THE R BOOK Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 67.941 5.236 12.976 1.54e-15 *** GrazingUngrazed -17.060 7.404 -2.304 0.0268 * Residual standard error: 23.41 on 38 degrees of freedom Multiple R-Squared: 0.1226, Adjusted R-squared: 0.09949 F-statistic: 5.309 on 1 and 38 DF, p-value: 0.02678 Fitting root size as a statistical weight is scientifically wrong in this case: why should values from larger plants be given greater influence? Also, this analysis gives entirely the wrong interpretation of the data (ungrazed plants come out as being less fecund than the grazed plants). Analysis of covariance reverses this interpretation, showing that for a given root size, the grazed plants produced 36.013 fruits than fewer the ungrazed plants; the problem was that the big plants were almost all in the grazed treatment (see p. 538). 9.16.3 Missing values What to do about missing values in the dataframe is an important issue (p. 172). Ideally, of course, there are no missing values, so you do not need to worry about what action to take ( na.action ). If there are missing values, you have two choices:  leave out any row of the dataframe in which one or more variables are missing, then na.action = na.omit ;or  fail the fitting process, so na.action = na.fail. If in doubt, you should specify na.action = na.fail because you will not get nasty surprises if unsuspected NA s in the dataframe cause strange (but unwarned) behaviour in the model. Let us introduce a missing value into the initial root weights: Root[37] <- NA model <- lm(Fruit~Grazing*Root) The model is fitted without comment, and the only thing you might notice is that the residual degrees of freedom is reduced from 36 to 35. If you want to be warned about missing values, then use the na.action option: model <- lm(Fruit~Grazing*Root,na.action=na.fail) Error in na.fail.default(list(Fruit = c(59.77, 60.98, 14.73, 19.28, 34.25, : missing values in object If you are carrying out regression with time series data that include missing values then you should use na.action = NULL so that residuals and fitted values are time series as well (if the missing values were omitted, then the resulting vector would not be a time series of the correct length).

437 STATISTICAL MODELLING 415 9.16.4 Offsets You would not use offsets with a linear model (you could simply subtract the offset from the value of the response variable, and work with the transformed values). But with generalized linear models you may want to specify part of the variation in the response using an offset (see p. 566 for details and examples). 9.16.5 Dataframes containing the same variable names If you have several different dataframes containing the same variable names (say, and y ) then the simplest x way to ensure that the correct variables are used in the modelling is to name the dataframe in the function call: model <- lm(y~x,data=correct.frame) The alternative is much more cumbersome to type: model <- lm(correct.frame\$y~correct.frame\$x) 9.17 Akaike’s information criterion penalized log-likelihood Akaike’s information criterion (AIC) is known in the statistics trade as a . If you have a model for which a log-likelihood value can be obtained (see Section 7.3.3), then + =− × log -likelihood + 2( AIC 2 1) , p where p is the number of parameters in the model, and 1 is added for the estimated variance (you could call this another parameter if you wanted to). To demystify AIC let us calculate it by hand. These data show the relationship between growth and dietary tannin for caterpillars in a feeding experiment: temp \\ regression.txt",header=T) \\ data <- read.table("c: attach(data) names(data) [1] "growth" "tannin" The regression model for these data is worked out, one term at a time, by hand in Chapter 10. model <- lm(growth~tannin) n ; the error variance s2 = To calculate the log-likelihood we need three quantities (p. 282): the sample size, ∑ 2 2 sse = ; and the sum of the squares of the residuals, ( y − μ ) : σ n <- length(growth) sse <- sum((growth-fitted(model))ˆ2) s2 <- sse/(n-2) s <- sqrt(s2) Now we can compute the log-likelihood: -(n/2)*log(2*pi)-n*log(s)-sse/(2*s2) [1] -16.51087

438 416 THE R BOOK logLik to calculate the log likelihood from any appropriate model object directly: There is an R function logLik(model) 'log Lik.' -16.37995 (df=3) The three degrees of freedom ( ) refer to the slope, the intercept and the variance. The difference between df the two estimates is just rounding error. Now we can compute AIC: -2 * -16.37995 + 6 [1] 38.7599 to compute the information criterion directly from Again, not surprisingly, there is an R function called AIC the model object: AIC(model) [1] 38.7599 9.17.1 AIC as a measure of the fit of a model The more parameters there are in the model, the better the fit. You could obtain a perfect fit if you had a separate parameter for every data point, but this model would have absolutely no explanatory power. There is always going to be a trade-off between the goodness of fit and the number of parameters required by parsimony. AIC is useful because it explicitly penalizes any superfluous parameters in the model, by adding 2( + 1) to the deviance. p When comparing two models, the smaller the AIC, the better the fit. This is the basis of automated model step simplification using . You can use the function AIC to compare two models, in exactly the same way as you can use anova (as explained on p. 415). Here we develop an analysis of covariance that is introduced on p. 538. model.1 <- lm(Fruit~Grazing*Root) model.2 <- lm(Fruit~Grazing+Root) AIC(model.1, model.2) df AIC model.1 5 263.6269 model.2 4 261.7835 Because model.2 lower AIC, we prefer it to model.l . The log-likelihood was penalized by has the 2 × (4 + 1) = 10 in model.1 because that model contained 4 parameters (2 slopes and 2 intercepts) and because that model had 3 parameters (two intercepts and a common slope). by 2 (3 + 1) = 8in model.2 × You can see where the two values of AIC come from by calculation: -2*logLik(model.1)+2*(4+1) [1] 263.6269 -2*logLik(model.2)+2*(3+1) [1] 261.7835 If you want to compare many models, you can combine the models into a list, models <- list (model1, model2, model3, model4, model5, model6)

439 STATISTICAL MODELLING 417 like this: then extract the AIC of each of them using lapply aic <- unlist(lapply(models, AIC)) will be a vector of numbers in which you can search for the minimum. where aic 9.18 Leverage x (to Points increase in influence to the extent that they lie on their own, a long way from the mean value of are proportional to either the left or right). To account for this, measures of leverage for a given data point y 2 ̄ . Here are the x data from our earlier example: − x ) x ( x <- c(2,3,3,3,4,7) The commonest measure of leverage is 2 ̄ x ( − x ) 1 i + , = h i 2 ̄ ( x n  x ) − i where the denominator is SSX . A good rule of thumb is that a point is highly influential if its 2 p h , > i n p where is the number of parameters in the model. We could easily calculate the leverage value of each point in our vector. It is more efficient, perhaps, to write a general function that could carry out the calculation of the h values for any vector of x values, leverage <- function(x) { 1/length(x)+(x-mean(x))ˆ2/sum((x-mean(x))ˆ2) } and then use this function with our vector of values to produce a leverage plot: x plot(leverage(x),type="h",ylim=c(0,1),col="blue") abline(h=4/6,lty=2,col="green") 1.0 0.8 0.6 0.4 leverage(x) 0.2 0.0 1234 56 Index

440 418 THE R BOOK = point shows more leverage than is reasonable (the horizontal green dashed As you can see, only the sixth / = 4/6 in this example). For built-in functions for checking influence, see p. 463. line shows 2 n p 9.19 Misspecified model The model may have the wrong terms in it, or the terms may be included in the model in the wrong way. We deal with the selection of terms for inclusion in the minimal adequate model in Chapter 10. Here we simply often produces improvements in model performance. note that transformation of the explanatory variables The most frequently used transformations are logs, powers and reciprocals. When both the error distribution and functional form of the relationship are unknown, there is no single specific rationale for choosing any given transformation in preference to another. The aim is pragmatic, namely to find a transformation that gives:  constant error variance;  approximately normal errors;  additivity;  a linear relationship between the response variables and the explanatory variables;  straightforward scientific interpretation. The choice is bound to be a compromise and, as such, is best resolved by quantitative comparison of the deviance produced under different model forms. Again, in testing for non-linearity in the relationship between 2 2 we might add a term in x and y x x to the model; a significant parameter in the term indicates curvilinearity in the relationship between y and x . A further element of misspecification can occur because of structural non-linearity . Suppose, for exam- ple, that we were fitting a model of the form b , + = a y x but the underlying process was really of the form b a + y = ; + x c then the fit is going to be poor. Of course if we that the model structure was of this form, then we could knew fit it as a non-linear model (p. 715) or as a non-linear mixed-effects model (p. 722), but in practice this is seldom the case. 9.20 Model checking in R The data we examine in this section are on the decay of a biodegradable plastic in soil: the response, y ,isthe mass of plastic remaining and the explanatory variable, x , is duration of burial: Decay <- read.table("c: \\ temp \\ Decay.txt",header=T) attach(Decay)

441 STATISTICAL MODELLING 419 names(Decay) [1] "time" "amount" For the purposes of illustration we shall fit a linear regression to these data and then use model-checking plots to investigate the adequacy of that model: model <- lm(amount~time) The basic model checking could not be simpler: par(mfrow=c(2,2)) plot(model) Normal Q-Q Residuals vs Fitted 1 1 403020100 3210 5 5 30 30 Residuals Standardized residuals –1 –20 –2 0 12 02040 60 80 –1 Fitted values Theoretical Quantiles Residuals vs Leverage Scale-Location 1 | 1 3210 0.5 1.51.00.5 5 30 5 30 Standardized resiuals Standardized residuals | –1 Cook’s distance √ 0.0 80 0.00 0.04 0.08 0.12 60 02040 Fitted values Leverage This one command produces a series of graphs, spread over four pages (here compressed to a single page by specifying . The upper two graphs are the most important. First, you get a plot of the par(mfrow=c(2,2)) residuals against the fitted values (top left) which shows very pronounced curvature; most of the residuals for intermediate fitted values are negative, and the positive residuals are concentrated at the smallest and largest fitted values. Remember, this plot should look like the sky at night, with no pattern of any sort. This suggests systematic inadequacy in the structure of the model. Perhaps the relationship between y and x is non-linear rather than linear as we assumed here? Second (top right), you get a quunatile–quantile plot (p. 463) which indicates pronounced non-normality in the residuals (the line should be straight, not banana-shaped as here). The third graph is like a positive-valued version of the first graph; it is good for detecting non-constancy of variance (heteroscedasticity), which shows up as a triangular scatter (like a wedge of cheese) with an

442 420 THE R BOOK increasing red line through it. The fourth graph shows a pronounced pattern in the standardized residuals as a function of the leverage. The graph also shows Cook’s distance, highlighting the identity of particularly influential data points. Cook’s distance is an attempt to combine leverage and residuals in a single measure. The absolute values ∗ are weighted as follows: | of the deletion residuals r | i ) ( / 1 2 h p n − i ∗ · C | . =| r i i h − 1 p i Data points 1, 5 and 30 are singled out as being influential, with point 1 especially so. When we were happier with other aspects of the model, we would repeat the modelling, leaving out each of these points in turn. Alternatively, we could jackknife the data, which involves leaving every data point out, one at a time, in turn. In any event, this is clearly not a good model for these data. The analysis is completed on p. 469, when we fit an exponential rather than a linear model to the data. 9.21 Extracting information from model objects p We often want to extract material from fitted models (e.g. slopes, residuals or values) and there are three different ways of doing this:  by name, e.g. coef(model);  with list subscripts, e.g. summary(model)[[3]];  using \$ to name the component, e.g. model\$resid. The model object we use to demonstrate these techniques is the simple linear regression that was analysed in full by hand on p. 450. data <- read.table("c: \\ temp \\ regression.txt",header=T) attach(data) names(data) [1] "growth" "tannin" model <- lm(growth~tannin) summary(model) Call: lm(formula = growth ~ tannin) Residuals: Min 1Q Median 3Q Max -2.4556 -0.8889 -0.2389 0.9778 2.8944 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.7556 1.0408 11.295 9.54e-06 *** tannin -1.2167 0.2186 -5.565 0.000846 *** Residual standard error: 1.693 on 7 degrees of freedom

443 STATISTICAL MODELLING 421 Multiple R-squared: 0.8157, Adjusted R-squared: 0.7893 F-statistic: 30.97 on 1 and 7 DF, p-value: 0.0008461 9.21.1 Extracting information by name You can extract the coefficients of the model, the fitted values, the residuals, the effect sizes and the variance– covariance matrix by name, as follows: coef(model) (Intercept) tannin 11.755556 -1.216667 ) and slope ( gives the parameter estimates (‘coefficients’) for the intercept ( ); a b fitted(model) [1]123456 11.755556 10.538889 9.322222 8.105556 6.888889 5.672222 [7] 7 8 9 4.455556 3.238889 2.022222 ˆ bx = a + gives the fitted values ( y ) of the model (its predictions) for each value of the explanatory variable(s); resid(model) [1]123456 0.2444444 -0.5388889 -1.3222222 2.8944444 -0.8888889 1.3277778 [7] 7 8 9 -2.4555556 -0.2388889 0.9777778 gives the residuals ( y minus fitted values) for the nine data points. lm aov the effects are the uncorrelated single-degree-of-freedom values For a linear model fitted by or obtained by projecting the data onto the successive orthogonal subspaces generated by the QR decomposition during the fitting process. The first ( = 2 in this case; the rank of the model) are associated with coefficients r and the remainder span the space of residuals but are not associated with particular residuals. The name effects produces a numeric vector of the same length as residuals of class coef . The first two rows are labelled by the corresponding coefficients (intercept and slope), and the remaining seven rows are unlabelled. vcov(model) (Intercept) tannin (Intercept) 1.083263 -0.19116402 tannin -0.191164 0.04779101 This extracts the variance–covariance matrix of the model’s parameters. 9.21.2 Extracting information by list subscripts The two model summary objects summary.aov(model) and summary.lm(model) are lists with many components. Here each of them is investigated in turn.

444 422 THE R BOOK summary.aov : Here is summary.aov(model) Df Sum Sq Mean Sq F value Pr(>F) tannin 1 88.82 88.82 30.97 0.000846 *** Residuals 7 20.07 2.87 The columns of the ANOVA table can be extracted one at a time: summary.aov(model)[[1]][1] Df tannin 1 Residuals 7 summary.aov(model)[[1]][2] Sum Sq tannin 88.817 Residuals 20.072 summary.aov(model)[[1]][3] Mean Sq tannin 88.817 Residuals 2.867 summary.aov(model)[[1]][4] F value tannin 30.974 Residuals summary.aov(model)[[1]][5] Pr(>F) tannin 0.0008461 *** Residuals --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 It can be quite involved to extract the numerical values that you might want to use in subsequent work. For instance, to get the F ratio (30.974) out of the fourth element of the list, we need to unlist the object, then use as.numeric , and then add a further subscript: as.numeric(unlist(summary.aov(model)[[1]][4]))[1] [1] 30.97398 Here is summary.lm: summary(model) Call: lm(formula = growth ~ tannin)

445 STATISTICAL MODELLING 423 Residuals: Min 1Q Median 3Q Max -2.4556 -0.8889 -0.2389 0.9778 2.8944 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.7556 1.0408 11.295 9.54e-06 *** tannin -1.2167 0.2186 -5.565 0.000846 *** Residual standard error: 1.693 on 7 degrees of freedom Multiple R-squared: 0.8157, Adjusted R-squared: 0.7893 F-statistic: 30.97 on 1 and 7 DF, p-value: 0.0008461 Call ) showing the response variable ( growth The first element of the list is the model formula (or ) and the explanatory variable(s) ( ): tannin summary(model)[[1]] lm(formula = growth ~ tannin) The second describes the attributes of the object called : summary(model) summary(model)[[2]] growth ~ tannin attr(,"variables") list(growth, tannin) attr(,"factors") tannin growth 0 tannin 1 attr(,"term.labels") [1] "tannin" attr(,"order") [1] 1 attr(,"intercept") [1] 1 attr(,"response") [1] 1 attr(,".Environment") attr(,"predvars") list(growth, tannin) attr(,"dataClasses") growth tannin "numeric" "numeric" The third gives the residuals for the nine data points: summary(model)[[3]] as shown above. The fourth gives the parameter table, including standard errors of the parameters, t values and p values. This is the really important information:

446 424 THE R BOOK summary(model)[[4]] Estimate Std. Error t value Pr(>|t|) (Intercept) 11.755556 1.0407991 11.294740 9.537315e-06 tannin -1.216667 0.2186115 -5.565427 8.460738e-04 You will often want to extract information from parts of this table, using extra subscripts summary(model)[[4]] [1] [1] 11.75556 summary(model)[[4]] [2] [1] -1.216667 summary(model)[[4]] [3] [1] 1.040799 summary(model)[[4]] [4] [1] 0.2186115 to extract the individual value of the intercept, slope, standard error of the intercept and standard error of the p slope respectively. The [[4]] [8] , for instance: value for the slope is in summary(model)[[4]] [8] [1] 0.0008460738 The fifth is concerned with whether the corresponding components of the fit (the model frame, the model matrix, the response or the QR decomposition) should be returned. The default is FALSE : summary(model)[[5]] (Intercept) tannin FALSE FALSE The sixth is the residual standard error: the square root of the error variance from the table summary.aov 2 = 2.867; see above): s ( summary(model)[[6]] [1] 1.693358 The seventh shows the number of rows in the summary.lm table (showing two parameters to have been estimated from the data with this model, and the residual degrees of freedom (d.f. = 7): summary(model)[[7]] [1] 2 7 2 2 SSR/SST = , the fraction of the total variation in the response variable that is explained by the r The eighth is model (see p. 456 for details): summary(model)[[8]] [1] 0.8156633

447 STATISTICAL MODELLING 425 2 R The ninth is the adjusted , explained on p. 461 but seldom used in practice: summary(model)[[9]] [1] 0.7893294 The tenth gives ratio information: the three values given here are the F ratio (30.973 98), the number of F numdf ) and the residual degrees of freedom (i.e. in degrees of freedom in the model (i.e. in the numerator, dendf the denominator, ): summary(model)[[10]] value numdf dendf 30.97398 1.00000 7.00000 The eleventh component is the correlation matrix of the parameter estimates: summary(model)[[11]] (Intercept) tannin (Intercept) 0.37777778 -0.06666667 tannin -0.06666667 0.01666667 9.21.3 Extracting components of the model using \$ Another way to extract model components is to use the \$ symbol. To get the intercept ( a ) and the slope ( b ) of the regression, type model\$coef (Intercept) tannin 11.755556 -1.216667 Finally, the residual degrees of freedom (9 points – 2 estimated parameters = 7 d.f.) are model\$df [1] 7 9.21.4 Using lists with models You might want to extract the coefficients from a series of related statistical models, and you want to avoid the use of a loop. One solution is to create a list and then employ lapply to do the work. Here are the data with as a function of x : y x <- 0:100 y <- 17+0.2*x+3*rnorm(101) Now create three linear models of increasing complexity: model0 <- lm(y~1) model1 <- lm(y~x) model2 <- lm(y~x+I(xˆ2))

448 426 THE R BOOK Make a list containing the three model objects: models <- list(model0,model1,model2) on the list to apply the function To obtain the coefficients from the three models, it is simple to use lapply to each element of the list: coef lapply(models,coef) [[1]] (Intercept) 26.90530 [[2]] (Intercept) x 15.8267899 0.2215701 [[3]] (Intercept) x I(xˆ2) 1.593695e+01 2.148935e-01 6.676673e-05 vector To get a (rather than a list) as output, and to select only the three intercepts, we use subscripts with unlist [c(1,2,4)] as.vector like this: and as.vector(unlist(lapply(models,coef)))[c(1,2,4)] [1] 26.90530 15.82679 15.93695 This protocol can be useful in model selection. Here we extract the AIC of each model: lapply(models,AIC) [[1]] [1] 672.7502 [[2]] [1] 510.787 [[3]] [1] 512.5231 Other things being equal, we would chose the model with the lowest AIC (the linear regression ( ) model1 has AIC = 510.787). 9.22 The summary tables for continuous and categorical explanatory variables It is important to understand the difference between and summary.aov for the same model. summary.lm Here is a one-way analysis of variance of the plant competition experiment (p. 511): comp <- read.table("c: \\ temp \\ competition.txt",header=T) attach(comp) names(comp) [1] "biomass" "clipping"

449 STATISTICAL MODELLING 427 clipping and it has five levels as follows: The categorical explanatory variable is levels(clipping) [1] "control" "n25" "n50" "r10" "r5" The analysis of variance model is fitted like this: model <- lm(biomass~clipping) and we can obtain two different summaries of it. Here is summary.aov: summary.aov(model) Df Sum Sq Mean Sq F value Pr(>F) clipping 4 85356 21339 4.302 0.00875 ** Residuals 25 124020 4961 showing one row for the treatment and one row for the residuals (the row for the total sum of squares is not printed in R), each row with degrees of freedom, sum of squares, variance (labelled ‘Mean Square’) and the ratio, testing the null hypothesis of no significant differences between the treatment means. The only F 2 are the error variance ( s interesting things in summary.aov 4961) which we use in calculating measures = F ratio (4.302) showing that there are significant differences amongst the means to of unreliability, and the be explained. Here is summary.lm : summary.lm(model) Call: lm(formula = biomass ~ clipping) Residuals: Min 1Q Median 3Q Max -103.333 -49.667 3.417 43.375 177.667 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 465.17 28.75 16.177 9.4e-15 *** clippingn25 88.17 40.66 2.168 0.03987 * clippingn50 104.17 40.66 2.562 0.01683 * clippingr10 145.50 40.66 3.578 0.00145 ** clippingr5 145.33 40.66 3.574 0.00147 ** Residual standard error: 70.43 on 25 degrees of freedom Multiple R-squared: 0.4077, Adjusted R-squared: 0.3129 F-statistic: 4.302 on 4 and 25 DF, p-value: 0.008752 The residuals are summarized first by their ‘five numbers’. The coefficients table has as many rows as there are parameters in the model (five in this case, one for each factor level mean). The top row, labelled ( Intercept ), is the only mean value in the table: it is the mean for the factor level that comes first in the alphabet ( in this example). The other four rows are differences between means (each mean control compared to the control mean in this example). The second column contains the unreliability estimates. The first row contains the standard error of a mean (28.75). The other four rows contain the standard error of the difference between two means (40.66). The significance stars are highly misleading in this example,

450 428 THE R BOOK suggesting wrongly that there are four significant contrasts for this model. The problem arises because the default ‘Treatment contrasts’ in R are not orthogonal. The four lower rows are being compared with the first row. As we shall see later, there is only one significant orthogonal contrast in this experiment (the control versus the other four treatments). So where do the effect sizes come from? What is 465.17 and what is 88.17? To understand the answers to these questions, we need to know how the equation for the explanatory variables is structured in a linear model when the explanatory variable, as here, is categorical. To recap, the linear regression model is written as lm(y~x) x which R interprets as the two-parameter linear equation. R knows this because is a continuous variable, so the equation it invokes is y bx , = a + and b are to be estimated from the data. But what about our analysis of a in which the values of the parameters = clipping , but it is a categorical variable with five levels, variance? We have one explanatory variable, x r5 aov model is exactly analogous to the regression model and control, n25, n50, r10 . The lm(y~x) but what is the associated equation? Let us look at the equation first, and try to understand it: ex + cx fx + dx + + . a y bx = + 4 2 5 1 3 This looks just like a multiple regression, with five explanatory variables, x x . The key point to ,..., 5 1 . The intercept, x x are dummy variables representing the levels of the factor called ,..., x understand is that 1 5 a , is the overall (or grand) mean for the whole experiment. The parameters b ,..., f are differences between the grand mean and the mean for a given factor level. You will need to concentrate to understand this. With a categorical explanatory variable, all the variables are coded as x = 0 except for the factor level that is associated with the y x is coded as x = 1. You will find this hard to understand value in question, when without a good deal of practice. Let us look at the first row of data in our dataframe: comp[1,] biomass clipping 1 551 n25 So the first biomass value (551) in the dataframe comes from clipping treatment n25 which, out of all the 0, = factor levels (above), comes second in the alphabet. This means that for this row of the dataframe x 1 = = 1, x x = 0. The equation for the first row therefore looks like this: x 0, = 0, x 4 3 5 2 y = a + b × 0 + c × 1 + d × 0 + e × 0 + f × 0 , so the model for the fitted value at is n25 ˆ = a + c ; y ˆ and similarly for the other factor levels. The fitted value y is the sum of two parameters, a and c . The equation apparently does not contain an explanatory variable (there is no x in the equation as there would be in a

451 STATISTICAL MODELLING 429 regression equation, above). Note, too, how many parameters the full model contains: they are represented by the letters f and there are six of them. But we can only estimate five parameters in this experiment (one a to mean for each of the five factor levels). Our model contains one redundant parameter, and we need to deal with this. There are several sensible ways of doing this, and people differ in their opinions about what is the best way. The writers of R agree that represent the best solution. This method does away treatment contrasts , the overall mean (in the jargon, this overall mean is intentionally aliased). The mean of the with parameter a , in our example) is promoted to pole position, and the control factor level that comes first in the alphabet ( other effects are shown as differences (contrasts) between this mean and the other four factor level means. An example might help make this clearer. Here are our five means: means <- tapply(biomass,clipping,mean) means control n25 n50 r10 r5 465.1667 553.3333 569.3333 610.6667 610.5000 control mean (465.1667) becomes the first parameter of the model (known as In treatment contrasts, the the intercept). The second parameter is the difference between the second mean (n25 = 553.3333) and the intercept: means[2]-means[1] n25 88.16667 The third parameter is the difference between the third mean ( n50 = 569.3333) and the intercept: means[3]-means[1] n50 104.1667 r10 610.6667) and the intercept: The fourth parameter is the difference between the fourth mean ( = means[4]-means[1] r10 145.5 r5 The fifth parameter is the difference between the fifth mean ( 610.5) and the intercept: = means[5]-means[1] r5 145.3333 So much for the effect sizes. What about their standard errors? The first row is a mean, so we need the standard error of one factor-level mean. This mean is based on six numbers in this example, so the standard √ 2 2 error of the mean is / n where the error variance, s summary.aov(model) = 4961, is obtained from s above: sqrt(4961/6) [1] 28.75471 All the other rows have the same standard error, but it is bigger than this. That is because the effects on the second and subsequent rows are not means, but differences between means . That means that the appropriate

452 430 THE R BOOK the standard error of the difference between two standard error is not the standard error of a mean, but rather means . When two samples are independent, the variance of their difference is the sum of their two variances. Thus, the formula for the standard error of a difference between two means is √ 2 2 s s 1 2 + = . se diff n n 2 1 When the two variances and the two sample sizes are the same (as here, because our design is balanced summary.aov and we are using the pooled error variance (4961) from the table) the formula simplifies √ 2 × s : / n 2 to sqrt(2*4961/6) [1] 40.6653 With some practice, that should demystify the origin of the numbers in the summary.lm table. But it does take lots of practice, and people do find this very difficult at first, so do not feel bad about it. 9.23 Contrasts Contrasts are the essence of hypothesis testing and model simplification in analysis of variance and analysis of covariance. They are used to compare means or groups of means with other means or groups of means, in what are known as single-degree-of-freedom comparisons . There are two sorts of contrasts we might be interested in:  a priori contrasts we had planned to examine at the experimental design stage (these are referred to as contrasts);  a posteriori contrasts). contrasts that look interesting after we have seen the results (these are referred to as Some people are very snooty about contrasts, on the grounds that they were unplanned. You are a posteriori not supposed to decide what comparisons to make after you have seen the analysis, but scientists do this all the time. The key point is that you should only do contrasts after the ANOVA has established that there really are significant differences to be investigated. It is not good practice to carry out tests to compare the largest mean with the smallest mean, if the ANOVA has failed to reject the null hypothesis (tempting though this may be). There are two important points to understand about contrasts:  there is a huge number of possible contrasts, and  there are only k − 1 orthogonal contrasts, where k is the number of factor levels. Two contrasts are said to be orthogonal to one another if the comparisons are statistically independent. Technically, two contrasts are orthogonal if the products of their contrast coefficients sum to zero (we shall see what this means in a moment).

453 STATISTICAL MODELLING 431 Let us take a simple example. Suppose we have one factor with five levels and the factor levels are called e b , d , c . Let us start writing down the possible contrasts. Obviously we could compare each mean singly , a , with every other: . b , a vs . c , a vs . d , a vs . e , b vs a c , b vs . d , b vs . e , c vs . d , c vs . e , d vs . e . vs . But we could also compare pairs of means: ,... } e , b { . vs } c , a { , } d , b { . vs } c , a { , } , vs . { d , e } a { , } e , c { . vs } b , a { , } d , c { . vs } b , a b { or triplets of means: a , b , c } vs . d , { a , b , c } vs . e , { a , b , d } vs . c , { a , b , d } vs . e , { a , c , d } vs . b ,... { or groups of four means: . c , e } vs . d , { a , b , d , e } vs . c , { , , c , d , e } vs . b , { b , c , d , e } vs . a a a , b { a , b , c , d } vs . e , { You doubtless get the idea. There are absolutely masses of possible contrasts. In practice, however, we should only compare things once, either directly or implicitly. So the two contrasts vs. b and a vs. c implicitly a then the third b . This means that if we have carried out the two contrasts a vs. b and a vs. c c contrast vs. b vs. c is not an orthogonal contrast because you have already carried it out, implicitly. Which contrast particular contrasts are orthogonal depends very much on your choice of the first contrast to make. Suppose { a , b , c , there were good reasons for comparing } vs. d . For example, d might be the placebo and the other e four might be different kinds of drug treatment, so we make this our first contrast. Because − 1 = 4we k a priori , { a reasons to group b } only have three possible contrasts that are orthogonal to this. There may be { c , e } , so we make this our second orthogonal contrast. This means that we have no degrees of freedom and in choosing the last two orthogonal contrasts: they have to be a vs. b and c vs. e . Just remember that with orthogonal contrasts you only compare things once . 9.23.1 Contrast coefficients Contrast coefficients are a numerical way of embodying the hypothesis we want to test. The rules for constructing contrast coefficients are straightforward:  Treatments to be lumped together get the same sign (plus or minus).  Groups of means to be to be contrasted get opposite sign.  Factor levels to be excluded get a contrast coefficient of 0.  The contrast coefficients, c , must add up to 0. Suppose that with our five-level factor a , b , c , d , e } we want to begin by comparing the our levels { { a , b , c , e } with the single level d . All levels enter the contrast, so none of the coefficients is 0. The four e terms a , b , c , { } are grouped together so they all get the same sign (minus, for example, although it makes no difference which sign is chosen). They are to be compared to d , so it gets the opposite sign (plus, in this case). The choice of what numeric values to give the contrast coefficients is entirely up to you. Most people use whole numbers rather than fractions, but it really does not matter. All that matters is that the coefficients sum to 0. The positive and negative coefficients have to add up to the same value. In our example, comparing

454 432 THE R BOOK { a b , c , e } and + 4for four means with one mean, a natural choice of coefficients would be –1 for each of , . + { a , b , c . Alternatively, with could select e } and − 1 for d 0.25 for each of d , abcd Factor level: e 1 − − − 14 − 1 contrast 1 coefficients: 1 { a , b } with Suppose the second contrast is to compare c , e } . Because this contrast excludes d , we set its { contrast coefficient to 0. a , b } get the same sign (say, plus) and { c , e } get the opposite sign. Because the { number of levels on each side of the contrast is equal (2 in both cases) we can use the name numeric value for all the coefficients. The value 1 is the most obvious choice (but you could use 13.7 if you wanted to be perverse): Factor level: ab cd e 1 1 − 10 − 1 Contrast 2 coefficients: : a b and c vs. e There are only two possibilities for the remaining orthogonal contrasts, vs. Factor level: abcde Contrast 3 coefficients: 1 − 100 0 Contrast 4 coefficients: 0 0 1 0 − 1 The variation in y contrast sum of squares , SSC . The attributable to a particular contrast is called the k 1 orthogonal contrasts add up to the total treatment sum of squares, SSA − sums of squares of the ∑ k − 1 ). The contrast sum of squares is computed like this: SSA = SSC ( i i 1 = ) ( ∑ 2 ( c T / n ) i i i ∑ = , SSC i 2 ( c / n ) i i c where the are the contrast coefficients (above), n are are the sample sizes within each factor level and T i i i y values within each factor level (often called the treatment totals). The significance of a the totals of the F test, dividing the contrast sum of squares by the error variance. The F test has 1 contrast is judged by an degree of freedom in the numerator (because a contrast is a comparison of two means, and 2 − 1 = 1) and k n − 1) degrees of freedom in the denominator (the error variance degrees of freedom). ( 9.23.2 An example of contrasts in R The following example comes from the competition experiment we analysed on p. 511, in which the biomass of control plants is compared to the biomass of plants grown in conditions where competition was reduced in one of four different ways. There are two treatments in which the roots of neighbouring plants were cut (to 5 cm or 10 cm depth) and two treatments in which the shoots of neighbouring plants were clipped (25% or 50% of the neighbours were cut back to ground level). comp <- read.table("c: \\ temp \\ competition.txt",header=T) attach(comp) names(comp) } [1] "biomass" "clipping"

455 STATISTICAL MODELLING 433 We start with the one-way analysis of variance: model1 <- aov(biomass~clipping) summary(model1) Df Sum Sq Mean Sq F value Pr(>F) clipping 4 85356 21339 4.302 0.00875 ** Residuals 25 124020 4961 Clipping treatment has a highly significant effect on biomass. But have we fully understood the result of this experiment? Probably not. For example, which factor levels had the biggest effect on biomass, and were all of the competition treatments significantly different from the controls? To answer these questions, we need to use summary.lm : summary.lm(model1) Call: aov(formula = biomass ~ clipping) Residuals: Min 1Q Median 3Q Max -103.333 -49.667 3.417 43.375 177.667 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 465.17 28.75 16.177 9.4e-15 *** clippingn25 88.17 40.66 2.168 0.03987 * clippingn50 104.17 40.66 2.562 0.01683 * clippingr10 145.50 40.66 3.578 0.00145 ** clippingr5 145.33 40.66 3.574 0.00147 ** Residual standard error: 70.43 on 25 degrees of freedom Multiple R-squared: 0.4077, Adjusted R-squared: 0.3129 F-statistic: 4.302 on 4 and 25 DF, p-value: 0.008752 This looks as if we need to keep all five parameters, because all five rows of the summary table have one or more significance stars. If fact, this is not the case. This example highlights the major shortcoming of treatment contrasts: they do not show how many significant factor levels we need to retain in the minimal adequate model because all of the rows are being compared with the intercept (with the controls in this case, simply because the factor level name for ‘controls’ comes first in the alphabet): levels(clipping) [1] "control" "n25" "n50" "r10" "r5" 9.23.3 A priori contrasts In this experiment, there are several planned comparisons we should like to make. The obvious place to start is by comparing the control plants, exposed to the full rigours of competition, with all of the other treatments. That is to say, we want to contrast the first level of clipping with the other four levels. The contrast coefficients, therefore, would be 4, –1, –1, –1, –1. The next planned comparison might contrast the shoot-pruned treatments ( n25 and n50 ) with the root-pruned treatments ( r10 and r5 ). Suitable contrast coefficients for this would be 0, 1, 1, –1, –1 (because we are ignoring the control in this contrast). A third contrast might compare the

456 434 THE R BOOK two depths of root pruning; 0, 0, 0, 1, –1. The last orthogonal contrast would therefore have to compare the two intensities of shoot pruning: 0, 1, –1, 0, 0. Because the factor called clipping has five levels there are 4 orthogonal contrasts. = only 5 – 1 R is outstandingly good at dealing with contrasts, and we can associate these five user-specified a priori clipping like this: contrasts with the categorical variable called contrasts(clipping) <- cbind(c(4,-1,-1,-1,-1),c(0,1,1,-1,-1),c(0,0,0,1,-1),c(0,1,-1,0,0)) We can check that this has done what we wanted by typing: clipping attr(,"contrasts") [,1] [,2] [,3] [,4] control 4000 n25 -1101 n50 -1 1 0 -1 r10 -1 -1 1 0 r5 -1 -1 -1 0 Levels: control n25 n50 r10 r5 which produces the matrix of contrast coefficients that we specified. One contrast is contained in each column. Note that all the columns add to zero (i.e. each set of contrast coefficients is correctly specified). Note also that the products of any two of the columns sum to zero (this shows that all the contrasts are orthogonal, as intended): for example, comparing contrasts 1 and 2 gives products 0 + (–1) + (–1) + 1 + 1 = 0. Now we can refit the model and inspect the results of our specified contrasts, rather than the default treatment contrasts: model2 <- aov(biomass~clipping) summary.lm(model2) Call: aov(formula = biomass ~ clipping) Residuals: Min 1Q Median 3Q Max -103.333 -49.667 3.417 43.375 177.667 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 561.80000 12.85926 43.688 < 2e-16 *** clipping1 -24.15833 6.42963 -3.757 0.000921 *** clipping2 -24.62500 14.37708 -1.713 0.099128 . clipping3 0.08333 20.33227 0.004 0.996762 clipping4 -8.00000 20.33227 -0.393 0.697313 Residual standard error: 70.43 on 25 degrees of freedom Multiple R-squared: 0.4077, Adjusted R-squared: 0.3129 F-statistic: 4.302 on 4 and 25 DF, p-value: 0.008752

457 STATISTICAL MODELLING 435 Instead of requiring five parameters (as suggested by our initial treatment contrasts), this analysis shows that we need only two parameters: the overall mean (561.8) and the contrast between the controls and the four = 0.000 921). All the other contrasts are non-significant. competition treatments ( p When we specify the contrasts, the intercept is the overall (grand) mean: mean(biomass) [1] 561.8 The second row, labelled clipping1 , estimates, like all contrasts, the difference between two means. But which two means, exactly? The means for the different factor levels are: tapply(biomass,clipping,mean) control n25 n50 r10 r5 465.1667 553.3333 569.3333 610.6667 610.5000 Thus this first contrast compares the controls (with mean 465.1667) with the mean of the other four treatments. that has value 1 for the controls and 2 for The simplest way to get this other mean is to create a new factor, c 1 the rest: c1 <- factor(1+(clipping!="control")) tapply(biomass,c1,mean) 12 465.1667 585.9583 The estimate reflecting the first contrast is the difference between the overall mean (561.8) and the mean of the four non-control treatments (585.9583): mean(biomass) - tapply(biomass,c1,mean)[2] 2 -24.15833 and you see the estimate in row 2 is –24.15833. What about the second contrast? This compares the root- and is a factor that lumps together the two root and two shoot treatments: shoot-pruned treatments, and c 2 c2 <- factor(2*(clipping=="n25")+2*(clipping=="n50")+ (clipping=="r10")+(clipping=="r5")) We can compute the mean biomass for the two treatments using tapply , then subtract the means from one another, then halve the differences: (tapply(biomass,c2,mean)[3]- tapply(biomass,c2,mean)[2])/2 2 -24.625 So the second contrast (–24.625) is half the difference between the root- and shoot-pruned treatments. What about the third contrast? This is between the two root-pruned treatments. We know their values already from tapply , above: r10 r5 610.6667 610.5000

458 436 THE R BOOK The two means differ by 0.166666 so the third contrast is half the difference between the two means: (610.666666-610.5)/2 [1] 0.083333 The final contrast compares the two shoot-pruning treatments, and the contrast is half the difference between these two means: (553.3333-569.3333)/2 [1] -8 To recap: the first contrast compares the overall mean with the mean of the four non-control treatments, the second contrast is half the difference between the root and shoot-pruned treatment means, the third contrast is half the difference between the two root-pruned treatments, and the fourth contrast is half the difference between the two shoot-pruned treatments. table are all different from It is important to note that the first four standard errors in the summary.lm one another. As we have just seen, the estimate in the first row of the table is a mean, while all the other rows contain estimates that are differences between means . The overall mean on the top row is based on 30 √ 2 2 se = numbers so the standard error of the mean is 30, where / comes from the ANOVA table: s s sqrt(4961/30) [1] 12.85950 The small difference in the fourth decimal place is due to rounding errors in calling the variance 4961.0. The next row compares two means so we need the standard error of the difference between two means. The different numbers of numbers . The complexity comes from the fact that the two means are each based on overall mean is based on all five factor levels (30 numbers) while the non-control mean with which it is compared is based on four means (24 numbers). Each factor level has n = 6 replicates, so the denominator in the standard error formula is 5 × 4 × 6 = 120. Thus, the standard error of the difference between the these √ 2 two means is = se / × × 4 s 6) : (5 sqrt(4961/(5*4*6)) [1] 6.429749 √ 2 For the second contrast, each of the means is based on 12 numbers so the standard error is se = × ( s 12) / 2 so the standard error of half the difference is: sqrt(2*(4961/12))/2 [1] 14.37735 The last two contrasts are both between means based on six numbers, so the standard error of the difference √ 2 is se = × ( s 2 / 6) and the standard error of half the difference is: sqrt(2*(4961/6))/2 [1] 20.33265 The complexity of these calculations is another reason for preferring treatment contrasts rather than user- specified contrasts as the default. The advantage of orthogonal contrasts, however, is that the summary.lm table gives us a much better idea of the number of parameters required in the minimal adequate model (two in this case). Treatment contrasts had significance stars on all five rows (see below) because all the non-control treatments were compared to the controls (the intercept).

459 STATISTICAL MODELLING 437 9.24 Model simplification by stepwise deletion An alternative to specifying the contrasts ourselves (as above) is to aggregate non-significant factor levels in procedure. To demonstrate this, we revert to treatment contrasts. First, we switch off a stepwise a posteriori our user-defined contrasts: contrasts(clipping) <- NULL options(contrasts=c("contr.treatment","contr.poly")) Now we fit the model with all five factor levels as a starting point: model3 <- aov(biomass~clipping) summary.lm(model3) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 465.17 28.75 16.177 9.4e-15 *** clippingn25 88.17 40.66 2.168 0.03987 * clippingn50 104.17 40.66 2.562 0.01683 * clippingr10 145.50 40.66 3.578 0.00145 ** clippingr5 145.33 40.66 3.574 0.00147 ** Looking down the list of parameter estimates, we see that the most similar are the effects of root pruning to 10 and 5 cm (145.5 vs. 145.33). We shall begin by simplifying these to a single root-pruning treatment called The trick is to use the gets arrow < - to change the names of the appropriate factor levels. Start by root. copying the original factor name: clip2 <- clipping Now inspect the level numbers of the various factor level names: levels(clip2) [1] "control" "n25" "n50" "r10" "r5" The plan is to lump together r10 and r5 under the same name, root . These are the fourth and fifth levels of , so we write: clip2 levels(clip2)[4:5] <- "root" Now if we type levels(clip2) [1] "control" "n25" "n50" "root" we see that r10 and r5 have indeed been replaced by root . The next step is to fit a new model with clip2 in place of clipping , and to test whether the new simpler model is significantly worse as a description of the data using anova : model4 <- aov(biomass~clip2) anova(model3,model4)

460 438 THE R BOOK Analysis of Variance Table Model 1: biomass ~ clipping Model 2: biomass ~ clip2 Res.Df RSS Df Sum of Sq F Pr(>F) 1 25 124020 2 26 124020 -1 -0.083333 0 0.9968 As we expected, this model simplification was completely justified. The next step is to investigate the effects using summary.lm: summary.lm(model4) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 465.17 28.20 16.498 2.72e-15 *** clip2n25 88.17 39.87 2.211 0.036029 * clip2n50 104.17 39.87 2.612 0.014744 * clip2root 145.42 34.53 4.211 0.000269 *** It looks as if the two shoot-clipping treatments ( and n50 ) are not significantly different from one another n25 (they differ by just 104.17 88.17 = 16.0 with a standard error of 39.87). We can lump these together into a − single shoot-pruning treatment as follows: clip3 <- clip2 levels(clip3)[2:3] <- "shoot" levels(clip3) [1] "control" "shoot" "root" Then we fit a new model with in place of clip2 : clip3 model5 <- aov(biomass~clip3) anova(model4,model5) Analysis of Variance Table Model 1: biomass ~ clip2 Model 2: biomass ~ clip3 Res.Df RSS Df Sum of Sq F Pr(>F) 1 26 124020 2 27 124788 -1 -768 0.161 0.6915 Again, this simplification was fully justified. Do the root and shoot competition treatments differ? clip4 <- clip3 levels(clip4)[2:3] <- "pruned" levels(clip4) [1] "control" "pruned" Now fit a new model with clip4 in place of clip3 : model6 <- aov(biomass~clip4) anova(model5,model6)

461 STATISTICAL MODELLING 439 Analysis of Variance Table Model 1: biomass ~ clip3 Model 2: biomass ~ clip4 Res.Df RSS Df Sum of Sq F Pr(>F) 1 27 124788 2 28 139342 -1 -14553 3.1489 0.08726 . p > 0.05), so we accept the simplification. This simplification was close to significant, but we are ruthless ( Now we have the minimal adequate model: summary.lm(model6) Call: aov(formula = biomass ~ clip4) Residuals: Min 1Q Median 3Q Max -135.958 -49.667 -4.458 50.635 145.042 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 465.2 28.8 16.152 1.01e-15 *** clip4pruned 120.8 32.2 3.751 0.000815 *** Residual standard error: 70.54 on 28 degrees of freedom Multiple R-squared: 0.3345, Adjusted R-squared: 0.3107 F-statistic: 14.07 on 1 and 28 DF, p-value: 0.0008149 It has just two parameters: the mean for the controls (465.2) and the difference between the control mean and the four treatment means (465.2 + 120.8 = 586.0): tapply(biomass,clip4,mean) control pruned 465.1667 585.9583 We know that these two means are significantly different because of the value of 0.000 815, but just to p model7 that has no explanatory variable at all (it fits only the show how it is done, we can make a final overall mean). This is achieved by writing ∼ 1 in the model formula: y model7 <- aov(biomass~1) anova(model6,model7) Analysis of Variance Table Model 1: biomass ~ clip4 Model 2: biomass ~ 1 Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 139342 2 29 209377 -1 -70035 14.073 0.0008149 *** Note that the p value is exactly the same as in model6 . The p values in R are calculated such that they avoid the need for this final step in model simplification: they are ‘deletion p values’.

462 440 THE R BOOK 9.25 Comparison of the three kinds of contrasts In order to show the differences between treatment, Helmert and sum contrasts, we shall reanalyse the competition experiment using each in turn. Contrasts are explained on p. 430. For present purposes, you need only know that R provides three types of contrasts that summarize the differences between parameter estimates in different ways. Treatment contrasts (Section 9.25.1) are more intuitive than Hermert (Section 9.25.2) or sum (Section 9.25.3) contrasts. 9.25.1 Treatment contrasts This is the default in R. These are the contrasts you get, unless you explicitly choose otherwise. options(contrasts=c("contr.treatment","contr.poly")) Here are the contrast coefficients as set under treatment contrasts: contrasts(clipping) n25 n50 r10 r5 control 0 0 0 0 n25 1 0 0 0 n50 0 1 0 0 r10 0 0 1 0 r5 0001 Notice that the contrasts are not orthogonal (the products of the coefficients do not sum to zero). output.treatment <- lm(biomass~clipping) summary(output.treatment) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 465.17 28.75 16.177 9.4e-15 *** clippingn25 88.17 40.66 2.168 0.03987 * clippingn50 104.17 40.66 2.562 0.01683 * clippingr10 145.50 40.66 3.578 0.00145 ** clippingr5 145.33 40.66 3.574 0.00147 ** With treatment contrasts, the factor levels are arranged in alphabetical sequence, and the level that comes first in the alphabet is made into the intercept. In our example this is control , so we can read off the control mean as 465.17, and the standard error of the mean as 28.75. The remaining four rows are differences between means, and the standard errors are standard errors of differences. Thus, clipping neighbours back to p = 0.039 87. And so 25 cm increases biomass by 88.17 over the controls and this difference is significant at on. The downside of treatment contrasts is that all the rows appear to be significant despite the fact that rows 2–5 are actually not significantly different from one another, as we saw earlier. 9.25.2 Helmert contrasts This is the default in S-PLUS, so beware if you are switching back and forth between the two languages. options(contrasts=c("contr.helmert","contr.poly")) contrasts(clipping)

463 STATISTICAL MODELLING 441 [,1] [,2] [,3] [,4] control -1 -1 -1 -1 n25 1-1-1-1 n50 0 2-1-1 r10 0 0 3 -1 r5 0004 Notice that the contrasts are orthogonal (the products sum to zero) and their coefficients sum to zero, unlike treatment contrasts, above: output.helmert <- lm(biomass~clipping) summary(output.helmert) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 561.800 12.859 43.688 <2e-16 *** clipping1 44.083 20.332 2.168 0.0399 * clipping2 20.028 11.739 1.706 0.1004 clipping3 20.347 8.301 2.451 0.0216 * clipping4 12.175 6.430 1.894 0.0699 . clipping1 With Helmert contrasts, the intercept is the overall mean (561.8). The first contrast (labelled ) compares the first mean in alphabetical sequence with the average of the first and second factor levels in control plus alphabetical sequence ( ; see above): its parameter value is the mean of the first two factor n25 levels, minus the mean of the first factor level: (465.16667+553.33333)/2-465.166667 [1] 44.08332 The second contrast ( clipping2 ) compares the third factor level ( n50 ) and the two levels already compared ( control n25 ): its value is the difference between the average of the first three factor levels and the and average of the first two factor levels: (465.16667+553.33333+569.333333)/3-(465.166667+553.3333)/2 [1] 20.02779 The third contrast ( clipping3 ) compares the fourth factor level ( r10 ) and the three levels already compared ( control , n25 and n50 ): its value is the difference between the average of the first four factor levels and the average of the first three factor levels (465.16667+553.33333+569.333333+610.66667)/4 -(553.3333+465.166667+569.3333)/3 [1] 20.34725 The fourth contrast ( clipping3 ) compares the fifth factor level (r5) and the four levels already compared ( control, n25, n50 and r10 ): its value is the difference between the average of the first five factor levels (the grand mean), and the average of the first four factor levels: mean(biomass)-(465.16667+553.33333+569.333333+610.66667)/4 [1] 12.175

464 442 THE R BOOK So much for the parameter estimates. Now look at the standard errors. We have seen rather few of these values in any of the analyses we have done to date. The standard error in row 1 is the standard error of the √ 2 2 s . / kn taken from the overall ANOVA table: s overall mean, with sqrt(4961/30) [1] 12.85950 The standard error in row 2 is a comparison of a group of two means with a single mean (2 × 1 = 2). Thus √ 2 in the denominator: 2 is multiplied by the sample size n / 2 n . s sqrt(4961/(2*6)) [1] 20.33265 The standard error in row 3 is a comparison of a group of three means with a group of two means (so √ 2 2 6 in the denominator): 3 = × s / 6 n . sqrt(4961/(3*2*6)) [1] 11.73906 The standard error in row 4 is a comparison of a group of four means with a group of three means (so √ 2 × 3 = 12 in the denominator): 4 s 12 n . / sqrt(4961/(4*3*6)) [1] 8.30077 The standard error in row 5 is a comparison of a group of five means with a group of four means (so 5 × √ 2 20 in the denominator): = 4 / 20 n s . sqrt(4961/(5*4*6)) [1] 6.429749 It is true that the parameter estimates and their standard errors are much more difficult to understand in Helmert than in treatment contrasts. But the advantage of Helmert contrasts is that they give you proper orthogonal contrasts, and hence give a much clearer picture of which factor levels need to be retained in the minimal adequate model. They do not eliminate the need for careful model simplification, however. As we saw earlier, this example requires only two parameters in the minimal adequate model, but Helmert contrasts suggest the need for three (albeit only marginally significant) parameters. 9.25.3 Sum contrasts Sum contrasts are the third option: options(contrasts=c("contr.sum","contr.poly")) output.sum <- lm(biomass~clipping) summary(output.sum) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 561.800 12.859 43.688 < 2e-16 *** clipping1 -96.633 25.719 -3.757 0.000921 *** clipping2 -8.467 25.719 -0.329 0.744743

465 STATISTICAL MODELLING 443 clipping3 7.533 25.719 0.293 0.772005 clipping4 48.867 25.719 1.900 0.069019 . As with Helmert contrasts, the first row contains the overall mean and the standard error of the overall mean. The remaining four rows are different: they are the differences between the grand mean and the first four control, n25, n50 r10 ): and factor means ( tapply(biomass,clipping,mean) - 561.8 control n25 n50 r10 r5 -96.633333 -8.466667 7.533333 48.866667 48.700000 The standard errors are all the same (25.719) for all four contrasts. The contrasts compare the grand mean (based on 30 numbers) with a single treatment mean: sqrt(4961/30+4961/10) [1] 25.71899 9.26 Aliasing Aliasing occurs when there is no information available on which to base an estimate of a parameter value. Parameters can be aliased for one of two reasons:  there are no data in the dataframe from which to estimate the parameter (e.g. missing values, partial designs or correlation among the explanatory variables), or  the model is structured in such a way that the parameter value cannot be estimated (e.g. over-specified models with more parameters than necessary). occurs when it is due to the structure of the model . Extrinsic aliasing occurs when it is Intrinsic aliasing due to the . nature of the data Suppose that in a factorial experiment all of the animals receiving level 2 of diet (factor A ) and level 3 of temperature (factor B ) have died accidentally as a result of attack by a fungal pathogen. This particular combination of diet and temperature contributes no data to the response variable, so the interaction term A (3) cannot be estimated. It is extrinsically aliased , and its parameter estimate is set to zero. B (2) : If one continuous variable is perfectly correlated with another variable that has already been fitted to the data (perhaps because it is a constant multiple of the first variable), then the second term is aliased and adds will lead to = being x ; then fitting a model with x + x x 0.5 x nothing to the model. Suppose that 2 1 1 2 2 intrinsically aliased and given a zero parameter estimate. If all the values of a particular explanatory variable are set to zero for a given level of a particular factor, then that level is intentionally aliased . This sort of aliasing is a useful programming trick in ANCOVA when we wish a covariate to be fitted to some levels of a factor but not to others. 9.27 Orthogonal polynomial contrasts: contr.poly Here are the data from a randomized experiment with four levels of dietary supplement: data <- read.table("c: \\ temp \\ poly.txt",header=T) attach(data)

466 444 THE R BOOK names(data) [1] "treatment" "response" We begin by noting that the factor levels are in alphabetical order (not in ranked sequence – none, low, medium, high – as we might prefer): tapply(response,treatment,mean) high low medium none 4.50 5.25 7.00 2.50 summary.lm table from the one-way analysis of variance looks like this: The Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.8125 0.1875 25.667 7.45e-12 *** treatment1 -0.3125 0.3248 -0.962 0.355 treatment2 0.4375 0.3248 1.347 0.203 treatment3 2.1875 0.3248 6.736 2.09e-05 *** and the summary.aov table looks like this: summary.aov(model) Df Sum Sq Mean Sq F value Pr(>F) treatment 3 41.69 13.896 24.7 2.02e-05 *** Residuals 12 6.75 0.563 We can see that treatment is a factor but it is not ordered: is.factor(treatment) [1] TRUE is.ordered(treatment) [1] FALSE To convert it into an ordered factor, we use the ordered function like this: treatment <- ordered(treatment,levels=c("none","low","medium","high")) levels(treatment) [1] "none" "low" "medium" "high" Now the factor levels appear in their ordered sequence, rather than in alphabetical order. Fitting the ordered factor makes no difference to the summary.aov table: model2 <- lm(response~treatment) summary.aov(model2) Df Sum Sq Mean Sq F value Pr(>F) treatment 3 41.69 13.896 24.7 2.02e-05 *** Residuals 12 6.75 0.562 but the summary.lm table is fundamentally different when the factors are ordered. Now the contrasts are not contr.treatment but contr.poly (which stands for ‘orthogonal polynomial contrasts’):

467 STATISTICAL MODELLING 445 summary.lm(model2) Call: lm(formula = response ~ treatment) Residuals: Min 1Q Median 3Q Max -1.25 -0.50 0.00 0.50 1.00 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.8125 0.1875 25.667 7.45e-12 *** treatment.L 1.7330 0.3750 4.621 0.000589 *** treatment.Q -2.6250 0.3750 -7.000 1.43e-05 *** treatment.C -0.7267 0.3750 -1.938 0.076520 . Residual standard error: 0.75 on 12 degrees of freedom Multiple R-squared: 0.8606, Adjusted R-squared: 0.8258 F-statistic: 24.7 on 3 and 12 DF, p-value: 2.015e-05 The levels of the factor called are no longer labelled low, medium, none as with treatment treatment contrasts (above). Instead they are labelled and C , which stand for ‘linear’, ‘quadratic’ and ‘cubic’ L, Q polynomial terms, respectively. But what are the coefficients, and why are they so difficult to interpret? The first thing you notice is that the intercept 4.8125 is no longer one of the treatment means: tapply(response,treatment,mean) none low medium high 2.50 5.25 7.00 4.50 You could fit a polynomial regression model to the mean values of the response with the four ordered levels of treatment represented by a continuous (dummy) explanatory variable (say, x <- c(1,2,3,4) ), 3 2 x and independently (using the ‘as is’ function I in the model formula). This is what x x then fit terms for , it would look like: yv <- as.vector(tapply(response,treatment,mean)) x <- 1:4 model <- lm(yv~x+I(xˆ2)+I(xˆ3)) summary(model) Call: lm(formula = yv ~ x + I(xˆ2) + I(xˆ3)) Residuals: ALL 4 residuals are 0: no residual degrees of freedom! Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.0000 NA NA NA x -1.7083 NA NA NA I(xˆ2) 2.7500 NA NA NA I(xˆ3) -0.5417 NA NA NA

468 446 THE R BOOK There are four data points and four estimated parameters, so there are no residual degrees of freedom. Thus the equation for x ) could be written y as a function of treatment ( 2 3 2 . 7083 x + 2 . 75 = x y − 1 5417 0 . . − x one of the factor-level means. To find the mean for factor level 1 ( none ), Notice that the intercept is not x the equation is evaluated for 1 (namely 2 − 1.7083 + 2.75 − 0.5417 = 2.5; exactly the correct answer, = as we can see above). So why does R not do it this way? There are two main reasons: orthogonality and orthogonal and fitted , then we computational accuracy. If the linear, quadratic and cubic contrasts are stepwise can see whether adding an extra term produces significantly improved explanatory power in the model. In this case, for instance, there is no justification for retaining the cubic term ( = 0.076 52). Computational accuracy p can become a major problem when fitting many polynomial terms, because these terms are necessarily so highly correlated: x <- 1:4 x2 <- xˆ2 x3 <- xˆ3 cor(cbind(x,x2,x3)) xx2x3 x 1.0000000 0.9843740 0.9513699 x2 0.9843740 1.0000000 0.9905329 x3 0.9513699 0.9905329 1.0000000 Orthogonal polynomial contrasts fix both these problems simultaneously. Here is one way to obtain orthogonal polynomial contrasts for a factor with four levels. The contrasts (in the rows) will go up to polynomials of degree = k − 1 = 4–1 = 3. term x x x x 3 2 1 4 3 − 1 1 3 linear − 1 quadratic 1 − 1 1 − cubic − 1 3 − 3 1 Note that the linear x terms are equally spaced, and have a mean of zero (i.e. each point on the x axis is separated by 2). Also, note that all the rows sum to zero. The key point is that the pointwise products of the terms in any two rows also sum to zero: thus for the linear and quadratic terms we have products of ( 3, 1, − − 1, 3), for the linear and cubic terms (3, 3, − 3, 3) and for the quadratic and cubic terms ( − 1, − 3, 3, 1). In − R, the orthogonal polynomial contrasts have different numerical values, but the same properties: t(contrasts(treatment)) [,1] [,2] [,3] [,4] .L -0.6708204 -0.2236068 0.2236068 0.6708204 .Q 0.5000000 -0.5000000 -0.5000000 0.5000000 .C -0.2236068 0.6708204 -0.6708204 0.2236068 If you wanted to be especially perverse, you could reconstruct the four estimated mean values from these polynomial contrasts and the treatment effects shown in summary.lm (above). The means for none, low, medium and high are respectively 4.8125 - 0.6708204*1.733 - 0.5*2.6250 + 0.2236068*0.7267

469 STATISTICAL MODELLING 447 [1] 2.499963 4.8125 - 0.2236068*1.733+0.5*2.6250 - 0.6708204*0.7267 [1] 5.250004 4.8125 + 0.2236068*1.733 + 0.5*2.6250 + 0.6708204*0.7267 [1] 6.999996 4.8125 + 0.6708204*1.733 - 0.5*2.6250 - 0.2236068*0.7267 [1] 4.500037 in agreement (to 3 decimal places) with the four mean values (above). Thus, the parameters can be interpreted k as the coefficients in a polynomial model of degree 3 ( 1 because there are k = 4 levels of the factor = − treatment ), but only so long as the factor levels are equally spaced (and we do not know whether called that is true from the information in the current dataframe, because we know only the ranking) and the class sizes are equal (that is true in the present case where = 4). n Because we have four data points (the treatment means) and four parameters, the fit of the model to the means is perfect (there are no residual degrees of freedom and no unexplained variation). We can see what barplot for the means: the polynomial function looks like by drawing the smooth curve on top of a y <- as.vector(tapply(response,treatment,mean)) model <- lm(y~poly(x,3)) summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.8125 NA NA NA poly(x, 3)1 1.7330 NA NA NA poly(x, 3)2 -2.6250 NA NA NA poly(x, 3)3 -0.7267 NA NA NA Now we can generate a smooth series of values between 1 and 4 from which to predict the smooth polynomial x function: xv <- seq(1,4,0.1) yv <- predict(model,list(x=xv)) x axis values on the barplot do not scale exactly one-to-one with The only slight difficulty is that the our x values, so we need to adjust the x -location of our smooth line from xv to xs =− 0.5 + 1.2 xv . The parameters 0.5 and 1.2 come from noting that the centres of the four bars are at 0.7, 1.9, 3.1 and 4.3: − (bar.x <- barplot(y)) [,1] [1,] 0.7 [2,] 1.9 [3,] 3.1 [4,] 4.3 barplot(y,names=levels(treatment)) xs <- -0.5 + 1.2 * xv lines(xs,yv,col="red")

470 448 THE R BOOK 76543210 none low medium high 9.28 Summary of statistical modelling The steps in the statistical analysis of data are always the same, and should always be done in the following order: (1) data inspection (plots and tabular summaries, identifying errors and outliers); (2) model specification (picking an appropriate model from many possibilities); (3) ensure that there is no pseudoreplication, or specify appropriate random effects; (4) fit a maximal model with an appropriate error structure; (5) model simplification (by deletion from a complex initial model); (6) model criticism (using diagnostic plots, influence tests, etc.); (7) repeat steps 2 to 6 as often as necessary.

471 10 Regression Regression analysis is the statistical method you use when both the response variable and the explanatory variable are continuous variables (i.e. real numbers with decimal places – things like heights, weights, volumes, or temperatures). Perhaps the easiest way of knowing when regression is the appropriate analysis is to see that a scatterplot is the appropriate graphic (in contrast to analysis of variance, say, where it would have been a box-and-whisker plot or a bar chart). We cover seven important kinds of regression analysis in this book:  linear regression (the simplest, and much the most frequently used);  polynomial regression (often used to test for non-linearity in a relationship);  piecewise regression (two or more adjacent straight lines);  robust regression (models that are less sensitive to outliers);  multiple regression (where there are numerous explanatory variables);  non-linear regression (to fit a specified non-linear model to data);  non-parametric regression (used when there is no obvious functional form). The first five cases are covered here, non-linear regression in Chapter 20 and non-parametric regression in Chapter 18 (where we deal with generalized additive models and non-parametric smoothing). The essence of regression analysis is using sample data to estimate parameter values and their standard errors. First, however, we need to select a model which describes the relationship between the response variable and the explanatory variable(s). The simplest of all is the linear model y = a + bx . There are two variables and two parameters. The response variable is , and x is a single continuous explanatory y variable. The parameters are a and b : the intercept is a (the value of y when x = 0); and the slope is b (the change in y divided by the change in x which brought it about). The R Book , Second Edition. Michael J. Crawley. © 2013 John Wiley & Sons, Ltd. Published 2013 by John Wiley & Sons, Ltd.

472 450 THE R BOOK 10.1 Linear regression Let us start with an example which shows the growth of caterpillars fed on experimental diets differing in their tannin content: temp regression.txt",header=T) reg.data <- read.table("c: \\ \\ attach(reg.data) names(reg.data) [1] "growth" "tannin" plot(tannin,growth,pch=21,col="blue",bg="red") 12 10 8642 growth 024 68 tannin The higher the percentage of tannin in the diet, the more slowly the caterpillars grew. You can get a crude estimate of the parameter values by eye. Tannin content increased by 8 units, in response to which growth declined from about 12 units to about 2 units, a change of –10 units of growth. The slope, , is the change in b y divided by the change in x ,so 10 − =− 1 . 25 . ≈ b 8 The intercept, , is the value of y when x = 0, and we see by inspection of the scatterplot that growth was a close to 12 units when tannin was zero. Thus, our rough parameter estimates allow us to write the regression equation as . ≈ 12 . 0 − 1 y 25 x . Of course, different people would get different parameter estimates by eye. What we want is an objective method of computing parameter estimates from the data that are in some sense the ‘best’ estimates of the parameters for these data and this particular model. The convention in modern statistics is to use the maximum

473 REGRESSION 451 of the parameters as providing the ‘best’ estimates. That is to say that, given the data, likelihood estimates and having selected a linear model, we want to find the values of the slope and intercept that make the data most likely. Keep re-reading this sentence until you understand what it is saying. For the simple kinds of regression models with which we begin, we make several important assumptions:  is constant (i.e. the variance does not change as y gets bigger). The variance in y  x , is measured without error. The explanatory variable,  y The difference between a measured value of and the value predicted by the model for the same value of is called a residual. x  y (i.e. parallel to the y Residuals are measured on the scale of axis).  The residuals are normally distributed. 12 10 8642 growth 024 68 tannin model <- lm(growth~tannin) abline(model,col="red") yhat <- predict(model,tannin=tannin) join <- function(i) lines(c(tannin[i],tannin[i]),c(growth[i],yhat[i]),col="green") sapply(1:9,join) Under these assumptions, the maximum likelihood is given by the method of least squares . The phrase ‘least squares’ refers to the residuals, as shown in the figure. The residuals are the vertical differences between the data (solid circles) and the fitted model (the straight line). Each of the residuals is a distance, d , between a data ˆ point, , and the value predicted by the fitted model, y , evaluated at the appropriate value of the explanatory y variable, x : ˆ d = y − y .

474 452 THE R BOOK ˆ ˆ + = y by its formula bx , noting the change in sign: y Now we replace the predicted value a y − a − bx . d = Finally, our measure of lack of fit is the sum of the squares of these distances: ∑ ∑ 2 2 ( ) − a − bx y = . d ∑ The sum of the residuals will always be zero, because the positive and negative residuals cancel out, so ∑ | | d is useful in computationally intensive statistics; see is no good as a measure of lack of fit (although d ̄ ( ) x p. 65). The best fit line is defined as passing through the point defined by the mean value of and the x ̄ ̄ ̄ ( ) ( ) y mean value of y . The large open circle marks the point x . You can think of maximum likelihood as , y ̄ ̄ ( ) working as follows. Imagine that the straight line is pivoted, so that it can rotate around the point x , . When y the line is too steep, some of the residuals are going to be very large. Likewise, if the line is too shallow, some of the residuals will again be very large. Now ask yourself what happens to the sum of the squares of the residuals as the slope is rotated from too shallow, through just right, to too steep. The sum of squares will be big at first, then decline to a minimum value, then increase again. A graph of the sum of squares against the value of the slope used in estimating it would look like this: 140120100806040200 sum of squared residuals –1.5 –0.5 –2.0 –1.0 slope b bs <- seq(-2,-0.5,0.01) SSE <- function(i) sum((growth - 12 - bs[i]*tannin)ˆ2) plot(bs,sapply(1:length(bs),SSE),type="l",ylim=c(0,140), xlab="slope b",ylab="sum of squared residuals",col="blue") The maximum likelihood estimate of the slope is the value of b associated with the minimum value of the sum of the squares of the residuals (i.e. close to –1.25). Ideally we want an analytic solution that gives the maximum likelihood of the slope directly (this is done using calculus in Box 10.1). It turns out, however, that the least-squares estimate of b can be calculated very simply from the covariance of x and y (which we met on p. 304).

475 REGRESSION 453 10.1.1 The famous five in R ∑ ∑ 2 2 d We want to find the minimum value of and . To work this out we need the ‘famous five’: these are y ∑ ∑ ∑ ∑ 2 xy and x , and the sum of products, (introduced on p. 331). The sum of products is worked , x y out pointwise. You can calculate the numbers from the data the long way: sum(tannin);sum(tanninˆ2);sum(growth);sum(growthˆ2);sum(tannin*growth) [1] 36 [1] 204 [1] 62 [1] 536 [1] 175 Alternatively, as we saw on p. 332, you can create a matrix and use matrix multiplication: XY <- cbind(1,growth,tannin) t(XY) %*% XY growth tannin 96236 growth 62 536 175 tannin 36 175 204 10.1.2 Corrected sums of squares and sums of products The next thing is to use the famous five to work out three essential ‘corrected sums’. We are already familiar 2 with corrected sums of squares, because these are used in calculating variance: s is calculated as the corrected sum of squares divided by the degrees of freedom (p. 333). We shall need the corrected sums of squares of SSX , and the response variable, SSY : both the explanatory variable, ( ) ∑ 2 ∑ x 2 , x = SSX − n ) ( ∑ 2 ∑ y 2 y SSY = . − n . The covariance of x and y is the expectation of the The third term is the corrected sum of products, SSXY ̄ ̄ ( ) )( vector product E , and this depends on the value of the corrected sum of products (p. 334), − x y x − y [ ] which is given by ) )( ( ∑ ∑ ∑ x y . − = SSXY xy n If you look carefully you will see that the corrected sum of products has exactly the same kind of structure as SSY and SSX .For SSY , the first term is the sum of y times y and the second term contains the sum of y times and the second the sum of (and similarly for SSX ). For SSXY , the first term contains the sum of x times y y term contains the sum of x times the sum of y . Note that for accuracy within a computer program it is best not to use these shortcut formulae, be- cause they involve differences (minus) between potentially very large numbers (sums of squares) and

476 454 THE R BOOK hence are potentially subject to rounding errors. Instead, when programming, use the following equivalent formulae: ∑ 2 ̄ ( ) y y − SSY = , ∑ 2 ̄ ) ( , SSX x − = x ∑ ̄ ̄ ( ) )( y = − − x y . SSXY x The three key quantities and SSXY can be computed the long way, substituting the values of the SSY, SSX famous five: 2 62 108 = . 8889 , 536 = − SSY 9 2 36 = 60 , 204 = SSX − 9 36 × 62 =− 73 . = − 175 SSXY 9 Alternatively, the matrix can be used (see p. 334). The next question is how we use SSX , SSY and SSXY to find the maximum likelihood estimates of the parameters and their associated standard errors. It turns out that this step is much simpler than what has gone before. The maximum likelihood estimate of the slope, ,isjust b SSXY = b SSX (the detailed derivation of this is in Box 10.1). So, for our example, − 73 216667 . =− 1 . b = 60 Compare this with our by-eye estimate of –1.25. Now that we know the value of the slope, we can use any point that we know to lie on the fitted straight line to work out the maximum likelihood estimate of the intercept, ̄ ̄ ( ) x a . One part of the definition of the best-fit straight line is that it passes through the point y determined by , ̄ ̄ a and y = a + bx , it must be the case that x y = . Since we know that + b the mean values of x , and so y ∑ ∑ y x ̄ ̄ b x − y a = = − b n n and, using R as a calculator, we get the value of the intercept as mean(growth)+1.216667*mean(tannin) [1] 11.75556 noting the change of sign. This is reasonably close to our original estimate by eye ( ≈ 12). a The function for carrying out linear regression in R is lm (which stands for ‘linear model’). The response variable comes first ( growth in our example), then the tilde ∼ , then the name of the continuous explanatory

477 REGRESSION 455 tannin ). R prints the values of the intercept and slope like this: variable ( lm(growth~tannin) Coefficients: (Intercept) tannin 11.756 -1.217 We can now write the maximum likelihood equation like this: × tannin growth 11.755 56 – 1.216 667 = . Box 10.1 The least-squares estimate of the regression slope, b error sum of squares , SSE , is minimized, so we The best fit slope is found by rotating the line until the ∑ 2 ( ) : . We start by finding the derivative of SSE with respect to b a − − bx y want to find the minimum of ∑ SSE d ) ( . x y =− 2 a − bx − b d x gives Now, multiplying through the bracketed term by ∑ d SSE 2 ax =− − 2 − bx xy . b d – 2to Apply summation to each term separately, set the derivative to zero, and divide both sides by remove the unnecessary constant: ∑ ∑ ∑ 2 ax − bx xy − 0 . = a and b . However, we know We cannot solve the equation as it stands because there are two unknowns, ∑ ∑ ̄ ̄ that the value of y . Also, note that a ax can be written as a x x , so replacing a and taking is b − and b outside their summations gives: both a ∑ ∑ [ ] ∑ ∑ ∑ y x 2 b − xy − = 0 . b − x x n n ∑ Now multiply out the bracketed term by to get: x ) ( ∑ ∑ ∑ 2 ∑ ∑ x y x 2 − xy b b + x − = 0 . n n b to the right-hand side, and note their change of sign: Next, take the two terms containing ( ) ∑ ∑ ∑ 2 ∑ ∑ x x y 2 xy − = b . x b − n n

478 456 THE R BOOK ) ( ∑ ∑ 2 2 x / x to obtain the required estimate b : Finally, divide both sides by − n ∑ ∑ ∑ xy x − y / n . = b ) ( ∑ ∑ 2 2 x x n − / that minimizes the sum of squares of the departures is given simply by: Thus, the value of b SSXY . = b SSX of the linear regression. maximum likelihood estimate of the slope This is the 10.1.3 Degree of scatter There is another very important issue that needs to be considered, because two data sets with exactly the same slope and intercept could look quite different: 151050 151050 y y 0510 15 20 0 5 10 15 20 xx We need a way to quantify the degree of fit, so that the graph on the left has a high value and the graph on the right has a low value. It turns out that we already have the appropriate quantity: it is the sum of squares of the error sum of squares SSE . Here, error does not mean ‘mistake’, , residuals (p. 338). This is referred to as the but refers to residual variation or unexplained variation : ∑ 2 ( ) − y − a bx . SSE = Graphically, you can think of SSE as the sum of the squares of the lengths of the vertical residuals (the green lines) in the plot on p. 452. By tradition, however, when talking about the degree of scatter we actually of scatter, so the graph on the left, with a perfect fit (zero scatter) gets a value of 1, and lack quantify the y and x (100% scatter), gets a value of the graph on the right, which shows no relationship at all between 0. This quantity used to measure the lack of scatter is officially called the ‘coefficient of determination’, but 2 is the squared’. This is an important definition that you should try to memorize: r everybody refers to it as ‘ r fraction of the total variation in y that is explained by variation in x . We have already defined the total variation

479 REGRESSION 457 SSY in the response variable as (p. 454). The unexplained variation in the response variable is defined above SSY – SSE . Thus, as (the error sum of squares) so the explained variation is simply SSE − SSE SSY 2 = . r SSE 2 A value of r = 1 means that all of the variation in the response variable is explained by variation in the 2 = 0 means none of the variation in the explanatory variable (the left-hand graph below) while a value of r response variable is explained by variation in the explanatory variable (the right-hand graph). r squared =0 r squared =1 15 15 10 10 y y 5 50 0 0 0510 20 5 10 15 20 15 x x y <- 5+0.5*x plot(x,y,pch=16,xlim=c(0,20),ylim=c(0,15),col="red",main="r squared = 1") abline(5,0.5,col="blue") y <- 5+runif(30)*10 plot(x,y,pch=16,xlim=c(0,20),ylim=c(0,15),col="red",main="r squared = 0") abline(h=10,col="blue") You can get the value of the long way as on p. 454 ( SSY = 108.8889), or using R to fit the null model SSY in which growth is described by a single parameter, the intercept a . In R, the intercept is called parameter 1, so the null model is expressed as lm(growth~1) deviance that can be . There is a function called applied to a linear model which returns the sum of the squares of the residuals (in this null case, it returns ∑ 2 ̄ ) ( , which is SSY as we require): y y − deviance(lm(growth~1) [1] 108.8889 ∑ 2 ( ) but this is a pain, and the value can be − SSE − bx The value of is worked out longhand from y a extracted very simply from the regression model using deviance like this: deviance(lm(growth~tannin)) [1] 20.07222 2 : Now we can calculate the value of r 072 22 . 20 − 8889 . 108 SSE = SSY 2 = 815 663 3 . . 0 = r = 8889 . SSY 108

480 458 THE R BOOK 2 r You will not be surprised that the value of can be extracted from the model: summary(lm(growth~tannin))[[8]] [1] 0.8156633 correlation coefficient The , , introduced on p. 373, is given by r SSXY . = r √ SSX × SSY 2 is the square root of r r Of course , but we use the formula above so that we retain the sign of the correlation: y x and negative for negative correlations between y SSXY is positive for positive correlations between and and x . For our example, the correlation coefficient is − 73 =− 0 . 903 140 7 . = r √ 60 . 8889 × 108 10.1.4 Analysis of variance in regression: = SSR + SSE SSY y SSY , and partition it into components that tell us about the The idea is simple: we take the total variation in , explanatory power of our model. The variation that is explained by the model is called the regression sum of SSR ), and the unexplained variation is called the error sum of squares (denoted by SSE ). squares (denoted by because we know that it is the sum of SSY Then + SSE . Now, in principle, we could compute SSE = SSR ∑ ∑ 2 2 ) ( bx y − a − = . Since we d the squares of the deviations of the data points from the fitted model, a know the values of , we are in a position to work this out. The formula is fiddly, however, because and b of all those subtractions, squarings and addings-up. Fortunately, there is a very simple shortcut that involves computing , the explained variation, rather than SSE . This is because SSR 2 SSXY , b = = SSR × SSXY SSX 88.816 67. And since = so we can immediately work out × –73 = SSR SSY = SSR + SSE ,we – 1.21667 can get SSE by subtraction: . SSE − SSR = 108 . 8889 − 88 SSY 81667 = 20 . 07222 . = Using R to do the calculations, we get: (sse <- deviance(lm(growth~tannin))) [1] 20.07222 (ssy <- deviance(lm(growth~1))) [1] 108.8889 (ssr <- ssy-sse) [1] 88.81667 We now have all of the sums of squares, and all that remains is to think about the degrees of freedom. We ∑ 2 ̄ ̄ ( ) = y y − , before we could calculate y SSY had to estimate one parameter, the overall mean, ,sothe

481 REGRESSION 459 n – 1. The error sum of squares was calculated only after two parameters had total degrees of freedom are ∑ 2 ) ( , so the error degrees of = y − been estimated from the data (the intercept and the slope) since bx SSE – 2. Finally, the regression model added just one parameter, the slope freedom are , compared with the n b one regression degree of freedom. Thus, the ANOVA table looks like this: null model, so there is Source Sum of squares Degrees of freedom Mean squares F ratio 88.817 1 Regression 30.974 88.817 2 = 2.867 46 Error 7 s 20.072 108.889 Total 8 Notice that the component degrees of freedom add up to the total degrees of freedom (this is always true, in any ANOVA table, and is a good check on your understanding of the design of the experiment). The third column, headed ‘Mean squares’, contains the variances obtained by dividing the sums of squares by the degrees of freedom in the same row. In the row labelled ‘Error’ we obtain the very important quantity 2 , by dividing the error sum of squares by the error degrees of freedom. s called the error variance, denoted by Obtaining the value of the error variance is the main reason for drawing up the ANOVA table. Traditionally, one does not fill in the bottom box (it would be the overall variance in y , SSY /( n – 1), although this is the 2 F value; see p. 461). Finally, the ANOVA table is completed by working out the ratio, r basis of the adjusted which is a ratio between two variances. In most simple ANOVA tables, you divide the treatment variance 2 in the denominator. The null in the numerator (the regression variance in this case) by the error variance s hypothesis under test in a linear regression is that the slope of the regression line is zero (i.e. that there is y no dependence of x ). The two-tailed alternative hypothesis is that the slope is significantly different on from zero (either positive or negative). In many applications it is not particularly interesting to reject the null hypothesis, because we are interested in the estimates of the slope and its standard error (we often know from F the outset that the null hypothesis is false). To test whether the ratio is sufficiently large to reject the null hypothesis, we compare the calculated value of in the final column of the ANOVA table with the critical F F , expected by chance alone (this is found from quantiles of the F value of qf , with 1 d.f. in the distribution numerator and n – 2 d.f. in the denominator, as described below). The table can be produced directly from the fitted model in R by using the anova function: anova(lm(growth~tannin)) Analysis of Variance Table Response: growth Df Sum Sq Mean Sq F value Pr(>F) tannin 1 88.817 88.817 30.974 0.0008461 *** Residuals 7 20.072 2.867 summary.aov(lm(growth~tannin)) . The extra column given The same output can be obtained using p value associated with the computed value of F . byRisthe There are two ways to assess our F ratio of 30.974. One way is to compare it with the critical value of F , with 1 d.f. in the numerator and 7 d.f. in the denominator. We have to decide on the level of uncertainty that we are willing to put up with; the traditional value for work like this is 5%, so our certainty is 0.95. Now we can use quantiles of the F distribution, qf , to find the critical value of F : qf(0.95,1,7) [1] 5.591448

482 460 THE R BOOK F is much larger than this critical value, we can be confident in rejecting the Because our calculated value of null hypothesis. The other way, which is perhaps better than working rigidly at the 5% uncertainty level, is F as big as 30.974 or larger if the null hypothesis is true. to ask what is the probability of getting a value for For this we use 1-pf rather than qf : 1-pf(30.974,1,7) [1] 0.0008460725 p < 0.001). This value is in the last column of the R output. Note that the p value It is very unlikely indeed ( not is the probability that the null hypothesis is true. On the contrary, it is the probability, given that the null hypothesis true, of obtaining a value of F this large or larger by chance alone. is 10.1.5 Unreliability estimates for the parameters Finding the least-squares values of slope and intercept is only half of the story, however. In addition to the a = 11.756 and b = parameter estimates, –1.2167, we need to measure the unreliability associated with each of the estimated parameters. In other words, we need to calculate the standard error of the intercept and the standard error of the slope. We have already met the standard error of the mean, and we used it in calculating confidence intervals (p. 122) and in doing Student’s t test (p. 358). Standard errors of regression parameters are similar in so far as they are enclosed inside a big square root term (so that the units of the standard error 2 , from the ANOVA table s are the same as the units of the parameter), and they have the error variance, (above) in the numerator. There are extra components, however, which are specific to the unreliability of a slope or an intercept (see Boxes 10.2 and 10.3 for details). Box 10.2 Standard error of the slope The uncertainty of the estimated slope increases with increasing variance and declines with increasing number of points on the graph. In addition, however, the uncertainty is greater when the range of x values SSX ) is small: (as measured by √ 2 s . = se b SSX Box 10.3 Standard error of the intercept The uncertainty of the estimated intercept increases with increasing variance and declines with increasing number of points on the graph. As with the slope, uncertainty is greater when the range of x values (as measured by SSX ) is small. Uncertainty in the estimate of the intercept also increases with the square of ∑ 2 ): x the distance between the origin and the mean value of (as measured by x √ ∑ 2 2 x s = se a n × SSX

483 REGRESSION 461 Longhand calculation shows that the standard error of the slope is √ √ 2 2 s . 867 = = , . 0 2186 = se b 60 SSX and the standard error of the intercept is √ √ 2 2 ×  x 204 867 . 2 s se = = 1 . 0408 . = a 9 × n × 60 SSX summary.lm However, in practice you would always use the function applied to the fitted linear model like this: summary(lm(growth~tannin)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.7556 1.0408 11.295 9.54e-06 *** tannin -1.2167 0.2186 -5.565 0.000846 *** I have stripped out the details about the residuals and the explanation of the significance stars in order to highlight the parameter estimates and their standard errors (as calculated above). The residual standard error √ = is the square root of the error variance from the ANOVA table (1.693 . 867). Multiple R -squared is the 2 fraction of the total variance explained by the model ( = 0.8157). The adjusted R -squared is close SSR/SSY 2 we have just calculated. Instead of being based on the explained sum to, but different from, the value of r of squares, SSR , and the total sum of squares, SSY , it is based on the overall variance (a quantity we do not 2 2 2 ( ) 13 − 1 SSY = = . 611 / and the error variance s , (from the ANOVA table, s n = typically calculate), s T 2.867) and is worked out like this: 2 2 s − s T . = -squared R adjusted 2 s T = (13.611 – 2.867)/13.611 = 0.7893. We discussed the F So in this example, adjusted R-squared statistic and p value in the previous section. The summary.lm table shows everything you need to know about the parameters and their standard errors, but there is a built-in function, confint , which produces 95% confidence intervals for the estimated parameters from the model directly like this: confint(model) 2.5 % 97.5 % (Intercept) 9.294457 14.2166544 tannin -1.733601 -0.6997325 These values are obtained by subtracting from, and adding to, each parameter estimate an interval which is the standard error times Student’s t with 7 degrees of freedom (the appropriate value of t is given by qt(.975,7) = 2.364 624). The fact that neither interval includes 0 indicates that both parameter values are significantly different from zero, as established by the earlier F tests. Of the two sorts of summary table, summary.lm is by far the more informative, because it shows the effect sizes (in this case the slope of the graph) and their unreliability estimates (the standard error of the

484 462 THE R BOOK slope). Generally, you should resist the temptation to put ANOVA tables in your written work. The important p value and the error variance can be put in the text, or in figure legends, much more information such as the efficiently. ANOVA tables put far too much emphasis on hypothesis testing, and show nothing directly about effect sizes. Box 10.4 Standard error for a predicted value ˆ y The standard error of a predicted value is given by: √ [ ] 2 ̄ ) ( 1 − x x 2 . se s = + ˆ y n SSX square of the difference between mean x and the value of x at which the prediction It increases with the x is made. As with the standard error of the slope, the wider the range of , the lower the values, SSX , the lower the uncertainty. Note that the formula for the standard n uncertainty. The bigger the sample size, error of the intercept is just the special case of this for = 0 (you should check the algebra of this result x as an exercise). For predictions made on the basis of the regression equation we need to know the standard error for a y , predicted single sample of √ [ ] 2 ̄ ( ) 1 − x x 2 1 + + , se s y SSX n k x items at a given level of while the standard error for a predicted mean for is i √ [ ] 2 ̄ ) ( 1 x x − 1 2 + + se s = . ̄ y i SSX k n 10.1.6 Prediction using the fitted model It is good practice to save the results of fitting the model in a named object. Naming models is very much a matter of personal taste: some people like the name of the model to describe its structure, other people like the name of the model to be simple and to rely on the formula (which is part of the structure of the model) to describe what the model does. I like the second approach, so I might write model <- lm(growth~tannin) The object called model can now be used for all sorts of things. For instance, we can use the predict function to work out values for the response at values of the explanatory variable that we did not measure. Thus, we can ask for the predicted growth if tannin concentration was 5.5%. The value or values of the explanatory variable to be used for prediction are specified in a list like this: predict(model,list(tannin=5.5)) [1] 5.063889

485 REGRESSION 463 indicating a predicted growth rate of 5.06 if a tannin concentration of 5.5% had been applied. To predict growth at more than one level of tannin, the list of values for the explanatory variable is specified as a vector. Here are the predicted growth rates at 3.3, 4.4, 5.5 and 6.6% tannin: predict(model,list(tannin=c(3.3,4.4,5.5,6.6))) 1234 7.740556 6.402222 5.063889 3.725556 For drawing smooth curves through a scatterplot we use with a vector of 100 or so closely-spaced predict x values, as illustrated on p. 207. 10.1.7 Model checking The final thing you will want to do is to expose the model to critical appraisal. The assumptions we really want to be sure about are constancy of variance and normality of errors. The simplest way to do this is with model-checking plots. Six plots (selectable by ) are currently available: a plot of residuals against which √ | | fitted values; a scale–location plot of residuals against fitted values; a normal qunatile–quantile plot; a plot of Cook’s distances versus row labels; a plot of residuals against leverages; and a plot of Cook’s distances against leverage/(1 – leverage). By default four plots are provided (the first three plus the fifth): windows(7,7) par(mfrow=c(2,2)) plot(model) Residuals vs Fitted Normal Q-Q 210 3 4 4 210–1–2–3 6 Residuals 3 Standardized residuals –1 7 7 6 8 24 12 10 –0.5 0.0 –1.0 –1.5 0.5 1.5 1.0 Fitted values Theoretical Quantiles Residuals vs Leverage Scale-Location 1.2 4 210–1–2 4 7 1 | 5 3 9 0.8 0.4 Standardized resiuals | Standardized residuals 5 √ Cook’s distance 1 0.0 0.1 4 2 0.0 12 0.2 0.3 8 10 6 Leverage Fitted values

486 464 THE R BOOK y axis against fitted values on the axis. It takes experience The first graph (top left) shows residuals on the x want to see is lots of structure or pattern in the plot. Ideally, as do not to interpret these plots, but what you here, the points should look like the sky at night. It is a major problem if the scatter increases as the fitted values get bigger; this would look like a wedge of cheese on its side (see p. 405). But in our present case, everything is OK on the constancy of variance front. The next plot (top right) shows the normal qqnorm plot (p. 406) which should be a straight line if the errors are normally distributed. Again, the present example looks fine. If the pattern were S-shaped or banana-shaped, we would need to fit a different model to the data. The third plot (bottom left) is a repeat of the first, but on a different scale; it shows the square root of the standardized residuals (where all the values are positive) against the fitted values. If there was a problem, such as the variance increasing with the mean, then the points would be distributed inside a triangular shape, with the scatter of the residuals increasing as the fitted values increase. The red line would then show a pronounced upward trend. But there is no such pattern here, which is good. The fourth and final plot (bottom right) shows standardized residuals as a function of leverage, along with Cook’s distance (p. 419) for each of the observed values of the response variable. The point of this plot is to highlight those values that have the biggest effect on the parameter estimates (high influence; p. 409). y You can see that point 9 has the highest leverage, but point 7 is quite influential (it is closest to the Cook’s distance contour). You might like to investigate how much this influential point (6, 2) affected the parameter estimates and their standard errors. To do this, we repeat the statistical modelling but leave out the point in question, using subset like this (recall that != means ‘not equal to’): model2 <- update(model,subset=(tannin != 6)) summary(model2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.6892 0.8963 13.042 1.25e-05 *** tannin -1.1171 0.1956 -5.712 0.00125 ** y First of all, notice that we have lost one degree of freedom, because there are now eight values of rather than nine. The estimate of the slope has changed from –1.2167 to –1.1171 (a difference of about 9%) and the standard error of the slope has changed from 0.2186 to 0.1956 (a difference of about 12%). What you do in response to this information depends on the circumstances. Here, we would simply note that point (6, 2) was influential and stick with our first model, using all the data. In other circumstances, a data point might be so influential that the structure of the model is changed completely by leaving it out. In that case, we might gather more data or, if the study was already finished, we might publish both results (with and without the influential point) so that the reader could make up their own mind about the interpretation. The important point is that we always do model checking; the summary.lm(model) table is not the end of the process of regression analysis. You might also want to check for lack of serial correlation in the residuals (e.g. time series effects) using the durbin.watson function from the car package (see p. 484), but there are too few data to use it with this example.

487 REGRESSION 465 10.2 Polynomial approximations to elementary functions x x ) can be expressed as Maclaurin series: ), log( Elementary functions such sin( ) and exp( x 3 7 5 x x x ) ( − + + ..., x = x sin − 3! 7! 5! 4 6 2 x x x ) ( − = 1 cos x + − ..., + 2! 4! 6! 3 2 1 0 x x x x ) ( + + + + ..., x = exp 0! 1! 3! 2! 4 5 3 2 x x x x ) ( − = 1 + x x log − + − ... + 5 3 2 4 In fact, we can approximate any smooth continuous single-valued function by a polynomial of sufficiently high degree. To see this in action, consider the graph of sin( ) against x in the range 0 < x <π x x is (where an angle measured in radians): x <- seq(0,pi,0.01) y <- sin(x) plot(x,y,type="l",ylab="sin(x)") Up to about x = 0.3 the very crude approximation sin( x ) = x works reasonably well. The first approximation, 3 including a single extra term for – x x 0.8: /3!, extends the reasonable fit up to about = a1 <- x-xˆ3/factorial(3) lines(x,a1,col="green") 5 Adding the term in x /5! captures the first peak in sin( x ) quite well. And so on. a2 <- x-xˆ3/factorial(3)+xˆ5/factorial(5) lines(x,a2,col="red") 1.0 0.8 0.6 sin (x) 0.4 0.2 0.0 3.0 0.0 2.5 1.0 1.5 2.0 0.5 x

488 466 THE R BOOK 10.3 Polynomial regression y x often turns out not to be a straight line. However, Occam’s razor requires and The relationship between that we fit a straight-line model unless a non-linear relationship is significantly better at describing the data. So this begs the question: how do we assess the significance of departures from linearity? One of the simplest ways is to use polynomial regression. The idea of polynomial regression is straightforward. As before, we have just one continuous explanatory 2 3 and x , to the model in addition to x to explain x x , but we can fit higher powers of , such as x variable, and x . It is useful to experiment with the kinds of curves that can be y curvature in the relationship between 2 , there x generated with very simple models. Even if we restrict ourselves to the inclusion of a quadratic term, are many curves we can describe, depending upon the signs of the linear and quadratic terms: par(mfrow=c(2,2)) x <- seq(0,10,0.1) y1 <- 4 + 2 * x - 0.1 * xˆ2 y2 <- 4 + 2 * x - 0.2 * xˆ2 y3 <- 12 - 4 * x + 0.35 * xˆ2 y4 <- 4 + 0.5 * x + 0.1 * xˆ2 plot(x,y1,type="l",ylim=c(0,15),ylab="y",col="red") plot(x,y2,type="l",ylim=c(0,15),ylab="y",col="red") plot(x,y3,type="l",ylim=c(0,15),ylab="y",col="red") plot(x,y4,type="l",ylim=c(0,15),ylab="y",col="red") 15105 15105 y y 0 0 02 46 810 0246810 x x 15105 15105 y y 0 0 0246810 02 46 810 x x In the top left-hand panel, there is a curve with positive but declining slope, with no hint of a hump 2 2 x – 0.2 x 2 + 4 ), ). The top right-hand graph shows a curve with a clear maximum ( y = 4 x x 2 + – 0.1 = y ( 2 ). The bottom right-hand = 12 – 4 x + 0.35 x and at bottom left we have a curve with a clear minimum ( y + curve shows a positive association between and x with the slope increasing as x increases ( y = 4 y 0.5 x 2 ). So you can see that a simple quadratic model with three parameters (an intercept, a slope for x , + 0.1 x

489 REGRESSION 467 2 x y and x .Itis and a slope for ) is capable of describing a wide range of functional relationships between the relationship between y and x ; it does not very important to understand that the quadratic model describes explain the mechanistic (or causal) relationship between y and x . pretend to We can see how polynomial regression works by analysing an example where diminishing returns in output ( ) are suspected as inputs ( xv ) are increased: yv \\ \\ diminish.txt",header=T) poly <- read.table("c: temp attach(poly) names(poly) [1] "xv" "yv" We begin by fitting a straight-line model to the data: windows(7,4) par(mfrow=c(1,2)) model1 <- lm(yv~xv) plot(xv,yv,pch=21,col="brown",bg="yellow") abline(model1,col="navy") 2 r This is not a bad fit to the data ( 0.8725), but there is a distinct hint of curvature (diminishing returns = in this case). Next, we fit a second explanatory variable which is the square of the x value (the so-called I (for ‘as is’) in the model formula; see p. 210. ‘quadratic term’). Note the use of model2 <- lm(yv~xv+I(xvˆ2)) model2 to predict the fitted values for a smooth range of x values between 0 and 90: Now we use plot(xv,yv,pch=21,col="brown",bg="yellow") x <- 0:90 y <- predict(model2,list(xv=x)) lines(x,y,col="navy") 45403530 45403530 yv yv 80 20 40 60 20 40 60 80 xv xv 2 This looks like a slightly better fit than the straight line ( r = 0.9046), but we shall choose between the two models on the basis of an F test using anova : anova(model1,model2) Analysis of Variance Table Model 1: yv ~ xv Model 2: yv ~ xv + I(xvˆ2)

490 468 THE R BOOK Res.Df RSS Df Sum of Sq F Pr(>F) 1 16 91.057 2 15 68.143 1 22.915 5.0441 0.0402 * p 0.04) so we = The more complicated curved model is a significant improvement over the linear model ( accept that there is evidence of curvature in these data. 10.4 Fitting a mechanistic model to data Rather than fitting some arbitrary model for curvature (as above, with a quadratic term for inputs), we sometimes have a mechanistic model relating the value of the response variable to the explanatory variable (e.g. a mathematical model of a physical process). In the following example we are interested in the decay of organic material in soil, and our mechanistic model is based on the assumption that the fraction of dry matter lost per year is a constant. This leads to a two-parameter model of exponential decay in which the amount of y ) is a function of time ( t ): material remaining ( − bt e . y y = 0 Here y is the initial dry mass (at time t = 0) and b is the decay rate (the parameter we want to estimate by 0 linear regression). Taking logs of both sides, we get ( ) ( ) = log y y log − . bt 0 b y ) Now you can see that we can estimate the parameter of interest, , as the slope of a linear regression of log( t (i.e. we log-transform the y on x axis) and the value of y axis but not the as the antilog of the intercept. 0 We begin by plotting our data: data <- read.table("c: \\ temp \\ Decay.txt",header=T) names(data) [1] "time" "amount" attach(data) plot(time,amount,pch=21,col="blue",bg="brown") abline(lm(amount~time),col="green") The curvature in the relationship is clearly evident from the poor fit of the straight-line (green) model through the scatterplot (there are groups of positive residuals for low and high values of time, and a large group of negative residuals at intermediate times). Now we fit the linear model of l og(amount) as a function of time : model <- lm(log(amount)~time) summary(model) Call: lm(formula = log(amount) ~ time) Residuals: Min 1Q Median 3Q Max -0.5935 -0.2043 0.0067 0.2198 0.6297

491 REGRESSION 469 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.547386 0.100295 45.34 < 2e-16 *** time -0.068528 0.005743 -11.93 1.04e-12 *** Residual standard error: 0.286 on 29 degrees of freedom Multiple R-squared: 0.8308, Adjusted R-squared: 0.825 F-statistic: 142.4 on 1 and 29 DF, p-value: 1.038e-12 94.385 36. The is the antilog of the intercept: y = = exp(4.547 386) y Thus, the slope is – 0.068 528 and 0 0 equation can now be parameterized (with standard errors in brackets) as ) ( ) ( 5474 t ± 0 . 00574 4 − 0 . 0685 . ± 0 . 1003 , y e = or written in its original form, without the uncertainty estimates, as 0 − . 0685 t = . y 94 385 , and we can draw the fitted line through the data, remembering to take the antilogs of the predicted values (the model predicts log(amount) and we want amount ), like this: ts <- seq(0,30,0.02) left <- exp(predict(model,list(time=ts))) plot(time,amount,pch=21,col="blue",bg="brown") lines(ts,left,col="blue") 120 120 80604020 80604020 amount amount 0 5 10 15 20 25 30 10 15 20 25 30 0 5 time time 10.5 Linear regression after transformation Many mathematical functions that are non-linear in their parameters can be linearized by transformation (see p. 258). The most frequent transformations (in order of frequency of use), are logarithms, antilogs and reciprocals. Here is an example of linear regression associated with a power law (p. 261): b . = ax y This is a two-parameter function, where the parameter a describes the slope of the function for low values of = x b is the shape parameter. For b = 0 we have a horizontal relationship y = a , for b and 1 we have a straight a line through the origin = ax with slope y , for b > 1 the slope is positive but increases with increasing x , for

492 470 THE R BOOK < b 1 the slope is positive but decreases with increasing x , while for b < 0 (negative powers) the curve is 0 < approaches 0 and asymptotic to zero as x approaches a negative hyperbola that is asymptotic to infinity as x infinity. Let us load a new dataframe and plot the data: power <- read.table("c: temp \\ power.txt",header=T) \\ attach(power) names(power) [1] "area" "response" plot(area,response,pch=21,col="green",bg="orange") abline(lm(response~area),col="blue") plot(log(area),log(response),pch=21,col="green",bg="orange") abline(lm(log(response)~log(area)),col="blue") 1.05 2.82.62.42.2 0.950.85 response log(response) 0.75 1.5 2.0 2.5 0.2 0.4 0.6 0.8 1.0 1.0 log(area) area The two plots look very similar (this is not always the case), but we need to compare the two models: model1 <- lm(response~area) model2 <- lm(log(response)~log(area)) summary(model2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.75378 0.02613 28.843 < 2e-16 *** log(area) 0.24818 0.04083 6.079 1.48e-06 *** t test to see whether the estimated shape parameter, b = We need to do a 0.248 18, is significantly less than b = 1 (a straight line): | | 1 0 . 24818 − 0 . = t = 18 . 41342 . 0 . 04083 This is highly significant ( p < 0.0001), so we conclude that there is a non-linear relationship between response and area . Let us get a visual comparison of the two models: windows(7,7) plot(area,response,pch=21,col="green",bg="orange")

493 REGRESSION 471 abline(lm(response~area),col="blue") xv <- seq(1,2.7,0.01) yv <- exp(0.75378)*xvˆ0.24818 lines(xv,yv,col="red") This is a nice example of the distinction between statistical significance and scientific importance. The power < law transformation shows that the curvature is highly significant ( p < 0.0001) but over the range b 1 with y , the effect of the curvature is very small; the straight line of the data, and given the high variance in and the power function are very close to one another. However, the choice of model makes an enormous difference if the function is to be used for prediction. Here are the two functions over an extended range of x : values for plot(area,response,xlim=c(0,5),ylim=c(0,4),pch=21,col="green",bg="orange") abline(lm(response~area),col="blue") xv <- seq(0,5,0.01) yv <- exp(0.75378)*xvˆ0.24818 lines(xv,yv,col="red") 4 3 2 response 1 0 5 012 34 area The moral is clear: you need to extremely careful when using regression models for prediction. If you know that response be zero when area is zero (the graph has to pass through the origin) then must obviously the power function is likely to be better for extrapolation to the left of the data. But if we have no information on non-linearity other than that contained within the data, then parsimony suggests that errors will be smaller using the simpler, linear model for prediction. Both models are equally good at describing the 2 2 = 0.574 and the power law model has r = 0.569), but extrapolation beyond the r data (the linear model has range of the data is always fraught with difficulties. Targeted collection of new data for response at values of area close to 0 and close to 5 might resolve the issue.

494 472 THE R BOOK 10.6 Prediction following regression The popular notion is that predicting the future is impossible, and that attempts at prediction are nothing more that crystal-gazing. However, all branches of applied science rely upon prediction. These predictions may be based on extensive experimentation (as in engineering or agriculture) or they may be based on detailed, long-term observations (as in astronomy or meteorology). In all cases, however, the main issue to be confronted in prediction is how to deal with uncertainty: uncertainty about the suitability of the fitted model, uncertainty about the representativeness of the data used to parameterize the model, and uncertainty about future conditions (in particular, uncertainty about the future values of the explanatory variables). Interpolation , There are two kinds of prediction, and these are subject to very different levels of uncertainty. within the measured range of the data, can often be very accurate and is not greatly affected which is prediction by model choice. Extrapolation , which is prediction beyond the measured range of the data, is far more problematical, and model choice is a major issue. Choice of the wrong model can lead to wildly different predictions (see p. 471). Here are two kinds of plots involved in prediction following regression: the first illustrates uncertainty in the parameter estimates; the second indicates uncertainty about predicted values of the response. We continue with the tannin example: \\ temp \\ reg.data <- read.table("c: regression.txt",header=T) attach(reg.data) names(reg.data) [1] "growth" "tannin" plot(tannin,growth,pch=21,col="blue",bg="red") 12 10 8 gowth 6 4 2 02468 tannin model <- lm(growth~tannin) abline(model,col="blue")

495 REGRESSION 473 The first plot is intended to show the uncertainty associated with the estimate of the slope. It is easy to extract the slope from the vector of coefficients: coef(model)[2] tannin -1.216667 The standard error of the slope is a little trickier to find. After some experimentation, you will discover that summary(model) : it is in the fourth element of the list that is summary(model)[[4]][4] [1] 0.2186115 Here is a function that will add dotted lines showing two extra regression lines to our existing plot – the estimated slope plus and minus one standard error of the slope: se.lines <- function(model) { b1 <- coef(model)[2]+ summary(model)[[4]][4] b2 <- coef(model)[2]- summary(model)[[4]][4] xm <- sapply(model[[12]][2],mean) ym <- sapply(model[[12]][1],mean) a1 <- ym-b1*xm a2 <- ym-b2*xm abline(a1,b1,lty=2,col="blue") abline(a2,b2,lty=2,col="blue") } se.lines(model) 12 10 8 growth 6 4 2 02468 tannin

496 474 THE R BOOK More often, however, we are interested in the uncertainty about predicted values (rather than uncertainty of parameter estimates, as above). We might want to draw the 95% confidence intervals associated with y at different values of x . As we saw on p. 460, uncertainty increases with the square of the predictions of x and the value of x at which the value of y difference between the mean value of is to be predicted. Before we can draw these lines we need to calculate a vector of values; you need 100 or so values to make an attractively x t (p. 122). Finally, we multiply Student’s smooth curve. Then we need the value of Student’s by the standard t error of the predicted value of y (p. 462) to get the confidence interval. This is added to the fitted values of y to get the upper limit and subtracted from the fitted values of y to get the lower limit. Here is the function: ci.lines <- function(model) { xm <- sapply(model[[12]][2],mean) n <- sapply(model[[12]][2],length) ssx <- sum(model[[12]][2]ˆ2)-sum(model[[12]][2])ˆ2/n s.t <- qt(0.975,(n-2)) xv <- seq(min(model[[12]][2]),max(model[[12]][2]),length=100) yv <- coef(model)[1]+coef(model)[2]*xv se <- sqrt(summary(model)[[6]]ˆ2*(1/n+(xv-xm)ˆ2/ssx)) ci <- s.t*se uyv <- yv+ci lyv <- yv-ci lines(xv,uyv,lty=2,col="blue") lines(xv,lyv,lty=2,col="blue") } We replot the linear regression, then overlay the confidence intervals (Box 10.4): plot(tannin,growth,pch=21,col="blue",bg="red") abline(model, col="blue") ci.lines(model) 12 10 8 growth 6 4 2 024 68 tannin

497 REGRESSION 475 tannin = tannin = 6 that fall outside the 95% confidence This draws attention to the points at 3 and limits of our fitted values. You can speed up this procedure by using the built-in ability to generate confidence intervals coupled with matlines int="c" , while prediction intervals (fitted values . The familiar 95% confidence intervals are plus or minus 2 standard deviations) are int="p". plot(tannin,growth,pch=16,ylim=c(0,15)) model <-lm(growth~tannin) As usual, start by generating a series of values for generating the curves, then create the scatterplot. The y x values are predicted from the model, specifying , then matlines is used to draw the regression int="c" line (solid) and the two confidence intervals (dotted), producing exactly the same graph as our last plot (above) without writing a special function: xv <- seq(0,8,0.1) yv <- predict(model,list(tannin=xv),int="c") matlines(xv,yv,lty=c(1,2,2),col="black") A similar plot can be obtained using the effects library (see p. 968). 10.7 Testing for lack of fit in a regression The unreliability estimates of the parameters explained in Boxes 10.2 and 10.3 draw attention to the important issues in optimizing the efficiency of regression designs. We want to make the error variance as small as possible (as always), but in addition, we want to make as large as possible, by placing as many points as SSX x possible at the extreme ends of the axis. Efficient regression designs allow for:  x ; replication of least some of the levels of  a preponderance of replicates at the extremes (to maximize SSX );  sufficient levels of x to allow testing for non-linearity;  x to allow accurate location of thresholds. sufficient different values of Here is an example where replication allows estimation of pure sampling error, and this in turn allows a test of the significance of the data’s departure from linearity. As the concentration of an inhibitor is increased, the reaction rate declines: data <- read.delim("c: \\ temp \\ lackoffit.txt") attach(data) names(data) [1] "conc" "rate" plot(conc,jitter(rate),pch=16,col="red",ylim=c(0,8),ylab="rate") abline(lm(rate~conc),col="blue")

498 476 THE R BOOK 8 6 4 rate 2 0 012 3456 conc The linear regression does not look too bad, and the slope is highly significantly different from zero: model.reg <- lm(rate~conc) summary(model.reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.7262 0.4559 14.755 7.35e-12 *** conc -0.9405 0.1264 -7.439 4.85e-07 *** Residual standard error: 1.159 on 19 degrees of freedom Multiple R-squared: 0.7444, Adjusted R-squared: 0.7309 F-statistic: 55.33 on 1 and 19 DF, p-value: 4.853e-07 Because there is replication at each level of we can do something extra, compared with a typical regres- x sion analysis. We can estimate what is called the pure error variance . This is the sum of the squares of the differences between the y values and the mean values of y for the relevant level of x . This should sound some- what familiar. In fact, it is the definition of SSE from a one-way analysis of variance (see p. 501). By creating x , we can estimate this SSE a factor to represent the seven levels of simply by fitting a one-way ANOVA: fac.conc <- factor(conc) model.aov <- aov(rate~fac.conc) summary(model.aov) Df Sum Sq Mean Sq F value Pr(>F) fac.conc 6 87.81 14.635 17.07 1.05e-05 *** Residuals 14 12.00 0.857 This shows that the pure error sum of squares is 12.0 on 14 degrees of freedom (three replicates, and hence 2 d.f., at each of seven levels of x ). See if you can figure out why this sum of squares is less than the observed in the model.reg regression (25.512). If the means from the seven different concentrations all fell exactly on the same straight line then the two sums of squares would be identical. It is the fact that the means do not fall on the regression line that causes the difference. The difference between these two sums of squares

499 REGRESSION 477 = rate data to the straight-line model. We can (25.512 – 12.9 13.512) is a measure of lack of fit of the compare the two models to see if they differ in their explanatory powers: anova(model.reg,model.aov) Analysis of Variance Table Model 1: rate ~ conc Model 2: rate ~ fac.conc Res.Df RSS Df Sum of Sq F Pr(>F) 1 19 25.512 2 14 12.000 5 13.512 3.1528 0.04106 * A single ANOVA table showing the lack-of-fit sum of squares on a separate line is obtained by fitting both the regression line (1 d.f.) and the lack of fit (5 d.f.) in the same model: anova(lm(rate~conc+fac.conc)) Analysis of Variance Table Response: rate Df Sum Sq Mean Sq F value Pr(>F) conc 1 74.298 74.298 86.6806 2.247e-07 *** fac.conc 5 13.512 2.702 3.1528 0.04106 * Residuals 14 12.000 0.857 To get a visual impression of this lack of fit we can draw vertical lines from the mean values to the fitted values of the linear regression for each level of x : my <- as.vector(tapply(rate,fac.conc,mean)) for (i in 0:6) lines(c(i,i),c(my[i+1],predict(model.reg,list(conc=0:6))[i+1]),col="green") points(0:6,my,pch=16,col="green") 8 6 4 rate 2 0 012345 6 conc

500 478 THE R BOOK not an adequate description of these data This significant lack of fit indicates that the straight-line model is < 0.05). A negative S-shaped function is likely to fit the data better (see p. 301). ( p lmtest on CRAN, which is full of tests for linear models. There is an R package called 10.8 Bootstrap with regression An alternative to estimating confidence intervals on the regression parameters from the pooled error variance in the ANOVA table (p. 459) is to use bootstrapping. There are two ways of doing this:  sample cases with replacement, so that some points are left off the graph while others appear more than once in the dataframe;  calculate the residuals from the fitted regression model, and randomize which fitted values get which y residuals. In both cases, the randomization is carried out many times, the model fitted and the parameters estimated. The confidence interval is obtained from the quantiles of the distribution of parameter values (see p. 41). The following dataframe contains a response variable (profit from the cultivation of a crop of carrots for a supermarket) and a single explanatory variable (the cost of inputs, including fertilizers, pesticides, energy and labour): \\ temp \\ regdat.txt",header=T) regdat <- read.table("c: attach(regdat) names(regdat) [1] "explanatory" "response" plot(explanatory,response,pch=21,col="green",bg="red") model <- lm(response~explanatory) abline(model,col="blue") 25 20 response 15 6 8 10 12 14 16 explanatory

501 REGRESSION 479 The response is a reasonably linear function of the explanatory variable, but the variance in the response is quite large. For instance, when the explanatory variable is about 12, the response variable ranges between less than 20 and more than 24. model Coefficients: (Intercept) explanatory 9.630 1.051 Theory suggests that the slope should be 1.0, and our estimated slope is very close to this (1.051). We want to establish a 95% confidence interval on the estimate. Here is a home-made bootstrap which resamples the data points 10 000 times and gives a bootstrapped estimate of the slope: b.boot <- numeric(10000) for (i in 1:10000) { indices <- sample(1:35,replace=T) xv <- explanatory[indices] yv <- response[indices] model <- lm(yv~xv) b.boot[i] <- coef(model)[2] } hist(b.boot,main="",col="green") 2000 1500 1000 Frequency 500 0 0.6 0.8 1.0 1.2 b.boot Here is the 95% interval for the bootstrapped estimate of the slope: quantile(b.boot,c(0.025,0.975)) 2.5% 97.5% 0.8137637 1.1964226

502 480 THE R BOOK Evidently, the bootstrapped data provide no support for the hypothesis that the slope is significantly greater than 1.0. function from the boot package: We now repeat the exercise, using the boot library(boot) boot how to calculate the The first step is to write what is known as the ‘statistic’ function. This shows statistic we want from the resampled data (the slope in this case). The resampling of the data is achieved by boot (here called index a subscript provided by ). The point is that every time the model is fitted within the bootstrap it uses a different data set ( and xv ): we need to describe how these data are constructed and yv how they are to be used in the model fitting: reg.boot <- function(regdat, index) { xv <- explanatory[index] yv <- response[index] model <- lm(yv~xv) coef(model) } Now we can run the boot function, then extract the intervals with the boot.ci function: reg.model <- boot(regdat,reg.boot,R=10000) boot.ci(reg.model,index=2) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 10000 bootstrap replicates CALL : boot.ci(boot.out = reg.model, index = 2) Intervals : Level Normal Basic 95% ( 0.870, 1.254 ) ( 0.903, 1.287 ) Level Percentile BCa 95% ( 0.815, 1.198 ) ( 0.821, 1.202 ) Calculations and Intervals on Original Scale Warning message: In boot.ci(reg.model, index = 2) : bootstrap variances needed for studentized intervals All the intervals are reasonably similar: statisticians typically prefer the bias-corrected, accelerated (BCa) intervals. These indicate that if we were to repeat the data-collection exercise we can be 95% confident that the regression slope for those new data would be between 0.821 and 1.202. y values The other way of bootstrapping with a model is to randomize the allocation of the residuals to fitted estimated from the original regression model. We start by calculating the residuals and the fitted values: model <- lm(response~explanatory) fit <- fitted(model) res <- resid(model) What we intend to do is to randomize which of the res values is added to the fit values to get a reconstructed response variable, y , which we regress as a function of the original explanatory variable. Here is the statistic

503 REGRESSION 481 function to do this: residual.boot <- function(res, index) { y <- fit+res[index] model <- lm(y~explanatory) coef(model) } Note that the data passed to the statistic function are res in this case (rather than the original dataframe as in the first example, above). Now use the boot function and the boot.ci regdat function to obtain the 95% confidence intervals on the slope (this is ; the intercept is index=1 ): index=2 res.model <- boot(res,residual.boot,R=10000) boot.ci(res.model,index=2) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 10000 bootstrap replicates CALL : boot.ci(boot.out = res.model, index = 2) Intervals : Level Normal Basic 95% ( 0.878, 1.224 ) ( 0.884, 1.225 ) Level Percentile BCa 95% ( 0.876, 1.218 ) ( 0.872, 1.215 ) Calculations and Intervals on Original Scale Warning message: In boot.ci(res.model, index = 2) : bootstrap variances needed for studentized intervals x The BCa from randomizing the residuals is from 0.872 to 1.215, while from selecting random y points and with replacement it was from 0.821 to 1.202 (above). The two rather different approaches to bootstrapping produce reassuringly similar estimates of the same parameter. 10.9 Jackknife with regression A second alternative to estimating confidence intervals on regression parameters is to the data. jackknife Each point in the data set is left out, one at a time, and the parameter of interest is re-estimated. The regdat dataframe (above) has length(response) data points: names(regdat) [1] "explanatory" "response" length(response) [1] 35 We create a vector to contain the 35 different estimates of the slope: jack.reg <- numeric(35)

504 482 THE R BOOK x y pair each time: Now carry out the regression 35 times, leaving out a different , for (i in 1:35) { model <- lm(response[-i]~explanatory[-i]) jack.reg[i] <- coef(model)[2] } Here is a histogram of the different estimates of the slope of the regression: hist(jack.reg,main="",col="pink") 12 1086420 Frequency 0.98 1.00 1.02 1.04 1.06 1.08 jack.reg As you can see, the distribution is strongly skew to the left. The quantiles of are not particularly jack.reg informative because the sample is so small (just 35). However, the jackknife does draw attention to one particularly influential point (the extreme left-hand bar) which, when omitted from the dataframe, causes the estimated slope to fall below 1.0. We say the point is influential because it is the only one of the 35 points whose omission causes the estimated slope to fall below 1.0. But which data point is this? We extract Cook’s distance \$infmat[,5] from the influence matrix from the model ( influence.measures(model)\$infmat ) and ask which data point has the maximum value of this influence measure: model <- lm(response~explanatory) which(influence.measures(model)\$infmat[,5] == max(influence.measures(model)\$infmat[,5])) 22 Now we can draw regression lines for the full data set (blue line) and for the model with the influential point number 22 omitted (red line) to see just how influential (or not) this point really is for the location of the line: plot(explanatory,response,pch=21,col="green",bg="red") abline(model,col="blue") abline(lm(response[-22]~explanatory[-22]),col="red")

505 REGRESSION 483 25 20 response 15 14 16 6 10 8 12 explanatory Neither model describes at all well the location of the response for the two lowest values of the explanatory variable (and the fit is worse with the most influential point removed). 10.10 Jackknife after bootstrap The jack.after.boot function calculates the jackknife influence values from a bootstrap output object, object calculated boot and plots the corresponding jackknife-after-bootstrap plot. We illustrate its use with the earlier called reg.model . We are interested in the slope, which is index=2 : jack.after.boot(reg.model,index=2) 0.2 0.1 0.0 –0.1 –0.2 21 9 7 16 20 6 34 5, 10, 16, 50, 84, 90, 95%-iles of (T*-t) 30 15 1 31 25 32 28 33 8 23 29 24 19 3 17 12 5 10 11 18 –0.3 26 4 35 22 14 2 27 3 2 0 –1 –2 1 standardized jackknife value

506 484 THE R BOOK The centred jackknife quantiles for each observation are estimated from those bootstrap samples in which the particular observation did not appear . These are then plotted against the influence values. From the top downwards, the horizontal dotted lines show the 95th, 90th, 84th, 50th, 16th, 10th and 5th per- centiles. The numbers at the bottom identify the 35 points by their index values within regdat .Again, the influence of point no. 22 shows up clearly (this time on the right-hand side), indicating that it has a strong positive influence on the slope, and the two left-hand outliers are identified as points nos 34 and 30. 10.11 Serial correlation in the residuals The Durbin–Watson function is used for testing whether there is autocorrelation in the residuals from a linear model or a generalized linear model, and is implemented as part of the car package (see Fox, 2002): library("car") durbinWatsonTest(model) lag Autocorrelation D-W Statistic p-value 1 -0.07946739 2.049899 0.874 Alternative hypothesis: rho != 0 There is no evidence of serial correlation in these residuals ( p = 0.874). The package also contains functions for drawing ellipses, including data ellipses, and confidence car ellipses for linear and generalized linear models. Here is the dataEllipse function for the present example: by default, the ellipses are drawn at 50% and 90%: dataEllipse(explanatory,response) 25 20 response 15 6 8 10 12 14 16 explanatory

507 REGRESSION 485 10.12 Piecewise regression This kind of regression fits different functions over different ranges of the explanatory variable. For example, it might fit different linear regressions to the left- and right-hand halves of a scatterplot. Two important questions arise in piecewise regression:  how many segments to divide the line into;  x axis. where to position the break points on the Suppose we want to do the simplest piecewise regression, using just two linear segments. Where do we break up the x values? A simple, pragmatic view is to divide the x values at the point where the piecewise regression best fits the response variable. Let us take an example using a linear model where the response is the log of a count (the number of species recorded) and the explanatory variable is the log of the size of the area searched for the species: data <- read.table("c: \\ temp \\ sasilwood.txt",header=T) attach(data) names(data) [1] "Species" "Area" log(Species) and log(Area) is not linear: A quick scatterplot suggests that the relationship between plot(log(Species)~log(Area),pch=21,col="red",bg="yellow") 6543210 log(Species) 0 5 10 –5 log(Area) The slope appears to be shallower at small scales than at large. The overall regression highlights this at the model-checking stage: model1 <- lm(log(Species)~log(Area)) par(mfrow=c(2,2)) plot(model1)

508 486 THE R BOOK Normal Q-Q Residuals vs Fitted 21 2 10–1–2–3–4 0 –1 Residuals –2–3 952 952 1266 12594206 Standardized residuals 3 5 2 1 2 3 4 –3 –2 –1 0 1 Fitted values Theoretical Quantiles Scale-Location Residuals vs Leverage | 1253 1457 1455 2.0 952 20–2–4 1.51.00.50.0 Standardized resiuals | Standardized residuals √ 5 4 1 3 2 0.000 0.002 0.004 0.006 Fitted values Leverage The residuals are very strongly U-shaped (this plot should look like the sky at night) and the errors are profoundly non-normal (the top right-hand line should be straight). If we are to use piecewise regression, then we need to work out how many straight-line segments to use and where to put the breaks. Visual inspection of the scatterplot suggests that two segments would be an improvement over a single straight line and that the break point should be about = 5. The log(Area) choice of break point is made more objective by choosing a range of values for the break point and selecting the break that produces the minimum deviance. We should have a minimum of two values for each of the x pieces of the regression, so the areas associated with the first and last breaks can be obtained by examination x of the table of values: table(Area) Area 0.01 0.1 1 10 100 1000 10000 40000 90000 160000 250000 1e+06 3463452592398867110187431 The leftmost break could be between areas 0.1 and 1, and the rightmost between 160 000 and 250 000 (i.e. between indices 2 and 3 and 10 and 11) Piecewise regression is extremely simple in R: we just include a logical statement as part of the model formula, with as many logical statements as we want straight-line segments in the fit. In the present example with two linear segments, the two logical statements are Area=Break to define the right-hand regression. We want to fit the model for all values of Break between 1 and 250 000, so we create a vector of breaks like this: Break <- sort(unique(Area))[3:11] Now we use a loop to fit the two-segment piecewise model nine times and to store the value of the residual standard error in a vector called d . This quantity is the sixth element of the list that is the model summary object, d[i] <- summary(model)[[6]]:

509 REGRESSION 487 d <- numeric(9) for (i in 1:9) { model <- lm(log(Species)~(Area=Break[i])*log(Area)) d[i] <- summary(model)[[6]] } A plot shows where the minimum value of occurs: d windows(7,4) par(mfrow=c(1,2)) plot(log(Break),d,typ="l",col="red") Where exactly does the minimum of occur? We use the which function for this: d Break[which(d==min(d))] [1] 100 The best piecewise regression will fit one line up to Area = 100 and a different line for Area > 100. The model formula looks like this: model2 <- lm(log(Species)~log(Area)*(Area<100)+log(Area)*(Area>=100)) The piecewise regression is a massive improvement over the linear model: anova(model1,model2) Analysis of Variance Table Model 1: log(Species) ~ log(Area) Model 2: log(Species) ~ log(Area) * (Area < 100) + log(Area) * (Area >= 100) Res.Df RSS Df Sum of Sq F Pr(>F) 1 1485 731.98 2 1483 631.36 2 100.62 118.17 < 2.2e-16 *** The summary of the piecewise regression takes some getting used to. We have fitted two linear regressions, so there are four parameters. Like an analysis of covariance, the table of coefficients contains one slope and one intercept, along with one difference between slopes and one difference between intercepts. The table has six rows because of the intentional aliasing, which we contrived by providing zeros for the explanatory variables where the two logical expressions evaluate to : FALSE summary(model2) Coefficients: (2 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.61682 0.13059 4.723 2.54e-06 *** log(Area) 0.41019 0.01655 24.787 < 2e-16 *** Area < 100TRUE 1.07854 0.13246 8.143 8.12e-16 *** Area >= 100TRUE NA NA NA NA log(Area):Area < 100TRUE -0.25611 0.01816 -14.100 < 2e-16 *** log(Area):Area >= 100TRUE NA NA NA NA Residual standard error: 0.6525 on 1483 degrees of freedom

510 488 THE R BOOK Multiple R-squared: 0.724, Adjusted R-squared: 0.7235 F-statistic: 1297 on 3 and 1483 DF, p-value: < 2.2e-16 The intercept is for the factor level that comes first in the alphabet: this is the right-hand part of the graph log(Area) log(Area):Area < 100FALSE ) is for this factor level too. The difference and the slope ( between the two intercepts is labelled Area < 100TRUE . The next row is labelled Area >= 100TRUE s because there were no x values (they were all zeros because logical FALSE was coerced and contains NA log(Area):Area to numeric zero by the multiplication). The difference between two slopes is labelled while the last row labelled < 100TRUE contains NA s because there log(Area):Area >= 100TRUE were no values. We cannot use abline , because we want two separate lines through different parts of the x scatterplot (not two lines across the whole plotting area; try it and see). To make plotting the two lines easier it is a good idea to calculate the two slopes and two intercepts in advance. The parameters are in the fourth element of the list that makes up . It is worth looking at this separately: summary(model2) summary(model2)[[4]] Estimate Std. Error t value Pr(>|t|) (Intercept) 0.6168168 0.13058554 4.723469 2.537983e-06 log(Area) 0.4101943 0.01654883 24.786903 9.081618e-114 Area < 100TRUE 1.0785395 0.13245572 8.142642 8.117406e-16 log(Area):Area < 100TRUE -0.2561147 0.01816373 -14.100333 1.834740e-42 Note that the two rows with NA s have been excluded. Using subscripts, we extract the parameter estimates a1 ) for the left and right hand pieces of the a2 ) and two slopes ( b1 and b2 and calculate two intercepts ( and regression: a1 <- summary(model2)[[4]][1]+summary(model2)[[4]][3] a2 <- summary(model2)[[4]][1] b1 <- summary(model2)[[4]][2]+summary(model2)[[4]][4] b2 <- summary(model2)[[4]][2] Finally, we need to decide on the x values between which to draw the two lines. Inspection of the scatterplot indicates that –5 would be a good minimum value and 15 would be a good maximum. The break point (4.6 log(100)) is the obvious point at which to stop the first line and start the second line. = plot(log(Area),log(Species),col="blue") lines(c(-5,4.6),c(a1+b1*-5,a1+b1*4.6),col="red") lines(c(4.6,15),c(a2+b2*4.6,a2+b2*15),col="red") 0.70 65421 0.68 d log(Species) 0.66 03 04 2681012 –50 10 5 log(Area) log(Break)

511 REGRESSION 489 Of course, for spatial scales even smaller than studied here, the slope of the plot must go asymptotically to zero, because once the plot is so small that it can contain only one individual, making the plot even smaller is bound to contain that same individual (thus, species richness will be one for all subsequent smaller spatial scales). 10.13 Multiple regression A multiple regression is a statistical model with two or more continuous explanatory variables. We contrast multiple regression with analysis of variance, where all the explanatory variables are categorical (Chapter 11) and analysis of covariance, where the explanatory variables are a mixture of continuous and categorical (Chapter 12). Multiple regressions models provide some of the most profound challenges faced by the analyst because of some crucial issues:  over-fitting (we often have more explanatory variables than data points);  parameter proliferation (we might want to fit parameters for curvature and interaction);  correlation between explanatory variables (called collinearity);  choice between contrasting models of roughly equal explanatory power. The principle of parsimony (Occam’s razor), discussed in Section 9.2, is again relevant here. It requires that the model should be as simple as possible. This means that the model should not contain any redundant parameters. Ideally, we achieve this by fitting a maximal model and then simplifying it by following one or more of these steps:  Remove non-significant interaction terms.  Remove non-significant quadratic or other non-linear terms.  Remove non-significant explanatory variables.  Amalgamate explanatory variables that have similar parameter values. Of course, such simplifications must make good scientific sense, and must not lead to significant reductions in explanatory power. It is likely that many of the explanatory variables are correlated with each other, and the order in which variables are deleted from the model will influence the explanatory power attributed so to them. The thing to remember about multiple regression is that, in principle, there is no end to it. The number of combinations of interaction terms and curvature terms is endless. There are some simple rules (like parsimony) and some automated functions (like step ) to help. But, in principle, you could spend a very great deal of time in modelling a single dataframe. There are no hard-and-fast rules about the best way to proceed, but we shall typically carry out simplification of a complex model by stepwise deletion: non-significant terms are left out, and significant terms are added back (see Chapter 9). At the data inspection stage, there are many more kinds of plots we could do:  Plot the response against each of the explanatory variables separately.  Plot the explanatory variables against one another (e.g. pairs ; see Section 10.13.1).  Plot the response against pairs of explanatory variables in three-dimensional plots.

512 490 THE R BOOK  Plot the response against explanatory variables for different combinations of other explanatory variables coplot (e.g. conditioning plots, ; see p. 236).  Fit non-parametric smoothing functions (e.g. using generalized additive models, to look for evidence of curvature).  Fit tree models to investigate whether interaction effects are simple or complex. 10.13.1 The multiple regression model There are several important issues involved in carrying out a multiple regression:  which explanatory variables to include;  curvature in the response to the explanatory variables;  interactions between explanatory variables;  correlation between explanatory variables;  the risk of overparameterization. The assumptions about the response variable are the same as with simple linear regression: the errors are normally distributed, the errors are confined to the response variable, and the variance is constant. The explanatory variables are assumed to be measured without error. The model for a multiple regression with ) looks like this: x and x two explanatory variables ( 1 2 . = β ε + β + x x β + y i 1 2 i 1 0 i i 2 x and , , is determined by the levels of the two continuous explanatory variables x y th data point, i The i 1 2 i i and the two slopes β of and β ε ), and by the residual by the model’s three parameters (the intercept β 2 1 0 i , from the fitted surface. For each of the i rows of the dataframe, there are point + 1 parameters, β i k j so that k ∑ = , ε β x + y i ji i j 0 = j where x = 1. 0 i Let us begin with an example from air pollution studies. How is ozone concentration related to wind speed, air temperature and the intensity of solar radiation? ozone.pollution <- read.table("c: \\ temp \\ ozone.data.txt",header=T) attach(ozone.pollution) names(ozone.pollution) [1] "rad" "temp" "wind" "ozone"

513 REGRESSION 491 to look at all the correlations: pairs In multiple regression, it is always a good idea to use pairs(ozone.pollution,panel=panel.smooth) 60 70 80 90 50 100 150 0 2501500 rad 50 temp 60 70 80 90 wind 5101520 150 100500 OZONE 0 50 150 200 20 15 10 5 The response variable, ozone concentration, is shown on the axis of the bottom row of panels: there is a y strong negative relationship with wind speed, a positive correlation with temperature and a rather unclear, humped relationship with radiation. A good way to tackle a multiple regression problem is using non-parametric smoothers in a generalized additive model like this: library(mgcv) par(mfrow=c(2,2)) model <- gam(ozone~s(rad)+s(temp)+s(wind)) plot(model)

514 492 THE R BOOK 6040200–20 6040200–20 80 70 0 50 100 150 200 250 300 60 90 temp rad 6040 020 –20 5101520 wind The confidence intervals are sufficiently narrow to suggest that the curvature in the relationships between ozone and temperature and ozone and wind are real, but the curvature of the relationship with solar radiation is marginal. The plots lead us to anticipate that quadratic terms for temperature and wind should be included in our initial model. What about interactions? This is where tree models can help: library(tree) model <- tree(ozone~.,data=ozone.pollution) par(mfrow=c(1,1)) plot(model) text(model) temp<82.5 wind < 10.6 wind < 7.15 temp < 88.5 rad < 205 48.71 rad < 79.5 74.54 102.40 83.43 temp<77.5 61.00 12.22 20.97 34.56

516 494 THE R BOOK rad 2.628e-02 2.142e-01 0.123 0.9026 temp -1.021e+01 4.209e+00 -2.427 0.0170 * wind -2.802e+01 9.645e+00 -2.906 0.0045 ** t2 5.953e-02 2.382e-02 2.499 0.0141 * w2 6.173e-01 1.461e-01 4.225 5.25e-05 *** r2 -3.388e-04 2.541e-04 -1.333 0.1855 wr -1.127e-02 6.277e-03 -1.795 0.0756 . tr 3.750e-03 2.459e-03 1.525 0.1303 tw 1.734e-01 9.497e-02 1.825 0.0709 . The least significant term is the quadratic term for radiation, so we remove that: model3 <- update(model2,~.-r2) summary(model3) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 486.346603 194.333075 2.503 0.01392 * rad -0.043163 0.208535 -0.207 0.83644 temp -9.446780 4.185240 -2.257 0.02613 * wind -26.471461 9.610816 -2.754 0.00697 ** t2 0.056966 0.023835 2.390 0.01868 * w2 0.599709 0.146069 4.106 8.14e-05 *** wr -0.011359 0.006300 -1.803 0.07435 . tr 0.003160 0.002428 1.302 0.19600 tw 0.157637 0.094595 1.666 0.09869 . The temperature by radiation interaction is not significant, so it goes next: model4 <- update(model3,~.-tr) summary(model4) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 514.401470 193.783580 2.655 0.00920 ** rad 0.212945 0.069283 3.074 0.00271 ** temp -10.654041 4.094889 -2.602 0.01064 * wind -27.391965 9.616998 -2.848 0.00531 ** t2 0.067805 0.022408 3.026 0.00313 ** w2 0.619396 0.145773 4.249 4.72e-05 *** wr -0.013561 0.006089 -2.227 0.02813 * tw 0.169674 0.094458 1.796 0.07538 . The temperature by wind interaction is the next to go (it is marginally significant but we are ruthless): model5 <- update(model4,~.-tw) summary(model5) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 223.573855 107.618223 2.077 0.040221 * rad 0.173431 0.066398 2.612 0.010333 *

517 REGRESSION 495 temp -5.197139 2.775039 -1.873 0.063902 . wind -10.816032 2.736757 -3.952 0.000141 *** t2 0.043640 0.018112 2.410 0.017731 * w2 0.430059 0.101767 4.226 5.12e-05 *** wr -0.009819 0.005783 -1.698 0.092507 . There is no place for the wind by rain interaction: model6 <- update(model5,~.-wr) summary(model6) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 291.16758 100.87723 2.886 0.00473 ** rad 0.06586 0.02005 3.285 0.00139 ** temp -6.33955 2.71627 -2.334 0.02150 * wind -13.39674 2.29623 -5.834 6.05e-08 *** t2 0.05102 0.01774 2.876 0.00488 ** w2 0.46464 0.10060 4.619 1.10e-05 *** The next job is to subject model6 to criticism: par(mfrow=c(2,2)) plot(model6) Normal Q-Q Residuals vs Fitted 77 77 500–50 024 Residuals –2 Standardized residuals 85 20 40 0–1012 120 –2 60 80 100 Theoretical Quantiles Fitted values Scale-Location Residuals vs Leverage 77 | 77 2.01.51.00.50.0 420–2 1 85 34 0.5 0.5 Standardized resiuals Standardized residuals | 1 √ 0.30 0.10 0.05 0.00 120 100 80 40 60 20 0 0.25 0.20 0.15 Fitted values Leverage This is quite seriously badly behaved. The residuals increase with the fitted values (non-constant variance) and the errors are not normal. Let us try transforming the response variable. Having done this we need to

518 496 THE R BOOK start the modelling from scratch with all of the original explanatory variables included. Having transformed the response variable, we should expect that the curvature has been altered: model7 <- lm(log(ozone) ~ rad+temp+wind+t2+w2+r2+wr+tr+tw+wtr) summary(model7) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.803e+00 5.676e+00 0.494 0.6225 rad 2.771e-02 1.529e-02 1.812 0.0729 . temp -3.018e-02 1.178e-01 -0.256 0.7983 wind -9.812e-02 3.211e-01 -0.306 0.7605 t2 6.034e-04 6.559e-04 0.920 0.3598 w2 8.732e-03 4.021e-03 2.172 0.0322 * r2 -1.489e-05 7.043e-06 -2.114 0.0370 * wr -2.001e-03 1.339e-03 -1.494 0.1382 tr -2.507e-04 2.056e-04 -1.219 0.2256 tw -1.985e-03 3.742e-03 -0.530 0.5971 wtr 2.535e-05 1.805e-05 1.404 0.1634 model8 <- update(model7,~.-wtr) summary(model8) model9 <- update(model8,~.-tr) summary(model9) model10 <- update(model9,~.-tw) summary(model10) model11 <- update(model10,~.-t2) summary(model11) model12 <- update(model11,~.-wr) summary(model12) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.724e-01 6.350e-01 1.216 0.226543 rad 7.466e-03 2.323e-03 3.215 0.001736 ** temp 4.193e-02 6.237e-03 6.723 9.52e-10 *** wind -2.211e-01 5.874e-02 -3.765 0.000275 *** w2 7.390e-03 2.585e-03 2.859 0.005126 ** r2 -1.470e-05 6.734e-06 -2.183 0.031246 * Residual standard error: 0.4851 on 105 degrees of freedom Multiple R-squared: 0.7004, Adjusted R-squared: 0.6861 F-statistic: 49.1 on 5 and 105 DF, p-value: < 2.2e-16 plot(model12) This is the minimum adequate model. It has five consequential parameters (the intercept of a multiple regression model is usually meaningless; it is the value of the response when every one of the explanatory variables is zero). As predicted by our initial plots, none of the interactions survived the model simplification.

519 REGRESSION 497 log(ozone) ozone against the explanatory The curvature on the scale of is different, of course (we plotted variables, not log(ozone) ). Log transformation of the response improved both the non-constancy of variance and the non-normality of errors. The model explains just over of 70% of the variation in log(ozone concentration). 10.13.2 Common problems arising in multiple regression The following are some of the problems and difficulties that crop up when we do multiple regression:  differences in the measurement scales of the explanatory variables, leading to large variation in the sums of squares and hence to an ill-conditioned matrix;  multicollinearity, in which there is a near-linear relation between two of the explanatory variables, leading to unstable parameter estimates;  parameter proliferation where quadratic and interaction terms soak up more degrees of freedom than our data can afford;  rounding errors during the fitting procedure;  non-independence of groups of measurements;  temporal or spatial correlation amongst the explanatory variables;  pseudoreplication. Wetherill et al . (1986) give a detailed discussion of these problems. We shall encounter other examples of multiple regressions in the context of generalized linear models (Chapter 13), generalized additive models (Chapter 18), survival models (Chapter 27) and mixed-effects models (Chapter 19).

520 11 Analysis of Variance Instead of fitting continuous, measured variables to data (as in regression), many experiments involve exposing experimental material to a range of discrete factors . levels of one or more categorical variables known as Thus, a factor might be drug treatment for a particular cancer, with five levels corresponding to a placebo plus four new pharmaceuticals. Alternatively, a factor might be mineral fertilizer, where the four levels represent four different mixtures of nitrogen, phosphorus and potassium. Factors are often used in experimental designs to represent statistical blocks ; these are internally homogeneous units in which each of the experimental treatments is repeated. Blocks may be different fields in an agricultural trial, different genotypes in a plant physiology experiment, or different growth chambers in a study of insect photoperiodism. It is important to understand that regression and analysis of variance (ANOVA) are identical approaches except for the nature of the explanatory variables. For example, it is a small step from having three levels of a shade factor (say light, medium and heavy shade cloths) then carrying out a one-way ANOVA, to measuring the light intensity in the three treatments and carrying out a regression with light intensity as the explanatory variable. As we shall see later on, some experiments combine regression and ANOVA by fitting a series of regression lines, one in each of several levels of a given factor (this is called analysis of covariance; see Chapter 12). The emphasis in ANOVA was traditionally on hypothesis testing. Nowadays, the aim of an analysis of variance in R is to estimate means and standard errors of differences between means. Comparing two means by a t test involved calculating the difference between the two means, dividing by the standard error of the difference, and then comparing the resulting statistic with the value of Student’s t from tables (or better still, using qt to calculate the critical value; see p. 287). The means are said to be significantly different when the t is larger than the critical value. For large samples ( n calculated value of 30) a useful rule of thumb is that > a t value greater than 2 is significant. In ANOVA, we are concerned with cases where we want to compare three or more means. For the two-sample case, the t test and the ANOVA are identical, and the t test is to be preferred because it is simpler. 11.1 One-way ANOVA There is a real paradox about analysis of variance, which often stands in the way of a clear understanding of exactly what is going on. The idea of ANOVA is to compare several means, but it does this by comparing variances. How can that work? The R Book , Second Edition. Michael J. Crawley. © 2013 John Wiley & Sons, Ltd. Published 2013 by John Wiley & Sons, Ltd.

521 ANALYSIS OF VARIANCE 499 A visual example should make this clear. To keep things simple, suppose we have just two levels of a single factor. We plot the data in the order in which they were measured: first for the first level of the factor and then for the second level. Draw the overall mean as a horizontal line through the data, and indicate the departures of each data point from the overall mean with a set of vertical lines: SST response 0246810 0246810 Index The green lines illustrate the total variation in the response. We shall call this quantity SST (the ‘total sum the sum of the squares of the differences between the data, y, and the overall mean . of squares’). It is In symbols, ∑ 2 ̄ ̄ , SST = y ) ( y − ̄ ̄ ̄ (through the red y double bar’) is the overall mean. Next we can fit each of the separate means, where y y (‘ A ̄ (through the blue points), and consider the sum of squares of the differences between each y points) and y B SSE (the ‘error sum of value and its own treatment mean (either the red line or the blue line). We call this squares’), and calculate it like this: ∑ ∑ 2 2 ̄ ̄ − + y ) ) y − y ( SSE ( y = B A B A

522 500 THE R BOOK On the graph, the differences from which SSE is calculated look like this: SSE 10 response 02 46 8 024 6810 Index SSE is the sum of the squares of the green lines (the ‘residuals’, as they are known). Now ask yourself this question. If the treatment means are different from the overall mean, what will be SST and SSE the relationship between ? After a moment’s thought you should have been able to convince yourself that if the means are the same, then is the same as SST , because the two horizontal lines in the SSE last plot would be in the same position as the single line in the earlier plot. Now what if the means were significantly different from one another? What would be the relationship between SSE and SST in this case? Which would be the larger? Again, it should not take long for you to see that if the means are different, then SSE will be less than SST . Indeed, in the limit, SSE could be zero if the replicates from each treatment fell exactly on their respective means, like this: SST = big SSE = 0 response response 0246810 0246810 046810 2 046810 2 Index Index SST= SSE response 0246810 68 10 2 04 Index

523 ANALYSIS OF VARIANCE 501 SST is big but is zero (all In the top row, there is a highly significant difference between the two means: SSE is still big, but now the replicates are identical). In the bottom row, the means are identical. SST . SSE SST = Once you have understood these three plots, you will see why you can investigate differences between means . This is how analysis of variance works. by looking at variances SSE , and use this as a measure of the difference between SST We can calculate the difference between and SSA : , and is denoted by the treatment means; this is traditionally called the treatment sum of squares SST SSA SSE . = − SSA SSE . When differences When differences between means are significant, then will be large relative to will be small relative to SSE . In the limit, SSA could be zero between means are not significant, then SSE is explained by differences between the means ( SSA = (top right in the last figure), so all of the variation in y SST SSA = 0 and so ). At the other extreme, when there is no difference between the means (bottom left), SSE SST . = The technique we are interested in, however, is analysis of variance, not analysis of sums of squares. We convert the sums of squares into variances by dividing by their degrees of freedom. In our example, there are 1 = 1 degree of freedom for SSA . In general, we might have k − two levels of the factor and so there is 2 − 1 d.f. for treatments. If each factor level were replicated n times, then there k levels of any factor and hence n − 1 d.f. for error within each level (we lose one degree of freedom for each individual treatment would be mean estimated from the data). Since there are k k ( n − 1) d.f. for error in the whole levels, there would be kn kn experiment. The total number of numbers in the whole experiment is − 1 (the single , so total d.f. is ̄ ̄ degree is lost for our estimating the overall mean, y ). As a check in more complicated designs, it is useful to make sure that the individual component degrees of freedom add up to the correct total: . k − = k − 1 + k ( n − 1) = k − 1 + kn 1 − kn The divisions for turning the sums of squares into variances are conveniently carried out in an ANOVA table: SS d.f. MS F Critical F Source MSA SSA F = MSA = k − qf(0.95, k-1,k(n-1)) Treatment SSA 1 2 1 − k s SSE 2 ( n − Error s SSE = k 1) k ( n − 1) Total SST kn − 1 Each element in the sums of squares column is divided by the number in the adjacent degrees of freedom column to give the variances in the mean square column (headed MS). The significance of the difference F test (a variance ratio test). The treatment variance MSA is between the means is then assessed using an 2 , and the value of this test statistic is compared with the critical value of F s divided by the error variance, using qf (the quantiles of the F distribution, with p = 0.95, k − 1 degrees of freedom in the numerator, and k n − 1) degrees of freedom in the denominator). If you need to look up the critical value of F in tables, ( remember that you look up the numerator degrees of freedom (on top of the division) across the top of the table, and the denominator degrees of freedom down the rows. The null hypothesis, traditionally denoted as , is stated as H 0 : nothing’s happening . H 0

524 502 THE R BOOK This does not imply that the sample means are exactly the same (the means will always differ from one another, simply because everything varies). In fact, the null hypothesis assumes that the means are not significantly What this implies is that the differences between the sample means could have different from one another. arisen by chance alone, through random sampling effects, despite the fact that the different factor levels have identical means. If the test statistic is larger than the critical value we reject the null hypothesis and accept the alternative: : at least one of the means is significantly different from the others . H 1 If the test statistic is less than the critical value, then it could have arisen due to chance alone, and so we accept the null hypothesis. Another way of visualizing the process of ANOVA is to think of the relative amounts of sampling variation between replicates receiving the same treatment (i.e. between individual samples in the same level), and between different treatments (i.e. between-level variation). When the variation between replicates within a treatment is large compared to the variation between treatments, we are likely to conclude that the difference between the treatment means is not significant. Only if the variation between replicates within treatments is relatively small compared to the differences between treatments will we be justified in concluding that the treatment means are significantly different. 11.1.1 Calculations in one-way ANOVA The definitions of the various sums of squares can now be formalized, and ways found of calculating their values from samples. The total sum of squares, , is defined as: SST ( ) ∑ 2 y ∑ 2 , − SST = y kn just as in regression (see Chapter 10). Note that we divide by the total number of numbers we added together ∑ SSE y (the grand total of all the y s) which is kn . It turns out that the formula that we used to define to get is rather difficult to calculate (see above), so we calculate the treatment sums of squares, SSA SSE , and obtain SSA by difference. The treatment sum of squares, , is calculated as: ( ) ∑ 2 ∑ 2 y C = SSA , − n kn where the new term is C ,the treatment total . This is the sum of all the n replicates within a given level. Each of the n (the number of numbers added different treatment totals is squared, added up, and then divided by k together to get the treatment total). The formula is slightly different if there is unequal replication in different C will become clear when we work through the example treatments, as we shall see below. The meaning of later on. Notice the symmetry of the equation. The second term on the right-hand side is also divided by the ∑ kn ) to get the total ( number of numbers that were added together ( y ) which is squared in the numerator. Finally, SSE = SST − SSA , to give all the elements required for completion of the ANOVA table.

525 ANALYSIS OF VARIANCE 503 11.1.2 Assumptions of ANOVA You should be aware of the assumptions underlying the analysis of variance. They are all important, but some are more important than others:  random sampling;  equal variances;  independence of errors;  normal distribution of errors;  additivity of treatment effects. 11.1.3 A worked example of one-way ANOVA To draw this background material together, we shall work through an example by hand. In so doing, it will become clear what R is doing during its analysis of the data. We have an experiment in which crop yields per unit area were measured from 10 randomly selected fields on each of three soil types. All fields were sown with the same variety of seed and provided with the same fertilizer and pest control inputs. The question is whether soil type significantly affects crop yield, and if so, to what extent. results <- read.table("c: \\ temp \\ yields.txt",header=T) attach(results) names(results) [1] "sand" "clay" "loam" Here are the data: results sand clay loam 1 6 17 13 2101516 3839 4 6 11 12 5141415 6171216 7 9 12 17 811813 9 7 10 18 10 11 13 14 The function sapply is used to calculate the mean yields for the three soils (contrast this with tapply , below, where the response and explanatory variables are in adjacent columns in a dataframe): sapply(list(sand,clay,loam),mean) [1] 9.9 11.5 14.3 Mean yield was highest on loam (14.3) and lowest on sand (9.9).

526 504 THE R BOOK y . To create a dataframe from a It will be useful to have all of the yield data in a single vector called where the values of the response are in multiple columns, we use the function spreadsheet like results called stack like this: (frame <- stack(results)) values ind 1 6 sand 2 10 sand 3 8 sand 4 6 sand ... ... 27 17 loam 28 13 loam 29 18 loam 30 14 loam stack function has invented names for the response variable ( values ) and the You can see that the explanatory variable ( ind ). We will always want to change these: names(frame) <- c("yield","soil") attach(frame) head(frame) yield soil 1 6 sand 2 10 sand 3 8 sand 4 6 sand 5 14 sand 6 17 sand That’s more like it. Before carrying out analysis of variance, we should check for constancy of variance (see p. 354) across the three soil types: tapply(yield,soil,var) clay loam sand 15.388889 7.122222 12.544444 The variances differ by more than a factor of 2. But is this significant? We test for heteroscedasticity using the Fligner–Killeen test of homogeneity of variances: fligner.test(y~soil) Fligner-Killeen test of homogeneity of variances data: y by soil Fligner-Killeen:med chi-squared = 0.3651, df = 2, p-value = 0.8332 We could have used bartlett.test(y~soil) , which gives p = 0.5283 (but this is more a test of non-normality than of equality of variances). Either way, there is no evidence of any significant difference in variance across the three samples, so it is legitimate to continue with our one-way analysis of variance.

527 ANALYSIS OF VARIANCE 505 Because the explanatory variable is categorical (three levels of soil type), initial data inspection involves a box-and-whisker plot of against soil like this: y plot(yield~soil,col="green") 15 yield 10 5 sand loam clay soil Median yield is lowest on sand and highest on loam, but there is considerable variation from replicate to replicate within each soil type (there is even a low outlier on clay). It looks as if yield on loam will turn out to be significantly higher than on sand (their boxes do not overlap) but it is not clear whether yield on clay is significantly greater than on sand or significantly lower than on loam. The analysis of variance will answer these questions. The analysis of variance involves calculating the total variation in the response variable ( in this yield case) and partitioning it (‘analysing it’) into informative components. In the simplest case, we partition the total variation into just two components, explained variation and unexplained variation: SSA SSY SSE Explained variation is called the treatment sum of squares ( SSA ) and unexplained variation is called the error sum of squares ( SSE , also known as the residual sum of squares), as defined earlier. Let us work through the numbers in R. From the formula for SSY , we can obtain the total sum of squares by finding the differences between the data and the overall mean: sum((yield-mean(yield))ˆ2) [1] 414.7

528 506 THE R BOOK SSE , is calculated from the differences between the yields and the mean yields The unexplained variation, for : that soil type sand-mean(sand) [1] -3.9 0.1 -1.9 -3.9 4.1 7.1 -0.9 1.1 -2.9 1.1 clay-mean(clay) [1] 5.5 3.5 -8.5 -0.5 2.5 0.5 0.5 -3.5 -1.5 1.5 loam-mean(loam) [1] -1.3 1.7 -5.3 -2.3 0.7 1.7 2.7 -1.3 3.7 -0.3 We need the sums of the squares of these differences: sum((sand-mean(sand))ˆ2) [1] 112.9 sum((clay-mean(clay))ˆ2) [1] 138.5 sum((loam-mean(loam))ˆ2) [1] 64.1 sapply like this: To get the sum of these totals across all soil types, we can use sum(sapply(list(sand,clay,loam),function (x) sum((x-mean(x))ˆ2) )) [1] 315.5 SSE So , the unexplained (or residual, or error) sum of squares, is 315.5. The extent to which SSE is less than SSY is a reflection of the magnitude of the differences between the means. The greater the difference between the mean yields on the different soil types, the greater will be the difference between and SSY . SSE The treatment sum of squares, SSA , is the amount of the variation in yield that is explained by differences between the treatment means. In our example, SSA = SSY − SSE = 414 . 7 − 315 . 5 = 99 . 2 . Now we can draw up the ANOVA table. There are six columns indicating, from left to right, the source of variation, the sum of squares attributable to that source, the degrees of freedom for that source, the variance F for that source (traditionally called the mean square rather than the variance), the ratio (testing the null hypothesis that this source of variation is not significantly different from zero) and the p value associated with that value (if p < 0.05 then we reject the null hypothesis). We can fill in the sums of squares just calculated, F then think about the degrees of freedom: Source Sum of squares Degrees of freedom Mean square F ratio p value Soil type 2 49.6 4.24 0.025 99.2 2 = 11.685 s Error 315.5 27 Total 414.7 29

529 ANALYSIS OF VARIANCE 507 29. We lose 1 d.f. because There are 30 data points in all, so the total degrees of freedom are 30 – 1 = ̄ ̄ y , SSY we had to estimate one parameter from the data in advance, namely the overall mean, in calculating ∑ 2 ̄ ̄ − = ( y ) before we could calculate SST y n = 10 replications, so each soil type has . Each soil type has = 9 d.f. for error, because we estimated one parameter from the data for each soil type , namely the 10 – 1 ̄ = in calculating SSE . Overall, therefore, the error has 3 × 9 27 d.f. There were three soil y treatment means i types, so there are 3 – 1 = 2 d.f. for soil type. The mean squares are obtained simply by dividing each sum of squares by its respective degrees of 2 , is the residual mean square (the mean square for the s freedom (in the same row). The error variance, unexplained variation); this is sometimes called the ‘pooled error variance’ because it is calculated across all the treatments. The alternative would be to have three separate variances, one for each treatment: tapply(yield,soil,var) clay loam sand 15.388889 7.122222 12.544444 mean(tapply(yield,soil,var)) [1] 11.68519 2 s You will see that the pooled error variance = 11.685 is simply the mean of the three separate variances, n = because (in this case) there is equal replication in each soil type ( 10). By tradition, we do not calculate the total mean square, so the bottom cell of the fourth column of the F ratio is the treatment variance divided by the error variance, testing the null ANOVA table is empty. The hypothesis that the treatment means are not significantly different. If we reject this null hypothesis, we accept the alternative hypothesis that at least one of the means is significantly different from the others . The question naturally arises at this point as to whether 4.24 is a big number or not. If it is a big number then we reject the null hypothesis. If it is not a big number, then we accept the null hypothesis. As ever, we decide whether the test statistic = 4.24 is big or small by comparing it with the critical value of F , given that there are 2 d.f. F in the numerator and 27 d.f. in the denominator. Critical values in R are found from the function qf which gives us quantiles of the F distribution: qf(.95,2,27) [1] 3.354131 Our calculated test statistic of 4.24 is larger than the critical value of 3.35, so we reject the null hypothesis. At least one of the soils has a mean yield that is significantly different from the others. The modern approach is not to work slavishly at the 5% level but rather to calculate the p value associated with our test statistic of F distribution, we use the function pf for cumulative 4.24. Instead of using the function for quantiles of the probabilities of the F distribution like this: 1-pf(4.24,2,27) [1] 0.02503987 The p value is 0.025, which means that a value of F = 4.24 or bigger would arise by chance alone when the null hypothesis was true about 25 times in 1000. This is a sufficiently small probability (i.e. it is less than 5%) for us to conclude that there is a significant difference between the mean yields (i.e. we reject the null hypothesis).

530 THE R BOOK 508 That was a lot of work. R can do the whole thing in a single line: summary(aov(yield~soil)) Df Sum Sq Mean Sq F value Pr(>F) soil 2 99.2 49.60 4.245 0.025 * Residuals 27 315.5 11.69 Residuals Here you see all the values that we calculated longhand. The error row is labelled . In the second and subsequent columns you see the degrees of freedom for treatment and error (2 and 27), the treatment and 2 11.685, the = s error sums of squares (99.2 and 315.5), the treatment mean square of 49.6, the error variance value (labelled Pr(>F) ). The single asterisk next to the F value indicates that the difference ratio and the p p between the soil means is significant at 5% (but not at 1%, which would have merited two asterisks). Notice that R does not print the bottom row of the ANOVA table showing the total sum of squares and total degrees of freedom. plot The next thing we would do is to check the assumptions of the model. This is done using aov like this (see p. 419): par(mfrow=c(2,2)) plot(aov(yield~soil)) Residuals vs Fitted Normal Q–Q 6 6 210–1–2 11 11 50–5–10 Residuals 13 Standardized residuals 13 0 1 –1 –2 2 10 12 13 14 11 Theoretical Quantiles Fitted values Constant Leverage: Residuals vs Factor Levels Scale–Location ⎪ 13 6 6 210–1–2–3 1.51.00.50.0 11 11 Standardized residuals 13 Standardized residuals ⎪ soil : clay sand loam 13 14 10 11 12 Factor Level Combinations Fitted values The first plot (top left) checks the most important assumption (constancy of variance); there should be no pattern in the residuals against the fitted values (the three treatment means) – and, indeed, there is none. The second plot (top right) tests the assumption of normality of errors: there should be a straight-line relationship between our standardized residuals and theoretical quantiles derived from a normal distribution. Points 6,

531 ANALYSIS OF VARIANCE 509 11 and 13 lie a little off the straight line, but this is nothing to worry about (see p. 405). The residuals are well behaved (bottom left) and there are no highly influential values that might be distorting the parameter estimates (bottom right). 11.1.4 Effect sizes plot.design (which takes a formula rather The best way to view the effect sizes graphically is to use than a model object), but our current model with just one factor is perhaps too simple to get full value from plot.design(yield~soil) ). To see the effect sizes in tabular form use model.tables (which this ( takes a model object as its argument) like this: model <- aov(yield~soil) model.tables(model,se=T) Tables of effects soil soil clay loam sand -0.4 2.4 -2.0 Standard errors of effects soil 1.081 replic. 10 The effects are shown as departures from the overall mean: soil 1 (sand) has a mean yield that is 2.0 below the overall mean, and soil 3 (loam) has a mean that is 2.4 above the overall mean. The standard error of effects is 1.081 on a replication of n 10 (this is the standard error of a mean). You should note that this is not the = appropriate standard error for comparing two means (see below). If you specify you get: "means" model.tables(model,"means",se=T) Tables of means Grand mean 11.9 soil soil clay loam sand 11.5 14.3 9.9 Standard errors for differences of means soil 1.529 replic. 10 Now the three means are printed (rather than the effects) and the standard error of the difference of means is given (this is what you need for doing a t test to compare any two means).

532 510 THE R BOOK summary.lm option for viewing the model, rather Another way of looking at effect sizes is to use the (as we used above): summary.aov than summary.lm(model) Call: aov(formula = yield ~ soil) Residuals: Min 1Q Median 3Q Max -8.5 -1.8 0.3 1.7 7.1 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.500 1.081 10.638 3.7e-11 *** soilloam 2.800 1.529 1.832 0.0781 . soilsand -1.600 1.529 -1.047 0.3046 Residual standard error: 3.418 on 27 degrees of freedom Multiple R-squared: 0.2392, Adjusted R-squared: 0.1829 F-statistic: 4.245 on 2 and 27 DF, p-value: 0.02495 In regression analysis (p. 461) the summary.lm output was easy to understand because it gave us the intercept and the slope (the two parameters estimated by the model) and their standard errors. But this table has three rows. Why is that? What is an intercept in the context of analysis of variance? And why are the soilsand standard errors different for the intercept and for ? It will take a while before you feel at ease with tables for analysis of variance. The details summary.lm summary.lm are explained on p. 424, but the central point is that all tables have as many rows as there are parameters estimated from the data. There are three rows in this case because our aov model estimates three parameters: a mean yield for each of the three soil types. In the context of aov , an intercept is a clay mean value; in this case it is the mean yield for because this factor-level name comes first in the alphabet. So if Intercept is the mean yield for clay, what are the other two rows labelled soilloam and soilsand ? This is the hardest thing to understand. All other rows in the summary.lm table for aov are differences between means . Thus row 2, labelled soilloam , is the difference between the mean yields on loam and clay, and row 3, labelled soilsand , is the difference between the mean yields of sand and clay. Intercept The first row ( ) is a mean, so the standard error column in row 1 contains the standard error of a mean. Rows 2 and 3 are differences between means, so their standard error columns contain the standard error of the difference between two means (and this is a bigger number; see p. 358). The standard error of a mean is √ √ 2 s 11 . 685 , = 081 = 1 . = se mean 10 n whereas the standard error of the difference between two means is √ √ 2 685 s 11 . = . = . 529 1 se × 2 2 = diff n 10 The summary.lm table shows that neither loam nor sand produces a significantly higher yield than clay (none of the p -values is less than 0.05, despite the fact that the ANOVA table showed p = 0.025). But what

533 ANALYSIS OF VARIANCE 511 about the contrast in the yields from loam and sand? To assess this we need to do some arithmetic of our own. The two parameters differ by 2.8 = 4.4 (take care with the signs). The standard error of the difference + 1.6 value is 2.88. This is much greater than 2 (our rule of thumb for t ) so the mean yields of is 1.529, so the t loam and sand are significantly different. To find the precise value of Student’s t with 10 replicates in each t is given by the function qt with 18 d.f. (we have lost two degrees of freedom treatment, the critical value of for the two treatment means we have estimated from the data): qt(0.975,18) [1] 2.100922 Alternatively we can work out the value associated with our calculated t = 2.88: p 2*(1 - pt(2.88, df = 18)) [1] 0.009966426 We multiply by 2 because this is a two-tailed test (see p. 293); we did not know in advance that loam would outyield sand under the particular circumstances of this experiment. The residual standard error in the summary.lm output is the square root of the error variance from the √ 685 11 . -squared is the fraction of the total variation in yield that is explained by = 3 . 418. R ANOVA table: the model (adjusted -squared are explained on p. 461). The F statistic and the p value come from the last R two columns of the ANOVA table. So there it is. That is how analysis of variance works. When the means are significantly different, then the sum of squares computed from the individual treatment means will be significantly smaller than the sum of squares computed from the overall mean. We judge the significance of the difference between the two sums of squares using analysis of variance. 11.1.5 Plots for interpreting one-way ANOVA There are two traditional ways of plotting the results of ANOVA:  box-and-whisker plots;  barplots with error bars. Here is an example to compare the two approaches. We have an experiment on plant competition with one clipping and the five levels consist of control (i.e. unclipped), factor and five levels. The factor is called two intensities of shoot pruning and two intensities of root pruning: comp <- read.table("c: \\ temp \\ competition.txt",header=T) attach(comp) names(comp) [1] "biomass" "clipping" plot(clipping,biomass,xlab="Competition treatment", ylab="Biomass",col="yellow")

534 512 THE R BOOK 700 650 600 Biomass 550 500 450 n50 r10 r5 control n25 Competition treatment The box-and-whisker plot is good at showing the nature of the variation within each treatment, and also whether there is skew within each treatment (e.g. for the control plots, there is a wider range of values between the median and third quartile than between the median and first quartile). No outliers are shown above the whiskers, so the tops and bottoms of the bars are the maxima and minima within each treatment. The medians for the competition treatments are all higher than the third quartile of the controls, suggesting that they may be significantly different from the controls, but there is little to suggest that any of the competition treatments are significantly different from one another (see below for the analysis). Barplots with error bars are preferred by many journal editors, and some people think that they make hy- pothesis testing easier. We shall see. Unlike S-PLUS, R does not have a built-in function called error.bar , so we shall have to write our own. Here is a very simple version without any bells or whistles. We shall call it to distinguish it from the much more general S-PLUS function: error.bars error.bars <- function(yv,z,nn) { xv <- barplot(yv,ylim=c(0,(max(yv)+max(z))), col="green",names=nn,ylab=deparse(substitute(yv))) for (i in 1:length(xv)) { arrows(xv[i],yv[i]+z[i],xv[i],yv[i]-z[i],angle=90,code=3,length=0.15) }} To use this function we need to decide what kind of values ( z ) to use for the lengths of the bars. Let us use the standard error of a mean based on the pooled error variance from the ANOVA, then return to a discussion of the pros and cons of different kinds of error bars later. Here is the one-way analysis of variance: model <- aov(biomass~clipping) summary(model) Df Sum Sq Mean Sq F value Pr(>F) clipping 4 85356 21339 4.302 0.00875 ** Residuals 25 124020 4961

535 ANALYSIS OF VARIANCE 513 2 s 4961. Now we need to know how = From the ANOVA table we can see that the pooled error variance many numbers were used in the calculation of each of the five means: table(clipping) clipping control n25 n50 r10 r5 66666 There was equal replication (which makes life easier), and each mean was based on six replicates, so the √ √ 2 standard error of a mean is / = s 4961 / 6 = 28 . 75. We shall draw an error bar up 28.75 from each mean n and down by the same distance, so we need five values for z , one for each bar, each of 28.75: se <- rep(28.75,5) We need to provide labels for the five different bars – the factor levels should be good for this: labels <- levels(clipping) Now we work out the five mean values which will be the heights of the bars, and save them as a vector called : ybar ybar <- tapply(biomass,clipping,mean) Finally, we can create the barplot with error bars (the function is defined above): error.bars(ybar,se,labels) 600 500 400 300 ybar 200 1000 n25 n50 r10 r5 control We do not get the same feel for the distribution of the values each treatment as was obtained by the within box-and-whisker plot, but we can certainly see clearly which means are not significantly different. If, as here, we use ± 1 standard error of the mean as the length of the error bars, then when the bars overlap this implies that the two means are not significantly different . Remember the rule of thumb for t : significance requires 2

536 514 THE R BOOK or more standard errors, and if the bars overlap it means that the difference between the means is less than 2 standard errors. There is another issue, too. For comparing means, we should use the standard error of the difference between two means (not the standard error of one mean) in our tests (see p. 358); these bars would be about 1.4 times as long as the bars we have drawn here. So while we can be sure that the two root-pruning treatments are not significantly different from one another, and that the two shoot-pruning treatments are not significantly different from one another (because their bars overlap), we cannot conclude from this plot (although we do know it from the ANOVA table above; p 0.008 75) that the controls have significantly = lower biomass than the rest (because the error bars are not the correct length for testing differences between means). An alternative graphical method is to use 95% confidence intervals for the lengths of the bars, rather than t , qt(.975,5) = standard errors of means. This is easy to do: we multiply our standard errors by Student’s 2.570 582, to get the lengths of the confidence intervals: error.bars(ybar,2.570582*se,labels) 600 500 400 300 ybar 200 1000 control r10 r5 n25 n50 Now, all of the error bars overlap, implying visually that there are no significant differences between the means. But we know that this is not true from our analysis of variance, in which we rejected the null hypothesis that all the means were the same at p = 0.008 75. If it were the case that the bars did not overlap when we are using confidence intervals (as here), then that would imply that the means differed by more than 4 standard errors, and this is a much greater difference than is required to conclude that the means are significantly different. So not this is not perfect either. With standard errors we could be sure that the means were significantly different did overlap. And with confidence intervals we can be sure that the means when the bars significantly are different when the bars do not overlap. But the alternative cases are not clear-cut for either type of bar. Can we somehow get the best of both worlds, so that the means are significantly different when the bars do not overlap, and the means are significantly different when the bars do overlap? not The answer is yes, we can, if we use least significant difference (LSD) bars. Let us revisit the formula for Student’s t test: a difference . t = standard error of the diffference

537 ANALYSIS OF VARIANCE 515 t t > qt(0.975,df) if we want We say that the difference is significant when > 2 (by the rule of thumb, or to be more precise). We can rearrange this formula to find the smallest difference that we would regard as being significant. We can call this the least significant difference: . qt(0.975,df) × standard error of a difference ≈ LSD × se = 2 diff In our present example this is qt(0.975,10)*sqrt(2*4961/6) [1] 90.60794 because a difference is based on 12 – 2 = 10 degrees of freedom. What we are saying is the two means would be significantly different if they differed by 90.61 or more. How can we show this graphically? We want overlapping bars to indicate a difference less than 90.61, and non-overlapping bars to represent a difference greater than 90.61. With a bit of thought you will realize that we need to draw bars that are LSD /2 in length, up and down from each mean. Let us try it with our current example: lsd <- qt(0.975,10)*sqrt(2*4961/6) lsdbars <- rep(lsd,5)/2 error.bars(ybar,lsdbars,labels) 600 500 400 ybar 300 200 100 0 n25 n50 r10 r5 control Now we can interpret the significant differences visually. The control biomass is significantly lower than any of the four treatments, but none of the four treatments is significantly different from any other. The statistical analysis of this contrast is explained in detail in Section 9.23 (p. 430). Sadly, most journal editors insist on error bars of 1 standard error of the mean. It is true that there are complicating issues to do with LSD bars (not least the vexed question of multiple comparisons; see p. 531), but at least they do what was intended by the error plot (i.e. overlapping bars means non-significance and non-overlapping bars means significance); neither standard errors nor confidence intervals can say that. A better option might be to use box-and-whisker plots with the notch=T option to indicate significance (see p. 213).

538 516 THE R BOOK 11.2 Factorial experiments A factorial experiment has two or more factors, each with two or more levels, plus replication for each combination of factors levels. This means that we can investigate statistical interactions, in which the response . Our example comes from a farm-scale trial of animal to one factor depends on the level of another factor diets. There are two factors: diet and supplement. Diet is a factor with three levels: barley, oats and wheat. Supplement is a factor with four levels: agrimore, control, supergain and supersupp. The response variable is weight gain after 6 weeks. \\ temp \\ growth.txt",header=T) weights <- read.table("c: attach(weights) barplot Data inspection is carried out using beside=T to get the bars in adjacent clusters (note the use of rather than vertical stacks): barplot(tapply(gain,list(diet,supplement),mean), beside=T,ylim=c(0,30),col=c("orange","yellow","cornsilk")) supplement ) appears as groups of bars from left to right in Note that the second factor in the list ( alphabetical order by factor level, from agrimore to supersupp. The first factor ( diet ) appears as three levels within each group of bars: orange = barley , yellow = oats , cornsilk = wheat , again in alphabetical order by factor level. We should really add a key to explain the levels of diet locator(1) to find .Use top left the coordinates for the corner of the box around the legend. You need to increase the default scale on the y axis to make enough room for the legend box. labs <- c("Barley","Oats","Wheat") legend(locator(1),labs,fill= c("orange","yellow","cornsilk")) 30 Barley Oats Wheat 25 20 15 10 50 agrimore supergain supersupp control

539 ANALYSIS OF VARIANCE 517 tapply as usual: We inspect the mean values using tapply(gain,list(diet,supplement),mean) agrimore control supergain supersupp barley 26.34848 23.29665 22.46612 25.57530 oats 23.29838 20.49366 19.66300 21.86023 wheat 19.63907 17.40552 17.01243 19.66834 or Nowweuse to fit a factorial analysis of variance (the choice affects only whether we aov lm summary ). We es- get an ANOVA table or a list of parameters estimates as the default output from diet timate parameters for the main effects of each level of supplement ,plus and each level of terms for the interaction between and supplement . Interaction degrees of freedom are the prod- diet × uct of the degrees of freedom of the component terms (i.e. (3 – 1) = 6). The model is (4–1) gain~diet+supplement+diet:supplement , but this can be simplified using the asterisk notation like this: model <- aov(gain~diet*supplement) summary(model) Df Sum Sq Mean Sq F value Pr(>F) diet 2 287.17 143.59 83.52 3.00e-14 *** supplement 3 91.88 30.63 17.82 2.95e-07 *** diet:supplement 6 3.41 0.57 0.33 0.917 Residuals 36 61.89 1.72 The ANOVA table shows that there is no hint of any interaction between the two explanatory variables ( p = 0.917); evidently the effects of diet and supplement are additive. The disadvantage of the ANOVA table is that it does not show us the effect sizes, and does not allow us to work out how many levels of each of the two factors are significantly different. As a preliminary to model simplification, summary.lm is often more useful than summary.aov : summary.lm(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 26.3485 0.6556 40.191 < 2e-16 *** dietoats -3.0501 0.9271 -3.290 0.002248 ** dietwheat -6.7094 0.9271 -7.237 1.61e-08 *** supplementcontrol -3.0518 0.9271 -3.292 0.002237 ** supplementsupergain -3.8824 0.9271 -4.187 0.000174 *** supplementsupersupp -0.7732 0.9271 -0.834 0.409816 dietoats:supplementcontrol 0.2471 1.3112 0.188 0.851571 dietwheat:supplementcontrol 0.8183 1.3112 0.624 0.536512 dietoats:supplementsupergain 0.2470 1.3112 0.188 0.851652 dietwheat:supplementsupergain 1.2557 1.3112 0.958 0.344601 dietoats:supplementsupersupp -0.6650 1.3112 -0.507 0.615135 dietwheat:supplementsupersupp 0.8024 1.3112 0.612 0.544381 Residual standard error: 1.311 on 36 degrees of freedom Multiple R-squared: 0.8607, Adjusted R-squared: 0.8182 F-statistic: 20.22 on 11 and 36 DF, p-value: 3.295e-12

540 518 THE R BOOK This is a rather complex model, because there are 12 estimated parameters (the number of rows in the table): six main effects and six interactions. Remember that the parameter labelled Intercept is the mean with and ). All diet=barley supplement=agrimore both factor levels set to their first in the alphabet ( other rows are differences between means. The output re-emphasizes that none of the interaction terms is even close to significant, but it suggests that the minimal adequate model will require five parameters: an intercept, , a difference due to a difference due to , a difference due to control and difference due to oats wheat (these are the five rows with significance stars). This draws attention to the main shortcoming supergain of using treatment contrasts as the default. If you look carefully at the table, you will see that the effect sizes control and of two of the supplements, , are not significantly different from one another. supergain You need lots of practice at doing tests in your head, to be able to do this quickly. Ignoring the signs t (because the signs are negative for both of them), we have 3.05 vs. 3.88, a difference of 0.83. But look at the associated standard errors (both 0.927); the difference is less than 1 standard error of a difference between two means. For significance, we would need roughly 2 standard errors (remember the rule of thumb, in which t ≥ 2 is significant; see p. 292). The rows get starred in the significance column because treatments contrasts compare all the main effects in the rows with the intercept (where each factor is set to its first level in the agrimore alphabet, namely barley in this case). When, as here, several factor levels are different and from the intercept, but not different from one another, they all get significance stars. This means that you cannot count up the number of rows with stars in order to determine the number of significantly different factor levels. We first simplify the model by leaving out the interaction terms: model <- aov(gain~diet+supplement) summary.lm(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 26.1230 0.4408 59.258 < 2e-16 *** dietoats -3.0928 0.4408 -7.016 1.38e-08 *** dietwheat -5.9903 0.4408 -13.589 < 2e-16 *** supplementcontrol -2.6967 0.5090 -5.298 4.03e-06 *** supplementsupergain -3.3815 0.5090 -6.643 4.72e-08 *** supplementsupersupp -0.7274 0.5090 -1.429 0.16 oats differs from wheat It is clear that we need to retain all three levels of diet ( = 2.90 by 5.99 – 3.09 with a standard error of 0.44). But it is not clear that we need four levels of supplement: supersupp is not obviously different from agrimore (0.727 with standard error 0.509). Nor is supergain obviously different from the unsupplemented animals (3.38 – 2.70 = 0.68). We shall try a new two-level control factor to replace the four-level supplement, and see if this significantly reduces the model’s explanatory power: agrimore and supersupp are recoded as ‘best’ and control and supergain as ‘worst’: supp2 <- factor(supplement) levels(supp2) [1] "agrimore" "control" "supergain" "supersupp" levels(supp2)[c(1,4)] <- "best" levels(supp2)[c(2,3)] <- "worst" levels(supp2) [1] "best" "worst"

541 ANALYSIS OF VARIANCE 519 Now we can compare the two models: model2 <- aov(gain~diet+supp2) anova(model,model2) Analysis of Variance Table Model 1: gain ~ diet + supplement Model 2: gain ~ diet + supp2 Res.Df RSS Df Sum of Sq F Pr(>F) 1 42 65.296 2 44 71.284 -2 -5.9876 1.9257 0.1584 The simpler model2 has saved two degrees of freedom and is not significantly worse than the more complex model ( = 0.1584). This is the minimal adequate model: all of the parameters are significantly different p from zero and from one another: summary.lm(model2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 25.7593 0.3674 70.106 < 2e-16 *** dietoats -3.0928 0.4500 -6.873 1.76e-08 *** dietwheat -5.9903 0.4500 -13.311 < 2e-16 *** supp2worst -2.6754 0.3674 -7.281 4.43e-09 *** Model simplification has reduced our initial 12-parameter model to a four-parameter model. 11.3 Pseudoreplication: Nested designs and split plots The model-fitting functions , lme and lmer aov have the facility to deal with complicated error structures, and it is important that you can recognize such error structures, and hence avoid the pitfalls of pseudoreplication. There are two general cases:  nested sampling, as when repeated measurements are taken from the same individual, or observational studies are conducted at several different spatial scales (mostly random effects);  split-plot analysis, as when designed experiments have different treatments applied to plots of different sizes (mostly fixed effects). 11.3.1 Split-plot experiments In a split-plot experiment, different treatments are applied to plots of different sizes. Each different plot size is associated with its own error variance, so instead of having one error variance (as in all the ANOVA tables up to this point), we have as many error terms as there are different plot sizes. The analysis is presented as a series of component ANOVA tables, one for each plot size, in a hierarchy from the largest plot size with the lowest replication at the top, down to the smallest plot size with the greatest replication at the bottom. The following example refers to a designed field experiment on crop yield with three treatments: irrigation (with two levels, irrigated or not), sowing density (with three levels, low, medium and high), and fertilizer application (with three levels, low, medium and high).

542 520 THE R BOOK temp splityield.txt",header=T) \\ \\ yields <- read.table("c: attach(yields) names(yields) [1] "yield" "block" "irrigation" "density" "fertilizer" The largest plots were the four whole fields ( block ), each of which was split in half, and irrigation was allocated at random to one half of the field. Each irrigation plot was split into three, and one of three different seed-sowing densities (low, medium or high) was allocated at random (independently for each level of irrigation and each block). Finally, each density plot was divided into three, and one of three fertilizer nutrient treatments (N, P, or N and P together) was allocated at random. The issue with split-plot experiments is pseudoreplication. Think about the irrigation experiment. There were four blocks, each split in half, with one half irrigated and the other as a control. The dataframe for an analysis of this experiment should therefore contain just 8 rows (not 72 rows as in the present case). There would be seven degrees of freedom in total, three for blocks, one for irrigation and just 7 3 − 1 = 3 d.f. − for error. If you did not spot this, the model could be run with 51 d.f. representing massive pseudoreplication p value for the irrigation treatment is 0.0247, but for the pseudoreplicated mistaken analysis p = (the correct –10 ). 10 6.16 × The model formula is specified as a factorial, using the asterisk notation. The error structure is defined in the term, with the plot sizes listed from left to right, from largest to smallest, with each variable Error fertilizer separated by the slash operator /. Note that the smallest plot size, , does not need to appear in Error term: the model <- aov(yield~irrigation*density*fertilizer+Error(block/irrigation/density)) summary(model) Error: block Df Sum Sq Mean Sq F value Pr(>F) Residuals 3 194.4 64.81 Error: block:irrigation Df Sum Sq Mean Sq F value Pr(>F) irrigation 1 8278 8278 17.59 0.0247 * Residuals 3 1412 471 Error: block:irrigation:density Df Sum Sq Mean Sq F value Pr(>F) density 2 1758 879.2 3.784 0.0532 . irrigation:density 2 2747 1373.5 5.912 0.0163 * Residuals 12 2788 232.3 Error: Within Df Sum Sq Mean Sq F value Pr(>F) fertilizer 2 1977.4 988.7 11.449 0.000142 *** irrigation:fertilizer 2 953.4 476.7 5.520 0.008108 ** density:fertilizer 4 304.9 76.2 0.883 0.484053 irrigation:density:fertilizer 4 234.7 58.7 0.680 0.610667 Residuals 36 3108.8 86.4

543 ANALYSIS OF VARIANCE 521 Here you see the four ANOVA tables, one for each plot size: blocks are the biggest plots, half blocks get the irrigation treatment, one third of each half block gets a sowing density treatment, and one third of a sowing density treatment gets each fertilizer treatment. Note that the non-significant main effect for density = 0.053) does not ( p mean that density is unimportant, because density appears in a significant interaction with irrigation (the density terms cancel out, when averaged over the two irrigation treatments; see below). The best way to understand the two significant interaction terms is to plot them using interaction.plot like this: interaction.plot(fertilizer,irrigation,yield) 120 irrigation irrigated 115 control 110 105 mean of yield 100 95 90 P NNP fertilizer Irrigation increases yield proportionately more on the N-fertilized plots than on the P-fertilized plots. The irrigation–density interaction is more complicated: interaction.plot(density,irrigation,yield) irrigation 120 irrigated control 110 100 mean of yield 90 low high medium density

544 522 THE R BOOK On the irrigated plots, yield is lowest on the low-density plots, but on control plots yield is lowest on the high- density plots. Alternatively, you could use the effects package which takes a model object (a linear model or a generalized linear model) and provides attractive trellis plots of specified interaction effects (p. 968). NA When there are one or more missing values ( ), then factors have effects in more than one stratum and or lmer rather than the same main effect turns up in more than one ANOVA table. In such a case, use lme aov is not to be trusted under these circumstances. . The output of aov 11.3.2 Mixed-effects models Mixed-effects models are so called because the explanatory variables are a mixture of fixed effects and random effects:  fixed effects influence only the of y ; mean  random effects influence only the of y . variance A random effect should be thought of as coming from a population of effects: the existence of this population prediction of random effects, rather than estimation: we estimate fixed is an extra assumption. We speak of effects from data, but we intend to make predictions about the population from which our random effects were sampled. Fixed effects are unknown constants to be estimated from the data. Random effects govern the variance–covariance structure of the response variable. The fixed effects are often experimental treatments that were applied under our direction, and the random effects are either categorical or continuous variables that are distinguished by the fact that we are typically not interested in the parameter values, but only in the variance they explain. One of more of the explanatory variables might represent grouping in time or in space. Random effects that come from the same group will be correlated, and this contravenes one of the fundamental assumptions of independence of errors standard statistical models: . Mixed-effects models take care of this non-independence of errors by modelling the covariance structure introduced by the grouping of the data. A major benefit of random-effects models is that they economize on the number of degrees of freedom used up by the factor levels. Instead of estimating a mean for every single factor level, the random-effects model estimates the distribution of the means (usually as the standard deviation of the differences of the factor-level means around an overall mean). Mixed-effects models are particularly useful in cases where there is temporal pseudoreplication (repeated measurements) and/or spatial pseudoreplication (e.g. nested designs or split-plot experiments). These models can allow for:  spatial autocorrelation between neighbours;  temporal autocorrelation across repeated measures on the same individuals;  differences in the mean response between blocks in a field experiment;  differences between subjects in a medical trial involving repeated measures. The point is that we really do not want to waste precious degrees of freedom in estimating parameters for each of the separate levels of the categorical random variables. On the other hand, we do want to make use of the all measurements we have taken, but because of the pseudoreplication we want to take account of both the  correlation structure, used to model within-group correlation associated with temporal and spatial depen- dencies, using correlation , and  variance function, used to model non-constant variance in the within-group errors using weights .

545 ANALYSIS OF VARIANCE 523 11.3.3 Fixed effect or random effect? It is difficult without lots of experience to know when to use a categorical explanatory variable as a fixed effect or as a random effect. Some guidelines are given below.  Am I interested in the effect sizes? Yes means fixed effects.  Is it reasonable to suppose that the factor levels come from a population of levels? Yes means random effects.  Are there enough levels of the factor in the dataframe on which to base an estimate of the variance of the population of effects? No means fixed effects.  Are the factor levels informative? Yes means fixed effects.  Are the factor levels just numeric labels? Yes means random effects.  Am I mostly interested in making inferences about the distribution of effects, based on the random sample of effects represented in the dataframe? Yes means random effects.  Is there hierarchical structure? Yes means we need to ask whether the data are experimental or observations.  Is it a hierarchical experiment, where the factor levels are experimental manipulations? Yes means fixed effects in a split-plot design (see p. 519)  Is it a hierarchical observational study? Yes means random effects, perhaps in a variance components analysis (see p. 524).  When the model contains both fixed and random effects, use mixed-effects models.  If the model structure is linear, use linear mixed effects, lme or lmer .  nlme . Otherwise, specify the model equation and use non-linear mixed effects, 11.3.4 Removing the pseudoreplication If you are principally interested in the fixed effects, then the best response to pseudoreplication in a data set is simply to eliminate it. Spatial pseudoreplication can be averaged away. You will always get the correct effect size and p value from the reduced, non-pseudoreplicated dataframe. Note also that you should not use anova to compare different models for the fixed effects when using lme or lmer with REML (see p. 688). Temporal pseudoreplication can be dealt with by carrying out carrying out separate ANOVAs, one at each time (or just one at the end of the experiment). This approach, however, has two weaknesses:  It cannot address questions about treatment effects that relate to the longitudinal development of the mean response profiles (e.g. differences in growth rates between successive times).  Inferences made with each of the separate analyses are not independent, and it is not always clear how they should be combined. The key feature of longitudinal data is that the same individuals are measured repeatedly through time. This would represent temporal pseudoreplication if the data were used uncritically in regression or ANOVA. The set of observations on one individual subject will tend to be positively correlated, and this correlation needs to be taken into account in carrying out the analysis. The alternative is a cross-sectional study, with all the data gathered at a single point in time, in which each individual contributes a single data point. The

546 524 THE R BOOK age effects from ; these are advantage of longitudinal studies is that they are capable of separating cohort effects inextricably confounded in cross-sectional studies. This is particularly important when differences between years mean that cohorts originating at different times experience different conditions, so that individuals of the same age in different cohorts would be expected to differ. There are two extreme cases in longitudinal studies:  a few measurements on a large number of individuals;  a large number of measurements on a few individuals. In the first case it is difficult to fit an accurate model for change within individuals, but treatment effects are likely to be tested effectively. In the second case, it is possible to get an accurate model of the way that individuals change though time, but there is less power for testing the significance of treatment effects, especially if variation from individual to individual is large. In the first case, less attention will be paid to estimating the correlation structure, while in the second case the covariance model will be the principal focus of attention. The aims are:  to estimate the average time course of a process;  to characterize the degree of heterogeneity from individual to individual in the rate of the process;  to identify the factors associated with both of these, including possible cohort effects. The response is not the individual measurement, but the sequence of measurements on an individual subject. This enables us to distinguish between age effects and year effects; see Diggle et al. (1994) for details. 11.3.5 Derived variable analysis The idea here is to get rid of the pseudoreplication by reducing the repeated measures into a set of summary analyse these summary statistics statistics (slopes, intercepts or means), then using standard parametric techniques such as ANOVA or regression. The technique is weak when the values of the explanatory variables change through time. Derived variable analysis makes most sense when it is based on the parameters of scientifically interpretable non-linear models from each time sequence. However, the best model from a theoretical perspective may not be the best model from the statistical point of view. There are three qualitatively different sources of random variation:  random effects , where experimental units differ (e.g. genotype, history, size, physiological condition) so that there are intrinsically high responders and other low responders;  serial correlation , where there may be time-varying stochastic variation within a unit (e.g. market forces, physiology, ecological succession, immunity) so that correlation depends on the time separation of pairs of measurements on the same individual, with correlation weakening with the passage of time;  measurement error , where the assay technique may introduce an element of correlation (e.g. shared bioassay of closely spaced samples; different assay of later specimens). 11.4 Variance components analysis For random effects we are often more interested in the question of how much of the variation in the response variable can be attributed to a given factor, than we are in estimating means or assessing the significance of differences between means. This procedure is called variance components analysis .

547 ANALYSIS OF VARIANCE 525 The following classic example of spatial pseudoreplication comes from Snedecor and Cochran (1980): rats <- read.table("c: temp \\ rats.txt",header=T) \\ attach(rats) names(rats) [1] "Glycogen" "Treatment" "Rat" "Liver" Three experimental treatments were administered to rats, and the glycogen content of the rats’ livers was n = 3 × analysed as the response variable. There were two rats per treatment, so the total sample was = 2 6. The tricky bit was that after each rat was killed, its liver was cut up into three pieces: a left-hand bit, a × central bit and a right-hand bit. So now there are six rats each producing three bits of liver, for a total of 6 = 18 numbers. Finally, two separate preparations were made from each macerated bit of liver, to assess the 3 measurement error associated with the analytical machinery. At this point there are 2 18 = 36 numbers in × the dataframe as a whole. The factor levels are numbers, so we need to declare the explanatory variables to be categorical before we begin: Treatment <- factor(Treatment) Rat <- factor(Rat) Liver <- factor(Liver) Here is the analysis done the way: wrong model <- aov(Glycogen~Treatment) summary(model) Df Sum Sq Mean Sq F value Pr(>F) Treatment 2 1558 778.8 14.5 3.03e-05 *** Residuals 33 1773 53.7 A massively significant effect or treatment, right? Wrong. This result is due entirely to pseudoreplication. With just six rats in the whole experiment, there should be just three degrees of freedom for error, not 33. The simplest way to do the analysis properly is to average away the pseudoreplication. Here are the mean glycogen values for the six rats: (means <- tapply(Glycogen,list(Treatment,Rat),mean)) 12 1 132.5000 148.5000 2 149.6667 152.3333 3 134.3333 136.0000 We need a new variable to represent the treatments associated with each of these rats. The ‘generate levels’ function gl is useful here: treat <- gl(3,1,length=6) Now we can fit the non-pseudoreplicated model with the correct error degrees of freedom (3 d.f., not 33): model <- aov(as.vector(means)~treat) summary(model) Df Sum Sq Mean Sq F value Pr(>F) treat 2 259.6 129.80 2.929 0.197 Residuals 3 132.9 44.31

548 526 THE R BOOK = As you can see, the treatment effect falls well short of significance ( 0.197). p There are two different ways of doing the analysis properly in R: ANOVA with multiple error terms ( ) aov or linear mixed-effects models ( lmer ). The problem is that the bits of the same liver are pseudoreplicates because they are spatially correlated (they come from the same rat); they are not independent, as required if they are to be true replicates. Likewise, the two preparations from each liver bit are very highly correlated (the livers were macerated before the preparations were taken, so they are essentially the same sample (certainly not independent replicates of the experimental treatments). aov with multiple error terms. In the Error term we start with the Here is the correct analysis using largest scale (treatment), then rats within treatments, then liver bits within rats within treatments. Finally, there were replicated measurements (two preparations) made for each bit of liver. model2 <- aov(Glycogen~Treatment+Error(Treatment/Rat/Liver)) summary(model2) Error: Treatment Df Sum Sq Mean Sq Treatment 2 1558 778.8 Error: Treatment:Rat Df Sum Sq Mean Sq F value Pr(>F) Residuals 3 797.7 265.9 Error: Treatment:Rat:Liver Df Sum Sq Mean Sq F value Pr(>F) Residuals 12 594 49.5 Error: Within Df Sum Sq Mean Sq F value Pr(>F) Residuals 18 381 21.17 You can do the correct, non-pseudoreplicated analysis of variance from this output (Box 11.1). Box 11.1 Sums of squares in hierarchical designs The trick to understanding these sums of squares is to appreciate that with nested categorical explanatory variables (random effects) the correction factor, which is subtracted from the sum of squared subtotals, ∑ 2 / kn . Instead, the correction factor is the uncorrected sum of squared is not the conventional ( ) y subtotals from the level in the hierarchy immediately above the level in question. This is very hard to see without lots of practice. The total sum of squares, SSA ,are SSY , and the treatment sum of squares, computed in the usual way (see p. 499): ( ) ∑ 2 y ∑ 2 , SSY y = − n ( ) ∑ ∑ 2 k 2 y C i 1 = i . − = SSA kn n The analysis is easiest to understand in the context of an example. For the rats data, the treatment totals were based on 12 numbers (two rats, three liver bits per rat and two preparations per liver bit). In this

549 ANALYSIS OF VARIANCE 527 SSA = 12 and kn = 36. We need to calculate sums of squares for rats case, in the formula for , above, n , and preparations within liver , liver bits within rats within treatments, SS within treatments, SS Rats Liverbits : SS bits within rats within treatments, Preparations ∑ ∑ 2 2 C R − , SS = Rats 6 12 ∑ ∑ 2 2 R L − , SS = Liverbits 6 2 ∑ ∑ 2 2 L y . − SS = Preparations 2 1 The correction factor at any level is the uncorrected sum of squares from the level above . The last sum of squares could have been computed by difference: . SS = SSY − SSA − SS − SS Rats Preparations Liverbits F test for equality of the treatment means is the treatment variance divided by the ‘rats within treatment The = 778.78/265.89 = variance’ from the row immediately beneath: F 2.928 956, with 2 d.f. in the numerator and 3 d.f. in the denominator (as we obtained in the correct ANOVA, above). To turn this into a variance components analysis we need to do a little work. The mean squares are converted into variance components like this:  residuals = = 21.17, preparations within liver bits: unchanged  liver bits within rats within treatments: (49.5 – 21.17)/2 14.165, =  rats within treatments: (265.89 – 49.5)/6 = 36.065. You divide the difference in variance in going from one spatial scale to the next, by the number of numbers in the level below (i.e. two preparations per liver bit, and six preparations per rat, in this case). Variance components analysis typically expresses these variances as percentages of the total: varcomps <- c(21.17,14.165,36.065) 100*varcomps/sum(varcomps) [1] 29.64986 19.83894 50.51120 illustrating that more than 50% of the random variation is accounted for by differences between the rats. Repeating the experiment using more than six rats would make much more sense than repeating it by cutting lmer up the livers into more pieces. Analysis of the rats data using is explained on p. 703. 11.5 Effect sizes in ANOVA: aov or lm ? The difference between lm and aov is mainly in the form of the output: the summary table with aov is in the traditional form for analysis of variance, with one row for each categorical variable and each interaction term. On the other hand, the summary table for lm produces one row per estimated parameter (i.e. one

550 528 THE R BOOK row for each factor level and one row for each interaction level). If you have multiple error terms (spatial pseudoreplication) then you must use lm does not support the Error term. because aov then using lm : aov Here is a three-way analysis of variance fitted first using \\ temp daphnia <- read.table("c: Daphnia.txt",header=T) \\ attach(daphnia) names(daphnia) [1] "Growth.rate" "Water" "Detergent" "Daphnia" model1 <- aov(Growth.rate~Water*Detergent*Daphnia) summary(model1) Df Sum Sq Mean Sq F value Pr(>F) Water 1 1.99 1.985 2.850 0.097838 . Detergent 3 2.21 0.737 1.059 0.375478 Daphnia 2 39.18 19.589 28.128 8.23e-09 *** Water:Detergent 3 0.17 0.058 0.084 0.968608 Water:Daphnia 2 13.73 6.866 9.859 0.000259 *** Detergent:Daphnia 6 20.60 3.433 4.930 0.000532 *** Water:Detergent:Daphnia 6 5.85 0.975 1.399 0.234324 Residuals 48 33.43 0.696 All three factors are likely to stay in the model because each is involved in at least one significant interaction. We must not be misled by the apparently non-significant main effect for detergent. The three-way interaction is clearly non-significant and can be deleted ( = 0.234). Here is the output from the same analysed using the p linear model function: model2 <- lm(Growth.rate~Water*Detergent*Daphnia) summary(model2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.81126 0.48181 5.835 4.48e-07 *** WaterWear -0.15808 0.68138 -0.232 0.81753 DetergentBrandB -0.03536 0.68138 -0.052 0.95883 DetergentBrandC 0.47626 0.68138 0.699 0.48794 DetergentBrandD -0.21407 0.68138 -0.314 0.75475 DaphniaClone2 0.49637 0.68138 0.728 0.46986 DaphniaClone3 2.05526 0.68138 3.016 0.00408 ** WaterWear:DetergentBrandB 0.46455 0.96361 0.482 0.63193 WaterWear:DetergentBrandC -0.27431 0.96361 -0.285 0.77712 WaterWear:DetergentBrandD 0.21729 0.96361 0.225 0.82255 WaterWear:DaphniaClone2 1.38081 0.96361 1.433 0.15835 WaterWear:DaphniaClone3 0.43156 0.96361 0.448 0.65627 DetergentBrandB:DaphniaClone2 0.91892 0.96361 0.954 0.34506 DetergentBrandC:DaphniaClone2 -0.16337 0.96361 -0.170 0.86609 DetergentBrandD:DaphniaClone2 1.01209 0.96361 1.050 0.29884 DetergentBrandB:DaphniaClone3 -0.06490 0.96361 -0.067 0.94658 DetergentBrandC:DaphniaClone3 -0.80789 0.96361 -0.838 0.40597 DetergentBrandD:DaphniaClone3 -1.28669 0.96361 -1.335 0.18809 WaterWear:DetergentBrandB:DaphniaClone2 -1.26380 1.36275 -0.927 0.35837 WaterWear:DetergentBrandC:DaphniaClone2 1.35612 1.36275 0.995 0.32466

551 ANALYSIS OF VARIANCE 529 WaterWear:DetergentBrandD:DaphniaClone2 0.77616 1.36275 0.570 0.57164 WaterWear:DetergentBrandB:DaphniaClone3 -0.87443 1.36275 -0.642 0.52414 WaterWear:DetergentBrandC:DaphniaClone3 -1.03019 1.36275 -0.756 0.45337 WaterWear:DetergentBrandD:DaphniaClone3 -1.55400 1.36275 -1.140 0.25980 Residual standard error: 0.8345 on 48 degrees of freedom Multiple R-squared: 0.7147, Adjusted R-squared: 0.578 F-statistic: 5.227 on 23 and 48 DF, p-value: 7.019e-07 table do Note that the two significant interactions from the show up in the summary.lm table (Water– aov not summary.lm shows treatment contrasts, comparing Daphnia and Detergent–Daphnia). This is because Intercept everything to the , rather than orthogonal contrasts (see p. 430). This draws attention to the t importance of model simplification rather than per-row tests in assessing statistical significance (i.e. removing aov table, the p the non-significant three-way interaction term in this case). In the p values are ‘on deletion’ values, which is a big advantage. The main difference is that there are eight rows in the summary.aov table (three main effects, three two-way interactions, one three-way interaction and an error term) but there are 24 rows in the summary.lm table (four levels of detergent by three levels of daphnia clone by two levels of water). You can easily view in linear model layout, or model2 as an ANOVA table using the opposite summary model1 the output of options: summary.lm(model1) summary.aov(model2) plot.design and In complicated designed experiments, it is easiest to summarize the effect sizes with functions. For main effects, use model.tables plot.design(Growth.rate~Water*Detergent*Daphnia) Clone2 4.5 Clone3 Wear BrandB 4.0 BrandC BrandA Tyne BrandD 3.5 mean of Growth.rate 3.0 Clone1 Water Detergent Daphnia Factors

552 530 THE R BOOK This simple graphical device provides a very clear summary of the three sets of main effects. It is no good, however, at illustrating the interactions. The function takes the name of the fitted model model.tables object as its first argument, and you can specify whether you want the standard errors (as you typically would): model.tables(model1, "means", se = TRUE) Tables of means Grand mean 3.851905 Water Water Tyne Wear 3.686 4.018 Detergent Detergent BrandA BrandB BrandC BrandD 3.885 4.010 3.955 3.558 Daphnia Daphnia Clone1 Clone2 Clone3 2.840 4.577 4.139 Water:Detergent Detergent Water BrandA BrandB BrandC BrandD Tyne 3.662 3.911 3.814 3.356 Wear 4.108 4.109 4.095 3.760 Water:Daphnia Daphnia Water Clone1 Clone2 Clone3 Tyne 2.868 3.806 4.383 Wear 2.812 5.348 3.894 Detergent:Daphnia Daphnia Detergent Clone1 Clone2 Clone3 BrandA 2.732 3.919 5.003 BrandB 2.929 4.403 4.698 BrandC 3.071 4.773 4.019 BrandD 2.627 5.214 2.834 Water:Detergent:Daphnia , , Daphnia = Clone1 Detergent Water BrandA BrandB BrandC BrandD Tyne 2.811 2.776 3.288 2.597 Wear 2.653 3.082 2.855 2.656

553 ANALYSIS OF VARIANCE 531 , , Daphnia = Clone2 Detergent Water BrandA BrandB BrandC BrandD Tyne 3.308 4.191 3.621 4.106 Wear 4.530 4.615 5.925 6.322 , , Daphnia = Clone3 Detergent Water BrandA BrandB BrandC BrandD Tyne 4.867 4.766 4.535 3.366 Wear 5.140 4.630 3.504 2.303 Standard errors for differences of means Water Detergent Daphnia Water:Detergent Water:Daphnia Detergent:Daphnia Water:Detergent:Daphnia 0.1967 0.2782 0.2409 0.3934 0.3407 0.4818 0.6814 replic. 36 18 24 9 12 6 3 Note how the standard errors of the differences between two means increase as the replication declines. All 2 = 0.696 (see above). For instance, the three-way the standard errors use the same pooled error variance s √ √ / = 2 × 0 . 696 / 3 = 0 . 681 . and the daphnia main effects have se = 24 2 × 0 . 696 = se interactions have . . 0 2409 Attractive plots of effect sizes can be obtained using the effects library (p. 968). 11.6 Multiple comparisons One of the cardinal sins is to take a set of samples, search for the sample with the largest mean and the sample with the smallest mean, and then do a test to compare them. You should not carry out contrasts t until the analysis of variance, calculated over the whole set of samples, has indicated that there are significant differences present (i.e. until after the null hypothesis has been rejected). Also, bear in mind that there are just k − 1 orthogonal contrasts when you have a categorical explanatory variable with k levels, so do not carry out more than k − 1 comparisons of means (see p. 430 for discussion of these ideas). When comparing the multiple means across the levels of a factor, a simple comparison using multiple tests will inflate the probability of declaring a significant difference when there is none. This is because t the intervals are calculated with a given coverage probability for each interval but the interpretation of the coverage is usually with respect to the entire family of intervals (i.e. for the factor as a whole). If you follow the protocol of model simplification recommended in this book, then issues of multiple comparisons will not arise very often. An occasional significant t test amongst a bunch of non-significant interaction terms is not likely to survive a deletion test (see p. 437). Again, if you have factors with large numbers of levels you might consider using mixed-effects models rather than ANOVA (i.e. treating the factors as random effects rather than fixed effects; see p. 681). John Tukey introduced intervals based on the range of the sample means rather than the individual differences; nowadays, these are called Tukey’s honest significant differences. The intervals returned by the TukeyHSD function are based on Studentized range statistics. Technically the intervals constructed in this way would only apply to balanced designs where the same number of observations is made at each level of the factor. This function incorporates an adjustment for sample size that produces sensible intervals for mildly unbalanced designs.

554 532 THE R BOOK The following example concerns the yield of fungi gathered from 16 different habitats: data <- read.table("c: \\ Fungi.txt",header=T) \\ temp attach(data) names(data) [1] "Habitat" "Fungus.yield" First we establish whether there is any variation in fungus yield to explain: model <- aov(Fungus.yield~Habitat) summary(model) Df Sum Sq Mean Sq F value Pr(>F) Habitat 15 7527 501.8 72.14 <2e-16 *** Residuals 144 1002 7.0 p Yes, there is ( 0.000 001). But this is not of much real interest, because it just shows that some habitats < which habitats produce significantly more produce more fungi than others. We are likely to be interested in × fungi than others. Multiple comparisons are an issue because there are 16 habitats and so there are (16 15)/2 = 120 possible pairwise comparisons. There are two options:  apply the function TukeyHSD to the model to get Tukey’s honest significant differences;  use the function to get adjusted p values for all comparisons. pairwise.t.test Here is Tukey’s test in action: it produces a table of values by default: p TukeyHSD(model) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = Fungus.yield ~ Habitat) \$Habitat diff lwr upr p adj Ash-Alder 3.53292777 -0.5808096 7.6466651 0.1844088 Aspen-Alder 12.78574402 8.6720067 16.8994814 0.0000000 Beech-Alder 12.32365349 8.2099161 16.4373908 0.0000000 Birch-Alder 14.11348150 9.9997441 18.2272189 0.0000000 ... ... Willow-Rowan -3.51860059 -7.6323379 0.5951368 0.1896363 Sycamore-Spruce 4.96019563 0.8464583 9.0739330 0.0044944 Willow-Spruce 4.92754623 0.8138089 9.0412836 0.0049788 Willow-Sycamore -0.03264941 -4.1463868 4.0810879 1.0000000 You can plot the confidence intervals if you prefer (or do both, of course): plot(TukeyHSD(model))

555 ANALYSIS OF VARIANCE 533 95% family-wise confidence level Pine-Alders Oak-Aspen Pine-Birch Pine-Holmoak 20 –10 –20 10 0 Willow-Sycamore Differences in mean levels of Habitat Habitats on opposite sides of the dotted line and not overlapping it are significantly different from one another. Alternatively, you can use the pairwise.t.test function in which you specify the response variable, and then the categorical explanatory variable containing the factor levels you want to be compared, separated by a comma (not a tilde): pairwise.t.test(Fungus.yield,Habitat) Pairwise comparisons using t tests with pooled SD data: Fungus.yield and Habitat Alder Ash Aspen Beech Birch Cherry Chestnut Holmoak Hornbeam Lime Oak Pine Ash 0.10011 - - - - - - - - - - - Aspen < 2e-16 6.3e-11 - - - - - - - - - - Beech < 2e-16 5.4e-10 1.00000 - - - - - - - - - Birch < 2e-16 1.2e-13 1.00000 1.00000 - - - - - - - - Cherry 4.7e-13 2.9e-06 0.87474 1.00000 0.04943 - - - - - - - Chestnut < 2e-16 7.8e-10 1.00000 1.00000 1.00000 1.00000 - - - - - - Holmoak 1.00000 0.00181 < 2e-16 < 2e-16 < 2e-16 3.9e-16 < 2e-16 - - - - - Hornbeam 1.1e-13 8.6e-07 1.00000 1.00000 0.10057 1.00000 1.00000 < 2e-16 - - - - Lime < 2e-16 < 2e-16 1.1e-05 1.9e-06 0.00131 3.3e-10 1.4e-06 < 2e-16 1.3e-09 - - - Oak < 2e-16 < 2e-16 1.4e-07 2.0e-08 2.7e-