tsa4

Transcript

1 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 1 — #1 i i Springer Texts in Statistics Robert H. Shumway David S. Sto er Time Series Analysis and Its Applications With R Examples Fourth Edition i i i i

2 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 2 — #2 i i Robert H. Shumway David S. Stoffer Time Series Analysis and Its Applications With R Examples Fourth Edition live free or bark i i i i

3 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page v — #3 i i Preface to the Fourth Edition The fourth edition follows the general layout of the third edition but includes some modernization of topics as well as the coverage of additional topics. The preface to the third edition—which follows—still applies, so we concentrate on the differences between the two editions here. As in the third edition, R code for each example is given in the text, even if the code is excruciatingly long. Most of the examples with seemingly endless coding are in the latter chapters. The R package for the text, astsa , is still supported and details may be found in Appendix R. A number of data sets have been updated. For example, the global temperature deviation series have been updated to 2015 and are included in the newest version of the package; the corresponding examples and problems have been updated accordingly. Chapter 1 of this edition is similar to the previous edition, but we have included the definition of trend stationarity and the the concept of prewhitening when using cross-correlation. The New York Stock Exchange data set, which focused on an old financial crisis, was replaced with a more current series of the Dow Jones Indus- trial Average, which focuses on a newer financial crisis. In Chapter 2, we rewrote some of the regression review, changed the smoothing examples from the mortality data example to the Southern Oscillation Index and finding El Niño. We also ex- panded the discussion of lagged regression to Chapter 3 to include the possibility of autocorrelated errors. In Chapter 3, we removed normality from definition of ARMA models; while the assumption is not necessary for the definition, it is essential for inference and pre- diction. We added a section on regression with ARMA errors and the corresponding problems; this section was previously in Chapter 5. Some of the examples have been modified and we added some examples in the seasonal ARMA section. In Chapter 4, we improved and added some examples. The idea of modulated series is discussed using the classic star magnitude data set. We moved some of the filtering section forward for easier access to information when needed. We removed (from the (from the astsa the reliance on spec.pgram mvspec stats package) to spec.pgram package) so we can avoid having to spend pages explaining the quirks of , which tended to take over the narrative. The section on wavelets was removed because i i i i

4 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page vi — #4 i i vi Preface to the Fourth Edition there are so many accessible texts available. The spectral representation theorems are discussed in a little more detail using examples based on simple harmonic processes. The general layout of Chapter 5 and of Chapter 7 is the same, although we have revised some of the examples. As previously mentioned, we moved regression with ARMA errors to Chapter 3. Chapter 6 sees the biggest change in this edition. We have added a section on smoothing splines, and a section on hidden Markov models and switching autore- gressions. The Bayesian section is completely rewritten and is on linear Gaussian state space models only. The nonlinear material in the previous edition is removed because it was old, and the newer material is in Douc, Moulines, and Stoffer (2014). Many of the examples have been rewritten to make the chapter more accessible. Our goal was to be able to have a course on state space models based primarily on the material in Chapter 6. The Appendices are similar, with some minor changes to Appendix A and Ap- pendix B. We added material to Appendix C, including a discussion of Riemann– Stieltjes and stochastic integration, a proof of the fact that the spectra of autoregressive processes are dense in the space of spectral densities, and a proof of the fact that spec- tra are approximately the eigenvalues of the covariance matrix of a stationary process. We tweaked, rewrote, improved, or revised some of the exercises, but the overall ordering and coverage is roughly the same. And, of course, we moved regression with ARMA errors problems to Chapter 3 and removed the Chapter 4 wavelet problems. The exercises for Chapter 6 have been updated accordingly to reflect the new and improved version of the chapter. Robert H. Shumway Davis, CA Pittsburgh, PA David S. Stoffer September 2016 i i i i

5 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page vii — #5 i i Preface to the Third Edition The goals of this book are to develop an appreciation for the richness and versatility of modern time series analysis as a tool for analyzing data, and still maintain a commitment to theoretical integrity, as exemplified by the seminal works of Brillinger (1975) and Hannan (1970) and the texts by Brockwell and Davis (1991) and Fuller (1995). The advent of inexpensive powerful computing has provided both real data and new software that can take one considerably beyond the fitting of simple time domain models, such as have been elegantly described in the landmark work of Box and Jenkins (1970). This book is designed to be useful as a text for courses in time series on several different levels and as a reference work for practitioners facing the analysis of time-correlated data in the physical, biological, and social sciences. We have used earlier versions of the text at both the undergraduate and gradu- ate levels over the past decade. Our experience is that an undergraduate course can be accessible to students with a background in regression analysis and may include Section 1.1–Section 1.5, Section 2.1–Section 2.3, the results and numerical parts of Section 3.1–Section 3.9, and briefly the results and numerical parts of Section 4.1– Section 4.4. At the advanced undergraduate or master’s level, where the students have some mathematical statistics background, more detailed coverage of the same sections, with the inclusion of extra topics from Chapter 5 or Chapter 6 can be used as a one-semester course. Often, the extra topics are chosen by the students according to their interests. Finally, a two-semester upper-level graduate course for mathemat- ics, statistics, and engineering graduate students can be crafted by adding selected theoretical appendices. For the upper-level graduate course, we should mention that we are striving for a broader but less rigorous level of coverage than that which is attained by Brockwell and Davis (1991), the classic entry at this level. The major difference between this third edition of the text and the second edition is that we provide R code for almost all of the numerical examples. An R package called is provided for use with the text; see Section R.2 for details. R code astsa is provided simply to enhance the exposition by making the numerical examples reproducible. We have tried, where possible, to keep the problem sets in order so that an instructor may have an easy time moving from the second edition to the third edition. i i i i

6 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page viii — #6 i i Preface to the Third Edition viii However, some of the old problems have been revised and there are some new problems. Also, some of the data sets have been updated. We added one section in Chapter 5 on unit roots and enhanced some of the presentations throughout the text. The exposition on state-space modeling, ARMAX models, and (multivariate) regression with autocorrelated errors in Chapter 6 have been expanded. In this edition, we use standard R functions as much as possible, but we use our own scripts (included in ) when we feel it is necessary to avoid problems with a particular R function; astsa these problems are discussed in detail on the website for the text under R Issues. We thank John Kimmel, Executive Editor, Springer Statistics, for his guidance in the preparation and production of this edition of the text. We are grateful to Don Percival, University of Washington, for numerous suggestions that led to substantial improvement to the presentation in the second edition, and consequently in this edition. We thank Doug Wiens, University of Alberta, for help with some of the R code in Chapter 4 and Chapter 7, and for his many suggestions for improvement of the exposition. We are grateful for the continued help and advice of Pierre Duchesne, University of Montreal, and Alexander Aue, University of California, Davis. We also thank the many students and other readers who took the time to mention typographical errors and other corrections to the first and second editions. Finally, work on the this edition was supported by the National Science Foundation while one of us (D.S.S.) was working at the Foundation under the Intergovernmental Personnel Act. Robert H. Shumway Davis, CA David S. Stoffer Pittsburgh, PA September 2010 i i i i

7 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page ix — #7 i i Contents Preface to the Fourth Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Preface to the Third Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1 Characteristics of Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 The Nature of Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Time Series Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Measures of Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 Stationary Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.5 Estimation of Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.6 Vector-Valued and Multidimensional Series . . . . . . . . . . . . . . . . . . . . . 33 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2 Time Series Regression and Exploratory Data Analysis . . . . . . . . . . . . . 47 2.1 Classical Regression in the Time Series Context . . . . . . . . . . . . . . . . . 47 2.2 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.3 Smoothing in the Time Series Context . . . . . . . . . . . . . . . . . . . . . . . . . 67 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3 ARIMA Models 3.1 Autoregressive Moving Average Models . . . . . . . . . . . . . . . . . . . . . . . 77 3.2 Difference Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.3 Autocorrelation and Partial Autocorrelation. . . . . . . . . . . . . . . . . . . . . 96 3.4 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.5 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.6 Integrated Models for Nonstationary Data . . . . . . . . . . . . . . . . . . . . . . 133 3.7 Building ARIMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 3.8 Regression with Autocorrelated Errors . . . . . . . . . . . . . . . . . . . . . . . . 145 3.9 Multiplicative Seasonal ARIMA Models . . . . . . . . . . . . . . . . . . . . . . . 148 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 i i i i

8 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page x — #8 i i x Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 4 Spectral Analysis and Filtering 4.1 Cyclical Behavior and Periodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 4.2 The Spectral Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 4.3 Periodogram and Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . 181 4.4 Nonparametric Spectral Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 4.5 Parametric Spectral Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 4.6 Multiple Series and Cross-Spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 4.7 Linear Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 4.8 Lagged Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 4.9 Signal Extraction and Optimum Filtering . . . . . . . . . . . . . . . . . . . . . . . 223 4.10 Spectral Analysis of Multidimensional Series . . . . . . . . . . . . . . . . . . . 227 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 5 Additional Time Domain Topics 5.1 Long Memory ARMA and Fractional Differencing . . . . . . . . . . . . . . 241 5.2 Unit Root Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 5.3 GARCH Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 5.4 Threshold Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 5.5 Lagged Regression and Transfer Function Modeling . . . . . . . . . . . . . 265 5.6 Multivariate ARMAX Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 6 State Space Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 6.1 Linear Gaussian Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 6.2 Filtering, Smoothing, and Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . 292 6.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 6.4 Missing Data Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 6.5 Structural Models: Signal Extraction and Forecasting . . . . . . . . . . . . 315 6.6 State-Space Models with Correlated Errors . . . . . . . . . . . . . . . . . . . . . 319 6.6.1 ARMAX Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 6.6.2 Multivariate Regression with Autocorrelated Errors . . . . . . . 322 6.7 Bootstrapping State Space Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 6.8 Smoothing Splines and the Kalman Smoother . . . . . . . . . . . . . . . . . . . 331 6.9 Hidden Markov Models and Switching Autoregression . . . . . . . . . . . 334 6.10 Dynamic Linear Models with Switching . . . . . . . . . . . . . . . . . . . . . . . 345 6.11 Stochastic Volatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 6.12 Bayesian Analysis of State Space Models . . . . . . . . . . . . . . . . . . . . . . 365 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 . . . . . . . . . . . . . . . . . . . . . 383 7 Statistical Methods in the Frequency Domain 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 7.2 Spectral Matrices and Likelihood Functions . . . . . . . . . . . . . . . . . . . . 386 7.3 Regression for Jointly Stationary Series . . . . . . . . . . . . . . . . . . . . . . . 388 7.4 Regression with Deterministic Inputs . . . . . . . . . . . . . . . . . . . . . . . . . 397 7.5 Random Coefficient Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 i i i i

9 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page xi — #9 i i xi Contents 7.6 Analysis of Designed Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 7.7 Discriminant and Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 7.8 Principal Components and Factor Analysis . . . . . . . . . . . . . . . . . . . . . 437 7.9 The Spectral Envelope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 Appendix A Large Sample Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471 A.1 Convergence Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471 A.2 Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478 A.3 The Mean and Autocorrelation Functions . . . . . . . . . . . . . . . . . . . . . . 482 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 Appendix B Time Domain Theory B.1 Hilbert Spaces and the Projection Theorem . . . . . . . . . . . . . . . . . . . . . 491 B.2 Causal Conditions for ARMA Models . . . . . . . . . . . . . . . . . . . . . . . . . 495 B.3 Large Sample Distribution of the AR Conditional Least Squares Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 B.4 The Wold Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500 Appendix C Spectral Domain Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503 C.1 Spectral Representation Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503 C.2 Large Sample Distribution of the Smoothed Periodogram . . . . . . . . . 507 C.3 The Complex Multivariate Normal Distribution . . . . . . . . . . . . . . . . . 517 C.4 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522 C.4.1 Riemann–Stieltjes Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 522 C.4.2 Stochastic Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524 C.5 Spectral Analysis as Principal Component Analysis . . . . . . . . . . . . . . 525 C.6 Parametric Spectral Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531 Appendix R R Supplement R.1 First Things First . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531 R.2 astsa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531 R.3 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532 R.4 Time Series Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536 R.4.1 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553 i i i i

10 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page xii — #10 i i i i i i

11 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 1 — #11 i i Chapter 1 Characteristics of Time Series The analysis of experimental data that have been observed at different points in time leads to new and unique problems in statistical modeling and inference. The obvi- ous correlation introduced by the sampling of adjacent points in time can severely restrict the applicability of the many conventional statistical methods traditionally dependent on the assumption that these adjacent observations are independent and identically distributed. The systematic approach by which one goes about answer- ing the mathematical and statistical questions posed by these time correlations is commonly referred to as time series analysis. The impact of time series analysis on scientific applications can be partially doc- umented by producing an abbreviated listing of the diverse fields in which important time series problems may arise. For example, many familiar time series occur in the field of economics, where we are continually exposed to daily stock market quota- tions or monthly unemployment figures. Social scientists follow population series, such as birthrates or school enrollments. An epidemiologist might be interested in the number of influenza cases observed over some time period. In medicine, blood pressure measurements traced over time could be useful for evaluating drugs used in treating hypertension. Functional magnetic resonance imaging of brain-wave time series patterns might be used to study how the brain reacts to certain stimuli under various experimental conditions. In our view, the first step in any time series investigation always involves careful examination of the recorded data plotted over time. This scrutiny often suggests the method of analysis as well as statistics that will be of use in summarizing the information in the data. Before looking more closely at the particular statistical methods, it is appropriate to mention that two separate, but not necessarily mutually time exclusive, approaches to time series analysis exist, commonly identified as the . The time domain approach frequency domain approach domain approach and the views the investigation of lagged relationships as most important (e.g., how does what happened today affect what will happen tomorrow), whereas the frequency domain approach views the investigation of cycles as most important (e.g., what is the economic cycle through periods of expansion and recession). We will explore both types of approaches in the following sections. i i i i

12 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 2 — #12 i i 2 1 Characteristics of Time Series l l l l 15 l l l l l l l l 10 l l l l l l l l l l l l l l l l l l l l l l 5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Quarterly Earnings per Share l l l l l l l l l l l l l l l l 0 1975 1980 1965 1960 1970 Time Johnson & Johnson quarterly earnings per share, 84 quarters, 1960-I to 1980-IV. Fig. 1.1. 1.1 The Nature of Time Series Data Some of the problems and questions of interest to the prospective time series analyst can best be exposed by considering real experimental data taken from different subject areas. The following cases illustrate some of the common kinds of experimental time series data as well as some of the statistical questions that might be asked about such data. Example 1.1 Johnson & Johnson Quarterly Earnings Figure 1.1 shows quarterly earnings per share for the U.S. company Johnson & Johnson, furnished by Professor Paul Griffin (personal communication) of the Graduate School of Management, University of California, Davis. There are 84 quarters (21 years) measured from the first quarter of 1960 to the last quarter of 1980. Modeling such series begins by observing the primary patterns in the time history. In this case, note the gradually increasing underlying trend and the rather regular variation superimposed on the trend that seems to repeat over quarters. Methods for analyzing data such as these are explored in Chapter 2 and Chapter 6. 1.1 To plot the data using the R statistical package, type the following: # SEE THE FOOTNOTE library(astsa) plot(jj, type="o", ylab="Quarterly Earnings per Share") Example 1.2 Global Warming Consider the global temperature series record shown in Figure 1.2. The data are the global mean land–ocean temperature index from 1880 to 2015, with the base period 1951-1980. In particular, the data are deviations, measured in degrees centigrade, from the 1951-1980 average, and are an update of Hansen et al. (2006). We note an apparent upward trend in the series during the latter part of the twentieth century that has been used as an argument for the global warming hypothesis. Note also the leveling off at about 1935 and then another rather sharp upward trend at about 1 . 1 Throughout the text, we assume that the R package for the book, astsa , has been installed and loaded. See Section R.2 for further details. i i i i

13 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 3 — #13 i i 1.1 The Nature of Time Series Data 3 l 0.8 l l l l l l l l l l l l l l l l l l l l l l 0.4 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Global Temperature Deviations l −0.4 1980 1960 2020 1920 1900 1880 2000 1940 Time Yearly average global temperature deviations (1880–2015) in degrees centigrade. Fig. 1.2. 4000 3000 2000 speech 1000 0 600 400 1000 200 0 800 Time sampled at 10,000 points per second Fig. 1.3. Speech recording of the syllable aaa ··· hhh n points. = with 1020 1970. The question of interest for global warming proponents and opponents is whether the overall trend is natural or whether it is caused by some human-induced interface. Problem 2.8 examines 634 years of glacial sediment data that might be taken as a long-term temperature proxy. Such percentage changes in temperature do not seem to be unusual over a time period of 100 years. Again, the question of trend is of more interest than particular periodicities. The R code for this example is similar to the code in Example 1.1: plot(globtemp, type="o", ylab="Global Temperature Deviations") Example 1.3 Speech Data Figure 1.3 shows a small .1 second (1000 point) sample of recorded speech for the phrase aaa ··· hhh , and we note the repetitive nature of the signal and the rather regular periodicities. One current problem of great interest is computer recognition of speech, which would require converting this particular signal into . Spectral analysis can be used in this context to ··· hhh the recorded phrase aaa produce a signature of this phrase that can be compared with signatures of various i i i i

14 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 4 — #14 i i 4 1 Characteristics of Time Series 0.10 0.05 0.00 DJIA Returns −0.05 Apr 01 2014 Mar 31 2016 Apr 02 2012 Apr 01 2008 Apr 21 2006 Apr 01 2010 The daily returns of the Dow Jones Industrial Average (DJIA) from April 20, 2006 to Fig. 1.4. April 20, 2016. library syllables to look for a match. One can immediately notice the rather regular repetition of small wavelets. The separation between the packets is known as the pitch period and represents the response of the vocal tract filter to a periodic sequence of pulses stimulated by the opening and closing of the glottis. In R, you can reproduce Figure 1.3 using . plot(speech) Example 1.4 Dow Jones Industrial Average As an example of financial time series data, Figure 1.4 shows the daily returns (or percent change) of the Dow Jones Industrial Average (DJIA) from April 20, 2006 to April 20, 2016. It is easy to spot the financial crisis of 2008 in the figure. The data shown in Figure 1.4 are typical of return data. The mean of the series appears to be stable with an average return of approximately zero, however, highly volatile (variable) periods tend to be clustered together. A problem in the analysis of these type of financial data is to forecast the volatility of future returns. Models ARCH such as and GARCH models (Engle, 1982; Bollerslev, 1986) and stochastic volatility models (Harvey, Ruiz and Shephard, 1994) have been developed to handle these problems. We will discuss these models and the analysis of financial data in Chapter 5 and Chapter 6. The data were obtained using the Technical Trading Rules TM (TTR) package to download the data from Yahoo and then plot it. We then used the fact that if x is the return, is the actual value of the DJIA and x )/ = r x x − ( t − 1 t t t − 1 t 1.2 / . r then 1 + r x = x ( )≈ x log )− x and log ( 1 + r ( ) = log ( x log / x = ) t 1 t t 1 t − t t t t − 1 t − must be loaded. xts , but The data set is also available in astsa # library(TTR) # djia = getYahooData("^DJI", start=20060420, end=20160420, freq="daily") library(xts) djiar = diff(log(djia$Close))[-1] # approximate returns plot(djiar, main="DJIA Returns", type="n") lines(djiar) 3 2 p p 1 . 2 + log + p ) = p − is near zero, the higher-order terms in the p ≤ 1 . If ( p − ··· for − 1 < 1 3 2 expansion are negligible. i i i i

15 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 5 — #15 i i 5 1.1 The Nature of Time Series Data Southern Oscillation Index 1.0 0.5 0.0 −0.5 −1.0 1970 1980 1950 1960 Recruitment 100 80 60 40 20 0 1960 1980 1970 1950 Time Fig. 1.5. Monthly SOI and Recruitment (estimated new fish), 1950-1987. Example 1.5 El Niño and Fish Population We may also be interested in analyzing several time series at once. Figure 1.5 Southern Oscillation shows monthly values of an environmental series called the Index (SOI) and associated Recruitment (number of new fish) furnished byDr. Roy Mendelssohn of the Pacific Environmental Fisheries Group (personal communica- tion). Both series are for a period of 453 months ranging over the years 1950–1987. The SOI measures changes in air pressure, related to sea surface temperatures in the central Pacific Ocean. The central Pacific warms every three to seven years due to the El Niño effect, which has been blamed for various global extreme weather events. Both series in Figure 1.5 exhibit repetitive behavior, with regularly repeating cycles that are easily visible. This periodic behavior is of interest because under- lying processes of interest may be regular and the rate or frequency of oscillation characterizing the behavior of the underlying series would help to identify them. The series show two basic oscillations types, an obvious annual cycle (hot in the summer, cold in the winter), and a slower frequency that seems to repeat about every 4 years. The study of the kinds of cycles and their strengths is the subject of Chapter 4. The two series are also related; it is easy to imagine the fish population is dependent on the ocean temperature. This possibility suggests trying some version of regression analysis as a procedure for relating the two series. Transfer function i i i i

16 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 6 — #16 i i 1 Characteristics of Time Series 6 Cortex Cortex 0.6 0.6 0.2 0.2 BOLD BOLD −0.2 −0.2 −0.6 −0.6 120 80 80 0 20 40 60 100 100 120 60 40 20 0 Thalamus & Cerebellum Thalamus & Cerebellum 0.6 0.6 0.2 0.2 BOLD BOLD −0.2 −0.2 −0.6 −0.6 0 60 80 20 40 0 20 40 60 80 100 120 100 120 Time (1 pt = 2 sec) = n fMRI data from various locations in the cortex, thalamus, and cerebellum; 128 Fig. 1.6. points, one observation taken every 2 seconds. modeling , as considered in Chapter 5, can also be applied in this case. The following R code will reproduce Figure 1.5: par(mfrow = c(2,1)) # set up the graphics plot(soi, ylab="", xlab="", main="Southern Oscillation Index") plot(rec, ylab="", xlab="", main="Recruitment") Example 1.6 fMRI Imaging A fundamental problem in classical statistics occurs when we are given a collection of independent series or vectors of series, generated under varying experimental conditions or treatment configurations. Such a set of series is shown in Figure 1.6, where we observe data collected from various locations in the brain via functional magnetic resonance imaging (fMRI). In this example, five subjects were given pe- riodic brushing on the hand. The stimulus was applied for 32 seconds and then stopped for 32 seconds; thus, the signal period is 64 seconds. The sampling rate was one observation every 2 seconds for 256 seconds ( ). For this example, n = 128 we averaged the results over subjects (these were evoked responses, and all subjects were in phase). The series shown in Figure 1.6 are consecutive measures of blood ) signal intensity, which measures areas of acti- oxygenation-level dependent ( bold vation in the brain. Notice that the periodicities appear strongly in the motor cortex i i i i

17 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 7 — #17 i i 7 1.1 The Nature of Time Series Data Earthquake 0.4 0.2 0.0 EQ5 −0.4 1000 1500 2000 0 500 Explosion 0.4 0.2 0.0 EXP6 −0.4 0 500 1500 2000 1000 Time Fig. 1.7. Arrival phases from an earthquake (top) and explosion (bottom) at 40 points per second. series and less strongly in the thalamus and cerebellum. The fact that one has series from different areas of the brain suggests testing whether the areas are responding differently to the brush stimulus. Analysis of variance techniques accomplish this in classical statistics, and we show in Chapter 7 how these classical techniques extend to the time series case, leading to a spectral analysis of variance. The following R commands can be used to plot the data: par(mfrow=c(2,1)) ts.plot(fmri1[,2:5], col=1:4, ylab="BOLD", main="Cortex") ts.plot(fmri1[,6:9], col=1:4, ylab="BOLD", main="Thalamus & Cerebellum") Example 1.7 Earthquakes and Explosions As a final example, the series in Figure 1.7 represent two phases or arrivals along t = , at a seismic the surface, denoted by P ( ) 2048 1 , . . ., 1024 ) and S ( t = 1025 , . . ., recording station. The recording instruments in Scandinavia are observing earth- quakes and mining explosions with one of each shown in Figure 1.7. The general problem of interest is in distinguishing or discriminating between waveforms gen- erated by earthquakes and those generated by explosions. Features that may be important are the rough amplitude ratios of the first phase P to the second phase S, which tend to be smaller for earthquakes than for explosions. In the case of the i i i i

18 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 8 — #18 i i 8 1 Characteristics of Time Series two events in Figure 1.7, the ratio of maximum amplitudes appears to be somewhat less than .5 for the earthquake and about 1 for the explosion. Otherwise, note a subtle difference exists in the periodic nature of the S phase for the earthquake. We can again think about spectral analysis of variance for testing the equality of the periodic components of earthquakes and explosions. We would also like to be able to classify future P and S components from events of unknown origin, leading to time series discriminant analysis the developed in Chapter 7. To plot the data as in this example, use the following commands in R: par(mfrow=c(2,1)) plot(EQ5, main="Earthquake") plot(EXP6, main="Explosion") 1.2 Time Series Statistical Models The primary objective of time series analysis is to develop mathematical models that provide plausible descriptions for sample data, like that encountered in the previous section. In order to provide a statistical setting for describing the character of data that seemingly fluctuate in a random fashion over time, we assume a time series can be defined as a collection of random variables indexed according to the order they are obtained in time. For example, we may consider a time series as a sequence of random , variables, x x x , where the random variable , x , . . . denotes the value taken by 3 2 1 1 denotes the value for the second x the series at the first time point, the variable 2 x denotes the value for the third time period, and so on. In general, a time period, 3 , indexed by collection of random variables, x . } { t is referred to as a stochastic process t t In this text, , or will typically be discrete and vary over the integers t = 0 , ± 1 , ± 2 , ... some subset of the integers. The observed values of a stochastic process are referred to as a of the stochastic process. Because it will be clear from the context realization of our discussions, we use the term time series whether we are referring generically to the process or to a particular realization and make no notational distinction between the two concepts. It is conventional to display a sample time series graphically by plotting the values of the random variables on the vertical axis, or ordinate, with the time scale as the abscissa. It is usually convenient to connect the values at adjacent time periods to reconstruct visually some original hypothetical continuous time series that might have produced these values as a discrete sample. Many of the series discussed in the previous section, for example, could have been observed at any continuous point in time and are conceptually more properly treated as continuous time series . The approximation of these series by discrete time parameter series sampled at equally spaced points in time is simply an acknowledgment that sampled data will, for the most part, be discrete because of restrictions inherent in the method of collection. Furthermore, the analysis techniques are then feasible using computers, which are limited to digital computations. Theoretical developments also rest on the idea that a continuous parameter time series should be specified in terms of finite-dimensional distribution functions defined over a finite number of points in time. This is not to i i i i

19 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 9 — #19 i i 1.2 Time Series Statistical Models 9 say that the selection of the sampling interval or rate is not an extremely important consideration. The appearance of data can be changed completely by adopting an insufficient sampling rate. We have all seen wheels in movies appear to be turning backwards because of the insufficient number of frames sampled by the camera. This aliasing (see Section 4.1). phenomenon leads to a distortion called The fundamental visual characteristic distinguishing the different series shown in Example 1.1–Example 1.7 is their differing degrees of smoothness. One possible explanation for this smoothness is that it is being induced by the supposition that , so the value of the series at time t , say, adjacent points in time are correlated , depends in some way on the past values , x x , . . . . This model expresses a x − − 2 1 t t t fundamental way in which we might think about generating realistic-looking time series. To begin to develop an approach to using collections of random variables to model time series, consider Example 1.8. Example 1.8 White Noise (3 flavors) A simple kind of generated series might be a collection of uncorrelated random 2 , with mean 0 and finite variance σ variables, . The time series generated from w t w uncorrelated variables is used as a model for noise in engineering applications, 2 ; we shall denote this process as w wn ∼ where it is called ( 0 , σ white noise . The ) t w designation white originates from the analogy with white light and indicates that all possible periodic oscillations are present with equal strength. We will sometimes require the noise to be independent and identically dis- 2 tributed (iid) random variables with mean 0 and variance σ . We distinguish this w 2 white independent noise 0 . A by writing w or by saying ∼ iid ( iid noise , σ or ) t w particularly useful white noise series is are , wherein the w Gaussian white noise t 2 ; or more suc- σ independent normal random variables, with mean 0 and variance w 2 . Figure 1.8 shows in the upper panel a collection of 500 ) cinctly, w , σ ∼ iid N ( 0 t w 2 1 = such random variables, with , plotted in the order in which they were drawn. σ w The resulting series bears a slight resemblance to the explosion in Figure 1.7 but is not smooth enough to serve as a plausible model for any of the other experi- mental series. The plot tends to show visually a mixture of many different kinds of oscillations in the white noise series. If the stochastic behavior of all time series could be explained in terms of the white noise model, classical statistical methods would suffice. Two ways of intro- ducing serial correlation and more smoothness into time series models are given in Example 1.9 and Example 1.10. Example 1.9 Moving Averages and Filtering We might replace the white noise series w that smooths the by a moving average t series. For example, consider replacing w in Example 1.8 by an average of its t current value and its immediate neighbors in the past and future. That is, let ) ( 1 w , + (1.1) w w + v = 1 − t t t + 1 t 3 which leads to the series shown in the lower panel of Figure 1.8. Inspecting the series shows a smoother version of the first series, reflecting the fact that the slower i i i i

20 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 10 — #20 i i 1 Characteristics of Time Series 10 white noise 4 3 2 1 w 0 −1 −3 300 400 500 0 100 200 Time moving average 3 2 1 v 0 −1 −2 −3 200 300 400 500 0 100 Time Fig. 1.8. Gaussian white noise series (top) and three-point moving average of the Gaussian white noise series (bottom). oscillations are more apparent and some of the faster oscillations are taken out. We begin to notice a similarity to the SOI in Figure 1.5, or perhaps, to some of the fMRI series in Figure 1.6. A linear combination of values in a time series such as in (1.1) is referred to, filter generically, as a filtered series; hence the command in the following code for Figure 1.8. w = rnorm(500,0,1) # 500 N(0,1) variates v = filter(w, sides=2, filter=rep(1/3,3)) # moving average par(mfrow=c(2,1)) plot.ts(w, main="white noise") plot.ts(v, ylim=c(-3,3), main="moving average") The speech series in Figure 1.3 and the Recruitment series in Figure 1.5, as well as some of the MRI series in Figure 1.6, differ from the moving average series because one particular kind of oscillatory behavior seems to predominate, producing a sinusoidal type of behavior. A number of methods exist for generating series with this quasi-periodic behavior; we illustrate a popular one based on the autoregressive model considered in Chapter 3. i i i i

21 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 11 — #21 i i 1.2 Time Series Statistical Models 11 autoregression 5 0 x −5 100 200 300 400 500 0 Time Fig. 1.9. Autoregressive series generated from model (1.2) . Example 1.10 Autoregressions w of Example 1.8 as input and calculate Suppose we consider the white noise series t the output using the second-order equation = x w x (1.2) − . 9 x + − t t t t − 2 1 500 t = 1 , 2 , . . ., successively for . Equation (1.2) represents a regression or predic- x tion of the current value of a time series as a function of the past two values of the t series, and, hence, the term autoregression is suggested for this model. A problem with startup values exists here because (1.2) also depends on the initial conditions , but assuming we have the values, we generate the succeeding values by x and x 0 − 1 substituting into (1.2). The resulting output series is shown in Figure 1.9, and we note the periodic behavior of the series, which is similar to that displayed by the speech series in Figure 1.3. The autoregressive model above and its generalizations can be used as an underlying model for many observed series and will be studied in detail in Chapter 3. As in the previous example, the data are obtained by a filter of white noise. filter x = w , and The function uses zeros for the initial values. In this case, 1 1 , and so on, so that the values do not satisfy (1.2). An easy = x w + w x = w + 1 2 2 2 1 fix is to run the filter for longer than needed and remove the initial values. # 50 extra to avoid startup problems w = rnorm(550,0,1) x = filter(w, filter=c(1,-.9), method="recursive")[-(1:50)] # remove first 50 plot.ts(x, main="autoregression") Example 1.11 Random Walk with Drift A model for analyzing trend such as seen in the global temperature data in Figure 1.2, model given by is the random walk with drift x δ + x = + w (1.3) t t t 1 − for is white noise. The = 1 , 2 , . . . , with initial condition x w = 0 , and where t 0 t random walk , (1.3) is called simply a 0 . constant δ is called the drift , and when δ = i i i i

22 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 12 — #22 i i 1 Characteristics of Time Series 12 random walk 50 40 30 20 10 0 150 200 50 100 0 Time σ Fig. 1.10. = 1 , with drift δ = . 2 (upper jagged line), without drift, δ = 0 Random walk, w δ . (lower jagged line), and straight (dashed) lines with slope δ The term random walk comes from the fact that, when 0 , the value of the time = t is the value of the series at time − 1 plus a completely random t series at time . Note that we may rewrite (1.3) as a cumulative sum w movement determined by t of white noise variates. That is, t ’ = δ t + x (1.4) w j t 1 = j 1 ; either use induction, or plug (1.4) into (1.3) to verify this statement. , . . . 2 for t = , Figure 1.10 shows 200 observations generated from the model with 0 and . 2 , = δ σ on = 1 . For comparison, we also superimposed the straight line . t and with 2 w the graph. To reproduce Figure 1.10 in R use the following code (notice the use of multiple commands per line using a semicolon). # so you can reproduce the results set.seed(154) # two commands in one line w = rnorm(200); x = cumsum(w) wd = w +.2; xd = cumsum(wd) plot.ts(xd, ylim=c(-5,55), main="random walk", ylab= '' ) lines(x, col=4); abline(h=0, col=4, lty=2); abline(a=0, b=.2, lty=2) Example 1.12 Signal in Noise Many realistic models for generating time series assume an underlying signal with some consistent periodic variation, contaminated by adding a random noise. For example, it is easy to detect the regular cycle fMRI series displayed on the top of Figure 1.6. Consider the model t + 15 x = 2 cos ( 2 π w (1.5) ) + t t 50 for t = 1 , 2 , . . ., 500 , where the first term is regarded as the signal, shown in the upper panel of Figure 1.11. We note that a sinusoidal waveform can be written as ) φ (1.6) , A cos ( 2 πω t + i i i i

23 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 13 — #23 i i 13 1.2 Time Series Statistical Models π cos ( 2 π t 50 + 0.6 2 ) 2 1 0 −1 −2 500 100 0 200 300 400 N π ) 2 cos ( 2 π t 50 + 0.6 1 ) + ( 0 , 4 2 0 −2 −4 400 0 500 100 200 300 2 2 2 cos ( ) π t 50 + 0.6 π ) + N ( 0 , 5 20 10 0 −10 500 0 100 200 300 400 Time Cosine wave with period 50 points (top panel) compared with the cosine wave Fig. 1.11. contaminated with additive white Gaussian noise, σ (bottom = 1 (middle panel) and σ 5 = w w panel); see (1.5) . where is a phase shift. In is the amplitude, ω is the frequency of oscillation, and φ A 6 π . π (1.5), A = 2 , ω = 1 / 50 (one cycle every 50 time points), and φ = 2 . 15 / 50 = An additive noise term was taken to be white noise with = 1 (middle σ w panel) and σ 5 = (bottom panel), drawn from a normal distribution. Adding the w two together obscures the signal, as shown in the lower panels of Figure 1.11. Of course, the degree to which the signal is obscured depends on the amplitude of the (or some signal and the size of σ . The ratio of the amplitude of the signal to σ w w ; the larger signal-to-noise ratio (SNR) function of the ratio) is sometimes called the the SNR, the easier it is to detect the signal. Note that the signal is easily discernible in the middle panel of Figure 1.11, whereas the signal is obscured in the bottom panel. Typically, we will not observe the signal but the signal obscured by noise. To reproduce Figure 1.11 in R, use the following commands: cs = 2*cos(2*pi*1:500/50 + .6*pi); w = rnorm(500,0,1) par(mfrow=c(3,1), mar=c(3,2,2,1), cex.main=1.5) plot.ts(cs, main=expression(2*cos(2*pi*t/50+.6*pi))) plot.ts(cs+w, main=expression(2*cos(2*pi*t/50+.6*pi) + N(0,1))) plot.ts(cs+5*w, main=expression(2*cos(2*pi*t/50+.6*pi) + N(0,25))) i i i i

24 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 14 — #24 i i 1 Characteristics of Time Series 14 spectral analysis as a possible technique In Chapter 4, we will study the use of for detecting regular or periodic signals, such as the one described in Example 1.12. In general, we would emphasize the importance of simple additive models such as given above in the form = (1.7) x v , s + t t t denotes a time series that may be white where s denotes some unknown signal and v t t or correlated over time. The problems of detecting a signal and then in estimating or are of great interest in many areas of engineering and extracting the waveform of s t the physical and biological sciences. In economics, the underlying signal may be a trend or it may be a seasonal component of a series. Models such as (1.7), where the signal has an autoregressive structure, form the motivation for the state-space model of Chapter 6. In the above examples, we have tried to motivate the use of various combinations of random variables emulating real time series data. Smoothness characteristics of observed time series were introduced by combining the random variables in vari- ous ways. Averaging independent random variables over adjacent time points, as in Example 1.9, or looking at the output of difference equations that respond to white noise inputs, as in Example 1.10, are common ways of generating correlated data. In the next section, we introduce various theoretical measures used for describing how time series behave. As is usual in statistics, the complete description involves the multivariate distribution function of the jointly sampled values x , . . ., , x x , 1 n 2 whereas more economical descriptions can be had in terms of the mean and autocor- relation functions. Because correlation is an essential feature of time series analysis, the most useful descriptive measures are those expressed in terms of covariance and correlation functions. 1.3 Measures of Dependence A complete description of a time series, observed as a collection of n random variables n , is provided by the joint , for any positive integer at arbitrary time points t t , t , . . ., 2 1 n distribution function, evaluated as the probability that the values of the series are n jointly less than the c ; i.e., , c constants, , . . ., c n 1 2 ) ( Pr . c c ≤ x , . . ., c ( c ≤ , c x , . . ., c , ) = F (1.8) x ≤ t t n 2 t 1 , 2 1 ,..., t n t t n n 1 2 1 2 Unfortunately, these multidimensional distribution functions cannot usually be writ- ten easily unless the random variables are jointly normal, in which case the joint density has the well-known form displayed in (1.33). Although the joint distribution function describes the data completely, it is an unwieldy tool for displaying and analyzing time series data. The distribution func- tion (1.8) must be evaluated as a function of n arguments, so any plotting of the corresponding multivariate density functions is virtually impossible. The marginal distribution functions ≤ } x x F { ( x ) = P t t i i i i

25 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 15 — #25 i i 15 1.3 Measures of Dependence or the corresponding marginal density functions ) x ( F ∂ t , ) f x = ( t x ∂ when they exist, are often informative for examining the marginal behavior of a 1.3 series. Another informative marginal descriptive measure is the mean function. The mean function is defined as Definition 1.1 π ∞ (1.9) x ) = μ ( E x f = ( x ) dx , xt t t −∞ provided it exists, where E denotes the usual expected value operator. When no confusion exists about which time series we are referring to, we will drop a subscript as μ . and write μ xt t Example 1.13 Mean Function of a Moving Average Series If w . The top series in denotes a white noise series, then μ t = E ( w for all ) = 0 t wt t Figure 1.8 reflects this, as the series clearly fluctuates around a mean value of zero. Smoothing the series as in Example 1.9 does not change the mean because we can write 1 )] . w ( E = + ) w ( E 0 + [ E ( w ) v E μ = = ) ( t 1 t t + 1 − t vt 3 Example 1.14 Mean Function of a Random Walk with Drift Consider the random walk with drift model given in (1.4), t ’ = δ t + x , , . . . . 2 w 1 , t = t j 1 = j E ( w is a constant, we have for all = 0 Because t , and δ ) t t ’ t δ = ) w ( E = δ + μ = E ( x t ) j t xt 1 = j which is a straight line with slope δ . A realization of a random walk with drift can be compared to its mean function in Figure 1.10. 3 2 . 1 2 μ and variance σ ) If , abbreviated as x x ∼ N ( μ , σ is Gaussian with mean , the marginal density t t t t t t } { 1 1 2 . R ) μ − x x ( ∈ − , exp ) f is given by ( x = √ t t 2 σ 2 t 2 σ π t i i i i

26 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 16 — #26 i i 1 Characteristics of Time Series 16 Example 1.15 Mean Function of Signal Plus Noise A great many practical applications depend on assuming the observed data have been generated by a fixed signal waveform superimposed on a zero-mean noise process, leading to an additive signal model of the form (1.5). It is clear, because the signal in (1.5) is a fixed function of time, we will have ] [ t + 15 w + ) E ( 2 cos ( 2 π E = x ) = μ t t xt 50 + t 15 π 2 cos = 2 ) ( + E ( w ) t 50 15 + t 2 π 2 cos = ( ) , 50 and the mean function is just the cosine wave. x The lack of independence between two adjacent values and x can be assessed t s numerically, as in classical statistics, using the notions of covariance and correlation. x Assuming the variance of is finite, we have the following definition. t Definition 1.2 The autocovariance function is defined as the second moment product γ (1.10) ( s , t ) = cov ( x , , x )] ) = E [( x μ − μ − )( x x t s t t s s for all s and t . When no possible confusion exists about which time series we are ) t , = referring to, we will drop the subscript and write γ s ( s , t ) as γ ( s , t ) . Note that γ ( x x ( t , s ) for all time points s and t . γ x linear The autocovariance measures the dependence between two points on the same series observed at different times. Very smooth series exhibit autocovariance t functions that stay large even when the s are far apart, whereas choppy series and tend to have autocovariance functions that are nearly zero for large separations. Recall γ , ( s , t ) = 0 from classical statistics that if x are not linearly related, but there and x t x s still may be some dependence structure between them. If, however, x and x are s t γ , ( s bivariate normal, t ) = 0 ensures their independence. It is clear that, for s = t , x the autocovariance reduces to the (assumed finite) variance , because 2 (1.11) . ) γ ( ( t x t ) = E [( x var − μ = ) , ] t t t x Example 1.16 Autocovariance of White Noise 0 = ( The white noise series w ) has E and w t t { 2 , s = t σ w (1.12) t ) = cov γ w ( , w ( ) = s , t w s . t , 0 s 2 is shown in the top panel of Figure 1.8. A realization of white noise with σ 1 = w We often have to calculate the autocovariance between filtered series. A useful result is given in the following proposition. i i i i

27 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 17 — #27 i i 1.3 Measures of Dependence 17 Property 1.1 Covariance of Linear Combinations If the random variables m r ’ ’ = U X a and V = b Y j j k k 1 = j k = 1 , respec- X } and { Y are linear combinations of (finite variance) random variables } { j k tively, then m r ’ ’ , X (1.13) . ( cov b ) Y a cov ( U , V ) = j k k j j = 1 = 1 k ( = ) var cov ( U , U ) . U Furthermore, Example 1.17 Autocovariance of a Moving Average w of the Consider applying a three-point moving average to the white noise series t previous example as in Example 1.9. In this case, } { 1 1 ) ( ) ( + , w w + w . w + + w w v ) = cov = ) cov t , v γ , ( s ( t 1 t − 1 s 1 + − s t + 1 s t s v 3 3 s = t we have When 1 t , t ) = w )} γ {( w + w + w ) , ( w ( + w + cov 1 v t t t − 1 − t t + 1 t + 1 9 1 , w [ cov ( w , w ) + )] ( w , w ) + cov ( w cov = 1 t t 1 t t − 1 1 + t + t − 9 2 3 = σ . w 9 = t + 1 , When s 1 w + w w + ( , ) )} cov {( w + w + w = ) t 1 + t ( γ , t t + 1 t − 1 1 t t + 2 t + v 9 1 w w [ cov ( w , , w )] ) + cov ( = t + 1 t t t + 1 9 2 2 = σ , w 9 2 − t ( γ = ) t , 2 + t γ , 9 / ( t σ t = 2 , 1 − ) ( using (1.12). Similar computations give γ v v v w 2 t and s . We summarize the values for all 2 > | as s / 9 , and 0 when | t − t , 2 ) σ = w  3 2  , t = s σ  w 9    2 2  − | = s , | t 1 σ w 9 t ) = , ( s γ (1.14) v 1 2  2 | = − , s t | σ  w 9     0 2 > . | s − t |  Example 1.17 shows clearly that the smoothing operation introduces a covariance function that decreases as the separation between the two time points increases and disappears completely when the time points are separated by three or more time points. This particular autocovariance is interesting because it only depends on the time separation or lag and not on the absolute location of the points along the series. We shall see later that this dependence suggests a mathematical model for the concept weak stationarity of . i i i i

28 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 18 — #28 i i 1 Characteristics of Time Series 18 Example 1.18 Autocovariance of a Random Walk Õ t x For the random walk model, = w , we have j t j 1 = s t ’ ’ © ™ 2 x t , x , ) σ cov γ s } ( s , w { , t ) = , min w = cov ( = ≠ Æ k j t s x w j = 1 = 1 k ́ ̈ because the w are uncorrelated random variables. Note that, as opposed to the t previous examples, the autocovariance function of a random walk depends on the and t , and not on the time separation or lag. Also, notice that particular time values s 2 ( the variance of the random walk, , increases without bound ) = γ var ( t , t ) = t σ x x t w as time t increases. The effect of this variance increase can be seen in Figure 1.10 (note that where the processes start to move away from their mean functions δ t and . δ in that example). = 0 2 As in classical statistics, it is more convenient to deal with a measure of association 1 and 1 , and this leads to the following definition. between − autocorrelation function (ACF) The Definition 1.3 is defined as γ ( s , t ) (1.15) . , t ρ = ( s ) √ , s ) γ ( t , t s γ ( ) The ACF measures the linear predictability of the series at time t , say x , using only t 1 using the Cauchy–Schwarz 1 ) ≤ the value x , . We can show easily that − t ≤ ρ ( s s 1.4 If we can predict x through a linear relationship, inequality. from x perfectly s t x 0 = β < + β β . when , then the correlation will be + 1 when β 1 > 0 , and − x 1 s 1 1 0 t Hence, we have a rough measure of the ability to forecast the series at time from t . s the value at time Often, we would like to measure the predictability of another series y from the t series x . Assuming both series have finite variances, we have the following definition. s , is Definition 1.4 The cross-covariance function between two series, x y and t t )] (1.16) . γ ( s , t ) = cov ( x , y ) = E [( x − μ )( y − μ t xs yt s t s xy There is also a scaled version of the cross-covariance function. The cross-correlation function (CCF) is given by Definition 1.5 , t ) s ( γ xy ) t , s ( ρ = (1.17) . √ xy ( , s ) γ ( t , t ) γ s x y 1 . 4 2 , t . The Cauchy–Schwarz inequality implies | γ ( s , t )| ) ≤ γ ( s , s ) γ ( t i i i i

29 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 19 — #29 i i 19 1.4 Stationary Time Series We may easily extend the above ideas to the case of more than two series, say, multivariate time series , x components. For example, the , . . ., x r ; that is, x with tr 2 t t 1 extension of (1.10) in this case is ( s , t ) = E γ x [( − μ (1.18) )( x . − μ r )] j , k = 1 , 2 , . . ., tk tk s j jk s j In the definitions above, the autocovariance and cross-covariance functions may s change as one moves along the series because the values depend on both t , the and locations of the points in time. In Example 1.17, the autocovariance function depends , and not on where the points are located x x on the separation of , say, h = | s − t | and s t units, the location of the two points in time. As long as the points are separated by h , when the mean is constant, is weak stationarity does not matter. This notion, called fundamental in allowing us to analyze sample time series data when only a single series is available. 1.4 Stationary Time Series The preceding definitions of the mean and autocovariance functions are completely general. Although we have not made any special assumptions about the behavior of the time series, many of the preceding examples have hinted that a sort of regularity may exist over time in the behavior of a time series. We introduce the notion of stationarity regularity using a concept called . Definition 1.6 A strictly stationary time series is one for which the probabilistic behavior of every collection of values x } , { x x , . . ., t t t 2 1 k is identical to that of the time shifted set x . } x { x , . . ., , t + t + h h t h + 2 1 k That is, x Pr { x (1.19) c , . . ., x c ≤ c } = Pr { x } ≤ ≤ c , . . ., ≤ k t + h t t 1 1 k t h + 1 1 k k , . . ., c for all k = 1 , 2 c , all time points t , and all time , t , . . ., t , all numbers c , , ... 2 k 1 1 k 2 , ... shifts = 0 , ± 1 , ± 2 h . If a time series is strictly stationary, then all of the multivariate distribution functions for subsets of variables must agree with their counterparts in the shifted set for all values of the shift parameter h . For example, when k = 1 , (1.19) implies that Pr { x (1.20) ≤ c } = Pr { x } ≤ c t s for any time points s and t . This statement implies, for example, that the probability . am is the same as at 10 am the value of a time series sampled hourly is negative at 1 i i i i

30 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 20 — #30 i i 20 1 Characteristics of Time Series , of the series exists, (1.20) implies that μ μ In addition, if the mean function, μ = t t s for all t , and hence μ s must be constant. Note, for example, that a random walk and t process with drift is not strictly stationary because its mean function changes with time; see Example 1.14. k = 2 , we can write (1.19) as When { { ≤ c , x ≤ c } = Pr x x ≤ c , x Pr ≤ c } (1.21) t 1 1 + t 2 h s + 2 h s . Thus, if the variance function of the process for any time points t and shift h and s exists, (1.20)–(1.21) imply that the autocovariance function of the series x satisfies t γ ( s , t ) = γ ( s + h , t + h ) for all s t and h . We may interpret this result by saying the autocovariance function and s and t , and not on the of the process depends only on the time difference between actual times. The version of stationarity in Definition 1.6 is too strong for most applications. Moreover, it is difficult to assess strict stationarity from a single data set. Rather than imposing conditions on all possible distributions of a time series, we will use a milder version that imposes conditions only on the first two moments of the series. We now have the following definition. Definition 1.7 A weakly stationary time series, x , is a finite variance process such t that is constant and does not depend on (1.9) , defined in μ (i) the mean value function, t time t , and ( t , t ) , defined in (1.10) depends on s and γ only s (ii) the autocovariance function, s − t | . | through their difference Henceforth, we will use the term stationary to mean weakly stationary; if a process is stationary in the strict sense, we will use the term strictly stationary. Stationarity requires regularity in the mean and autocorrelation functions so that these quantities (at least) may be estimated by averaging. It should be clear from the discussion of strict stationarity following Definition 1.6 that a strictly stationary, finite variance, time series is also stationary. The converse is not true unless there are further conditions. One important case where stationarity implies strict stationarity is if the time series is Gaussian [meaning all finite distributions, (1.19), of the series are Gaussian]. We will make this concept more precise at the end of this section. E ( x , of a stationary time series is independent ) μ Because the mean function, = t t t , we will write of time = μ. (1.22) μ t Also, because the autocovariance function, γ ( s , t ) , of a stationary time series, x , t , we may simplify the notation. s and t only through their difference | s − t | depends on h Let = t + s , where h represents the time shift or lag . Then , h ) 0 γ ( t + h , t ) = cov ( x ( γ = , x ) ) = cov ( x x , h 0 t h + t i i i i

31 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 21 — #31 i i 21 1.4 Stationary Time Series + because the time difference between times t is the same as the time difference h t and and . Thus, the autocovariance function of a stationary time series h between times 0 does not depend on the time argument t . Henceforth, for convenience, we will drop ( h , 0 ) . the second argument of γ The autocovariance function of a stationary time series will be Definition 1.8 written as ( h ) = cov ( x . (1.23) γ , x )] ) = E [( x μ − x − μ )( h + t t h + t t The autocorrelation function (ACF) of a stationary time series will Definition 1.9 (1.15) as be written using ( ) γ h + h γ t ) ( t , (1.24) . = ρ = ) h ( √ γ ( 0 ) γ , t ) ( t + h , t + h ) γ ( t The Cauchy–Schwarz inequality shows again that 1 ≤ ρ ( h − 1 for all h , ) ≤ enabling one to assess the relative importance of a given autocorrelation value by 1 comparing with the extreme values . 1 and − Example 1.19 Stationarity of White Noise The mean and autocovariance functions of the white noise series discussed in 0 μ = Example 1.8 and Example 1.16 are easily evaluated as and wt { 2 , h σ 0 = w ( w γ ) ( , w h ) = = cov t + t w h 0 h , 0 . Thus, white noise satisfies the conditions of Definition 1.7 and is weakly stationary or stationary. If the white noise variates are also normally distributed or Gaussian, the series is also strictly stationary, as can be seen by evaluating (1.19) using the fact that the noise would also be iid. The autocorrelation function is given by ρ ( 0 ) = 1 w . 0 and ρ ( h ) = 0 for h , Example 1.20 Stationarity of a Moving Average The three-point moving average process of Example 1.9 is stationary because, from μ = 0 , Example 1.13 and Example 1.17, the mean and autocovariance functions vt and 3 2  = h , 0 σ  w  9   2 2  h , 1 ± = σ w 9 γ = ) ( h v 1 2  , = 2 h σ ±  w 9    2 | h > 0 |  , satisfying the conditions of Definition 1.7. are independent of time t The autocorrelation function is given by i i i i

32 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 22 — #32 i i 22 1 Characteristics of Time Series 0.8 ACF 0.4 l l l l l l 0.0 2 0 4 −2 −4 LAG Autocorrelation function of a three-point moving average. Fig. 1.12.  , 0 = h 1     2  1 = ± , h 3 = ) ( ρ h v 1  = h , 2 ±  3    | h | > 2 . 0  h . Note that Figure 1.12 shows a plot of the autocorrelations as a function of lag the ACF is symmetric about lag zero. Example 1.21 A Random Walk is Not Stationary γ ( s , A random walk is not stationary because its autocovariance function, ) = t 2 , depends on time; see Example 1.18 and Problem 1.8. Also, the random s , t } σ min { w walk with drift violates both conditions of Definition 1.7 because, as shown in μ = δ Example 1.14, the mean function, , is also a function of time t . t xt Example 1.22 Trend Stationarity For example, if is stationary, then the mean function y = α + β t + y , where x t t t μ + t , which is not independent of time. Therefore, the is μ β + α = E ( x = ) t x t y , process is not stationary. The autocovariance function, however, is independent of time, because γ y ( h ) = cov ( x [( E = , x )] ) = E [( x − μ − − μ x )( t t h , x t h + x , + t h + t x t + h t h . Thus, the model may be considered as having stationary ( ) μ γ )( y = − μ )] y y t y trend stationarity behavior around a linear trend; this behavior is sometimes called . An example of such a process is the price of chicken series displayed in Figure 2.1. The autocovariance function of a stationary process has several special properties. γ First, h ) is non-negative definite (see Problem 1.25) ensuring that variances of linear ( 1 ≥ will never be negative. That is, for any combinations of the variates x n , and t , constants a , . . ., a n 1 n n ’ ’ γ j ( ) , k a a − (1.25) + a var x ( ) = a 0 x ≤ + ··· j k 1 1 n n j = 1 1 = k 0 = h , namely using Property 1.1. Also, the value at i i i i

33 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 23 — #33 i i 23 1.4 Stationary Time Series 2 ) = E [( x γ ( μ ) 0 ] (1.26) − t is the variance of the time series and the Cauchy–Schwarz inequality implies γ ( h | γ ( 0 ) . )| ≤ A final useful property, noted in a previous example, is that the autocovariance function of a stationary series is symmetric around the origin; that is, h h ) = γ γ ( ) (1.27) (− for all h . This property follows because γ (( t + h )− t ) = cov ( x , )) h , x + ) = cov ( x t , x −( t ( ) = γ + h t t h t t + which shows how to use the notation as well as proving the result. When several series are available, a notion of stationarity still applies with addi- tional conditions. Definition 1.10 Two time series, say, x if and y jointly stationary , are said to be t t they are each stationary, and the cross-covariance function μ − (1.28) )] γ y ( h ) = cov ( x )( μ − , y x ) = E [( t t + x t xy t h y h + h . is a function only of lag The cross-correlation function (CCF) of jointly stationary time Definition 1.11 series x and y is defined as t t γ ( h ) xy (1.29) . ( h ) ρ = √ xy ( 0 ) γ γ ( 0 ) y x Again, we have the result − 1 ≤ which enables comparison with 1 ( h ) ≤ ρ xy . and x y the extreme values − 1 and 1 when looking at the relation between t t + h The cross-correlation function is not generally symmetric about zero, i.e., typically ( ( h ) , ρ ) (− h ) . This is an important concept; it should be clear that cov ρ x y , xy 2 xy 1 cov ( x and , y ) need not be the same. It is the case, however, that 2 1 ρ (1.30) ( h ) = ρ , (− h ) yx xy which can be shown by manipulations similar to those used to show (1.27). Example 1.23 Joint Stationarity Consider the two series, x and y , formed from the sum and difference of two t t successive values of a white noise process, say, , w w − x = = w y + w and t − t t t t 1 1 − t i i i i

34 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 24 — #34 i i 24 1 Characteristics of Time Series y & x 1.0 y leads x leads 0.5 CCovF 0.0 −10 −15 15 −5 10 0 5 LAG = 5 . The title shows which Demonstration of the results of Example 1.24 when Fig. 1.13. ` side leads. 2 are independent random variables with zero means and variance σ where w . It t w 2 2 γ ) = γ ( 0 ) = 2 σ is easy to show that ( and γ ( 1 ) = γ (− 1 ) = σ 0 = , γ ( 1 ) x y x x y w w 2 . Also, (− 1 ) = − σ γ y w 2 ) ( 1 ) = cov ( x σ = γ , y w ) = cov ( w − w , + w 1 t t t t t − + 1 + t xy 1 w 2 γ . We ( 0 ) = 0 , γ because only one term is nonzero. Similarly, (− 1 ) = − σ xy xy w obtain, using (1.29),  0 = , h 0      2 / 1 1 = h , = ) h ( ρ xy  1 , = − 1 / 2 h −     . 0 2 | h | ≥  Clearly, the autocovariance and cross-covariance functions depend only on the lag separation, h , so the series are jointly stationary. Example 1.24 Prediction Using Cross-Correlation As a simple example of cross-correlation, consider the problem of determining x possible leading or lagging relations between two series . If the model y and t t w y = Ax + t t − ` t for ` < . Hence, is said to lead y 0 for ` > 0 and is said to lag y holds, the series x t t t the analysis of leading and lagging relations might be important in predicting the value of y series, x . Assuming that the noise w is uncorrelated with the x from t t t t the cross-covariance function can be computed as , x γ ( h ) = cov ( y ) , x ) = cov ( Ax + w t h yx + t + h − ` t t t + h ) = ( Ax . ) ` − h , x ( cov = A γ x t ` − h + t ) Since (Cauchy–Schwarz) the largest absolute value of γ , i.e., when ( h − ` 0 is γ ) ( x x h = ` , the cross-covariance function will look like the autocovariance of the input series x and a peak on , and it will have a peak on the positive side if x y leads t t t the negative side if x x is white y . Below is the R code of an example where lags t t t ) ( h shown in Figure 1.13. noise, ` = 5 , and with ˆ γ yx i i i i

35 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 25 — #35 i i 1.4 Stationary Time Series 25 x = rnorm(100) y = lag(x, -5) + rnorm(100) ' , type= ' covariance ccf(y, x, ylab= ) ' CCovF ' The concept of weak stationarity forms the basis for much of the analysis per- formed with time series. The fundamental properties of the mean and autocovariance functions (1.22) and (1.23) are satisfied by many theoretical models that appear to generate plausible sample realizations. In Example 1.9 and Example 1.10, two series were generated that produced stationary looking realizations, and in Example 1.20, we showed that the series in Example 1.9 was, in fact, weakly stationary. Both examples are special cases of the so-called linear process. Definition 1.12 linear process , x , is defined to be a linear combination of white A t noise variates , and is given by w t ∞ ∞ ’ ’ x = (1.31) . ∞ ψ < w | ψ | , μ + j j − t j t −∞ j = = j −∞ For the linear process (see Problem 1.11), we may show that the autocovariance function is given by ∞ ’ 2 h ) = σ γ ψ (1.32) ( ψ j + h j x w = −∞ j h ≥ 0 ; recall that γ ) (− h for = γ . This method exhibits the autocovariance ( h ) x x function of the process in terms of the lagged products of the coefficients. We only Õ ∞ 2 need ψ for the process to have finite variance, but we will discuss this < ∞ −∞ j = j = ψ 3 = ψ / 1 = ψ further in Chapter 5. Note that, for Example 1.9, we have 1 − 0 1 and the result in Example 1.20 comes out immediately. The autoregressive series in Example 1.10 can also be put in this form, as can the general autoregressive moving average processes considered in Chapter 3. Notice that the linear process (1.31) is dependent on the future ( j < 0 ), the present ). For the purpose of forecasting, a future dependent model 0 > j ( j = 0 ), and the past ( will be useless. Consequently, we will focus on processes that do not depend on the future. Such models are called ψ causal = 0 for , and a causal linear process has j < 0 ; we will discuss this further in Chapter 3. j Finally, as previously mentioned, an important case in which a weakly stationary series is also strictly stationary is the normal or Gaussian series. Gaussian process A process, x Definition 1.13 } , is said to be a { if the n -dimensional t ′ x = ( x , t , x , . . ., t , . . ., x , t ) vectors , for every collection of distinct time points t 2 t 1 n t n 2 1 and every positive integer n , have a multivariate normal distribution. ′ mean vector Defining the n × 1 n E ( x ) ≡ μ = ( μ × n , μ and the , . . ., μ ) t t t n 2 1 , which is assumed to be covariance matrix as var ( x )≡ Γ = { γ ( t } , t n ) ; i , j = 1 , . . ., j i positive definite, the multivariate normal density function can be written as } { 1 ′ − 1 / 2 − 1 / 2 n − − x ( μ − x ) , Γ (1.33) ( ) μ ( − exp | Γ | ) π 2 f = ) x ( 2 i i i i

36 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 26 — #36 i i 1 Characteristics of Time Series 26 n ∈ , where |·| denotes the determinant. x R for We list some important items regarding linear and Gaussian processes. • x If a Gaussian time series, } , is weakly stationary, then μ is constant and { t t ( t μ , t Γ ) = γ (| t and the matrix − t γ |) , so that the vector are independent of j j i i { time. These facts imply that all the finite distributions, (1.33), of the series } x t depend only on time lag and not on the actual times, and hence the series must be strictly stationary. In a sense, weak stationarity and normality go hand-in-hand in that we will base our analyses on the idea that it is enough for the first two moments to behave nicely. We use the multivariate normal density in the form given above as well as in a modified version, applicable to complex random variables throughout the text. Wold Decomposition (Theorem B.5) states that a stationary • A result called the Õ 2 ψ < non-deterministic time series is a causal linear process (but with ∞ ). j A linear process need not be Gaussian, but if a time series is Gaussian, then it 2 . Hence, stationary Gaussian ∼ iid N ( 0 is a causal linear process with w ) , σ t w processes form the basis of modeling many time series. • It is not enough for the marginal distributions to be Gaussian for the process to be , X X ) Gaussian. It is easy to construct a situation where Y and Y are normal, but ( X and Z be independent normals and let Y = Z is not bivariate normal; e.g., let X Z if X Z > 0 and Y = − Z if ≤ 0 . 1.5 Estimation of Correlation Although the theoretical autocorrelation and cross-correlation functions are useful for describing the properties of certain hypothesized models, most of the analyses must be performed using sampled data. This limitation means the sampled points x only are available for estimating the mean, autocovariance, and au- , x x , . . ., 2 n 1 tocorrelation functions. From the point of view of classical statistics, this poses a that are available for x problem because we will typically not have iid copies of t estimating the covariance and correlation functions. In the usual situation with only one realization, however, the assumption of stationarity becomes critical. Somehow, we must use averages over this single realization to estimate the population means and covariance functions. μ = Accordingly, if a time series is stationary, the mean function (1.22) μ is t constant so that we can estimate it by the , sample mean n ’ 1 (1.34) . x x = ̄ t n 1 t = In our case, E ( , and the standard error of the estimate is the square root of x ) = μ ̄ , which can be computed using first principles (recall Property 1.1), and is ) x var ( ̄ given by i i i i

37 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 27 — #37 i i 27 1.5 Estimation of Correlation ( ( ) ) n n n ’ ’ ’ 1 1 ( ) var = var cov ̄ x = , x x x t t s 2 n n 1 = t = 1 t 1 = s ( 1 1 − ) n γ n ( 0 ) + ( n − 1 ) γ ( ( 1 ) + ( n − 2 ) γ γ ( 2 ) + ··· + = x x x x 2 n ) ( n − 1 ) γ − (− 1 ) + ( n − 2 ) γ ) (− 2 ) + ··· + γ n ( 1 + x x x n ) ( ’ | h | 1 γ h . ) ( (1.35) = 1 − x n n − n = h 2 / n recalling that If the process is white noise, (1.35) reduces to the familiar σ x 2 ̄ . Note that, in the case of dependence, the standard error of may be γ x ( 0 ) = σ x x smaller or larger than the white noise case depending on the nature of the correlation structure (see Problem 1.19) The theoretical autocovariance function, (1.23), is estimated by the sample auto- covariance function defined as follows. The is defined as Definition 1.14 sample autocovariance function h n − ’ 1 − γ ( ˆ ) = n h , ) ̄ x (1.36) − ( x x )( x − ̄ h t t + 1 = t . 1 − with ˆ γ (− h ) = ˆ γ ( h ) for h = 0 , 1 , . . ., n is not available for x The sum in (1.36) runs over a restricted range because t + h . The estimator in (1.36) is preferred to the one that would be obtained by + h > n t − because (1.36) is a non-negative definite function. Recall that the h dividing by n autocovariance function of a stationary process is non-negative definite [(1.25); also, x see Problem 1.25] ensuring that variances of linear combinations of the variates t will never be negative. And because a variance is never negative, the estimate of that variance n n ’ ’ ̂ k ( γ ˆ a a ) , − j + x a var ( a = ··· x ) + j k 1 n 1 n 1 = j = k 1 should also be non-negative. The estimator in (1.36) guarantees this result, but no − n nor such guarantee exists if we divide by n − h . Note that neither dividing by n h . γ ( h ) in (1.36) yields an unbiased estimator of is defined, analogously to Definition 1.15 The sample autocorrelation function (1.24) , as h ) ( γ ˆ ) = h ( ˆ ρ . (1.37) ˆ γ ( 0 ) The sample autocorrelation function has a sampling distribution that allows us to assess whether the data comes from a completely random or white series or whether correlations are statistically significant at some lags. i i i i

38 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 28 — #38 i i 1 Characteristics of Time Series 28 l l 1.0 1.0 −0.187 0.604 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.5 0.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l soi soi l l 0.0 0.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −0.5 −0.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −1.0 −1.0 0.5 −1.0 −0.5 0.0 0.5 1.0 0.0 1.0 −1.0 −0.5 lag(soi, −6) lag(soi, −1) Fig. 1.14. Display for Example 1.25. For the SOI series, the scatterplots show pairs of values one month apart (left) and six months apart (right). The estimated correlation is displayed in the box. Example 1.25 Sample ACF and Scatterplots Estimating autocorrelation is similar to estimating of correlation in the usual setup ( x . For example, if we , y where we have pairs of observations, say ) , for i = 1 , . . ., n i i have time series data for t x 1 , . . ., n , then the pairs of observations for estimating = t ( {( ) are the n − h pairs given by ρ x . Figure 1.14 shows an , x } h h ) ; t = 1 , . . ., n − h t t + . The following 187 ρ . example using the SOI series where ˆ ρ ( 1 ) = . 604 and ˆ − ( 6 ) = code was used for Figure 1.14. # first 6 sample acf values (r = round(acf(soi, 6, plot=FALSE)$acf[-1], 3)) [1] 0.604 0.374 0.214 0.050 -0.107 -0.187 par(mfrow=c(1,2)) plot(lag(soi,-1), soi); legend( ' topleft ' , legend=r[1]) plot(lag(soi,-6), soi); legend( ' topleft ' , legend=r[6]) Property 1.2 Large-Sample Distribution of the ACF 1.5 Under general conditions, if x is white noise, then for n large, the sample ACF, t H ˆ ρ is fixed but arbitrary, is approximately normally ( h ) , for h = 1 , 2 , . . ., H , where x distributed with zero mean and standard deviation given by 1 σ = (1.38) . √ ˆ ρ ) ( h x n Based on the previous result, we obtain a rough method of assessing whether peaks ( in ˆ ρ are significant by determining whether the observed peak is outside the interval h ) √ n (or plus/minus two standard errors); for a white noise sequence, approximately / 2 ± 95% of the sample ACFs should be within these limits. The applications of this property develop because many statistical modeling procedures depend on reducing a time series to a white noise series using various kinds of transformations. After such a procedure is applied, the plotted ACFs of the residuals should then lie roughly within the limits given above. 1 . 5 is iid with finite fourth moment. A sufficient condition for this to The general conditions are that x t hold is that is white Gaussian noise. Precise details are given in Theorem A.7 in Appendix A. x t i i i i

39 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 29 — #39 i i 29 1.5 Estimation of Correlation Example 1.26 A Simulated Time Series To compare the sample ACF for various sample sizes to the theoretical ACF, consider a contrived set of data generated by tossing a fair coin, letting x = 1 when t x when a tail is obtained. Then, construct = − 1 a head is obtained and y as t t 7 − 5 + x (1.39) y . = x . − 1 t t t n To simulate data, we consider two cases, one with a small sample size ( ) = 10 and another with a moderate sample size ( n 100 ). = set.seed(101010) # simulated sequence of coin tosses x1 = 2*rbinom(11, 1, .5) - 1 x2 = 2*rbinom(101, 1, .5) - 1 y1 = 5 + filter(x1, sides=1, filter=c(1,-.7))[-1] y2 = 5 + filter(x2, sides=1, filter=c(1,-.7))[-1] plot.ts(y1, type= ' s ' ); plot.ts(y2, type= ' s ' ) # plot both series (not shown) c(mean(y1), mean(y2)) # the sample means [1] 5.080 5.002 √ . = 10 32 / 1 # acf(y1, lag.max=4, plot=FALSE) Autocorrelations of series ' y1 ' , by lag 0 1 2 3 4 1.000 -0.688 0.425 -0.306 -0.007 √ / 1 # acf(y2, lag.max=4, plot=FALSE) 100 = 1 . ' y2 Autocorrelations of series , by lag ' 0 1 2 3 4 1.000 -0.480 -0.002 -0.004 0.000 # Note that the sample ACF at lag zero is always 1 (Why?). The theoretical ACF can be obtained from the model (1.39) using the fact that the mean of x is one. It can be shown that is zero and the variance of x t t 7 . − ) 1 ( = ρ = − . 47 y 2 + . 7 1 ( and ρ (Problem 1.24). It is interesting to compare the theoretical h 1 ) = 0 for | h | > y ACF with sample ACFs for the realization where = 10 and the other realization n where = 100 ; note the increased variability in the smaller size sample. n Example 1.27 ACF of a Speech Signal Computing the sample ACF as in the previous example can be thought of as matching the time series h units in the future, say, x . Figure 1.15 x against itself, h t t + shows the ACF of the speech series of Figure 1.3. The original series appears to contain a sequence of repeating short signals. The ACF confirms this behavior, showing repeating peaks spaced at about 106-109 points. Autocorrelation functions of the short signals appear, spaced at the intervals mentioned above. The distance between the repeating signals is known as the pitch period and is a fundamental parameter of interest in systems that encode and decipher speech. Because the series is sampled at 10,000 points per second, the pitch period appears to be between .0106 . acf(speech, 250) and .0109 seconds. To compute the sample ACF in R, use i i i i

40 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 30 — #40 i i 1 Characteristics of Time Series 30 1.0 0.5 ACF 0.0 −0.5 100 200 150 250 50 0 LAG Fig. 1.15. ACF of the speech series. The estimators for the cross-covariance function, γ ( h ) , as given Definition 1.16 xy (1.28) and the cross-correlation, ρ (1.11) ( h ) , in in are given, respectively, by the xy sample cross-covariance function n − h ’ − 1 n γ ˆ ( h ) = x ( (1.40) , x ) y ̄ − ̄ − )( y xy t h + t = 1 t sample determines the function for negative lags, and the ) h where ˆ γ ( (− h ) = ˆ γ yx xy cross-correlation function h ( ) γ ˆ xy (1.41) . ˆ ρ h ( ) = √ xy γ ˆ ) 0 ( 0 ) ˆ γ ( y x The sample cross-correlation function can be examined graphically as a function to search for leading or lagging relations in the data using the property h of lag mentioned in Example 1.24 for the theoretical cross-covariance function. Because ρ ( 1 ≤ ˆ − 1 h ) ≤ , the practical importance of peaks can be assessed by comparing xy y x their magnitudes with their theoretical maximum values. Furthermore, for and t t independent linear processes of the form (1.31), we have the following property. Property 1.3 Large-Sample Distribution of Cross-Correlation h The large sample distribution of ˆ ρ is normal with mean zero and ( ) xy 1 (1.42) σ = √ ρ ˆ x y n if at least one of the processes is independent white noise (see Theorem A.8). i i i i

41 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 31 — #41 i i 1.5 Estimation of Correlation 31 Southern Oscillation Index 0.8 0.4 ACF 0.0 −0.4 1 4 0 2 3 Recruitment 1.0 0.6 ACF 0.2 −0.2 3 4 0 1 2 SOI vs Recruitment 0.2 0.0 CCF −0.2 −0.6 −4 −2 2 4 0 LAG Fig. 1.16. Sample ACFs of the SOI series (top) and of the Recruitment series (middle), and the sample CCF of the two series (bottom); negative lags indicate SOI leads Recruitment. The lag axes are in terms of seasons (12 months). Example 1.28 SOI and Recruitment Correlation Analysis The autocorrelation and cross-correlation functions are also useful for analyzing the joint behavior of two stationary series whose behavior may be related in some unspecified way. In Example 1.5 (see Figure 1.5), we have considered simultaneous monthly readings of the SOI and the number of new fish (Recruitment) computed from a model. Figure 1.16 shows the autocorrelation and cross-correlation functions (ACFs and CCF) for these two series. Both of the ACFs exhibit periodicities corresponding to the correlation between values separated by 12 units. Observations 12 months or one year apart are strongly positively correlated, as are observations at 36 Observations separated by six months are negatively 48 multiples such as 24 , , . . . , correlated, showing that positive excursions tend to be associated with negative excursions six months removed. The sample CCF in Figure 1.16, however, shows some departure from the cyclic − component of each series and there is an obvious peak at h = 6 . This result implies 6 that SOI measured at time t − months is associated with the Recruitment series . We could say the SOI leads the Recruitment series by six months. The t at time sign of the CCF is negative, leading to the conclusion that the two series move in different directions; that is, increases in SOI lead to decreases in Recruitment i i i i

42 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 32 — #42 i i 1 Characteristics of Time Series 32 and vice versa. We will discover in Chapter 2 that there is a relationship between the series, but the relationship is nonlinear. The dashed lines shown on the plots √ [see (1.42)], but since neither series is noise, these lines do not 453 2 / ± indicate apply. To reproduce Figure 1.16 in R, use the following commands: par(mfrow=c(3,1)) acf(soi, 48, main="Southern Oscillation Index") acf(rec, 48, main="Recruitment") ccf(soi, rec, 48, main="SOI vs Recruitment", ylab="CCF") Example 1.29 Prewhitening and Cross Correlation Analysis Although we do not have all the tools necessary yet, it is worthwhile to discuss the idea of prewhitening a series prior to a cross-correlation analysis. The basic idea is simple; in order to use Property 1.3, at least one of the series must be white noise. If this is not the case, there is no simple way to tell if a cross-correlation estimate is significantly different from zero. Hence, in Example 1.28, we were only guessing at the linear dependence relationship between SOI and Recruitment. x 120 and y For example, in Figure 1.17 we generated two series, , for t = 1 , . . ., t t independently as 1 1 ) + w ) w and ] + = 2 cos ( 2 π [ t + 5 y t x = 2 cos ( 2 π 1 t t 2 t t 12 12 } 120 , . . ., = where { w 1 are all independent standard normals. The series , w t ; 2 t t 1 are made to resemble SOI and Recruitment. The generated data are shown in the top row of the figure. The middle row of Figure 1.17 shows the sample ACF of each series, each of which exhibits the cyclic nature of each series. The bottom x and row (left) of Figure 1.17 shows the sample CCF between , which appears y t t to show cross-correlation even though the series are independent. The bottom row x y and the prewhitened (right) also displays the sample CCF between , which t t y , we mean that shows that the two sequences are uncorrelated. By prewhtiening t y cos on the signal has been removed from the data by running a regression of ( 2 π t ) t ˆ , where ) are the and sin ( 2 π t y [see Example 2.10] and then putting ̃ y y = y ˆ − t t t t predicted values from the regression. The following code will reproduce Figure 1.17. set.seed(1492) num=120; t=1:num X = ts(2*cos(2*pi*t/12) + rnorm(num), freq=12) Y = ts(2*cos(2*pi*(t+5)/12) + rnorm(num), freq=12) Yw = resid( lm(Y~ cos(2*pi*t/12) + sin(2*pi*t/12), na.action=NULL) ) par(mfrow=c(3,2), mgp=c(1.6,.6,0), mar=c(3,3,1,1) ) plot(X) plot(Y) ' ) ' ACF(X) acf(X,48, ylab= acf(Y,48, ylab= ' ACF(Y) ' ) ccf(X,Y,24, ylab= ' CCF(X,Y) ' ) CCF(X,Yw) , ylim=c(-.6,.6)) ' ' ccf(X,Yw,24, ylab= i i i i

43 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 33 — #43 i i 33 1.6 Vector-Valued and Multidimensional Series 4 2 2 0 X Y 0 −2 −2 −4 2 4 6 8 10 2 4 6 8 10 Series Y Series X Time Time 1.0 1.0 0.5 0.5 ACF(Y) ACF(X) 0.0 0.0 −0.5 −0.5 3 1 1 2 3 4 0 4 0 2 X & Y X & Yw Lag Lag 0.6 0.6 0.2 0.2 CCF(X,Y) CCF(X,Yw) −0.2 −0.2 −0.6 −0.6 −2 0 0 1 2 −2 1 −1 2 −1 Lag Lag Fig. 1.17. Middle row: The sample Display for Example 1.29. Top row; The generated series. ACF of each series. Bottom row; The sample CCF of the series (left) and the sample CCF of the first series with the prewhitened second series (right). 1.6 Vector-Valued and Multidimensional Series We frequently encounter situations in which the relationships between a number of jointly measured time series are of interest. For example, in the previous sec- tions, we considered discovering the relationships between the SOI and Recruit- ment series. Hence, it will be useful to consider the notion of a vector time series ′ , p x = ( x , which contains as its components x univariate time series. , . . ., x ) t t 1 t t p 2 ′ × column vector of the observed series as x p . The row vector x We denote the 1 is t t × 1 mean vector p its transpose. For the stationary case, the E ( x μ ) (1.43) = t ′ μ = ( μ autocovariance matrix p , μ × p , . . ., μ and the ) of the form t p 2 t 1 t ′ Γ ( h = E [( x (1.44) ] ) − μ )( x ) − μ t + t h can be defined, where the elements of the matrix Γ ( h ) are the cross-covariance functions (1.45) )] − μ γ x ( h ) = E [( x )( μ − i j , h + i t i t j j i i i i

44 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 34 — #44 i i 34 1 Characteristics of Time Series 10 temperature 8 6 4 30 20 20 cols 40 10 rows 60 Two-dimensional time series of temperature measurements taken on a rectangular Fig. 1.18. field ( 64 × 36 with 17-foot spacing). Data are from Bazza et al. (1988). for i , j = 1 , . . ., p . Because γ , it follows that ( ) ) h γ h (− = ji i j ′ ( ) = Γ Γ h h ) . (1.46) (− Now, the sample autocovariance matrix matrix of the vector series p is the p × x t of sample cross-covariances, defined as n − h ’ − ′ 1 ˆ ) = n (1.47) , ( ) Γ ( x ̄ − h x x )( x − ̄ h + t t = t 1 where n ’ − 1 (1.48) x n = x ̄ t = t 1 . The symmetry property of the theoretical p × 1 sample mean vector denotes the autocovariance (1.46) extends to the sample autocovariance (1.47), which is defined for negative values by taking ′ ˆ ˆ ( ) Γ (− h ) = (1.49) Γ . h In many applied problems, an observed series may be indexed by more than time alone. For example, the position in space of an experimental unit might be described by two coordinates, say, s . We may proceed in these cases by defining a and s 2 1 ′ × x vector as a function of the r multidimensional process 1 s = ( s ) , s s , . . ., , 2 r 1 s th index. s i denotes the coordinate of the where i i i i i

45 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 35 — #45 i i 35 1.6 Vector-Valued and Multidimensional Series 7.5 7.0 6.5 6.0 Average Temperature 5.5 0 50 40 30 20 10 60 row Õ ̄ Fig. 1.19. x Row averages of the two-dimensional soil temperature profile. = x 36 / . · · s · , s , s s 1 1 2 2 Example 1.30 Soil Surface Temperatures ( = 2 ) temperature series x As an example, the two-dimensional r in Figure 1.18 , s s 2 1 s that represent positions on and a column number s is indexed by a row number 1 2 64 a spatial grid set out on an agricultural field. The value of the temperature × 36 = s , is denoted by x s x and column . We can note from measured at row s s , s 2 1 2 1 the two-dimensional plot that a distinct change occurs in the character of the two- dimensional surface starting at about row 40, where the oscillations along the row axis become fairly stable and periodic. For example, averaging over the 36 columns, s as in Figure 1.19. It is clear that the we may compute an average value for each 1 noise present in the first part of the two-dimensional series is nicely averaged out, and we see a clear and consistent temperature signal. To generate Figure 1.18 and Figure 1.19 in R, use the following commands: persp(1:64, 1:36, soiltemp, phi=25, theta=25, scale=FALSE, expand=4, ticktype="detailed", xlab="rows", ylab="cols", zlab="temperature") plot.ts(rowMeans(soiltemp), xlab="row", ylab="Average Temperature") of a stationary multidimensional process, x The , can be autocovariance function s ′ = defined as a function of the multidimensional lag vector, say, h , as , h h , . . ., h ) ( r 2 1 )] μ ( (1.50) γ , h ) = E [( x − x )( − μ h + s s where (1.51) ( x μ ) E = s does not depend on the spatial coordinate s . For the two dimensional temperature process, (1.50) becomes ( h (1.52) , h , ) = E [( x )] γ μ − x )( μ − , + 1 2 s , s s + h s h 2 1 1 2 2 1 directions. ) h ( which is a function of lag, both in the row ( h and column ) 1 2 is defined as The multidimensional sample autocovariance function ’ ’ ’ − 1 , h ) = ( S )( S x ··· S ̄ ) (1.53) ( γ ) x ˆ ̄ ··· − − ( x x s + h s r 2 1 s s s r 2 1 i i i i

46 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 36 — #46 i i 1 Characteristics of Time Series 36 ′ s s , s where , . . ., s ≤ = ( and the range of summation for each argument is 1 ≤ s ) i r 2 1 − h S , for i = 1 , . . ., r . The mean is computed over the r -dimensional array, that is, i i ’ ’ ’ 1 − x , (1.54) ··· ) S ̄ S = x ··· S ( s s s , , , ··· 1 r 2 r 2 1 s s s r 1 2 S s ≤ s where the arguments ≤ 1 are summed over . The multidimensional sample i i i autocorrelation function follows, as usual, by taking the scaled ratio ˆ γ ( h ) (1.55) . h ) ˆ ρ ( = ( 0 ) γ ˆ Example 1.31 Sample ACF of the Soil Temperature Series The autocorrelation function of the two-dimensional (2d) temperature process can be written in the form , h ˆ ) ( γ h 2 1 h h , ) = ( ρ ˆ , 1 2 γ ( 0 , 0 ) ˆ where ’ ’ 1 − ( h γ ) = ( S h S ˆ ) , ( x − x x ̄ )( − ̄ x ) 1 2 2 1 , h s + s + h s , s 2 1 2 1 1 2 s s 1 2 Figure 1.20 shows the autocorrelation function for the temperature data, and we note the systematic periodic variation that appears along the rows. The autocovariance 0 = over columns seems to be strongest for h , implying columns may form 1 replicates of some underlying process that has a periodicity over the rows. This idea can be investigated by examining the mean series over columns as shown in Figure 1.19. The easiest way (that we know of) to calculate a 2d ACF in R is by using the fast Fourier transform (FFT) as shown below. Unfortunately, the material needed to understand this approach is given in Chapter 4, Section 4.3. The 2d autocovariance ) below; ˆ γ ( cs , 0 function is obtained in two steps and is contained in is the (1,1) 0 element so that ˆ ρ ( h , h is obtained by dividing each element by that value. The ) 2 1 2d ACF is contained in below, and the rest of the code is simply to arrange the rs results to yield a nice display. fs = Mod(fft(soiltemp-mean(soiltemp)))^2/(64*36) cs = Re(fft(fs, inverse=TRUE)/sqrt(64*36)) # ACovF rs = cs/cs[1,1] # ACF rs2 = cbind(rs[1:41,21:2], rs[1:41,1:21]) rs3 = rbind(rs2[41:2,], rs2) par(mar = c(1,2.5,0,0)+.1) persp(-40:40, -20:20, rs3, phi=30, theta=30, expand=30, scale="FALSE", ticktype="detailed", xlab="row lags", ylab="column lags", zlab="ACF") The sampling requirements for multidimensional processes are rather severe be- cause values must be available over some uniform grid in order to compute the ACF. In some areas of application, such as in soil science, we may prefer to sample a transects limited number of rows or and hope these are essentially replicates of the i i i i

47 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 37 — #47 i i 37 1.6 Vector-Valued and Multidimensional Series 1.0 0.8 ACF 0.6 0.4 0.2 0.0 −40 20 −20 10 0 row lags 0 20 −10 column lags 40 −20 Two-dimensional autocorrelation function for the soil temperature data. Fig. 1.20. basic underlying phenomenon of interest. One-dimensional methods can then be ap- plied. When observations are irregular in time space, modifications to the estimators need to be made. Systematic approaches to the problems introduced by irregularly spaced observations have been developed by Journel and Huijbregts (1978) or Cressie (1993). We shall not pursue such methods in detail here, but it is worth noting that the introduction of the variogram (1.56) } x − 2 V x ( h ) = var { x h s s + and its sample estimator ’ 1 2 ˆ ) V = ( h 2 (1.57) x − x ( ) x s s + h ) h ( N s h ) , play key roles, where N ( h denotes both the number of points located within and the sum runs over the points in the neighborhood. Clearly, substantial indexing difficulties will develop from estimators of the kind, and often it will be difficult to find non-negative definite estimators for the covariance function. Problem 1.27 investigates the relation between the variogram and the autocovariance function in the stationary case. i i i i

48 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 38 — #48 i i 1 Characteristics of Time Series 38 Problems Section 1.1 1.1 To compare the earthquake and explosion signals, plot the data displayed in Figure 1.7 on the same graph using different colors or different line types and comment on the results. (The R code in Example 1.11 may be of help on how to add lines to existing plots.) x 1.2 Consider a signal-plus-noise model of the general form s w + w , where = t t t t 2 1 is Gaussian white noise with = σ . Simulate and plot n = 200 observations from w each of the following two models. x 1 = s , where + w 200 , for t = (a) , ..., t t t { , . . ., 100 , t = 1 0 = s t ( t − 100 ) . 200 , . . ., } cos ( 2 π t / 4 ) , t = 101 10 exp {− 20 Hint: s = c(rep(0,100), 10*exp(-(1:100)/20)*cos(2*pi*1:100/4)) x = s + rnorm(200) plot.ts(x) x = s , where (b) w , for t = 1 , . . ., 200 + t t t { 1 = 100 t , 0 , . . ., = s t ( − 100 ) t 10 exp } cos ( 2 π t / {− ) , t = 101 , . . ., 200 . 4 200 (c) Compare the general appearance of the series (a) and (b) with the earthquake series and the explosion series shown in Figure 1.7. In addition, plot (or sketch) exp {− and compare the signal modulators (a) / 20 } and (b) exp {− t / 200 } , for t t 1 , 2 , . . ., 100 . = Section 1.2 1.3 (a) Generate n = 100 observations from the autoregression + x 9 x w . = − t t − 2 t 1 , using the method described in Example 1.10. Next, apply the with σ = w moving average filter 4 x v = ( )/ + x + x + x t − 1 t t − − 2 3 t t as a line and superimpose as a dashed to x , the data you generated. Now plot x v t t t line. Comment on the behavior of x and how applying the moving average filter t Use v = filter(x, rep(1/4, 4), sides = 1) changes that behavior. [ Hints: for the filter and note that the R code in Example 1.11 may be of help on how to add lines to existing plots.] i i i i

49 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 39 — #49 i i 39 Problems (b) Repeat (a) but with 4 = π t / 2 ) . ( cos x t , 1 ) noise, 0 ( (c) Repeat (b) but with added N x cos ( 2 π t / 4 ) + w = . t t (d) Compare and contrast (a)–(c); i.e., how does the moving average change each series. Section 1.3 1.4 Show that the autocovariance function can be written as ( s , t ) = E [( x , γ μ μ )( x μ − μ )− )] = E ( x x − s t t s s t s t E x ] = μ . where [ t t For the two series, x 1.5 , in Problem 1.2 (a) and (b): t (a) Compute and plot the mean functions . ( t ) , for μ = 1 , . . ., 200 t x ) . 200 , . . ., (b) Calculate the autocovariance functions, γ = ( s , t 1 , for s , t x Section 1.4 1.6 Consider the time series x w = β , + β + t 2 t t 1 w are known constants and is a white noise process with variance where β β and 1 2 t 2 σ . w (a) Determine whether x is stationary. t x = − (b) Show that the process y x is stationary. t t 1 t − (c) Show that the mean of the moving average q ’ 1 x v = − t j t + 1 2 q j = q − β + is t , and give a simplified expression for the autocovariance function. β 1 2 1.7 For a moving average process of the form , w + w x 2 = w + − t t t t + 1 1 2 , determine the autoco- σ are independent with zero means and variance where w t w and plot the ACF t = − variance and autocorrelation functions as a function of lag h s as a function of h . i i i i

50 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 40 — #50 i i 40 1 Characteristics of Time Series 1.8 Consider the random walk with drift model = δ x x + , w + 1 t − t t 2 = 1 , 2 , . . ., with x , where = 0 for w . is white noise with variance σ t 0 t w Õ t w . (a) Show that the model can be written as + = x δ t k t = 1 k x . (b) Find the mean function and the autocovariance function of t (c) Argue that x is not stationary. t √ − t 1 → t , t ) = ( . What is the implication of this result? ρ (d) Show →∞ 1 1 as t − x t (e) Suggest a transformation to make the series stationary, and prove that the trans- formed series is stationary. (Hint: See Problem 1.6b.) 1.9 A time series with a periodic component can be constructed from x , = U ) sin ( 2 πω t t ) + U πω cos ( 2 2 1 t 0 0 2 are independent random variables with zero means and and U = where E ( U U ) 2 1 1 2 2 U ) determines the period or time it takes the process to E = σ ( . The constant ω 0 2 make one complete cycle. Show that this series is weakly stationary with autocovari- ance function 2 ( h ) = . γ cos ( 2 πω ) h σ 0 x Suppose we would like to predict a single stationary series 1.10 with zero mean t and autocorrelation function ( h ) at some time in the future, say, t + `, for ` > γ . 0 (a) If we predict using only x and some scale multiplier A , show that the mean-square t prediction error 2 ( A ) = E [( x ] − ) M SE Ax t t + ` is minimized by the value = ρ ( ` ) . A (b) Show that the minimum mean-square prediction error is 2 ( A ) = γ ( 0 )[ 1 − ρ M SE ( ` )] . ) (c) Show that if x . 0 < = Ax A , then ρ ( ` if = 1 A > 0 , and ρ ( ` ) = − 1 if t t + ` 1.11 Consider the linear process defined in (1.31). (a) Verify that the autocovariance function of the process is given by (1.32). Use Hint: For h ≥ 0 , cov ( x the result to verify your answer to Problem 1.7. = ) , x t h + t Õ Õ w , the only “survivor” will be when ψ Z cov ∈ j . For each ) w , ( ψ − j t − j k h + t k k j k = h + j . (b) Show that x exists as a limit in mean square (see Appendix A). t and x y , verify (1.30). 1.12 For two weakly stationary series t t i i i i

51 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 41 — #51 i i Problems 41 Consider the two series 1.13 x = w t t − = w , y θ w u + 1 − t t t t 2 2 , and u σ where σ w and are independent white noise series with variances t t u w respectively, and is an unspecified constant. θ (a) Express the ACF, ρ ( h ) as a function of h = , for , ± 1 , ± 2 , . . . of the series y 0 y t 2 2 , and θ . , σ σ u w and ( (b) Determine the CCF, ) relating x . ρ y h t t xy x and (c) Show that y are jointly stationary. t t 1.14 Let x be a stationary normal process with mean μ and autocovariance function x t γ h ) . Define the nonlinear time series ( y } . exp { x = t t . The moment generating (a) Express the mean function E ( y ) ) in terms of μ 0 and γ ( x t 2 is σ and variance function of a normal random variable x with mean μ } { 1 2 2 . σ λ { λ x }] = exp M ( + λ ) = E [ exp μλ x 2 (b) Determine the autocovariance function of y . The sum of the two normal random t variables x is still a normal random variable. + x t h t + 2 1.15 w be a normal white noise process, and consider the , for t = 0 , ± 1 , ± Let , . . . t series x w = w . t t − 1 t , and state whether it is sta- Determine the mean and autocovariance function of x t tionary. 1.16 Consider the series = sin ( 2 Ut ) , x π t = 1 , , . . . t , where U has a uniform distribution on the interval ( 0 , 1 ) . 2 (a) Prove x is weakly stationary. t (b) Prove x is not strictly stationary. t 1.17 generated by x Suppose we have the linear process t = , − θ w x w t t − 1 t , . . . t 0 , 1 , 2 = , where { w } is independent and identically distributed with character- t is a fixed constant. [Replace “characteristic function" with , and θ istic function φ (·) w “moment generating function" if instructed to do so.] i i i i

52 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 42 — #52 i i 1 Characteristics of Time Series 42 , x x x , say, (a) Express the joint characteristic function of , . . ., 1 2 n , ) , . . .,λ ( λ φ ,λ x ,..., x 1 x 2 , n n 1 2 φ (·) in terms of . w (b) Deduce from (a) that is strictly stationary. x t Suppose that 1.18 is a linear process of the form (1.31). Prove x t ∞ ’ . ∞ < | γ ( h )| −∞ h = Section 1.5 2 Suppose x . = μ + w ) + θ w 1.19 , σ 0 , where w ( ∼ w n t − t t 1 t w ( E ) = μ . (a) Show that mean function is x t 2 2 θ + 1 ( , ) σ = ) 0 ( γ is given by (b) Show that the autocovariance function of x x t w 2 otherwise. 0 = γ h (± 1 ) = σ ) ( θ , and γ x x w x is stationary for all values of (c) Show that ∈ R . θ t var ( ̄ x ) for estimating μ when (i) θ = 1 , (ii) θ = 0 , and (iii) (d) Use (1.35) to calculate = − 1 θ ( n − 1 ) is typically large, so that (e) In time series, the sample size n . With this as ≈ 1 n a consideration, comment on the results of part (d); in particular, how does the μ change for the three different cases? accuracy in the estimate of the mean 1.20 n = 500 Gaussian white noise observations as in Exam- (a) Simulate a series of ple 1.8 and compute the sample ACF, ˆ ρ ( h ) , to lag 20. Compare the sample ACF you obtain to the actual ACF, ρ ( h ) . [Recall Example 1.19.] affect the results? n . How does changing 50 (b) Repeat part (a) using only n = 1.21 = 500 moving average observations as in Example 1.9 n (a) Simulate a series of ˆ ρ ( h ) , to lag 20. Compare the sample ACF you and compute the sample ACF, ρ obtain to the actual ACF, h ) . [Recall Example 1.20.] ( (b) Repeat part (a) using only n = 50 . How does changing n affect the results? 1.22 Although the model in Problem 1.2(a) is not stationary (Why?), the sample ACF can be informative. For the data you generated in that problem, calculate and plot the sample ACF, and then comment. = 1.23 Simulate a series of n 500 observations from the signal-plus-noise model 2 . Compute the sample ACF to lag 100 of the presented in Example 1.12 with σ = 1 w data you generated and comment. 1.24 For the time series y described in Example 1.26, verify the stated result that t h . 1 > ρ for ( 1 ) = − . 47 and ρ 0 ( h ) = y y i i i i

53 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 43 — #53 i i 43 Problems g ( t ) , defined on the integers, is non-negative definite if 1.25 A real-valued function and only if n n ’ ’ ≥ a ) t a − g ( t 0 i i j j 1 i = j 1 = ′ a = ( a for all positive integers , n = , . . ., a t ) and for all vectors and a n 2 1 ′ , this implies that , t } , . . ., t n ) j . For the matrix G = { g ( t , . . ., − t 2 ) ; i , ( = 1 t , j n 2 1 i ′ Ga 0 for all vectors a . It is called positive definite if we can replace ‘ ≥ ’ with ‘ > ’ a ≥ , , the zero vector. a for all 0 ( h ) , the autocovariance function of a stationary process, is a non- (a) Prove that γ negative definite function. ˆ γ ( h ) is a non-negative definite function. (b) Verify that the sample autocovariance Section 1.6 Consider a collection of time series x x that are observing some , 1.26 x , . . ., t 1 N t t 2 μ observed in noise processes e , e , . . ., e common signal , with a model for the t t N t 1 2 t -th observed series given by j e μ + x = . jt t jt Suppose the noise series have zero means and are uncorrelated for different j . The t , γ . Define the sample common autocovariance functions of all series are given by ) s ( e mean N ’ 1 . x x = ̄ jt t N j 1 = E [ ̄ x (a) Show that ] = μ . t t 2 − 1 ̄ x . − μ ) (b) Show that )] = N E [( γ ) ( t , t e t (c) How can we use the results in estimating the common signal? 1.27 A concept used in geostatistics , see Journel and Huijbregts (1978) or Cressie (1993), is that of the variogram , defined for a spatial process x , for , s = ( s ) , s 2 1 s , ... 2 , , as s ± , s 1 = 0 , ± 2 1 1 2 ) x − x ] , E [( h V ( = ) h s + s x 2 Show that, for a stationary process, the , ... 2 = where h = ( h ± , h , ) , for h 1 , h ± , 0 2 1 2 1 variogram and autocovariance functions can be related through γ h ( V , ( h ) = γ ( 0 )− ) x , ) . Note the easy where γ ( h 0 is the usual lag h covariance function and 0 = ( 0 ) extension to any spatial dimension. i i i i

54 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 44 — #54 i i 1 Characteristics of Time Series 44 The following problems require the material given in Appendix A Suppose x , = β → ∞ + β n 1.28 , where β are constants. Prove as and β t 1 0 1 0 t ˆ ρ is the ACF (1.37). ( h )→ 1 for fixed h , where ˆ ρ ) ( h x x (a) Suppose is a weakly stationary time series with mean zero and with 1.29 x t γ absolutely summable autocovariance function, h ) , such that ( ∞ ’ . 0 = γ ( h ) −∞ h = √ p n ̄ x is the sample mean (1.34). → 0 , where ̄ x Prove that (b) Give an example of a process that satisfies the conditions of part (a). What is special about this process? x be a linear process of the form (A.43)–(A.44). If we define 1.30 Let t n ’ − 1 = n ( ̃ γ ( h , ) ) x μ − x − μ )( x t + t x h 1 = t show that ) ( / 2 1 n ̃ h )− ˆ γ ( h ) ( = o . ( 1 ) γ p Hint: The Markov Inequality x E | |  } < Pr x {| | ≥  can be helpful for the cross-product terms. 1.31 For a linear process of the form ∞ ’ j , x w = φ − t t j 0 = j { w , show that } satisfies the conditions of Theorem A.7 and | φ | < 1 where t √ ( ˆ ρ )) 1 1 )− ρ ( ( d x x n → , ) , 1 N ( 0 √ 2 1 − ( 1 ) ρ x . and construct a 95% confidence interval for φ when ˆ ρ 100 ( = ) = . 64 and n 1 x 2 . , σ ) 1.32 Let { x 0 ; t = 0 , ± 1 , ± 2 , . . . } be iid ( t x (a) For h ≥ 1 and k ≥ 1 , show that x . x t , s and x are uncorrelated for all k s h + t t + s (b) For fixed h ≥ 1 , show that the h × 1 vector n ’ d ′ ′ − 2 1 − 2 / x ) z , . . ., ( x z x →( n σ , . . ., x ) t + h t 1 + t t 1 h 1 = t random variables. [Hint: Use the Cramér-Wold 1 ) where z , , . . ., z 0 are iid N ( h 1 device.] i i i i

55 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 45 — #55 i i 45 Problems 1 (c) Show, for each h ≥ , [ ] n − n h ’ ’ p 2 / − 1 n x x − x → 0 as n ) x ( →∞ ̄ − ̄ x )( x − t + t h + t t h t 1 = 1 = t Õ n − 1 where ̄ x = n x . t = t 1 p Õ n 2 1 2 − n x by the WLLN, conclude that σ (d) Noting that → t 1 t = d ′ 1 / 2 ′ ] [ 1 ) , . . ., ˆ ρ ( h ) ( ˆ n →( z ) , . . ., z ρ h 1 , . . ., . x x where ˆ ρ ( h ) is the sample ACF of the data 1 n i i i i

56 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 46 — #56 i i i i i i

57 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 47 — #57 i i Chapter 2 Time Series Regression and Exploratory Data Analysis In this chapter we introduce classical multiple linear regression in a time series context, model selection, exploratory data analysis for preprocessing nonstationary time series (for example trend removal), the concept of differencing and the backshift operator, variance stabilization, and nonparametric smoothing of time series. 2.1 Classical Regression in the Time Series Context We begin our discussion of linear regression in the time series context by assuming dependent time series, say, x , . . ., , for t = 1 some output or n , is being influenced by t , . . ., independent z , where we z , z series, say, a collection of possible inputs or t 1 tq t 2 first regard the inputs as fixed and known. This assumption, necessary for applying conventional linear regression, will be relaxed later on. We express this relation through the linear regression model β (2.1) , w x z = β β + + + z ··· + + β z 0 2 1 t 1 2 q t tq t t } { w is a random where β are unknown fixed regression coefficients, and , β , . . ., β 1 0 q t error or noise process consisting of independent and identically distributed (iid) 2 . For time series regression, it σ normal variables with mean zero and variance w is rarely the case that the noise is white, and we will need to eventually relax that assumption. A more general setting within which to embed mean square estimation and linear regression is given in Appendix B, where we introduce Hilbert spaces and the Projection Theorem. Example 2.1 Estimating a Linear Trend Consider the monthly price (per pound) of a chicken in the US from mid-2001 to , shown in Figure 2.1. There is an obvious upward mid-2016 (180 months), say x t trend in the series, and we might use simple linear regression to estimate that trend by fitting the model i i i i

58 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 48 — #58 i i 2 Time Series Regression and Exploratory Data Analysis 48 120 110 100 90 80 cents per pound 70 60 2005 2015 2010 Time Fig. 2.1. The price of chicken: monthly whole bird spot price, Georgia docks, US cents per pound, August 2001 to July 2016, with fitted linear trend line. 6 7 8 + β z = 2016 + w , . . ., , z x = 2001 β . , 2001 t t 1 0 t t 12 12 12 q = This is in the form of the regression model (2.1) with . Note that we are 1 making the assumption that the errors, w , are an iid normal sequence, which may t not be true; the problem of autocorrelated errors is discussed in detail in Chapter 3. In ordinary least squares (OLS), we minimize the error sum of squares n n ’ ’ 2 2 w x β = Q = + β ]) ( z −[ t t 0 1 t 1 = = t 1 t β with respect to for i = 0 , 1 . In this case we can use simple calculus to evaluate i β , to obtain two equations to solve for the 1 , ∂ Q / ∂β s. The OLS = 0 for i = 0 i estimates of the coefficients are explicit and given by Õ n ̄ ) z ̄ ( x − − z x )( t t 1 = t ˆ ˆ ˆ β = , β ̄ − x ̄ = β z and Õ 1 0 1 n 2 ) z ( − ̄ z t = 1 t Õ Õ = are the respective sample means. where ̄ x z n / x = / n and ̄ z t t t t ˆ . 3 (with a β 59 = Using R, we obtained the estimated slope coefficient of 1 standard error of .08) yielding a significant estimated increase of about 3.6 cents per year. Finally, Figure 2.1 shows the data with the estimated trend line superimposed. R code with partial output: summary(fit <- lm(chicken~time(chicken), na.action=NULL)) Estimate Std.Error t.value (Intercept) -7131.02 162.41 -43.9 time(chicken) 3.59 0.08 44.4 -- Residual standard error: 4.7 on 178 degrees of freedom plot(chicken, ylab="cents per pound") abline(fit) # add the fitted line The multiple linear regression model described by (2.1) can be conveniently writ- ′ z 1 ) ten in a more general notation by defining the column vectors zzz , . . ., = ( z , z , 1 t t 2 t tq i i i i

59 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 49 — #59 i i 49 2.1 Classical Regression in the Time Series Context ′ ′ ) β , . . ., β = ( , where β denotes transpose, so (2.1) can be written in the and , β 1 q 0 alternate form ′ = β (2.2) + β . z w + + ··· + β z x z + w β = t t t 1 t q 1 0 t tq 2 , σ ∼ iid N ( 0 where w . As in the previous example, OLS estimation finds the ) t w β that minimizes the error sum of squares coefficient vector n n ’ ’ 2 ′ 2 = Q = w , z ) ( (2.3) β − x t t t = t 1 t 1 = β . This minimization can be accomplished by differen- , β with respect to , . . ., β 1 q 0 tiating (2.3) with respect to the vector β or by using the properties of projections. Õ n ′ ′ ˆ z ( x = − Either way, the solution must satisfy β This procedure gives the z . ) 0 t t t 1 = t normal equations ) ( n n ’ ’ ′ ˆ z = z β z (2.4) . x t t t t 1 = t t 1 = Õ n ′ If β z is non-singular, the least squares estimate of z is t t = t 1 ( ) n n − 1 ’ ’ ′ ˆ = β z z z . x t t t t = t 1 1 t = The minimized error sum of squares (2.3), denoted SSE , can be written as n ’ ′ 2 ˆ x . − SSE β = z (2.5) ) ( t t = 1 t ˆ , and have the = ) The ordinary least squares estimators are unbiased, i.e., E ( β β smallest variance within the class of linear unbiased estimators. ˆ is also the maximum likelihood are normally distributed, β w If the errors t β and is normally distributed with estimator for 2 ˆ (2.6) C , ( = σ cov β ) w where ) ( 1 − n ’ ′ (2.7) z C = z t t 1 = t 2 is is a convenient notation. An unbiased estimator for the variance σ w SSE 2 , (2.8) = M SE = s w + q 1 ) n −( where M SE denotes the mean squared error . Under the normal assumption, ˆ − ) ( β β i i = (2.9) t √ s c w ii i i i i

60 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 50 — #60 i i 2 Time Series Regression and Exploratory Data Analysis 50 Analysis of Variance for Regression Table 2.1. Source df Sum of Squares Mean Square F M SR = F ) r − q z r − = SSE q − SSE M SR = SSR /( SSR r q 1: + r , t M SE Error ) 1 n −( q + 1 ) SSE M SE = SSE /( n − q − degrees of freedom; -th diagonal n −( q + 1 ) has the t-distribution with c i denotes the ii element of C , as defined in (2.7). This result is often used for individual tests of the q , . . ., null hypothesis H . : β = 0 for i = 1 0 i Various competing models are often of interest to isolate or select the best subset of < q r independent variables. Suppose a proposed model specifies that only a subset z = { is influencing the dependent independent variables, say, , z , . . ., z } z r 1 t 2 t , tr t 1: variable x . The reduced model is t x β (2.10) = w + β + z z β + ··· + 1 t tr 1 0 t t r where variables. q , β are a subset of coefficients of the original , . . ., β β r 2 1 . We can test the 0 = The null hypothesis in this case is H : β = ··· = β 1 q 0 r + reduced model (2.10) against the full model (2.2) by comparing the error sums of squares under the two models using the F -statistic ) SSE − SSE )/( q ( r − M SR r F = = , (2.11) 1 ) /( n M SE − q − SSE SSE where is the error sum of squares under the reduced model (2.10). Note that r β ≥ SSE because the full model has more parameters. If H : SSE = ··· = β = 0 0 r + q 1 r is true, then SSE ≈ SSE because the estimates of those β s will be close to 0. Hence, r is big. Under the null hypothesis, (2.11) we do not believe H SSE if SSR = SSE − r 0 q degrees of freedom when (2.10) − has a central F -distribution with q − r and n − 1 is the correct model. These results are often summarized in an table as Analysis of Variance (ANOVA) given in Table 2.1 for this particular case. The difference in the numerator is often SSR ). The null hypothesis is rejected at level α called the regression sum of squares ( − r q r numerator and if F > F − q distribution with ( α ) , the 1 − α percentile of the F 1 − q n − q n − − 1 denominator degrees of freedom. = A special case of interest is the null hypothesis H β . In this = ··· : β 0 = q 1 0 case r = 0 , and the model in (2.10) becomes x = β + w . t 0 t We may measure the proportion of variation accounted for by all the variables using − SSE SSE 0 2 , (2.12) = R SSE 0 where the residual sum of squares under the reduced model is i i i i

61 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 51 — #61 i i 2.1 Classical Regression in the Time Series Context 51 n ’ 2 SSE = x ) . ( x (2.13) − ̄ t 0 1 t = and is otherwise In this case SSE x is the sum of squared deviations from the mean ̄ 0 2 coefficient known as the adjusted total sum of squares. The measure R is called the . of determination The techniques discussed in the previous paragraph can be used to test various F test given in (2.11). These tests have been models against one another using the used in the past in a stepwise manner, where variables are added or deleted when the values from the F -test either exceed or fail to exceed some predetermined levels. The procedure, called stepwise multiple regression , is useful in arriving at a set of useful variables. An alternative is to focus on a procedure for model selection that does not proceed sequentially, but simply evaluates each model on its own merits. Suppose we consider a normal regression model with coefficients and denote the k maximum likelihood estimator for the variance as ) k ( SSE 2 , (2.14) = σ ˆ k n regression denotes the residual sum of squares under the model with k where SSE ( k ) coefficients. Then, Akaike (1969, 1973, 1974) suggested measuring the goodness of fit for this particular model by balancing the error of the fit against the number of 2.1 parameters in the model; we define the following. Definition 2.1 Akaike’s Information Criterion (AIC) n + 2 k 2 + = log ˆ σ (2.15) AIC , k n 2 ˆ σ where is the number of parameters in the model. is given by (2.14) and k k The value of k yielding the minimum AIC specifies the best model. The idea is 2 roughly that minimizing ˆ σ would be a reasonable objective, except that it decreases k k increases. Therefore, we ought to penalize the error variance by a monotonically as term proportional to the number of parameters. The choice for the penalty term given by (2.15) is not the only one, and a considerable literature is available advocating different penalty terms. A corrected form, suggested by Sugiura (1978), and expanded by Hurvich and Tsai (1989), can be based on small-sample distributional results for the linear regression model (details are provided in Problem 2.4 and Problem 2.5). The corrected form is defined as follows. Definition 2.2 AIC, Bias Corrected (AICc) k + n 2 , (2.16) + log ˆ = AICc σ k − − 2 k n 2 . 1 is the number is the maximized likelihood and k Formally, AIC is defined as − 2 log L L + 2 k where k k of parameters in the model. For the normal regression problem, AIC can be reduced to the form given by (2.15). AIC is an estimate of the Kullback-Leibler discrepency between a true model and a candidate model; see Problem 2.4 and Problem 2.5 for further details. i i i i

62 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 52 — #62 i i 2 Time Series Regression and Exploratory Data Analysis 52 2 is given by (2.14) , k where n is the ˆ σ is the number of parameters in the model, and k sample size. We may also derive a correction term based on Bayesian arguments, as in Schwarz (1978), which leads to the following. Definition 2.3 Bayesian Information Criterion (BIC) k log n 2 , (2.17) σ + BIC = log ˆ k n using the same notation as in Definition 2.2. BIC is also called the Schwarz Information Criterion (SIC); see also Rissanen (1978) for an approach yielding the same statistic based on a minimum description length argument. Notice that the penalty term in BIC is much larger than in AIC, consequently, BIC tends to choose smaller models. Various simulation studies have tended to verify that BIC does well at getting the correct order in large samples, whereas AICc tends to be superior in smaller samples where the relative number of parameters is large; see McQuarrie and Tsai (1998) for detailed comparisons. In fitting regression models, two measures that have been used in the past are adjusted 2 , and Mallows C , Mallows (1973), which we do s R-squared, which is essentially p w not consider in this context. Example 2.2 Pollution, Temperature and Mortality The data shown in Figure 2.2 are extracted series from a study by Shumway et al. (1988) of the possible effects of temperature and pollution on weekly mor- tality in Los Angeles County. Note the strong seasonal components in all of the series, corresponding to winter-summer variations and the downward trend in the cardiovascular mortality over the 10-year period. A scatterplot matrix, shown in Figure 2.3, indicates a possible linear relation between mortality and the pollutant particulates and a possible relation to tempera- ture. Note the curvilinear shape of the temperature mortality curve, indicating that higher temperatures as well as lower temperatures are associated with increases in cardiovascular mortality. Based on the scatterplot matrix, we entertain, tentatively, four models where denotes the denotes cardiovascular mortality, T denotes temperature and M P t t t particulate levels. They are M = β (2.18) + β w t + t t 0 1 β T − T M β = β + + ( t (2.19) w ) + t 1 t 2 0 t · 2 (2.20) + w ) T − T ( β + ) β t T + − T ( β + M = β t 3 t 0 t 2 1 t · · 2 + (2.21) = β + β t + β w ( T − T ) + β ( T − T P ) β + M 0 t t t 2 1 t 3 t 4 · · T where we adjust temperature for its mean, = , to avoid collinearity prob- 26 . 74 · lems. It is clear that (2.18) is a trend only model, (2.19) is linear temperature, (2.20) i i i i

63 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 53 — #63 i i 2.1 Classical Regression in the Time Series Context 53 Cardiovascular Mortality 130 110 90 70 1976 1978 1980 1970 1972 1974 Temperature 100 90 80 70 60 50 1978 1980 1970 1972 1974 1976 Particulates 100 80 60 40 20 1972 1974 1970 1978 1980 1976 Average weekly cardiovascular mortality (top), temperature (middle) and particulate Fig. 2.2. pollution (bottom) in Los Angeles County. There are 508 six-day smoothed averages obtained by filtering daily values over the 10 year period 1970-1979. Table 2.2. Summary Statistics for Mortality Models 2 Model k SSE df MSE R AIC BIC (2.18) 2 40,020 506 79.0 .21 5.38 5.40 (2.19) 3 31,413 505 62.2 .38 5.14 5.17 (2.20) 4 27,985 504 55.5 .45 5.03 5.07 (2.21) 5 20,508 503 40.8 .60 4.72 4.77 is curvilinear temperature and (2.21) is curvilinear temperature and pollution. We summarize some of the statistics given for this particular case in Table 2.2. We note that each model does substantially better than the one before it and that the model including temperature, temperature squared, and particulates does the best, accounting for some 60% of the variability and with the best value for AIC and BIC (because of the large sample size, AIC and AICc are nearly the same). Note that one can compare any two models using the residual sums of squares and (2.11). Hence, a model with only trend could be compared to the full model, H 508 = , and n , : β 1 = β = = β r = 0 , using q = 4 , 4 3 2 0 i i i i

64 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 54 — #64 i i 54 2 Time Series Regression and Exploratory Data Analysis 60 70 80 90 100 50 l l l l 130 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 110 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Mortality l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 90 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 80 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 70 l l l l l l 100 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 90 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 80 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Temperature l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 70 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 60 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 50 l l 100 l l l l l l l l l l l l l l l l l l l l l l l l 80 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 60 l l l l l l l l l l l l l l l l l l l l l l l l l l l l Particulates l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 40 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 20 120 90 110 70 20 40 60 80 100 80 100 130 Fig. 2.3. Scatterplot matrix showing relations between mortality, temperature, and pollution. 508 20 3 − )/ 020 , ( 40 , = , 160 F = 503 3 , 20 , 508 / 503 . We obtain the best prediction model, which exceeds F 51 . 5 ( . 001 ) = 503 3 , ˆ . M 74 = 2831 . 5 − 1 26 396 ) t − . 472 ( T − . t t 032 ) ( . 10 ) ( . 2 − + . 023 , ( T . + 74 . 26 ) P 255 t t . 003 ) ) ( ( . 019 for mortality, where the standard errors, computed from (2.6)–(2.8), are given in parentheses. As expected, a negative trend is present in time as well as a negative coefficient for adjusted temperature. The quadratic effect of temperature can clearly be seen in the scatterplots of Figure 2.3. Pollution weights positively and can be interpreted as the incremental contribution to daily deaths per unit of particulate ˆ w pollution. It would still be essential to check the residuals ˆ M for = M − t t t autocorrelation (of which there is a substantial amount), but we defer this question to Section 3.8 when we discuss regression with correlated errors. Below is the R code to plot the series, display the scatterplot matrix, fit the final regression model (2.21), and compute the corresponding values of AIC, AICc and 2.2 na.action lm() Finally, the use of BIC. in is to retain the time series attributes for the residuals and fitted values. 2 . 2 run in R is to use the command or AIC() The easiest way to extract AIC and BIC from an lm() BIC() . Our definitions differ from R by terms that do not change from model to model. In the example, we show how to obtain (2.15) and (2.17) from the R output. It is more difficult to obtain AICc. i i i i

65 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 55 — #65 i i 55 2.1 Classical Regression in the Time Series Context par(mfrow=c(3,1)) # plot the data plot(cmort, main="Cardiovascular Mortality", xlab="", ylab="") plot(tempr, main="Temperature", xlab="", ylab="") plot(part, main="Particulates", xlab="", ylab="") dev.new() # open a new graphic device ts.plot(cmort,tempr,part, col=1:3) # all on same plot (not shown) dev.new() pairs(cbind(Mortality=cmort, Temperature=tempr, Particulates=part)) temp = tempr-mean(tempr) # center temperature temp2 = temp^2 trend = time(cmort) # time fit = lm(cmort~ trend + temp + temp2 + part, na.action=NULL) summary(fit) # regression results summary(aov(fit)) # ANOVA table (compare to next line) summary(aov(lm(cmort~cbind(trend, temp, temp2, part)))) # Table 2.1 num = length(cmort) # sample size # AIC AIC(fit)/num - log(2*pi) BIC(fit)/num - log(2*pi) # BIC (AICc = log(sum(resid(fit)^2)/num) + (num+5)/(num-5-2)) # AICc As previously mentioned, it is possible to include lagged variables in time series regression models and we will continue to discuss this type of problem throughout the text. This concept is explored further in Problem 2.2 and Problem 2.10. The following is a simple example of lagged regression. Example 2.3 Regression With Lagged Variables In Example 1.28, we discovered that the Southern Oscillation Index (SOI) measured − at time t 6 months is associated with the Recruitment series at time t , indicating that the SOI leads the Recruitment series by six months. Although there is evidence that the relationship is not linear (this is discussed further in Example 2.8 and Example 2.9), consider the following regression, , w R = β + β S (2.22) + t 1 − t 0 6 t S where denotes SOI six months prior. denotes Recruitment for month t and R t − 6 t w Assuming the sequence is white, the fitted model is t ˆ S R = 65 (2.23) 79 − 44 . 28 . t 6 − t 2 78 ) . ( on 445 degrees of freedom. This result indicates the strong pre- with ˆ σ = 22 . 5 w dictive ability of SOI for Recruitment six months in advance. Of course, it is still essential to check the model assumptions, but again we defer this until later. Performing lagged regression in R is a little difficult because the series must be aligned prior to running the regression. The easiest way to do this is to create a data frame (that we call fish ) using ts.intersect , which aligns the lagged series. fish = ts.intersect(rec, soiL6=lag(soi,-6), dframe=TRUE) summary(fit1 <- lm(rec~soiL6, data=fish, na.action=NULL)) The headache of aligning the lagged series can be avoided by using the R package , which must be downloaded and installed. dynlm library(dynlm) summary(fit2 <- dynlm(rec~ L(soi,6))) i i i i

66 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 56 — #66 i i 56 2 Time Series Regression and Exploratory Data Analysis We note that object, but the time series attributes are fit1 fit2 is similar to the retained without any additional commands. 2.2 Exploratory Data Analysis In general, it is necessary for time series data to be stationary so that averaging lagged products over time, as in the previous section, will be a sensible thing to do. With time series data, it is the dependence between the values of the series that is important to measure; we must, at least, be able to estimate autocorrelations with precision. It would be difficult to measure that dependence if the dependence structure is not regular or is changing at every time point. Hence, to achieve any meaningful statistical analysis of time series data, it will be crucial that, if nothing else, the mean and the autocovariance functions satisfy the conditions of stationarity (for at least some reasonable stretch of time) stated in Definition 1.7. Often, this is not the case, and we will mention some methods in this section for playing down the effects of nonstationarity so the stationary properties of the series may be studied. A number of our examples came from clearly nonstationary series. The Johnson & Johnson series in Figure 1.1 has a mean that increases exponentially over time, and the increase in the magnitude of the fluctuations around this trend causes changes in the covariance function; the variance of the process, for example, clearly increases as one progresses over the length of the series. Also, the global temperature series shown in Figure 1.2 contains some evidence of a trend over time; human-induced global warming advocates seize on this as empirical evidence to advance the hypothesis that temperatures are increasing. Perhaps the easiest form of nonstationarity to work with is the trend stationary model wherein the process has stationary behavior around a trend. We may write this type of model as y (2.24) + x = μ t t t y x are the observations, μ denotes the trend, and where is a stationary process. t t t Quite often, strong trend will obscure the behavior of the stationary process, y , as t we shall see in numerous examples. Hence, there is some advantage to removing the trend as a first step in an exploratory analysis of such time series. The steps involved ˆ , and then work are to obtain a reasonable estimate of the trend component, say μ t with the residuals − y = x ˆ ˆ μ . (2.25) t t t Example 2.4 Detrending Chicken Prices Here we suppose the model is of the form of (2.24), , μ + x = y t t t where, as we suggested in the analysis of the chicken price data presented in Example 2.1, a straight line might be useful for detrending the data; i.e., i i i i

67 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 57 — #67 i i 57 2.2 Exploratory Data Analysis detrended 10 5 0 resid(fit) −5 2015 2005 2010 first difference 3 2 1 0 diff(chicken) −2 2015 2010 2005 Time Fig. 2.4. Detrended (top) and differenced (bottom) chicken price series. The original data are shown in Figure 2.1. + . = β μ t β 1 t 0 In that example, we estimated the trend using ordinary least squares and found = ˆ μ t − 7131 + 3 . 59 t where we are using t instead of z for time. Figure 2.1 shows the data with the t estimated trend line superimposed. To obtain the detrended series we simply subtract 2.3 x , to obtain the detrended series ˆ μ from the observations, t t 3 . y t = x 59 + 7131 − ˆ . t t The top graph of Figure 2.4 shows the detrended series. Figure 2.5 shows the ACF of the original data (top panel) as well as the ACF of the detrended data (middle panel). In Example 1.11 and the corresponding Figure 1.10 we saw that a random walk might also be a good model for trend. That is, rather than modeling trend as fixed (as in Example 2.4), we might model trend as a stochastic component using the random walk with drift model, (2.26) + μ = δ + μ , w t − 1 t t w is white noise and is independent of y where . If the appropriate model is (2.24), t t x then differencing the data, , yields a stationary process; that is, t . 3 2 Because the error term, y , is not assumed to be iid, the reader may feel that weighted least squares is t and that is precisely what we called for in this case. The problem is, we do not know the behavior of y t are trying to assess at this stage. A notable result by Grenander and Rosenblatt (1957, Ch 7), however, y , for polynomial regression or periodic regression, asymptotically, is that under mild conditions on t ordinary least squares is equivalent to weighted least squares with regard to efficiency. i i i i

68 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 58 — #68 i i 2 Time Series Regression and Exploratory Data Analysis 58 x − x (2.27) ) y = ( μ + + y μ )−( − t 1 t 1 t − 1 − t t t δ + w . + y = − y t − 1 t t z = y It is easy to show − y is stationary using Property 1.1. That is, because y − 1 t t t t is stationary, y − cov γ y ( h ) = ) ( z , y − , z y ) = cov ( − t h h + t + h + 1 t z t t − 1 t ) 1 − = 2 γ h ( h )− γ ( ( h + 1 )− γ y y y is independent of time; we leave it as an exercise (Problem 2.7) to show that x − x − t t 1 in (2.27) is stationary. One advantage of differencing over detrending to remove trend is that no param- eters are estimated in the differencing operation. One disadvantage, however, is that differencing does not yield an estimate of the stationary process y as can be seen in t (2.27). If an estimate of y is essential, then detrending may be more appropriate. If t the goal is to coerce the data to stationarity, then differencing may be more appropri- ate. Differencing is also a viable tool if the trend is fixed, as in Example 2.4. That is, e.g., if μ β in the model (2.24), differencing the data produces stationarity + β = t 0 t 1 (see Problem 2.6): − x = ( . + y )−( μ + y x ) = β + y − y μ t − − 1 1 t t 1 t t t − 1 t − 1 t Because differencing plays a central role in time series analysis, it receives its own notation. The first difference is denoted as ∇ x (2.28) = x . − x t − 1 t t As we have seen, the first difference eliminates a linear trend. A second difference, that is, the difference of (2.28), can eliminate a quadratic trend, and so on. In order to define higher differences, we need a variation in notation that we will use often in our discussion of ARIMA models in Chapter 3. We define the backshift operator by Definition 2.4 Bx = x t − 1 t 2 B Bx x , and so on. Thus, = B ( and extend it to powers x ) = Bx = − 1 t t t − 2 t k x = B (2.29) . x t t k − − 1 B = B , so that 1 The idea of an inverse operator can also be given if we require − 1 − 1 . x Bx = B = B x t 1 t − t 1 − is the That is, B forward-shift operator . In addition, it is clear that we may rewrite (2.28) as i i i i

69 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 59 — #69 i i 59 2.2 Exploratory Data Analysis chicken 0.8 0.4 ACF 0.0 3 4 0 2 1 detrended 1.0 0.6 ACF 0.2 −0.2 4 1 2 0 3 first difference 0.8 0.4 ACF 0.0 −0.4 4 2 1 0 3 LAG Fig. 2.5. Sample ACFs of chicken prices (top), and of the detrended (middle) and the differenced (bottom) series. Compare the top plot with the sample ACF of a straight line: acf(1:100) . x (2.30) = ( 1 ∇ B ) x , − t t and we may extend the notion further. For example, the second difference becomes 2 2 2 + ∇ x x 2 = ( 1 − B ) x x − = ( 1 − 2 B + B (2.31) ) x x = 1 t t t − t t t − 2 by the linearity of the operator. To check, just take the difference of the first difference ) . − x ∇(∇ x x ) = ∇( x )−( − x x − x ) = ( 1 1 t t − 1 t t t − − t − 2 t Definition 2.5 Differences of order ddd are defined as d d ) (2.32) ∇ , = ( 1 − B d where we may expand the operator ( 1 − B ) algebraically to evaluate for higher d . When , we drop it from the notation. integer values of 1 = d The first difference (2.28) is an example of a linear filter applied to eliminate a , can produce adjusted series trend. Other filters, formed by averaging values near x t that eliminate other kinds of unwanted fluctuations, as in Chapter 4. The differencing technique is an important component of the ARIMA model of Box and Jenkins (1970) (see also Box et al., 1994), to be discussed in Chapter 3. i i i i

70 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 60 — #70 i i 60 2 Time Series Regression and Exploratory Data Analysis Example 2.5 Differencing Chicken Prices The first difference of the chicken prices series, also shown in Figure 2.4, produces different results than removing trend by detrending via regression. For example, the differenced series does not contain the long (five-year) cycle we observe in the detrended series. The ACF of this series is also shown in Figure 2.5. In this case, the differenced series exhibits an annual cycle that was obscured in the original or detrended data. The R code to reproduce Figure 2.4 and Figure 2.5 is as follows. # regress chicken on time fit = lm(chicken~time(chicken), na.action=NULL) par(mfrow=c(2,1)) plot(resid(fit), type="o", main="detrended") plot(diff(chicken), type="o", main="first difference") par(mfrow=c(3,1)) # plot ACFs acf(chicken, 48, main="chicken") acf(resid(fit), 48, main="detrended") acf(diff(chicken), 48, main="first difference") Example 2.6 Differencing Global Temperature The global temperature series shown in Figure 1.2 appears to behave more as a random walk than a trend stationary series. Hence, rather than detrend the data, it would be more appropriate to use differencing to coerce it into stationarity. The detreded data are shown in Figure 2.6 along with the corresponding sample ACF. In this case it appears that the differenced process shows minimal autocorrelation, which may imply the global temperature series is nearly a random walk with drift. It is interesting to note that if the series is a random walk with drift, the mean of the differenced series, which is an estimate of the drift, is about .008, or an increase of about one degree centigrade per 100 years. The R code to reproduce Figure 2.4 and Figure 2.5 is as follows. par(mfrow=c(2,1)) plot(diff(globtemp), type="o") mean(diff(globtemp)) # drift estimate = .008 acf(diff(gtemp), 48) An alternative to differencing is a less-severe operation that still assumes sta- fractional differ- tionarity of the underlying time series. This alternative, called encing , extends the notion of the difference operator (2.32) to fractional powers . 5 < d < . 5 , which still define stationary processes. Granger and Joyeux (1980) and − Hosking (1981) introduced long memory time series, which corresponds to the case when 0 < d < . 5 . This model is often used for environmental time series arising in hydrology. We will discuss long memory processes in more detail in Section 5.1. Often, obvious aberrations are present that can contribute nonstationary as well as nonlinear behavior in observed time series. In such cases, transformations may be useful to equalize the variability over the length of a single series. A particularly useful transformation is (2.33) y , = log x t t which tends to suppress larger fluctuations that occur over portions of the series where power transformations the underlying values are larger. Other possibilities are in the i i i i

71 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 61 — #71 i i 61 2.2 Exploratory Data Analysis 0.2 0.0 diff(globtemp) −0.2 1980 1880 2020 1900 1920 1940 1960 2000 Year 1.0 0.6 ACF 0.2 −0.2 5 15 20 10 25 0 LAG Fig. 2.6. Differenced global temperature series and its sample ACF. Box–Cox family of the form { λ , λ λ )/ 1 − ( , x 0 t y = (2.34) t . 0 log x = λ t λ are available (see Johnson and Wichern, 1992, Methods for choosing the power §4.7) but we do not pursue them here. Often, transformations are also used to improve the approximation to normality or to improve linearity in predicting the value of one series from another. Example 2.7 Paleoclimatic Glacial Varves Melting glaciers deposit yearly layers of sand and silt during the spring melting seasons, which can be reconstructed yearly over a period ranging from the time deglaciation began in New England (about 12,600 years ago) to the time it ended varves (about 6,000 years ago). Such sedimentary deposits, called , can be used as proxies for paleoclimatic parameters, such as temperature, because, in a warm year, more sand and silt are deposited from the receding glacier. Figure 2.7 shows the thicknesses of the yearly varves collected from one location in Massachusetts for 634 years, beginning 11,834 years ago. For further information, see Shumway and Verosub (1992). Because the variation in thicknesses increases in proportion to the amount deposited, a logarithmic transformation could remove the nonstationarity observable in the variance as a function of time. Figure 2.7 shows the original and transformed varves, and it is clear that this improvement has occurred. We may also plot the histogram of the original and transformed data, as in Problem 2.8, to argue that the approximation to normality is improved. The ordinary first differences (2.30) are also computed in Problem 2.8, and we note that the first differences have i i i i

72 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 62 — #72 i i 62 2 Time Series Regression and Exploratory Data Analysis varve 150 100 50 0 400 300 0 500 600 100 200 log(varve) 5 4 3 2 0 100 200 300 400 500 600 Time 634 years compared with = Fig. 2.7. Glacial varve thicknesses (top) from Massachusetts for n log transformed thicknesses (bottom). a significant negative correlation at lag h = 1 . Later, in Chapter 5, we will show that perhaps the varve series has long memory and will propose using fractional differencing. Figure 2.7 was generated in R as follows: par(mfrow=c(2,1)) plot(varve, main="varve", ylab="") plot(log(varve), main="log(varve)", ylab="" ) Next, we consider another preliminary data processing technique that is used for the purpose of visualizing the relations between series at different lags, namely, scat- terplot matrices . In the definition of the ACF, we are essentially interested in relations and between x ; the autocorrelation function tells us whether a substantial linear x t − h t relation exists between the series and its own lagged values. The ACF gives a profile of the linear correlation at all possible lags and shows which values of h lead to the best predictability. The restriction of this idea to linear predictability, however, may . x mask a possible nonlinear relation between current values, x , and past values, h t − t This idea extends to two series where one may be interested in examining scatterplots of y versus x − h t t Example 2.8 Scatterplot Matrices, SOI and Recruitment To check for nonlinear relations of this form, it is convenient to display a lagged scatterplot matrix, as in Figure 2.8, that displays values of the SOI, S , on the vertical t on the horizontal axis. The sample autocorrelations are axis plotted against S h t − displayed in the upper right-hand corner and superimposed on the scatterplots are locally weighted scatterplot smoothing (lowess) lines that can be used to help i i i i

73 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 63 — #73 i i 63 2.2 Exploratory Data Analysis soi(t−3) soi(t−2) soi(t−1) l l l 1.0 1.0 1.0 l l l l l l 0.37 0.21 0.6 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.5 0.5 0.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 0.0 0.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l soi(t) soi(t) soi(t) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −1.0 −1.0 −1.0 −1.0 −1.0 −0.5 0.5 −1.0 −0.5 0.0 0.5 1.0 1.0 0.0 1.0 0.5 −0.5 0.0 soi(t−5) soi(t−4) soi(t−6) l l l 1.0 1.0 1.0 l l l l l l −0.11 −0.19 0.05 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.5 0.5 0.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 0.0 0.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l soi(t) soi(t) soi(t) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −1.0 −1.0 −1.0 −0.5 0.0 −0.5 −1.0 1.0 −1.0 0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 soi(t−8) soi(t−9) soi(t−7) l l l 1.0 1.0 1.0 l l l l l l −0.18 0.05 −0.1 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.5 0.5 0.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 0.0 0.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l soi(t) soi(t) soi(t) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −1.0 −1.0 −1.0 −0.5 1.0 0.5 −1.0 −0.5 0.0 0.5 1.0 0.0 1.0 −1.0 −0.5 −1.0 0.5 0.0 soi(t−11) soi(t−12) soi(t−10) l l l 1.0 1.0 1.0 l l l l l l 0.22 0.36 0.41 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.5 0.5 0.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 0.0 0.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l soi(t) soi(t) soi(t) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −1.0 −1.0 −1.0 −0.5 −0.5 0.0 1.0 0.5 0.5 −1.0 −1.0 0.0 0.5 1.0 0.0 −0.5 −1.0 1.0 S , to past SOI values, , at lags S Fig. 2.8. Scatterplot matrix relating current SOI values, t h t − 1 , 2 , ..., 12 . The values in the upper right corner are the sample autocorrelations and the h = lines are a lowess fit. discover any nonlinearities. We discuss smoothing in the next section, but for now, think of lowess as a robust method for fitting local regression. In Figure 2.8, we notice that the lowess fits are approximately linear, so that the sample autocorrelations are meaningful. Also, we see strong positive linear = h , and 1 , 2 , 11 , 12 , that is, between S and S , S , S S relations at lags , − − 2 t − t 1 11 t t t − 12 a negative linear relation at lags 6 , 7 . These results match up well with peaks = h noticed in the ACF in Figure 1.16. Similarly, we might want to look at values of one series, say Recruitment, R denoted plotted against another series at various lags, say the SOI, S , to look − h t t for possible nonlinear relations between the two series. Because, for example, we might wish to predict the Recruitment series, R , from current or past values of the t it would be worthwhile to examine the scatterplot SOI series, S 2 , 1 , for h = 0 , , ... h − t R matrix. Figure 2.9 shows the lagged scatterplot of the Recruitment series on the t i i i i

74 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 64 — #74 i i 64 2 Time Series Regression and Exploratory Data Analysis soi(t−2) soi(t−1) soi(t−0) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.02 0.01 −0.04 l l l l l l 100 100 100 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 80 80 80 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 60 60 60 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l rec(t) rec(t) rec(t) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 40 40 40 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 20 20 20 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0 0 0 1.0 1.0 −1.0 −0.5 0.0 0.5 −0.5 0.5 0.0 −0.5 −1.0 0.5 1.0 −1.0 0.0 soi(t−4) soi(t−5) soi(t−3) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −0.3 −0.15 −0.53 l l l l l l 100 100 100 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 80 80 80 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 60 60 60 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l rec(t) rec(t) rec(t) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 40 40 40 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 20 20 20 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0 0 0 0.5 −1.0 0.5 1.0 −1.0 −0.5 0.0 0.0 1.0 0.0 1.0 −0.5 0.5 −1.0 −0.5 soi(t−6) soi(t−8) soi(t−7) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −0.6 −0.56 −0.6 l l l l l l 100 100 100 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 80 80 80 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 60 60 60 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l rec(t) rec(t) rec(t) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 40 40 40 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 20 20 20 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0 0 0 0.0 −1.0 −1.0 −1.0 −0.5 −0.5 0.5 1.0 0.0 −0.5 0.0 0.5 1.0 0.5 1.0 Fig. 2.9. Scatterplot matrix of the Recruitment series, R , on the vertical axis plotted against t , . . ., . The values in the upper 1 8 the SOI series, S , 0 = , on the horizontal axis at lags h h t − right corner are the sample cross-correlations and the lines are a lowess fit. vertical axis plotted against the SOI index S on the horizontal axis. In addition, t − h the figure exhibits the sample cross-correlations as well as lowess fits. Figure 2.9 shows a fairly strong nonlinear relationship between Recruitment, , R , and the SOI series at S , indicating the SOI series tends to lead S , S , S 5 8 t t − 6 − t t − 7 t − the Recruitment series and the coefficients are negative, implying that increases in the SOI lead to decreases in the Recruitment. The nonlinearity observed in the scatterplots (with the help of the superimposed lowess fits) indicates that the behavior between Recruitment and the SOI is different for positive values of SOI than for negative values of SOI. Simple scatterplot matrices for one series can be obtained in R using the command. Figure 2.8 and Figure 2.9 may be reproduced using the fol- lag.plot lowing scripts provided with astsa : lag1.plot(soi, 12) # Figure 2.8 # lag2.plot(soi, rec, 8) Figure 2.9 Example 2.9 Regression with Lagged Variables (cont) In Example 2.3 we regressed Recruitment on lagged SOI, S . w + R β = β + − 0 1 t t t 6 i i i i

75 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 65 — #75 i i 2.2 Exploratory Data Analysis 65 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l + l l l l + l l l l l l + l l l + l l + l l l l l l l l l l + + l l l + l l + + l l l l l + + + + + + l + + l + l + + + l + + + + l l l + l + + + + l + + l l l + + + + l + + + + l + + l + + + l + + + + + l + l l + + l l l l + + l l l + + + + + + l l l + + l l + + + + + l + + + l + + + + + l + + + l l + l + + + + + + + + + l l + + + l + + + + + l l l l + + l l l + + + + + + + + + l + + l l l + + 80 + + l + l + + + + + + l l l l + + + + + + + + + + + + + + + + + + + + + + + + + + + l + + + + l + l l + + + + + + + l + + + + + + + + + + l l l l + + + + + + l l l l l + l l l l l l l + + + + l l + + + l l l + + + + l l l l l l l + + + + + + l l l + + + + + l l + + + + + + + + l + l l l l + + + + + + l l + + l + + + l l l + + + + + + + + + l l l + + + + + + + l l + + + + + + + + + + l + + + + + l l l + + + + + + + + + l l l l + + + + + + l l l + + + + + + l l l l l l + + + + + + + l l l l l l l + + + + + l l l + + + + + l 60 l + + + + l l l l + + + + + + + + + + + l l l + + l l + + l l l + + + + l l l + + + + l l l + + + + + + + l l l l l l + + l l l + + + + + + + l l l + + + + + + + l + + + + + l l l l + + + + + + + l rec + + + + + l + + + + l l + + + l l l l l l + + + + + + + + l l l + + + + + + + l + + l l l l l + + + + + l l l l + + l l l + l l l l l l l l l + + + + l l + + + + l + 40 l l + + + l l l l l l + + l l + l l + + + + l + + + + + + l l + + + l l l l + + l l l l + + l l l l + + l l l + + + l l l l l l l l + + + l l + + l l l l l l + + + + l l l l l l + l l l l + l l + + + l l l l l l l l + + + l l l l l l + + l 20 l l l + + l l l l l l l l l l l l l l l l l l l l l l l l + l l l l l 0 0.5 1.0 −1.0 −0.5 0.0 soiL6 Display for Example 2.9: Plot of Recruitment ( ) R ) vs SOI lagged 6 months ( S Fig. 2.10. t − t 6 + ) and a lowess fit ( with the fitted values of the regression as points ( ). — However, in Example 2.8, we saw that the relationship is nonlinear and different when SOI is positive or negative. In this case, we may consider adding a dummy variable to account for this change. In particular, we fit the model D , w + R S = β β + β + S D β + 6 − t t − 6 1 0 3 t t − 6 t − 6 t 2 D and 1 otherwise. This means that is a dummy variable that is 0 if S where < 0 t t { S β + β , 0 < + w if S 0 t t − 1 t − 6 6 R = t S + 0 ≥ ( β if + β w ) . ( β + + β S ) 0 3 1 2 t 6 t t − 6 − with S vs The result of the fit is given in the R code below. Figure 2.10 shows R t t − 6 the fitted values of the regression and a lowess fit superimposed. The piecewise regression fit is similar to the lowess fit, but we note that the residuals are not white noise (see the code below). This is followed up in Example 3.45. dummy = ifelse(soi<0, 0, 1) fish = ts.intersect(rec, soiL6=lag(soi,-6), dL6=lag(dummy,-6), dframe=TRUE) summary(fit <- lm(rec~ soiL6*dL6, data=fish, na.action=NULL)) Coefficients: Estimate Std.Error t.value (Intercept) 74.479 2.865 25.998 soiL6 -15.358 7.401 -2.075 dL6 -1.139 3.711 -0.307 soiL6:dL6 -51.244 9.523 -5.381 --- Residual standard error: 21.84 on 443 degrees of freedom Multiple R-squared: 0.4024 F-statistic: 99.43 on 3 and 443 DF attach(fish) plot(soiL6, rec) lines(lowess(soiL6, rec), col=4, lwd=2) points(soiL6, fitted(fit), pch= , col=2) ' + ' plot(resid(fit)) # not shown ... # ... but obviously not noise acf(resid(fit)) As a final exploratory tool, we discuss assessing periodic behavior in time series data using regression analysis. In Example 1.12, we briefly discussed the problem of i i i i

76 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 66 — #76 i i 2 Time Series Regression and Exploratory Data Analysis 66 identifying cyclic or periodic signals in time series. A number of the time series we have seen so far exhibit periodic behavior. For example, the data from the pollution study example shown in Figure 2.2 exhibit strong yearly cycles. The Johnson & Johnson data shown in Figure 1.1 make one cycle every year (four quarters) on top of an increasing trend and the speech data in Figure 1.2 is highly repetitive. The monthly SOI and Recruitment series in Figure 1.6 show strong yearly cycles, which obscures the slower El Niño cycle. Example 2.10 Using Regression to Discover a Signal in Noise In Example 1.12, we generated n 500 observations from the model = = A cos ( 2 πω t + φ ) + w x , (2.35) t t = 1 / 50 , A = 2 , φ ω . 6 π , and σ where = 5 ; the data are shown on the bottom = w ω = 1 / panel of Figure 1.11. At this point we assume the frequency of oscillation 50 is known, but and φ are unknown parameters. In this case the parameters appear A 2.4 in (2.35) in a nonlinear way, so we use a trigonometric identity and write cos ( 2 πω t + φ ) = β , cos ( 2 πω t ) + β ) sin ( 2 A t πω 2 1 β β = A cos ( φ ) and where . Now the model (2.35) can be written in = − A sin ( φ ) 1 2 the usual linear regression form given by (no intercept term is needed here) w + π (2.36) x ) = β 50 cos ( 2 . t / 50 ) + β / sin ( 2 π t 2 1 t t ˆ ˆ ; = − . 74 18 . 5 = , = β σ Using linear regression, we find − β . 99 ˆ with 1 2 w 1 33 . ) 33 . ( ( ) the values in parentheses are the standard errors. We note the actual values of the ) = π 6 coefficients for this example are β . = 2 cos ( . 6 π ) = − . 62 , and β ( = − 2 sin 2 1 − 1 . 90 . It is clear that we are able to detect the signal in the noise using regression, even though the signal-to-noise ratio is small. Figure 2.11 shows data generated by (2.35) with the fitted line superimposed. To reproduce the analysis and Figure 2.11 in R, use the following: set.seed(90210) # so you can reproduce these results x = 2*cos(2*pi*1:500/50 + .6*pi) + rnorm(500,0,5) z1 = cos(2*pi*1:500/50) z2 = sin(2*pi*1:500/50) summary(fit <- lm(x~0+z1+z2)) # zero to exclude the intercept Coefficients: Estimate Std. Error t value z1 -0.7442 0.3274 -2.273 z2 -1.9949 0.3274 -6.093 Residual standard error: 5.177 on 498 degrees of freedom par(mfrow=c(2,1)) plot.ts(x) plot.ts(x, col=8, ylab=expression(hat(x))) lines(fitted(fit), col=2) We will discuss this and related approaches in more detail in Chapter 4. 2 . 4 ( ) β . cos ( α ± β ) = cos ( α ) cos ( β )∓ sin ( α ) sin i i i i

77 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 67 — #77 i i 2.3 Smoothing in the Time Series Context 67 15 5 x −5 −15 400 500 0 300 200 100 15 5 ^ x −5 −15 300 400 500 0 100 200 [top] and the fitted line superimposed on the data [bottom]. Fig. 2.11. Data generated by (2.35) 2.3 Smoothing in the Time Series Context In Section 1.2, we introduced the concept of filtering or smoothing a time series, and in Example 1.9, we discussed using a moving average to smooth white noise. This method is useful in discovering certain traits in a time series, such as long-term trend and seasonal components. In particular, if x represents the observations, then t k ’ a , x (2.37) = m j t − j t k j = − Õ k where a = a is a symmetric moving average of the data. ≥ 0 and 1 = a j j − j = j k − Example 2.11 Moving Average Smoother For example, Figure 2.12 shows the monthly SOI series discussed in Example 1.5 1 = ; smoothed using (2.37) with weights a = 24 = ··· = a = 1 / 12 , and a / a ± 1 ± 5 ± 6 0 k = 6 . This particular method removes (filters out) the obvious annual temperature cycle and helps emphasize the El Niño cycle. To reproduce Figure 2.12 in R: wgts = c(.5, rep(1,11), .5)/12 soif = filter(soi, sides=2, filter=wgts) plot(soi) lines(soif, lwd=2, col=4) # the insert par(fig = c(.65, 1, .65, 1), new = TRUE) nwgts = c(rep(0,20), wgts, rep(0,20)) n , ann=FALSE) ' plot(nwgts, type="l", ylim = c(-.02,.1), xaxt= ' n ' , yaxt= ' Although the moving average smoother does a good job in highlighting the El Niño effect, it might be considered too choppy. We can obtain a smoother fit using the normal distribution for the weights, instead of boxcar-type weights of (2.37). i i i i

78 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 68 — #78 i i 2 Time Series Regression and Exploratory Data Analysis 68 1.0 0.5 soi 0.0 −0.5 −1.0 1970 1960 1980 1950 Time Fig. 2.12. Moving average smoother of SOI. The insert shows the shape of the moving average . (“boxcar”) kernel [not drawn to scale] described in (2.39) 1.0 0.5 soi 0.0 −0.5 −1.0 1950 1960 1970 1980 Time Kernel smoother of SOI. The insert shows the shape of the normal kernel [not drawn Fig. 2.13. to scale]. Example 2.12 Kernel Smoothing Kernel smoothing is a moving average smoother that uses a weight function, or kernel, to average the observations. Figure 2.13 shows kernel smoothing of the SOI series, where m is now t n ’ , x t ) w ( (2.38) = m i i t i = 1 where / ) ( ) ( Õ − t j n t − i K K ( w ) = (2.39) t i 1 j = b b are the weights and K (·) is a kernel function. This estimator, which was originally explored by Parzen (1962) and Rosenblatt (1956b), is often called the Nadaraya– Watson estimator (Watson, 1966). In this example, and typically, the normal kernel, 1 2 √ , is used. 2 z (− exp / ) ( = K ) z 2 π i i i i

79 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 69 — #79 i i 69 2.3 Smoothing in the Time Series Context 1.0 0.5 soi 0.0 −0.5 −1.0 1980 1970 1960 1950 Time Locally weighted scatterplot smoothers ( lowess Fig. 2.14. ) of the SOI series. To implement this in R, use the ksmooth function where a bandwidth can be chosen. The wider the bandwidth, b , the smoother the result. From the R ksmooth help file: The kernels are scaled so that their quartiles (viewed as probability densities) are For the standard normal distribution, the quartiles are at . 25 ∗ bandwidth. 0 ± . 674 . ± In our case, we are smoothing over time, which is of the form t / 12 for the SOI time series. In Figure 2.13, we used the value of = 1 to correspond to approximately b smoothing a little over one year. Figure 2.13 can be reproduced in R as follows. plot(soi) lines(ksmooth(time(soi), soi, "normal", bandwidth=1), lwd=2, col=4) par(fig = c(.65, 1, .65, 1), new = TRUE) # the insert gauss = function(x) { 1/sqrt(2*pi) * exp(-(x^2)/2) } x = seq(from = -3, to = 3, by = 0.001) ' , ann=FALSE) ' n plot(x, gauss(x), type ="l", ylim=c(-.02,.45), xaxt= ' n ' , yaxt= Example 2.13 Lowess Another approach to smoothing a time plot is nearest neighbor regression. The technique is based on k -nearest neighbors regression, wherein one uses only the x data { x . = ˆ m , . . ., x , . . ., x via regression, and then sets x to predict } t t t t 2 + t 2 / / k − t k Lowess is a method of smoothing that is rather complex, but the basic idea is close to nearest neighbor regression. Figure 2.14 shows smoothing of SOI using the R function lowess (see Cleveland, 1979). First, a certain proportion of nearest neighbors to x in time get are included in a weighting scheme; values closer to x t t x and obtain more weight. Then, a robust weighted regression is used to predict t m the smoothed values . The larger the fraction of nearest neighbors included, the t smoother the fit will be. In Figure 2.14, one smoother uses 5% of the data to obtain an estimate of the El Niño cycle of the data. In addition, a (negative) trend in SOI would indicate the long-term warming of the Pacific Ocean. To investigate this, we used lowess with the default smoother span of of the data. Figure 2.14 can be reproduced in R as follows. f=2/3 plot(soi) # El Nino cycle lines(lowess(soi, f=.05), lwd=2, col=4) i i i i

80 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 70 — #80 i i 70 2 Time Series Regression and Exploratory Data Analysis 1.0 0.5 soi 0.0 −0.5 −1.0 1950 1980 1960 1970 Time Smoothing splines fit to the SOI series. Fig. 2.15. # trend (with default span) lines(lowess(soi), lty=2, lwd=2, col=2) Example 2.14 Smoothing Splines An obvious way to smooth data would be to fit a polynomial regression in terms of time. For example, a cubic polynomial would have x = m + w where t t t 3 2 β β . + β m t + β t t = + t 2 1 0 3 We could then fit m via ordinary least squares. t An extension of polynomial regression is to first divide time t = 1 , . . ., n , into are t , . . ., t k intervals, [ t , = 1 , t t ] , [ t ; the values + 1 , t ] ] , . . ., [ t n = t + 1 , 1 k k 2 1 1 0 0 1 k − knots . Then, in each interval, one fits a polynomial regression, typically the called . cubic splines order is 3, and this is called smoothing splines , which minimizes a compromise between A related method is the fit and the degree of smoothness given by π n ’ ) ( 2 2 ′′ ] [ dt (2.40) , λ x m m − + t t t = 1 t m t is a cubic spline with a knot at each where and primes denote differentiation. t The degree of smoothness is controlled by λ > 0 . t is the position of your car at time . In m Think of taking a long drive where t ∫ ′′ ′′ 2 dt ( m this case, is a measure ) is instantaneous acceleration/deceleration, and m t t of the total amount of acceleration and deceleration on your trip. A smooth drive ′′ 0 ). A choppy would be one where a constant velocity, is maintained (i.e., m = t ride would be when the driver is constantly accelerating and decelerating, such as beginning drivers tend to do. If λ = 0 , we don’t care how choppy the ride is, and this leads to m , the = x t t data, which are not smooth. If λ = ∞ , we insist on no acceleration or deceleration ′′ t v + ( m , and c = 0 ); in this case, our drive must be at constant velocity, m = t t i i i i

81 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 71 — #81 i i 2.3 Smoothing in the Time Series Context 71 l l l l l l 120 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 100 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Mortality l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 90 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 80 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 70 l l 100 80 70 60 50 90 Temperature Fig. 2.16. Smooth of mortality as a function of temperature using lowess. is seen as a trade-off between linear regression consequently very smooth. Thus, λ (completely smooth) and the data itself (no smoothness). The larger the value of , λ the smoother the fit. In R, the smoothing parameter is called spar and it is monotonically related to λ ; ?smooth.spline to view the help file for details. Figure 2.15 shows smoothing type spar=.5 spline fits on the SOI series using to emphasize the El Niño cycle, and spar=1 to emphasize the trend. The figure can be reproduced in R as follows. plot(soi) lines(smooth.spline(time(soi), soi, spar=.5), lwd=2, col=4) lines(smooth.spline(time(soi), soi, spar= 1), lty=2, lwd=2, col=2) Example 2.15 Smoothing One Series as a Function of Another In addition to smoothing time plots, smoothing techniques can be applied to smooth- ing a time series as a function of another time series. We have already seen this idea used in Example 2.8 when we used lowess to visualize the nonlinear relationship between Recruitment and SOI at various lags. In this example, we smooth the scat- terplot of two contemporaneously measured time series, mortality as a function of temperature. In Example 2.2, we discovered a nonlinear relationship between mor- tality and temperature. Continuing along these lines, Figure 2.16 show a scatterplot T of mortality, M , and temperature, T , along with M smoothed as a function of t t t t using lowess. Note that mortality increases at extreme temperatures, but in an asym- metric way; mortality is higher at colder temperatures than at hotter temperatures. ◦ The minimum mortality rate seems to occur at approximately 83 F. Figure 2.16 can be reproduced in R as follows using the defaults. plot(tempr, cmort, xlab="Temperature", ylab="Mortality") lines(lowess(tempr, cmort)) i i i i

82 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 72 — #82 i i 2 Time Series Regression and Exploratory Data Analysis 72 Problems Section 2.1 2.1 A Structural Model For the Johnson & Johnson data, say y , shown in Figure 1.1, t x . In this problem, we are going to fit a special type of structural model, = log ( y let ) t t = T + S + N x where T is a trend component, S is a seasonal component, and N t t t t t t t is noise. In our case, time t is in quarters ( 1960 . 00 , 1960 . 25 , . . . ) so one unit of time is a year. (a) Fit the regression model x = β t α + ) + α t Q ( ( t ) + α Q Q α ( t ) + w + Q ) ( t t 3 4 2 4 2 1 t 3 1 ︸︷︷︸ ︷︷ ︸ ︸ ︸︷︷︸ noise seasonal trend 4 , 3 , and zero otherwise. where Q , ( t ) = 1 if time t corresponds to quarter i = 1 , 2 i The ( t ) ’s are called indicator variables. We will assume for now that w Q is a i t Hint: Detailed code is given in Code R.4, the last Gaussian white noise sequence. example of Section R.4. (b) If the model is correct, what is the estimated average annual increase in the logged earnings per share? (c) If the model is correct, does the average logged earnings rate increase or decrease from the third quarter to the fourth quarter? And, by what percentage does it increase or decrease? (d) What happens if you include an intercept term in the model in (a)? Explain why there was a problem. x , on the graph. , and superimpose the fitted values, say ˆ x (e) Graph the data, t t Examine the residuals, x − ˆ x , and state your conclusions. Does it appear that the t t model fits the data well (do the residuals look white)? 2.2 For the mortality data examined in Example 2.2: (a) Add another component to the regression in (2.21) that accounts for the particulate count four weeks prior; that is, add P to the regression in (2.21). State your − 4 t conclusion. (b) Draw a scatterplot matrix of M and then calculate the pairwise , T P , P and t t t − 4 t correlations between the series. Compare the relationship between M P and t t versus M . and P t 4 t − 2.3 In this problem, we explore the difference between a random walk and a trend stationary process. (a) Generate four series that are random walk with drift, (1.4), of length n = 100 . Call the data 100 x with δ = . 01 and σ , . . ., = 1 . Fit the regression 1 = for t t w x using least squares. Plot the data, the true mean function (i.e., = β t + w t t ˆ Hint: The following R , on the same graph. β = . 01 t ) and the fitted line, ˆ x t = μ t t code may be useful. i i i i

83 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 73 — #83 i i 73 Problems par(mfrow=c(2,2), mar=c(2.5,2.5,0,0)+.5, mgp=c(1.6,.6,0)) # set up for (i in 1:4){ # data x = ts(cumsum(rnorm(100,.01,1))) # regression regx = lm(x~0+time(x), na.action=NULL) ' Random Walk w Drift ) # plots plot(x, ylab= ' abline(a=0, b=.01, col=2, lty=2) # true mean (red - dashed) # fitted line (blue - solid) abline(regx, col=4) } (b) Generate four series of length n = 100 that are linear trend plus noise, say w = . 01 t + w + , where t and y t are as in part (a). Fit the regression y β = w t t t t t μ = . 01 t ) and the using least squares. Plot the data, the true mean function (i.e., t ˆ y β = fitted line, ˆ t , on the same graph. t (c) Comment (what did you learn from this assignment). Given the random n × 2.4 Kullback-Leibler Information vector y , we define the 1 information for discriminating between two densities in the same family, indexed by θ ; , as a parameter θ , say f ( y ; θ ) ) and f ( y 1 2 ) θ ; y ( f 1 − 1 , (2.41) E ; log θ = ) θ ( n I 1 1 2 y ; θ f ) ( 2 where E denotes expectation with respect to the density determined by θ . For the 1 1 ′ 2 ′ = Gaussian regression model, the parameters are β , σ ) . Show that θ ( ( ) 2 2 ′ ′ σ σ β 1 − β ( Z β ( − Z β 1 ) ) 2 1 2 1 1 1 ) I ( = θ θ ; − . 1 − (2.42) + log 1 2 2 2 2 2 2 σ σ σ n 2 2 2 2.5 Model Selection Both selection criteria (2.15) and (2.16) are derived from Kullback-Leibler discrim- information theoretic arguments, based on the well-known numbers (see Kullback and Leibler, 1951, Kullback, 1958). We ination information give an argument due to Hurvich and Tsai (1989). We think of the measure (2.42) as measuring the discrepancy between the two densities, characterized by the parameter ′ ′ 2 ′ ′ ′ 2 ′ θ values = ( β , σ and θ ) = ( β , σ ) . Now, if the true value of the parameter 1 1 1 2 2 2 vector is , we argue that the best model would be one that minimizes the discrep- θ 1 ˆ ( θ ancy between the theoretical value and the sample, say ; I θ ) . Because θ will not 1 1 be known, Hurvich and Tsai (1989) considered finding an unbiased estimator for 2 2 ˆ I ( β E , σ [ ; β ˆ σ )] , where 1 1 , 1 ( ) 2 2 ′ ′ ˆ ˆ σ σ ) ) β − − β ( 1 β ( Z β Z 1 1 1 1 1 2 2 ˆ log 1 − − + I = , σ ) β σ ˆ ; β ( , 1 1 2 2 2 2 2 σ ˆ σ ˆ n ˆ σ regression vector. Show that is a k × 1 β and ) ( + n k 1 2 2 2 2 ˆ − (2.43) , log − σ 1 + + σ log ˆ E E ( β I [ , σ σ β ; = ˆ )] 1 1 1 , 1 1 − 2 n − 2 k using the distributional properties of the regression coefficients and error variance. An 2 2 σ . Hence, we have shown that the expectation unbiased estimator for E is log ˆ σ log ˆ 1 i i i i

84 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 74 — #84 i i 2 Time Series Regression and Exploratory Data Analysis 74 of the above discrimination information is as claimed. As models with differing k are considered, only the second and third terms in (2.43) will vary and dimensions we only need unbiased estimators for those two terms. This gives the form of AICc quoted in (2.16) in the chapter. You will need the two distributional results ′ ′ 2 ˆ ˆ β − β ) ( Z β Z ( ) β − σ ˆ n 1 1 2 2 and χ ∼ χ ∼ n k − k 2 2 σ σ 1 1 The two quantities are distributed independently as chi-squared distributions with the 2 x ∼ χ E . , indicated degrees of freedom. If ( 1 / x ) = 1 /( n − 2 ) n Section 2.2 2.6 Consider a process consisting of a linear trend with an additive noise term con- 2 sisting of independent random variables w with zero means and variances σ , that t w is, w + t , x β = β + 0 1 t t β , β where are fixed constants. 0 1 (a) Prove x is nonstationary. t x is stationary by finding its (b) Prove that the first difference series ∇ x − = x t t − 1 t mean and autocovariance function. , with mean y (c) Repeat part (b) if w is replaced by a general stationary process, say t t . ) function μ h ( γ and autocovariance function y y Show (2.27) is stationary. 2.7 The glacial varve record plotted in Figure 2.7 exhibits some nonstationarity that 2.8 can be improved by transforming to logarithms and some additional nonstationarity that can be corrected by differencing the logarithms. (a) Argue that the glacial varves series, say x , exhibits heteroscedasticity by com- t puting the sample variance over the first half and the second half of the data. Argue that the transformation y stabilizes the variance over the series. = log x t t Plot the histograms of x and y to see whether the approximation to normality is t t improved by transforming the data. y (b) Plot the series . Do any time intervals, of the order 100 years, exist where t one can observe behavior comparable to that observed in the global temperature records in Figure 1.2? (c) Examine the sample ACF of y and comment. t (d) Compute the difference u , examine its time plot and sample ACF, = y y − t t − 1 t and argue that differencing the logged varve data produces a reasonably stationary Hint ? u : Recall Footnote 1.2. series. Can you think of a practical interpretation for t i i i i

85 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 75 — #85 i i 75 Problems (e) Based on the sample ACF of the differenced transformed series computed in (c), argue that a generalization of the model given by Example 1.26 might be reasonable. Assume = μ + + θ w u w t − 1 t t is stationary when the inputs are assumed independent with mean 0 and w t 2 variance σ . Show that w  2 2  = h if ) , 0 σ θ ( 1 +  w  2 h γ ( = ) 1 , θ σ ± = if h u w    h if | 0 | > 1 .  0 ( ρ , to derive (f) Based on part (e), use ˆ ) γ ( 1 ) and the estimate of the variance of u ˆ , t u u 2 . This is an application of the method of moments from σ θ and estimates of w classical statistics, where estimators of the parameters are derived by equating sample moments to theoretical moments. In this problem, we will explore the periodic nature of S 2.9 , the SOI series displayed t in Figure 1.5. . Is there a significant S (a) Detrend the series by fitting a regression of on time t t trend in the sea surface temperature? Comment. (b) Calculate the periodogram for the detrended series obtained in part (a). Identify the frequencies of the two main peaks (with an obvious one at the frequency of one cycle every 12 months). What is the probable El Niño cycle indicated by the minor peak? Section 2.3 oil . The oil series is in dollars per gas and 2.10 Consider the two weekly time series barrel, while the gas series is in cents per gallon. (a) Plot the data on the same graph. Which of the simulated series displayed in Section 1.2 do these series most resemble? Do you believe the series are stationary (explain your answer)? (b) In economics, it is often the percentage change in price (termed growth rate or return ), rather than the absolute price change, that is important. Argue that a log is x transformation of the form y x = ∇ might be applied to the data, where t t t : Recall Footnote 1.2. the oil or gas price series. Hint (c) Transform the data as described in part (b), plot the data on the same graph, look at the sample ACFs of the transformed data, and comment. (d) Plot the CCF of the transformed data and comment The small, but significant might be considered as feedback. leads gas values when oil (e) Exhibit scatterplots of the oil and gas growth rate series for up to three weeks of lead time of oil prices; include a nonparametric smoother in each plot and comment on the results (e.g., Are there outliers? Are the relationships linear?). i i i i

86 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 76 — #86 i i 2 Time Series Regression and Exploratory Data Analysis 76 (f) There have been a number of studies questioning whether gasoline prices respond more quickly when oil prices are rising than when oil prices are falling (“asymme- try”). We will attempt to explore this question here with simple lagged regression; we will ignore some obvious problems such as outliers and autocorrelated errors, G so this will not be a definitive analysis. Let and O denote the gas and oil t t growth rates. (i) Fit the regression (and comment on the results) , w + O G β = α + + α O I β + 2 t t 1 t 2 t − 1 t 1 I and 0 otherwise ( is the indicator of no growth or 0 where I ≥ = 1 if O t t t positive growth in oil price). Hint: poil = diff(log(oil)) pgas = diff(log(gas)) indi = ifelse(poil < 0, 0, 1) mess = ts.intersect(pgas, poil, poilL = lag(poil,-1), indi) summary(fit <- lm(pgas~ poil + poilL + indi, data=mess)) (ii) What is the fitted model when there is negative growth in oil price at time ? What is the fitted model when there is no or positive growth in oil price? t Do these results support the asymmetry hypothesis? (iii) Analyze the residuals from the fit and comment. Use two different smoothing techniques described in Section 2.3 to estimate the 2.11 globtemp trend in the global temperature series . Comment. i i i i

87 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 77 — #87 i i Chapter 3 ARIMA Models Classical regression is often insufficient for explaining all of the interesting dynamics of a time series. For example, the ACF of the residuals of the simple linear regression fit to the price of chicken data (see Example 2.4) reveals additional structure in the data that regression did not capture. Instead, the introduction of correlation that may be generated through lagged linear relations leads to proposing the autoregressive and models that were presented in autoregressive moving average (ARMA) (AR) Whittle (1951). Adding nonstationary models to the mix leads to the autoregressive integrated moving average (ARIMA) model popularized in the landmark work by for identifying ARIMA models is Box and Jenkins (1970). The Box–Jenkins method given in this chapter along with techniques for forecasting and parameter estimation for these models. A partial theoretical justification of the use of ARMA models is discussed in Section B.4. 3.1 Autoregressive Moving Average Models The classical regression model of Chapter 2 was developed for the static case, namely, we only allow the dependent variable to be influenced by current values of the independent variables. In the time series case, it is desirable to allow the dependent variable to be influenced by the past values of the independent variables and possibly by its own past values. If the present can be plausibly modeled in terms of only the past values of the independent inputs, we have the enticing prospect that forecasting will be possible. Introduction to Autoregressive Models Autoregressive models are based on the idea that the current value of the series, , , can be explained as a function of p past values, x x x , . . ., x , where p t t − 2 t − t − 1 p determines the number of steps into the past needed to forecast the current value. As a typical case, recall Example 1.10 in which data were generated using the model w + , x x 90 = x . − − 1 t t t t − 2 i i i i

88 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 78 — #88 i i 78 3 ARIMA Models 2 . We have now assumed the current σ w where = 1 is white Gaussian noise with t w value is a particular linear function of past values. The regularity that persists in Figure 1.9 gives an indication that forecasting for such a model might be a distinct possibility, say, through some version such as n , x 90 . = x − x n n − 1 1 n + n 1 where the quantity on the left-hand side denotes the forecast at the next period + x . We will make this notion more precise in , based on the observed data, x , . . ., x 2 n 1 our discussion of forecasting (Section 3.4). The extent to which it might be possible to forecast a real data series from its own past values can be assessed by looking at the autocorrelation function and the lagged scatterplot matrices discussed in Chapter 2. For example, the lagged scatterplot matrix for the Southern Oscillation Index (SOI), shown in Figure 2.8, gives a distinct indication that lags 1 and 2, for example, are linearly associated with the current value. The ACF shown in Figure 1.16 shows relatively large positive values at lags 1, 2, 12, 24, and 36 and large negative values at 18, 30, and 42. We note also the possible relation between the SOI and Recruitment series indicated in the scatterplot matrix shown in Figure 2.9. We will indicate in later sections on transfer function and vector AR modeling how to handle the dependence on values taken by other series. The preceding discussion motivates the following definition. Definition 3.1 An autoregressive model of order p , abbreviated AR( p ) , is of the form + (3.1) φ x x , w + + φ x x φ + ··· = p 2 t p 2 t − − 1 − t t t 1 2 x 0 is stationary, w , ∼ w n ( 0 , σ where φ ) , and φ are constants ( , φ , . . ., φ ). 2 p 1 p t t w μ The mean of x x in (3.1) is zero. If the mean, μ , of x by is not zero, replace x − t t t t in , (3.1) − μ μ = φ ( x − μ ) + φ ( x − μ ) + ··· + φ ( x − x ) + w , t p − t t − p t − 1 2 2 t 1 or write , x w = α + φ + x x φ + + φ ··· x + (3.2) − t 2 1 − p 2 t − p 1 t t t . ) where φ α = μ ( 1 − φ −···− 1 p We note that (3.2) is similar to the regression model of Section 2.1, and hence the term auto (or self) regression. Some technical difficulties, however, develop from applying that model because the regressors, x x , are random components, , . . ., 1 t t − p − z was assumed to be fixed. A useful form follows by using the backshift whereas t operator (2.29) to write the AR( p ) model, (3.1), as 2 p − 1 − φ B (3.3) φ B , −···− φ B ( ) x = w 1 p t 2 t or even more concisely as x φ ( B ) = . w (3.4) t t x ) . This leads to the following The properties of φ ( B are important in solving (3.4) for t definition. i i i i

89 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 79 — #89 i i 79 3.1 Autoregressive Moving Average Models Definition 3.2 autoregressive operator is defined to be The 2 p = 1 − φ − B φ φ ( B B −···− φ (3.5) B ) . p 2 1 Example 3.1 The AR(1) Model We initiate the investigation of AR models by considering the first-order model, AR(1), given by x = φ x times, we get k Iterating backwards + w . t − t t 1 w = φ x + x = φ ( φ x + w ) + w t − 2 − 1 t − 1 t t t t 2 = x + φ w φ + w − t − 2 1 t t . . . 1 k − ’ k j x φ w + = φ . k t − t − j 0 j = This method suggests that, by continuing to iterate backward, and provided that ( | φ | < 1 and sup ∞ var , we can represent an AR(1) model as a linear process x < ) t t 3.1 given by ∞ ’ j φ . x (3.6) w = t t − j j = 0 Representation (3.6) is called the stationary solution of the model. In fact, by simple substitution, ∞ ∞ ) ( ’ ’ k j w φ φ w φ w . + = t − k 1 j − t − t j = 0 0 = k ︸ ︸ ︷︷ ︸ ︸ ︷︷ x t − 1 x t The AR(1) process defined by (3.6) is stationary with mean ∞ ’ j , 0 = ) ( E φ w x E ) = ( j − t t 0 = j and autocovariance function, ) ( ( )   ∞ ∞ ’ ’   j k   x E x , ( h ) γ = ) = ( cov φ w φ w t + h t t + h − j t − k     0 j = 0 k =   ) ( ] [ 1 + h h (3.7) ( ) ··· w + φ + w + w φ + ··· w + ··· φ w + E = t t − 1 t − h + 1 t t ∞ ∞ h 2 ’ ’ φ σ w 2 j 2 h j j h + 2 , h ≥ 0 . = φ φ φ = σ φ = σ w w 2 1 φ − = j 0 0 j = ( ( ) ) 2 Õ k − 1 . 3 1 k 2 j 2 E Note that x lim − lim = w 0 x , E so (3.6) exists in the mean = φ φ t j − t →∞ k →∞ k j = 0 − k t square sense (see Appendix A for a definition). i i i i

90 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 80 — #90 i i 3 ARIMA Models 80 γ ( h ) = γ (− h ) Recall that , so we will only exhibit the autocovariance function for h ≥ 0 . From (3.7), the ACF of an AR(1) is γ ( h ) h φ (3.8) , 0 , ≥ h = ρ = ) h ( ) γ ( 0 and ρ ( h ) satisfies the recursion ρ ( h ) = φ ρ ( h − 1 ) , h = 1 , 2 , . . . . (3.9) We will discuss the ACF of a general AR( p ) model in Section 3.3. Example 3.2 The Sample Path of an AR(1) Process Figure 3.1 shows a time plot of two AR(1) processes, one with . 9 and one φ = 2 h − . 9 ; in both cases, σ ρ with = 1 . In the first case, φ ( h ) = . 9 = , for h ≥ 0 , so w observations close together in time are positively correlated with each other. This result means that observations at contiguous time points will tend to be close in value to each other; this fact shows up in the top of Figure 3.1 as a very smooth sample path for x , so that . Now, contrast this with the case in which φ = − . 9 t h ( h ) = (− . 9 ) ≥ , for h ρ 0 . This result means that observations at contiguous time points are negatively correlated but observations two time points apart are positively correlated. This fact shows up in the bottom of Figure 3.1, where, for example, if an observation, x x , is typically negative, and , is positive, the next observation, t t + 1 , is typically positive. Thus, in this case, the sample path the next observation, x 2 + t is very choppy. The following R code can be used to obtain a figure similar to Figure 3.1: par(mfrow=c(2,1)) plot(arima.sim(list(order=c(1,0,0), ar=.9), n=100), ylab="x", main=(expression(AR(1)~~~phi==+.9))) plot(arima.sim(list(order=c(1,0,0), ar=-.9), n=100), ylab="x", main=(expression(AR(1)~~~phi==-.9))) Example 3.3 Explosive AR Models and Causality + = x is not In Example 1.18, it was discovered that the random walk x w 1 t t t − stationary. We might wonder whether there is a stationary AR(1) process with φ | > 1 . Such processes are called explosive because the values of the time series | j φ | quickly become large in magnitude. Clearly, because increases without bound | Õ k − 1 j j , , so the → ∞ as φ → ∞ w k will not converge (in mean square) as j − t = j 0 intuition used to get (3.6) will not work directly. We can, however, modify that argument to obtain a stationary model as follows. Write x , in which w + = φ x t + t + 1 t 1 case, ( ) − − 1 1 − 1 − − 1 1 − 1 φ w φ φ − = φ w x − φ = x x − φ w + t 1 + t 2 2 t t 1 t + + t 1 + . . . 1 − k ’ k − − j x − φ = w (3.10) φ , k + t t + j = 1 j i i i i

91 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 81 — #91 i i 81 3.1 Autoregressive Moving Average Models = 1 ( ) + 0.9 AR φ 6 4 2 x 0 −2 60 100 40 20 0 80 ( 1 ) φ AR − 0.9 = 4 2 x 0 −4 60 80 0 40 20 100 Time Fig. 3.1. Simulated AR(1) models: φ = . 9 (top); φ = − . 9 (bottom). − 1 < φ | by iterating forward | steps. Because 1 , this result suggests the stationary k future dependent AR(1) model ∞ ’ j − w − (3.11) . φ x = t t + j j 1 = The reader can verify that this is stationary and of the AR(1) form x . = φ x w + 1 − t t t Unfortunately, this model is useless because it requires us to know the future to be able to predict the future. When a process does not depend on the future, such as , we will say the process is 1 . In the explosive case causal the AR(1) when | φ | < of this example, the process is stationary, but it is also future dependent, and not causal. Example 3.4 Every Explosion Has a Cause Excluding explosive models from consideration is not a problem because the models have causal counterparts. For example, if φ x x 1 > + = | with | φ w t 1 − t t 2 ∼ iid N ( 0 and w , σ ) , then using (3.11), { x } is a non-causal stationary Gaussian t t w process with E ( x ) = 0 and t ∞ ∞ ’ ’ © ™ − k j − , φ w w φ − − x cov = γ ( ) ) , x ( h = cov ≠ Æ + j h + k t + t x t + h t 1 = j 1 k = ́ ̈ 2 h 2 − 2 − − φ φ ) = φ σ . /( 1 − w i i i i

92 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 82 — #92 i i 3 ARIMA Models 82 Thus, using (3.7), the causal process defined by − 1 φ = + v y y t t − t 1 2 − 2 ( 0 where v , σ φ ∼ iid N ) is stochastically equal to the x process (i.e., all finite t t w = 2 distributions of the processes are the same). For example, if with x + w x t − t t 1 1 2 2 / 1 = with σ is an equivalent causal process (see 4 y v + = σ 1 , then y = 1 − t t t v w 2 Problem 3.3). This concept generalizes to higher orders, but it is easier to show using Chapter 4 techniques; see Example 4.8. The technique of iterating backward to get an idea of the stationary solution of AR models works well when = 1 , but not for larger orders. A general technique is p that of matching coefficients. Consider the AR(1) model in operator form φ ( B ) x (3.12) = w , t t φ ( B ) = 1 − φ B , and | φ | < 1 . where Also, write the model in equation (3.6) using operator form as ∞ ’ x = (3.13) , w ) w = ψ ( B ψ t t t − j j 0 = j Õ ∞ j j j ) B ( ψ where = ψ = B . We and ψ ψ = φ φ . Suppose we did not know that j j j 0 = j could substitute ψ in (3.12) to obtain B ) w x from (3.13) for ( t t = (3.14) . w φ ( B ) ψ ( B ) w t t B on the left-hand side of (3.14) must be equal to those on The coefficients of right-hand side of (3.14), which means 2 j φ B )( 1 + ψ B B + ψ (3.15) ( 1 + ··· + ψ . B − + ···) = 1 j 2 1 Reorganizing the coefficients in (3.15), 2 j ( ψ , − φ ) B + ( ψ 1 − ψ = φ ) B 1 + ··· + ( ψ ··· − ψ + + B φ ) 1 j j 1 2 1 − j j on the left must be zero because 1 , 2 , . . . , the coefficient of B we see that for each = − ) φ , and equating this to it is zero on the right. The coefficient of B on the left is ( ψ 1 2 zero, ψ , so − φ = 0 , leads to ψ ) = φ . Continuing, the coefficient of B − is ( ψ φ ψ 1 1 2 1 2 . = φ ψ In general, 2 φ, ψ ψ = j j − 1 j 1 φ . with ψ = = ψ , which leads to the solution 0 j Another way to think about the operations we just performed is to consider the 1 − x φ ( B ) B ) = w AR(1) model in operator form, . Now multiply both sides by φ ( t t (assuming the inverse operator exists) to get − 1 − 1 B , ( B ) φ ( B ) x ( = φ φ w ) t t or i i i i

93 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 83 — #93 i i 83 3.1 Autoregressive Moving Average Models 1 − = ( B ) w φ . x t t We know already that − 1 2 2 j j + φ B + φ + B φ + ··· + φ ( B B ) = , 1 ··· 1 − ( B ) is ψ ( B ) in (3.13). Thus, we notice that working with operators is like that is, φ z ( φ = 1 − φ z , where z working with polynomials. That is, consider the polynomial ) | φ | < 1 . Then, is a complex number and 1 2 2 j j 1 − | = 1 + φ z + , , z 1 + ··· + φ | ≤ z z + ··· φ z ) = φ ( ) φ − 1 z ( 1 j − j − 1 in ( are the same as the coefficients of z φ in φ ) B ( z ) . In and the coefficients of B B , as a complex number, z . These other words, we may treat the backshift operator, results will be generalized in our discussion of ARMA models. We will find the polynomials corresponding to the operators useful in exploring the general properties of ARMA models. Introduction to Moving Average Models x on the left-hand As an alternative to the autoregressive representation in which the t side of the equation are assumed to be combined linearly, the moving average model q , abbreviated as MA of order q ) , assumes the white noise w on the right-hand side ( t of the defining equation are combined linearly to form the observed data. MA( q model, is defined ) Definition 3.3 The moving average model of order q , or to be , (3.16) = w x + θ w w θ + ··· + θ + w t t 2 1 − t q 1 2 − q t t − 3.2 2 0 ∼ w n ( where , σ w ) are parameters. ) , and θ 0 , θ , , . . ., θ θ ( q q 2 1 t w The system is the same as the infinite moving average defined as the linear process , (3.13), where ψ for other values. We may = 1 , ψ = = θ ψ 0 for j = 1 , . . ., q , and j 0 j j also write the MA ( q ) process in the equivalent form B = θ ( x ) w (3.17) , t t using the following definition. Definition 3.4 The moving average operator is 2 q B ) = 1 + θ (3.18) B + θ . B θ + ··· + θ ( B q 1 2 Unlike the autoregressive process, the moving average process is stationary for any values of the parameters θ , . . ., θ ; details of this result are provided in Section 3.3. 1 q 3 . 2 Some texts and software packages write the MA model with negative coefficients; that is, x = t θ . w −···− w w − θ θ w − t q − q t 2 t − 1 t 1 2 − i i i i

94 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 84 — #94 i i 3 ARIMA Models 84 1 ) θ = + 0.9 MA ( 4 2 x 0 −2 −4 40 60 80 100 20 0 ( 1 ) MA = − 0.9 θ 4 2 x 0 −2 60 100 40 20 0 80 Time Simulated MA(1) models: θ Fig. 3.2. . 9 (top); θ = − . 9 (bottom). = Example 3.5 The MA(1) Process = Consider the MA(1) model , 0 x ) = w x + θ w ( E . Then, 1 t t t t −  2 2  h , ) θ 0 = + ( 1 σ  w  2 γ h ) = ( 1 θσ h = , w    , > 0 h 1  and the ACF is  θ  , 1 h =  2 ) θ ( 1 + ρ ( h ) =   0 h > 1 .  2 | ( Note )| ≤ 1 / ρ for all values of θ (Problem 3.1). Also, x is correlated with 1 t x Contrast this with the case of the AR(1) model in , . . . . x , but not with x , − t 1 t − 3 − t 2 which the correlation between x , for example, and x 9 . = is never zero. When θ k t t − x and x x are positively correlated, and ρ ( 1 and = . 497 . When θ = − . 9 , x ) 1 − − 1 t t t t = . ρ ( 1 ) are negatively correlated, − . 497 Figure 3.2 shows a time plot of these two 2 1 θ . The series for which is smoother than the series for 9 = . = σ processes with w − which θ = . . 9 A figure similar to Figure 3.2 can be created in R as follows: par(mfrow = c(2,1)) plot(arima.sim(list(order=c(0,0,1), ma=.9), n=100), ylab="x", main=(expression(MA(1)~~~theta==+.5))) plot(arima.sim(list(order=c(0,0,1), ma=-.9), n=100), ylab="x", main=(expression(MA(1)~~~theta==-.5))) i i i i

95 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 85 — #95 i i 85 3.1 Autoregressive Moving Average Models Example 3.6 Non-uniqueness of MA Models and Invertibility ) is the same for θ and ( Using Example 3.5, we note that for an MA(1) model, ρ h 1 1 2 σ = ; try 5 and = 1 and θ , for example. In addition, the pair 5 yield the same w θ 5 2 1 25 and θ = σ / 5 , namely, autocovariance function as the pair = w   = , 0 h 26   ( = ) h γ 5 h = 1 ,    0 h > 1 .  Thus, the MA(1) processes 1 , , 25 ) w 0 ( w ∼ iid N = x + w − t 1 t t t 5 and v + 5 v y = , v ∼ iid N ( 0 , 1 ) t t t − t 1 are the same because of normality (i.e., all finite distributions are the same). We x can only observe the time series, y or , and not the noise, w , so we cannot or v t t t t distinguish between the models. Hence, we will have to choose only one of them. For convenience, by mimicking the criterion of causality for AR models, we will choose the model with an infinite AR representation. Such a process is called an process. invertible To discover which model is the invertible model, we can reverse the roles of and w (because we are mimicking the AR case) and write the MA(1) model x t t w = = − θ w , then as 1 + x < . Following the steps that led to (3.6), if | θ | w t − t t t 1 Õ ∞ j (− θ ) x , which is the desired infinite AR representation of the model. Hence, t − j = 0 j 2 = because it is given a choice, we will choose the model with 25 and σ = 1 / 5 θ w invertible. As in the AR case, the polynomial, ( z ) , corresponding to the moving average θ operators, ( B ) , will be useful in exploring general properties of MA processes. For θ example, following the steps of equations (3.12)–(3.15), we can write the MA(1) | ( , then we can write the model 1 model as x θ = θ < B ) w | , where θ ( B ) = 1 + θ B . If t t − 1 1 x ( = w as , where π ( B ) = θ π ) ( B ) . Let θ ( z ) = 1 + θ z , for | z | ≤ B , then π ( z ) = t t Õ Õ ∞ ∞ j j j j 1 − (− B θ ) . ( B ) = π , and we determine that ) θ (− z = ( θ ) = 1 /( 1 + θ z ) z 0 j = = j 0 Autoregressive Moving Average Models We now proceed with the general development of autoregressive, moving average, and mixed autoregressive moving average (ARMA), models for stationary time series. q , x if it is stationary Definition 3.5 A time series { ) p ; t = 0 , ± 1 , ± 2 , . . . } is ARMA( t and , w x = φ x + ··· + (3.19) x + w + θ w + ··· + θ φ p − 1 t q − 1 1 t − 1 t p t t q − t 2 and φ are called the autoregres- , 0 , θ q , 0 , and σ with p > 0 . The parameters q p w sive and the moving average orders, respectively. If x has a nonzero mean μ , we set t and write the model as ) φ −···− α = μ ( 1 − φ 1 p i i i i

96 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 86 — #96 i i 3 ARIMA Models 86 = α + φ x x (3.20) , w + ··· + φ θ x + ··· w + w θ + + t p t − 1 t − t p q 1 t − q − t 1 1 2 ) w n ( 0 , σ w where ∼ . t w As previously noted, when q = 0 , the model is called an autoregressive model p , AR( p of order p = 0 , the model is called a moving average model of ), and when order , MA( q ). To aid in the investigation of ARMA models, it will be useful to q write them using the AR operator, (3.5), and the MA operator, (3.18). In particular, the ARMA( q ) model in (3.19) can then be written in concise form as p , ) B x φ = ( ( B ) w (3.21) . θ t t The concise form of the model points to a potential problem in that we can unneces- sarily complicate the model by multiplying both sides by another operator, say ( B ) φ η B ) x , = η ( B ) θ ( B ) w ( t t without changing the dynamics. Consider the following example. Example 3.7 Parameter Redundancy Consider a white noise process x = w . If we multiply both sides of the equation t t η , or w ) by B ( B ) = 1 − . 5 B , then the model becomes ( 1 − . 5 B ) x 5 = ( 1 − . t t 5 . . 5 x (3.22) , w − x = w + − 1 t 1 t − t t which looks like an ARMA ( 1 , 1 ) model. Of course, x is still white noise; nothing t x = w is the solution to (3.22)], but we have has changed in this regard [i.e., t t hidden the fact that x or is white noise because of the parameter redundancy t over-parameterization. The consideration of parameter redundancy will be crucial when we discuss estimation for general ARMA models. As this example points out, we might fit , model to white noise data and find that the parameter estimates ) 1 an ARMA ( 1 are significant. If we were unaware of parameter redundancy, we might claim the data are correlated when in fact they are not (Problem 3.20). Although we have not yet discussed estimation, we present the following demonstration of the problem. We generated 150 iid normals and then fit an ARMA( 1 , 1 ) to the data. Note that ˆ ˆ = − . 96 and φ θ = . 95 , and both are significant. Below is the R code (note that the estimate called ‘intercept’ is really the estimate of the mean). # Jenny, I got your number set.seed(8675309) x = rnorm(150, mean=5) # generate iid N(5,1)s arima(x, order=c(1,0,1)) # estimation Coefficients: ar1 ma1 intercept<= misnomer -0.9595 0.9527 5.0462 s.e. 0.1688 0.1750 0.0727 Thus, forgetting the mean estimate, the fitted model looks like . w , ( 1 + B 96 B ) x 95 = ( 1 + . ) t t which we should recognize as an over-parametrized model. i i i i

97 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 87 — #97 i i 87 3.1 Autoregressive Moving Average Models Example 3.3, Example 3.6, and Example 3.7 point to a number of problems with p q ) models, as given by (3.19), or, equivalently, by the general definition of ARMA( , (3.21). To summarize, we have seen the following problems: (i) parameter redundant models, (ii) stationary AR models that depend on the future, and (iii) MA models that are not unique. To overcome these problems, we will require some additional restrictions on the model parameters. First, we make the following definitions. Definition 3.6 The AR and MA polynomials are defined as p z ) = 1 − φ (3.23) z −···− φ , z φ , φ ( , 0 p p 1 and q z ) = 1 + θ (3.24) z + ··· + θ , θ ( , θ 0 , z q 1 q z is a complex number. respectively, where p , q ) model to To address the first problem, we will henceforth refer to an ARMA( mean that it is in its simplest form. That is, in addition to the original definition given . have no common factors in equation (3.19), we will also require that φ ( z ) and θ ( z ) . x . 5 x − = 5 w + w , discussed in Example 3.7 is not referred So, the process, t − t 1 t t − 1 ( 1 , 1 ) process because, in its reduced form, x to as an ARMA is white noise. t To address the problem of future-dependent models, we formally introduce the . concept of causality = Definition 3.7 p , q ) model is said to be causal , if the time series { x An ARMA( ; t t ± , ± 1 , 0 2 , . . . } can be written as a one-sided linear process: ∞ ’ x = w , ) B ( ψ w (3.25) = ψ t t t j − j 0 j = Õ Õ ∞ ∞ j = B . 1 = ψ ψ B ψ , and where ( ; we set ) ∞ | ψ < | j j 0 0 j 0 = j = 1 < x In Example 3.3, the AR(1) process, . | = φ x φ | is causal only when + w , t − t t 1 is bigger z φ Equivalently, the process is causal only when the root of φ ( z ) = 1 − ( φ 1 than one in absolute value. That is, the root, say, z , of φ / z ) is z (because = 0 0 1 φ z . In general, we have the following property. ) = 0 ) and | z 1 | > ( because | φ | < 0 0 Property 3.1 Causality of an ARMA( p , q ) Process , An ARMA( p , q ) model is causal if and only if φ ( z ) 0 for | z | ≤ 1 . The coefficients of the linear process given in (3.25) can be determined by solving ∞ ’ ) z ( θ j z | ≤ . | , 1 z ψ z = = ) ( ψ j ( z ) φ = 0 j i i i i

98 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 88 — #98 i i 3 ARIMA Models 88 an ARMA process is causal only when Another way to phrase Property 3.1 is that φ ( z ) lie outside the unit circle ; that is, φ ( z the roots of = 0 only when | z | > 1 . Finally, ) to address the problem of uniqueness discussed in Example 3.6, we choose the model that allows an infinite autoregressive representation. = Definition 3.8 An ARMA( p , q ) model is said to be invertible , if the time series { x t ; t ± , ± 1 , 0 2 , . . . } can be written as ∞ ’ ( B ) x (3.26) , = π w = x π j t − j t t j = 0 Õ Õ ∞ ∞ j . = where ( = π ; we set π ∞ B π , and B ) < | π | 1 j j 0 0 = j 0 = j Analogous to Property 3.1, we have the following property. p , q ) Process Property 3.2 Invertibility of an ARMA( θ An ARMA( p , q ) model is invertible if and only if , ( z ) 0 for | z | ≤ 1 . The (3.26) given in coefficients π can be determined by solving of π ( B ) j ∞ ’ ) z ( φ j | ≤ 1 , z . | π ( = ) z = z π j ( z ) θ 0 = j Another way to phrase Property 3.2 is that an ARMA process is invertible only > . 1 when the roots of θ ( z ) lie outside the unit circle ; that is, θ ( z ) = 0 only when | z | The proof of Property 3.1 is given in Section B.2 (the proof of Property 3.2 is similar). The following examples illustrate these concepts. Example 3.8 Parameter Redundancy, Causality, Invertibility Consider the process w = . 4 x , w 25 + . 45 x . + w + x + − t t − 1 − t 1 2 t − 2 t t or, in operator form, 2 2 − . 4 B − . 45 B ( ) x . = ( 1 + B + . 25 B 1 ) w t t At first, x process. But notice that appears to be an ARMA ( 2 , 2 ) t 2 ( ( B ) = 1 − . 4 B − . 45 B . = φ 1 + . 5 B )( 1 − 9 B ) and 2 2 ) ) ( 1 + B + . 25 B ( = = ( 1 + . 5 B ) B θ have a common factor that can be canceled. After cancellation, the operators are φ ( B ) = ( 1 − . 9 B ) and θ ( B ) = ( 1 + . 5 B ) , so the model is an ARMA ( 1 , 1 ) model, ( ( − . 9 B ) x w = 1 1 + . 5 B ) , or t t (3.27) . + w x w = . 9 x 5 . + 1 1 t t t − − t i i i i

99 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 89 — #99 i i 89 3.1 Autoregressive Moving Average Models when z ( 1 − . 9 z ) = 0 = z = 10 / 9 , which is ( ) The model is causal because φ outside the unit circle. The model is also invertible because the root of ) = θ ( z + 5 z ) is z = − . , which is outside the unit circle. ( 1 2 -weights using ψ To write the model as a linear process, we can obtain the ( z ) ψ ( z ) = θ ( z ) , or Property 3.1, φ 2 j ψ z )( 1 + ψ ( z + ψ 1 z − + ··· + . . z 9 + ···) = 1 + . 5 z j 2 1 Rearranging, we get 2 j ψ . − . 9 ) z + ( ψ 9 − . 1 ψ z ) z + + ··· + ( ψ 5 − . 9 ψ . + 1 ) z ( + ··· = 1 2 j j 1 1 − = z on the left and right sides we get − . 9 ψ . 5 and Matching the coefficients of 1 − j 1 9 ) ψ and (3.27) can be written 1 = 0 for j > 1 . Thus, ψ ≥ = 1 . 4 ( . 9 . − j ψ for j − j j 1 as ∞ ’ j − 1 4 . + w = x 1 . w 9 . t t t − j 1 j = The values of may be calculated in R as follows: ψ j ARMAtoMA(ar = .9, ma = .5, 10) # first 10 psi-weights [1] 1.40 1.26 1.13 1.02 0.92 0.83 0.74 0.67 0.60 0.54 The invertible representation using Property 3.1 is obtained by matching coef- z ( ) , ficients in θ ( z ) π ( z ) = φ 2 3 π 5 z )( 1 + π . z + π 1 z z + + 9 z . + ···) = 1 − . ( 3 2 1 j j − 1 , and hence, = (− 1 ) for 1 . 4 ( . 5 ) In this case, the π -weights are given by , π j ≥ 1 j Õ ∞ w = because , we can also write (3.27) as π x t j j t − = 0 j ∞ ’ j − 1 4 (− x w + x . . 5 ) = 1 . t − j t t j 1 = The values of π may be calculated in R as follows by reversing the roles of w and j t : w x ; i.e., write the model as = − . 5 w + x − . 9 x 1 t 1 t − t t − t # first 10 pi-weights ARMAtoMA(ar = -.5, ma = -.9, 10) [1] -1.400 .700 -.350 .175 -.087 .044 -.022 .011 -.006 .003 Example 3.9 Causal Conditions for an AR(2) Process = ( 1 − φ B ) x z z w φ , to be causal, the root of φ ( For an AR(1) model, ) = 1 − must t t 1 lie outside of the unit circle. In this case, φ ( z ) = 0 when z = / φ , so it is easy , to a requirement on the / to go from the causal requirement on the root, | 1 φ | > 1 φ | < 1 . It is not so easy to establish this relationship for higher order | parameter, models. 2 − 1 , is causal when the φ For example, the AR(2) model, B − φ B ( ) x = w t 1 2 t 2 z lie outside of the unit circle. Using the quadratic two roots of φ ( z ) = 1 − φ φ z − 1 2 formula, this requirement can be written as i i i i

100 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 90 — #100 i i 3 ARIMA Models 90 Causal Region of an AR(2) 1.0 0.5 real roots 2 φ 0.0 complex roots −0.5 −1.0 0 1 −1 2 −2 φ 1 Causal region for an AR(2) in terms of the parameters. Fig. 3.3. √ 2 φ ± φ 4 + φ 1 2 1 > . 1 φ 2 − 2 φ The roots of may be real and distinct, real and equal, or a complex conjugate ( z ) − 1 1 − z ; ) z )( 1 − z and 1 − z , we can write ( pair. If we denote those roots by φ z ) = ( z z 2 1 2 1 note that φ ( z 1 ) = − ( z ( ) = 0 . The model can be written in operator form as φ 2 1 − − 1 1 − 1 1 − B )( 1 − z z z ) B ) x + = w z . From this representation, it follows that φ ( = 1 t t 1 2 2 1 − 1 = −( z 1 z > ) . and φ can be | z | | > 1 and | z This relationship and the fact that 2 1 2 1 2 used to establish the following equivalent condition for causality: (3.28) . 1 and φ | + φ φ < 1 , φ | − φ < < 1 , 1 2 2 1 2 This causality condition specifies a triangular region in the parameter space; see Figure 3.3 We leave the details of the equivalence to the reader (Problem 3.5). 3.2 Difference Equations The study of the behavior of ARMA processes and their ACFs is greatly enhanced by a basic knowledge of difference equations, simply because they are difference equations. We will give a brief and heuristic account of the topic along with some examples of the usefulness of the theory. For details, the reader is referred to Mickens (1990). u such that Suppose we have a sequence of numbers u , . . . , u , 2 1 0 , u (3.29) , . . . . 2 1 − α u = n , = 0 , α , 0 1 n n − For example, recall (3.9) in which we showed that the ACF of an AR(1) process is a h ( , satisfying ) sequence, ρ i i i i

101 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 91 — #101 i i 91 3.2 Difference Equations = )− ( h − 1 ) = 0 , h φρ 1 , 2 , . . . . ρ ( h . To solve homogeneous difference equation of order 1 Equation (3.29) represents a the equation, we write: = α u u 1 0 2 α u u = α = u 0 2 1 . . . n = α u u u . = α 1 n 0 n − n = c , we may solve (3.29), namely, u Given an initial condition = α u c . n 0 ( 1 − α B ) In operator notation, (3.29) can be written as . The polynomial = 0 u n α α z ) = 1 associated with (3.29) is ( z , and the root, say, z , of this polynomial is − 0 solution) to (3.29), with z the = 1 / α ; that is α ( z . We know a solution (in fact, ) = 0 0 0 u = c , is initial condition 0 ) ( n n − 1 = z α = c . u c (3.30) n 0 That is, the solution to the difference equation (3.29) depends only on the initial condition and the inverse of the root to the associated polynomial α z ) . ( Now suppose that the sequence satisfies (3.31) u , . . . − α 3 u , 2 = − α n u , 0 , = 0 , α 2 2 n 2 1 − n 1 n − homogeneous difference equation of order 2 This equation is a . The corresponding polynomial is 2 z ) = 1 − α , z − α ( z α 2 1 . We will consider two which has two roots, say, z 0 and z = ; that is, α ( z ) ) = α ( z 2 2 1 1 . Then the general solution to (3.31) is cases. First suppose z , z 1 2 − n − n u z , c c z = (3.32) + n 1 2 1 2 a c depend on the initial conditions. The claim it is c solution can be where and 2 1 verified by direct substitution of (3.32) into (3.31): ( ( ) ) ) ( − n −( ) ) 1 2 − n −( 2 ) 1 n − −( − −( n ) n − n − c α − c z α z − z c + + c c + z z c z 1 1 1 2 2 2 2 1 2 1 1 1 2 2 ︸ ︷︷ ︸ ︸ ︸ ︷︷ ︸ ︸ ︷︷ u u u n 1 − n n − 2 ) ) ( ( − n 2 2 − n z − α = z z − α α z z − + c z α c 1 − 1 1 1 2 2 2 2 1 1 1 2 2 1 n − n − = c z z . 0 ( α ( z α ) + c = z ) 2 1 2 1 1 2 : Given two initial conditions u and u c , we may solve for c and 1 0 1 2 1 − 1 − u z = c c + c = and u z c + , 0 2 1 1 1 2 1 2 α and and α where z can be solved for in terms of using the quadratic formula, z 2 1 1 2 for example. i i i i

102 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 92 — #102 i i 3 ARIMA Models 92 z = When the roots are equal, ( = z ), a general solution to (3.31) is z 1 2 0 − n (3.33) ( c . + c ) n = u z 2 1 n 0 This claim can also be verified by direct substitution of (3.33) into (3.31): ) ( ) ( n −( −( ) ) 1 − n 2 − n − − − α α − c n [ c ( + c c ( n − 1 )] + z 2 z + c z c ( [ ) n )] 2 1 1 2 2 1 1 2 0 0 0 ︸ ︷︷ ︸ ︸ ︸ ︸ ︷︷ ︷︷ ︸ u u u n 1 n − n − 2 ) ( n − 1 − n 2 + ) ( α c n ) ( 1 − α z z α − α 2 z + + z + c c z = 0 1 2 1 0 2 2 2 1 0 0 0 1 + n − ) ( + = z c α z α 2 . 2 2 1 0 0 1 − 2 2 ( α + 2 α , and take derivatives z ) ) = 0 , write 1 − α z z − α z z To show that = ( 1 − 2 1 1 0 2 0 1 − 1 − z − 1 ( z ) . + α ) = 2 z 2 z with respect to α ( on both sides of the equation to obtain z 1 2 0 0 − − 1 1 α as was to be shown. Finally, given two + 2 α , z 0 ) = 2 z ) Thus, = ( 1 − z z ( 0 0 2 1 0 0 : c initial conditions, u and and u c , we can solve for 1 1 2 0 1 − z u = c c and u = ( c + ) . 1 0 1 2 1 0 It can also be shown that these solutions are unique. To summarize these results, in the case of distinct roots, the solution to the homogeneous difference equation of degree two was n − of degree ×( a polynomial in n = m z − 1 ) u n 1 1 (3.34) n − ×( a polynomial in n of degree m − + z ) , 1 2 2 . In is the multiplicity of the root z m and m where is the multiplicity of the root z 2 2 1 1 = m = 1 , and we called the polynomials of degree zero this example, of course, m 1 2 and c c , respectively. In the case of the repeated root, the solution was 2 1 − n u = z ×( a polynomial in (3.35) of degree m , − 1 ) n 0 n 0 m = is the multiplicity of the root z In this case, we wrote the ; that is, m . where 2 0 0 0 c given + c n polynomial of degree one as . In both cases, we solved for c and c 1 2 1 2 u two initial conditions, u . and 0 1 These results generalize to the homogeneous difference equation of order p : u − α u −···− α u , = 0 , α , 0 n = p , p + 1 , . . . . (3.36) p p n − n − 1 n p 1 p has ) = 1 − α α z −···− α distinct z The associated polynomial is . Suppose α ( z ) z r ( p 1 with multiplicity m , z roots, m with multiplicity , . . . , and z with multiplicity z 1 r 2 1 2 + , such that m The general solution to the difference equation m m . + ··· + m p = r 1 r 2 (3.36) is n − n − n − ( P ) z + ··· + n ) , P n (3.37) ( z + ) n ( P u = z r 2 1 n r 2 1 where P p ( n ) , for j = 1 , initial , . . ., r , is a polynomial in n , of degree m Given − 1 . 2 j j ) explicitly. ( n conditions u P , . . ., u , we can solve for the − 1 p 0 j i i i i

103 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 93 — #103 i i 93 3.2 Difference Equations Example 3.10 The ACF of an AR(2) Process x + x + φ φ Suppose = w is a causal AR(2) process. Multiply each side x t t − − 1 1 t t 2 2 x of the model by , and take expectation: for 0 > h t h − ( x . x ) ) x E = φ w E ( x ( E + x ) x x ) + φ ( E − 2 t − 2 − t − h t 1 h t 1 t h t − h t t − The result is γ ( h ) = φ (3.38) γ ( h − 1 ) + φ , . . . . γ ( h − 2 ) , h = 1 , 2 2 1 > E ( ) = 0 and for h x 0 , In (3.38), we used the fact that t ∞ ( ) ’ E ( w ) = E w x ψ w . 0 = h t t t − − − t j j h = 0 j Divide (3.38) through by γ ( 0 ) to obtain the difference equation for the ACF of the process: , ρ h )− φ (3.39) ρ ( h − 1 )− φ , . . . . ρ ( h − 2 ) = 0 , h = 1 ( 2 2 1 The initial conditions are ρ ( 0 ) = 1 and ρ (− , which is obtained by ) = φ ) /( 1 − φ 1 2 1 . ) 1 (− evaluating (3.39) for h = 1 and noting that ρ ( 1 ) = ρ Using the results for the homogeneous difference equation of order two, let z 1 2 be the roots of the associated polynomial, φ ( z ) = 1 − φ z z and φ . Because z − 2 2 1 z and the model is causal, we know the roots are outside the unit circle: > 1 | | 1 z | | > 1 . Now, consider the solution for three cases: 2 z and z (i) When are real and distinct, then 1 2 − h h − c ρ z ( h ) = c + z , 1 2 1 2 so h )→ 0 exponentially fast as h →∞ . ρ ( z ( = z ) are real and equal, then (ii) When = z 0 1 2 − h , h c ) + c ( ) = z ρ ( h 2 1 0 so ρ ( . )→ 0 exponentially fast as h →∞ h = is real), (iii) When z ) h ̄ z are a complex conjugate pair, then c = ̄ c (because ρ ( 1 1 2 2 and h − − h . ̄ c ̄ z + = z c ( ) h ρ 1 1 1 1 θ i is the and z θ in polar coordinates, for example, z , where = | z Write | e c 1 1 1 1 angle whose tangent is the ratio of the imaginary part and the real part of z 1 θ (sometimes called arg( z ). Then, using the fact that ); the range of is [− π, π ] 1 α − i α i + e = 2 cos ( α ) , the solution has the form e − h h ) = a | z , | ρ ( cos ( h θ + b ) 1 dampens a and b where ρ ( h ) are determined by the initial conditions. Again, , but it does so in a sinusoidal fashion. The h →∞ to zero exponentially fast as implication of this result is shown in the next example. i i i i

104 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 94 — #104 i i 3 ARIMA Models 94 5 0 ar2 −5 0 12 24 36 144 60 72 84 96 108 120 132 48 Time . Fig. 3.4. Simulated AR(2) model, n = 144 with φ 75 = 1 . 5 and φ . = − 2 1 Example 3.11 An AR(2) with Complex Roots Figure 3.4 shows n = 144 observations from the AR(2) model w x + = 1 . 5 , x 75 . − x 1 − t t − 2 t t 2 = , and with complex roots chosen so the process exhibits pseudo- 1 with σ w cyclic behavior at the rate of one cycle every 12 time points. The autoregressive √ 2 z ) = 1 − 1 . 5 z + . 75 polynomial for this model is ( . The roots of φ ( z ) are 1 ± i / φ z 3 , √ − 1 and θ = / radians per unit time. To convert the angle to cycles 12 ( 1 / tan 3 ) = 2 π per unit time, divide by 2 π to get 1/12 cycles per unit time. The ACF for this model is shown in left-hand-side of Figure 3.5. To calculate the roots of the polynomial and solve for arg in R: # coefficients of the polynomial z = c(1,-1.5,.75) (a = polyroot(z)[1]) # print one root = 1 + i/sqrt(3) [1] 1+0.57735i arg = Arg(a)/(2*pi) # arg in cycles/pt 1/arg # the pseudo period [1] 12 To reproduce Figure 3.4: set.seed(8675309) ar2 = arima.sim(list(order=c(2,0,0), ar=c(1.5,-.75)), n = 144) plot(ar2, axes=FALSE, xlab="Time") axis(2); axis(1, at=seq(0,144,by=12)); box() abline(v=seq(0,144,by=12), lty=2) To calculate and display the ACF for this model: ACF = ARMAacf(ar=c(1.5,-.75), ma=0, 50) plot(ACF, type="h", xlab="lag") abline(h=0) i i i i

105 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 95 — #105 i i 95 3.2 Difference Equations -weights for an ARMA Model ψ Example 3.12 The , q ) model, φ ( B ) x ) = θ ( B For a causal ARMA( w p , where the zeros of φ ( z ) are t t outside the unit circle, recall that we may write ∞ ’ x = , ψ w j − j t t j 0 = ψ where the -weights are determined using Property 3.1. q ) model, ψ , = 1 , ψ 0 = θ = , for j = For the pure MA( , . . ., q , and ψ 1 j j 0 j p otherwise. For the general case of ARMA( q ) models, the task of solving for the , ψ -weights is much more complicated, as was demonstrated in Example 3.8. The use of the theory of homogeneous difference equations can help here. To solve for the ψ -weights in general, we must match the coefficients in φ ( z ) ψ ( z ) = θ ( z ) : 2 2 2 ( z − φ . z ( −···)( ψ ···) + ψ + z + ψ φ z 1 + ···) = − 1 + θ z z + θ 2 2 1 0 2 1 1 The first few values are = ψ 1 0 ψ − φ ψ = θ 0 1 1 1 ψ φ ψ − φ ψ = θ − 1 0 1 2 2 2 ψ − ψ ψ − φ φ − φ ψ = θ 1 3 3 2 0 2 1 3 . . . q -weights satisfy ψ . The where we would take φ > = 0 for j > p , and θ j = 0 for j j the homogeneous difference equation given by p ’ j ψ (3.40) , φ ) ψ 1 + q = 0 , − ≥ max ( p , k j j k − 1 k = with initial conditions j ’ < ψ . ) 1 φ + ψ q , p = θ ( , 0 ≤ j (3.41) − max j − j k k j k 1 = φ ( ) = The general solution depends on the roots of the AR polynomial − z φ z − 1 1 p φ , as seen from (3.40). The specific solution will, of course, depend on ···− z p the initial conditions. x Because = . 9 x . w Consider the ARMA process given in (3.27), + . 5 w + − 1 t − t t t 1 ( . By (3.40), 4 . max 1 p , q + 1 ) = 2 , using (3.41), we have ψ = = 1 and ψ 5 = . 9 + . 1 0 for j = 2 , 3 , . . ., the ψ -weights satisfy ψ . The general solution − . 9 ψ 0 = 1 − j j j 1 , 4 is ψ = = c . 9 . . To find the specific solution, use the initial condition ψ 1 j j − 1 so 1 . 4 = . 9 c or c = 1 . , as we saw in / . 9 . Finally, ψ , for = 1 . 4 ( . 9 ) 1 ≥ j 4 j Example 3.8. ψ To view, for example, the first 50 -weights in R, use: # for a list ARMAtoMA(ar=.9, ma=.5, 50) plot(ARMAtoMA(ar=.9, ma=.5, 50)) # for a graph i i i i

106 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 96 — #106 i i 3 ARIMA Models 96 3.3 Autocorrelation and Partial Autocorrelation ) process, x w = θ ( B ) We begin by exhibiting the ACF of an MA( q , where θ ( B ) = t t q 1 is a finite linear combination of white noise terms, B + ··· + θ x B + . Because θ q 1 t the process is stationary with mean q ’ E ( ) = x θ , = ) w ( 0 E t j t − j j 0 = θ 1 = where we have written , and with autocovariance function 0 q q ( ) ’ ’ ( ) , x w θ γ ( x , h = cov ) = cov w θ j t j − t + h − k t h + t k = j 0 = 0 k { Õ q − h 2 σ q ≤ θ 0 θ ≤ , h h j + j w 0 = j = (3.42) h > q . 0 Recall that γ ( h ) = γ (− h ) , so we will only display the values for h ≥ 0 . Note that γ ( q ) cannot be zero because θ lags is the signature of , 0 . The cutting off of γ ( h ) after q q ) ) model. Dividing (3.42) by γ ( 0 ) yields the ACF of an MA( q q : the MA( Õ h − q  θ θ  j + j h  j = 0  ≤ 1 q h ≤ 2 2 (3.43) = ) h ( ρ 1 + θ θ + + ··· q 1    0 h > q .  ) p ) model, φ ( B ) x For a causal ARMA( = θ ( B ) w are , where the zeros of φ ( z q , t t outside the unit circle, write ∞ ’ ψ x (3.44) . = w t t − j j 0 = j E ( x is ) It follows immediately that 0 and the autocovariance function of x = t t ∞ ’ 2 (3.45) . 0 ≥ h , ψ ψ ) x ) = σ ( cov γ h = ( , x j j + h t t + h w 0 = j We could then use (3.40) and (3.41) to solve for the ψ -weights. In turn, we could ( 0 As in Example 3.10, it is also possible . solve for γ ) h ) , and the ACF ρ ( h ) = γ ( h )/ γ ( . First, we write γ ( h ) to obtain a homogeneous difference equation directly in terms of q p ) ( ’ ’ + x φ x θ w , γ x ) = cov x , cov = ) h ( ( − j j t t h − j + h + j t h t t + 0 = j 1 = j (3.46) p q ’ ’ 2 h , φ ψ γ ( h − j ) + σ = θ 0 ≥ , j h − j j w 1 = j = h j i i i i

107 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 97 — #107 i i 97 3.3 Autocorrelation and Partial Autocorrelation ≥ 0 , h where we have used the fact that, for ∞ ( ) ’ 2 w , , x cov ) = cov ( w σ ψ w . = ψ t − k k j − h j − h + + t t t − j h w 0 = k From (3.46), we can write a general homogeneous equation for the ACF of a causal ARMA process : γ ( h )− φ (3.47) γ ( h − 1 )−···− φ , γ ( h − p ) = 0 , h ≥ max ( p , q + 1 ) p 1 with initial conditions q p ’ ’ 2 (3.48) + q , p ( max . h ≤ 0 , ψ θ ) 1 < j h σ h = φ γ ( ) γ − ( )− h − j j j w j = 1 = h j Dividing (3.47) and (3.48) through by γ ( 0 ) will allow us to solve for the ACF, 0 ρ h ) = γ ( h )/ γ ( ( ) . Example 3.13 The ACF of an AR( p ) In Example 3.10 we considered the case where p = 2 . For the general case, it follows immediately from (3.47) that p ≥ (3.49) . ρ ( h )− φ h ρ ( h − 1 )−···− φ , ρ ( h − p ) = 0 p 1 z Let z , respec- denote the roots of φ ( z ) , each with multiplicity m , . . ., , . . ., m 1 r 1 r + ··· m m + = p . Then, from (3.37), the general solution is tively, where 1 r − h h − h − , z P (3.50) ( h ) + z = ) h P , ( h ) + ··· + z ( ρ p P ≥ ( h ) h 2 1 r r 1 2 1 P ( h ) is a polynomial in h of degree where . − m j j | z Recall that for a causal model, all of the roots are outside the unit circle, | > 1 , i ( = dampens exponentially fast to ) for i h 1 , . . ., r . If all the roots are real, then ρ zero as h If some of the roots are complex, then they will be in conjugate → ∞ . ρ h ) will dampen, in a sinusoidal fashion, exponentially fast to zero as pairs and ( → ∞ . In the case of complex roots, the time series will appear to be cyclic in h nature. This, of course, is also true for ARMA models in which the AR part has complex roots. Example 3.14 The ACF of an ARMA ( 1 , 1 ) Consider the ARMA ( 1 , 1 ) process x = φ x + + θ w w , where | φ | < 1 . Based 1 t − t − 1 t t on (3.47), the autocovariance function satisfies = ( h )− φγ ( h − 1 ) γ 0 , h = 2 , 3 , . . ., and it follows from (3.29)–(3.30) that the general solution is h 2 (3.51) , . . . . γ ( h ) = c φ , , h = 1 i i i i

108 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 98 — #108 i i 3 ARIMA Models 98 To obtain the initial conditions, we use (3.48): 2 2 2 φγ ( 1 ) + σ γ γ [ 1 + θφ + θ ( ] and 0 ( 1 ) = φγ ( 0 ) + σ ) = θ. w w ( 0 ) and γ γ 1 ) , we obtain: Solving for ( 2 θ + φ )( θφ + 1 + 2 θφ + θ ) 1 ( 2 2 and ) = σ σ ( γ ( 0 1 γ = ) . w w 2 2 φ − 1 1 − φ , note that from (3.51), γ ( 1 ) = c φ or c = γ ( 1 )/ φ . Hence, the specific To solve for c ≥ h solution for 1 is + φ ( )( θφ ) θ 1 + ) 1 ( γ 2 h 1 − h = σ φ φ . = γ ( h ) w 2 φ φ 1 − γ 0 ) yields the ACF Finally, dividing through by ( ) θ + ( 1 + θφ )( φ h − 1 . 1 ≥ , h (3.52) φ = ) h ( ρ 2 + 1 θ θφ 2 + in (3.52) is not different from ρ ) versus h h Notice that the general pattern of ( that of an AR(1) given in (3.8). Hence, it is unlikely that we will be able to tell the difference between an ARMA(1,1) and an AR(1) based solely on an ACF estimated from a sample. This consideration will lead us to the partial autocorrelation function. The Partial Autocorrelation Function (PACF) q We have seen in (3.43), for MA( ) models, the ACF will be zero for lags greater q . Moreover, because θ . Thus, the ACF , 0 , the ACF will not be zero at lag than q q provides a considerable amount of information about the order of the dependence when the process is a moving average process. If the process, however, is ARMA or AR, the ACF alone tells us little about the orders of dependence. Hence, it is worthwhile pursuing a function that will behave like the ACF of MA models, but for AR models, namely, the partial autocorrelation function (PACF) . Recall that if X , Y , and Z are random variables, then the partial correlation ˆ Z Y given Z is obtained by regressing X on and to obtain between X , regressing Y X ˆ Z to obtain Y on , and then calculating ˆ ˆ } Y = corr { X − ρ X , Y − . XY Z | and X with the linear Y The idea is that ρ measures the correlation between | XY Z effect of Z removed (or partialled out). If the variables are multivariate normal, then ρ X . ) = corr ( this definition coincides with , Y | Z Z | XY To motivate the idea for time series, consider a causal AR(1) model, x = φ x + t − 1 t w . Then, t γ ( 2 ) = cov ( x , x ) ) = cov ( φ x + w , x t − t t t − 1 − x t 2 2 2 2 w = cov ( φ . x + φ ) ( + w , x ) = φ 0 γ − t t − 2 − 2 1 t t x i i i i

109 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 99 — #109 i i 99 3.3 Autocorrelation and Partial Autocorrelation This result follows from causality because x , which are involves { w } , . . . w , 2 − 3 − t 2 t − t and all uncorrelated with w is not zero, x and . The correlation between w x t − t − 2 t t 1 is dependent on x through x as it would be for an MA(1), because . Suppose x 1 t t − t − 2 x we break this chain of dependence by removing (or partial out) the effect That . 1 − t − − x and is, we consider the correlation between φ φ x x , because it is the x − 2 t − 1 t − t 1 t x correlation between and x removed. x with the linear dependence of each on 2 1 t − t t − x In this way, we have broken the dependence chain between and x . In fact, t − 2 t 0 ) cov ( x = − φ x ) x φ , x − x , − φ x w ( cov . = − 1 t 2 − t − 2 t 1 − t − 1 t t t Hence, the tool we need is partial autocorrelation, which is the correlation between and x with the linear effect of everything “in the middle” removed. x s t To formally define the PACF for mean-zero stationary time series, let ˆ x , for + h t 3.3 , h ≥ 2 , denote the regression which we write of x } x , . . ., on { x x , h h t t + h − 2 + + t + 1 t 1 − as (3.53) ˆ x = β x + β x x + ··· + β . 1 h 2 2 t t + h − 1 + t + h h − 1 − t + 1 x is zero (otherwise, No intercept term is needed in (3.53) because the mean of t x replace x x by denote the regression of ˆ − μ in this discussion). In addition, let t t t x , } x on { x , x then , . . ., x 1 + 1 t t 2 t + t + h − ··· x x + β x + β + β x = . (3.54) ˆ + 1 1 t h − 1 2 t + h − 1 t + 2 t β are the same in (3.53) and Because of stationarity, the coefficients, , . . ., β 1 h − 1 (3.54); we will explain this result in the next section, but it will be evident from the examples. Definition 3.9 The partial autocorrelation function (PACF) of a stationary process, 1 , denoted φ is , for h = x , 2 , . . ., hh t = corr ( x (3.55) , φ ) = ρ ( 1 ) x 11 t t + 1 and (3.56) corr ( x − ˆ x φ , = − ˆ x ) , h ≥ 2 . x t t + + h h t hh t The reason for using a double subscript will become evident in the next section. , is the correlation between The PACF, φ with the linear dependence x x and + t t hh h = of { φ , . . ., x } on each, removed. If the process x is Gaussian, then x h − 1 1 + t t t + hh φ corr ( x , x is the correlation coefficient between | x x ; that is, ) , . . ., x hh t h − 1 1 + t t t h t + h + + x in the bivariate distribution of ( x ) , x and conditional on { x , . . ., x } . t t + 1 t + t + h − 1 h t 3 3 . The term regression here refers to regression in the population sense. That is, ˆ x is the linear combina- + h t Õ h − 1 2 x x tion of , x − { ( E that minimizes the mean squared error , ..., x } x α ) . j + t j + t 1 t + h t 1 − h + 2 − h t + 1 = j i i i i

110 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 100 — #110 i i 100 3 ARIMA Models 1.0 1.0 0.5 0.5 ACF PACF 0.0 0.0 −0.5 −0.5 15 5 20 5 10 15 20 10 lag lag φ 1 = Fig. 3.5. . The ACF and PACF of an AR(2) model with and φ . = − . 75 5 2 1 ) 1 Example 3.15 The PACF of an AR( = φ x + x Consider the PACF of the AR(1) process given by , with | φ | < 1 . w t t − t 1 φ By definition, ρ ( 1 ) = φ . To calculate φ on , consider the regression of x = t + 22 2 11 , say, ˆ x = β x . We choose β to minimize x + 1 t + 1 t + 2 t 2 2 2 E ( x ( γ ) − 0 x β + ) ) ˆ = E ( x 1 ( βγ − β x 2 )− 0 ) . = γ ( 1 + 2 + t 2 + t 2 + t t β Taking derivatives with respect to and setting the result equal to zero, we have = γ ( 1 )/ γ ( 0 ) = ρ β 1 ) = φ. Next, consider the regression of x , say on x ( t + 1 t β . We choose ˆ to minimize x x = β t + 1 t 2 2 2 E ( x − ˆ x . ) γ = E ( ) − β x ) 0 = γ ( 0 )− 2 βγ ( 1 ) + β ( x + 1 t t t t This is the same equation as before, so φ . Hence, = β = corr ( x − ˆ x , x − ˆ ) ) = corr ( x − φ x φ , x − φ x x t + 2 t t 22 t + 1 t t t + 2 + t + 1 2 φ = w , x − ( x corr ) = 0 t t + 1 t + 2 by causality. Thus, φ = 0 . In the next example, we will see that in this case, 22 for all φ . = 0 > h 1 hh Example 3.16 The PACF of an AR( ) p Õ p x = The model implies ( are outside ) z , where the roots of x φ w + φ t + h j + t + h t − j h 1 = j the unit circle. When h > p , the regression of x , is } x on { x , . . ., + + h t + h − 1 t t 1 p ’ x φ . = ˆ x t j + h − j h + t = 1 j We have not proved this obvious result yet, but we will prove it in the next section. p > , h Thus, when i i i i

111 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 101 — #111 i i 3.3 Autocorrelation and Partial Autocorrelation 101 Table 3.1. Behavior of the ACF and PACF for ARMA Models AR( p ) MA( q ) ARMA( p , q ) ACF Tails off Cuts off Tails off after lag q Tails off Tails off PACF Cuts off after lag p − = φ = corr ( x − ˆ x , x 0 ˆ x ) = corr ( w , , x − ˆ x ) h + h t + h t hh t t t t t + { , . . . − ˆ x depends only on x w , w because, by causality, } ; recall equation h − 1 t 2 t + h − t t + (3.54). When are not necessarily zero. ≤ p , φ , . . ., φ is not zero, and φ h pp p − 1 , p − 1 11 = φ . Figure 3.5 shows the ACF and the PACF We will see later that, in fact, φ p pp of the AR(2) model presented in Example 3.11. To reproduce Figure 3.5 in R, use the following commands: ACF = ARMAacf(ar=c(1.5,-.75), ma=0, 24)[-1] PACF = ARMAacf(ar=c(1.5,-.75), ma=0, 24, pacf=TRUE) par(mfrow=c(1,2)) plot(ACF, type="h", xlab="lag", ylim=c(-.8,1)); abline(h=0) plot(PACF, type="h", xlab="lag", ylim=c(-.8,1)); abline(h=0) Example 3.17 The PACF of an Invertible MA(q) Õ ∞ + x . Moreover, no finite = − w For an invertible MA( q ), we can write x π t j − j t t j = 1 representation exists. From this result, it should be apparent that the PACF will never cut off, as in the case of an AR( p ). For an MA(1), x 1 , calculations similar to Exam- w < + θ w | θ | , with = 1 − t t t 2 4 2 ple 3.15 will yield φ ) = . For the MA(1) in general, we can show θ θ /( 1 + θ − + 22 that 2 h (− ) θ ) ( 1 − θ = φ − , . 1 ≥ h hh + h 1 ) 2 ( θ 1 − In the next section, we will discuss methods of calculating the PACF. The PACF for MA models behaves much like the ACF for AR models. Also, the PACF for AR models behaves much like the ACF for MA models. Because an invertible ARMA model has an infinite AR representation, the PACF will not cut off. We may summarize these results in Table 3.1. Example 3.18 Preliminary Analysis of the Recruitment Series We consider the problem of modeling the Recruitment series shown in Figure 1.5. There are 453 months of observed recruitment ranging over the years 1950-1987. The ACF and the PACF given in Figure 3.6 are consistent with the behavior of an AR(2). The ACF has cycles corresponding roughly to a 12-month period, and the PACF has large values for h = 1 , 2 and then is essentially zero for higher ) = 2 order lags. Based on Table 3.1, these results suggest that a second-order ( p autoregressive model might provide a good fit. Although we will discuss estimation i i i i

112 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 102 — #112 i i 102 3 ARIMA Models 1.0 0.5 ACF 0.0 −0.5 0 3 2 4 1 LAG 1.0 0.5 PACF 0.0 −0.5 0 4 1 2 3 LAG Fig. 3.6. ACF and PACF of the Recruitment series. Note that the lag axes are in terms of season (12 months in this case). in detail in Section 3.5, we ran a regression (see Section 2.1) using the data triplets x ; z x , z to fit a model of the form ) : ( x )} ; x x , x , ) , ( x x ; {( ; , x x ) , . . ., ( 2 1 4 452 1 2 451 3 2 453 3 φ = φ x + + x x + w φ 2 t 1 0 t − 2 t − t 1 ˆ t 3 , 4 , . . ., 453 . The estimates and standard errors (in parentheses) are = φ for = 0 2 ˆ ˆ . 6 . 74 = . σ ˆ , and , 72 φ 46 = 1 . 35 . − = φ , 89 1 2 ) ) ( . 04 ) . 11 . 1 ( ( 04 w The following R code can be used for this analysis. We use acf2 from astsa to print and plot the ACF and PACF. # will produce values and a graphic acf2(rec, 48) (regr = ar.ols(rec, order=2, demean=FALSE, intercept=TRUE)) regr$asy.se.coef # standard errors of the estimates 3.4 Forecasting In forecasting, the goal is to predict future values of a time series, x , . . . , 2 , m = 1 , m n + } x , . . ., . Throughout this based on the data collected to the present, x x , = { x 1 2 n n 1: section, we will assume x is stationary and the model parameters are known. The t problem of forecasting when the model parameters are unknown will be discussed in the next section; also, see Problem 3.26. The minimum mean square error predictor of x is n + m n x (3.57) | ) x x ( E = n n + m 1: m n + because the conditional expectation minimizes the mean square error 2 ] [ (3.58) E , x ) x ( − g m 1: n n + x is a function of the observations ) x where ; see Problem 3.14. ( g 1: n 1: n i i i i

113 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 103 — #113 i i 103 3.4 Forecasting First, we will restrict attention to predictors that are linear functions of the data, that is, predictors of the form n ’ n + α x = , α (3.59) x 0 k k n m + = k 1 α where , . . .,α are real numbers. We note that the α ,α n and m , but for s depend on n 0 1 1 now we drop the dependence from the notation. For example, if n is m = 1 , then x = 2 1 + . α = x α x the one-step-ahead linear forecast of x given x . In terms of (3.59), 0 1 1 2 1 2 2 But if n = 2 , x x . In terms is the one-step-ahead linear forecast of x and given x 1 3 2 3 2 1 2 will be different. and x + α x + α = of (3.59), , and in general, the α s in x x α x 2 2 1 0 1 3 3 2 Linear predictors of the form (3.59) that minimize the mean square prediction (BLPs). As we shall see, linear prediction error (3.58) are called best linear predictors depends only on the second-order moments of the process, which are easy to estimate from the data. Much of the material in this section is enhanced by the theoretical material presented in Appendix B. For example, Theorem B.3 states that if the process is Gaussian, minimum mean square error predictors and best linear predictors are the same. The following property, which is based on the Projection Theorem, Theorem B.1, is a key result. Property 3.3 Best Linear Prediction for Stationary Processes Õ n n x x , . . ., x α , the best linear predictor, x , of = α x + , Given data n 0 m n k 1 k + m n + 1 k = m ≥ 1 , is found by solving for [ ] ( ) n n x , 0 , . . ., − x 1 (3.60) E x = , = 0 , k k + n m n + m ,α where 1 , for α . = , . . . α x 1 n 0 0 prediction equations , and they The equations specified in (3.60) are called the { ,α , . . .,α α } . The results of Property 3.3 can are used to solve for the coefficients 1 0 n Õ n 2 E ( x with − ) also be obtained via least squares; i.e., to minimize Q = x α k m + n k = k 0 α α ∂ Q / ∂α respect to the = 0 for the s, solve , j = 0 , 1 , . . ., n . This leads to (3.60). j j μ If E ( x ) of (3.60) implies ) = 0 , the first equation ( k = t n μ. = ) ) = E ( x x E ( n + m + m n Thus, taking expectation in (3.59), we have n n ) ( ’ ’ α − 1 μ α . μ or α = α + μ = k 0 k 0 1 = k = 1 k Hence, the form of the BLP is n ’ n μ μ = x . ) + α x ( − k k m n + 1 = k Thus, until we discuss estimation, there is no loss of generality in considering the = α . 0 case that μ = 0 , in which case, 0 i i i i

114 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 104 — #114 i i 104 3 ARIMA Models x . That is, given x First, consider one-step-ahead prediction { , we wish to } , . . ., n 1 x is forecast the value of the time series at the next time point, x . The BLP of 1 + + 1 n n of the form n , (3.61) x = φ x + φ x + + ··· φ x 2 1 n 1 1 n − n n nn 1 n + where we now display the dependence of the coefficients on n ; in this case, α in k (3.59) is φ . Using Property 3.3, the coefficients n , . . ., 1 = k in (3.61), for k + n , n 1 − φ satisfy , φ , . . ., φ } { 2 1 nn n n n ) ] [( ’ x E x n , . . ., 1 φ − , x = k = 0 , 1 − j 1 n n + + − k n j 1 n + 1 = j or n ’ . φ n γ ( k − j ) = γ ( , ) (3.62) k = 1 , . . ., k n j = 1 j The prediction equations (3.62) can be written in matrix notation as Γ φ γ , (3.63) = n n n n ′ 1 = { γ ( k − j )} vector, where is an n × n matrix, φ × = ( φ n is an , . . ., φ Γ ) nn 1 n n n j k = 1 , ′ ( ) = and γ ( 1 ) , . . ., γ ( n ) vector. γ is an n × 1 n The matrix Γ is nonnegative definite. If Γ is singular, there are many solutions n n n is unique. If is to (3.63), but, by the Projection Theorem (Theorem B.1), x Γ n n + 1 nonsingular, the elements of φ are unique, and are given by n − 1 (3.64) γ Γ . = φ n n n 2 > σ and γ ( h ) → 0 as h → ∞ is enough to 0 For ARMA models, the fact that w ensure that is positive definite (Problem 3.12). It is sometimes convenient to write Γ n the one-step-ahead forecast in vector notation ′ n x , (3.65) = φ x n n + 1 ′ , . . ., x , x where ( x ) = . x 1 − 1 n n The mean square one-step-ahead prediction error is 1 − n n ′ 2 (3.66) . γ Γ = E − x ( x γ )− ) P = γ ( 0 n 1 + n n n + 1 1 n + n To verify (3.66) using (3.64) and (3.65), 2 1 − ′ 2 ′ 2 n ) x Γ = γ − x ( E ) x E x ) − − x ( φ ( x E = 1 + n 1 + n 1 + n n n n 1 n + 1 2 − ′ − 1 ′ ′ − 1 γ x Γ = ) E Γ ( x x + γ γ Γ − 2 x x 1 + n n n n n n n 1 + n − − ′ 1 1 1 ′ − γ Γ γ 2 γ )− + γ 0 ( Γ γ = Γ Γ n n n n n n n n − 1 ′ . = γ ( 0 )− γ γ Γ n n n i i i i

115 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 105 — #115 i i 105 3.4 Forecasting ) Example 3.19 Prediction for an AR( 2 w x = x Suppose we have a causal AR(2) process , and one φ + φ + x t t 2 2 1 t − 1 t − x x observation . Then, using equation (3.64), the one-step-ahead prediction of 1 2 based on x is 1 1 ( γ ) 1 φ = x ) x = x x = ρ ( 1 . 1 11 1 1 2 ) ( γ 0 Now, suppose we want the one-step-ahead prediction of x based on two observa- 3 2 x φ + x φ . We could use (3.62) = x x and x ; i.e., tions 2 22 21 1 2 1 3 ) γ ( 0 ) + φ 1 γ φ 1 ) = γ ( ( 22 21 ( φ ) + φ γ γ ( 0 ) = γ 1 2 ) ( 22 21 φ and φ , or use the matrix form in (3.64) and solve to solve for 21 22 ( ) ) ( ( ) − 1 ( γ ( 1 ) γ ) 0 φ ) 1 γ ( 21 = , ( 1 ) φ ( 0 ) γ ( 2 ) γ γ 22 2 x x = φ φ x + + φ x but, it should be apparent from the model that x . Because φ 1 2 2 2 1 2 1 1 3 satisfies the prediction equations (3.60), x , 0 = E {[ ) x −( φ w x ( + φ E x = )] x } 1 3 3 1 1 2 1 2 {[ x −( φ x + , E x )] x } = E ( w x ) = 0 φ 1 3 2 2 2 2 1 3 2 , and by the uniqueness of the coefficients x it follows that, indeed, x φ + = φ x 1 2 2 1 3 φ . Continuing in this way, it is easy to verify = φ φ and φ in this case, that = 1 2 21 22 that, for n ≥ 2 , n = x x . + x φ φ n 2 n − 1 1 1 + n , , . . ., 4 That is, φ . n φ , φ = φ , and φ = 0 , for j = 3 = 1 1 2 n n 2 n j From Example 3.19, it should be clear (Problem 3.45) that, if the time series is a p ) process, then, for n ≥ p , causal AR( n = (3.67) . x φ x + φ x + ··· + φ x − − 1 n 1 1 2 p p n n + 1 + n For ARMA models in general, the prediction equations will not be as simple as the pure AR case. In addition, for n large, the use of (3.64) is prohibitive because it requires the inversion of a large matrix. There are, however, iterative solutions that do not require any matrix inversion. In particular, we mention the recursive solution due to Levinson (1947) and Durbin (1960). Property 3.4 The Durbin–Levinson Algorithm Equations (3.64) and (3.66) can be solved iteratively as follows: 0 φ = 0 , P (3.68) . = γ ( 0 ) 00 1 , 1 ≥ n For i i i i

116 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 106 — #116 i i 106 3 ARIMA Models Õ n − 1 ) − n ( k φ ρ ρ )− ( n k n − 1 , 1 = k 1 − n 2 n (3.69) 1 , = ) ( P φ − , = P φ nn Õ n nn 1 + n − n 1 1 − φ ρ ( k ) k 1 − n , 1 = k where, for n ≥ 2 , φ (3.70) = φ . 1 − n , . . ., − φ 2 φ , 1 = k , − k , 1 − n nn k , 1 − n nk n The proof of Property 3.4 is left as an exercise; see Problem 3.13. Example 3.20 Using the Durbin–Levinson Algorithm 0 = . Then, for 0 , P φ , = γ ( 0 ) To use the algorithm, start with n = 1 00 1 1 2 = ρ ( φ ) , P 1 φ = γ ( 0 )[ 1 − . ] 11 2 11 For n = 2 , 1 ρ ( 2 )− φ ( ρ ) 11 = φ φ , , φ φ = φ − 11 22 21 11 22 ρ ( 1 ) − φ 1 11 2 1 2 2 2 ][ − = P φ 1 [ 1 − φ . ] ] = γ ( 0 )[ 1 − φ P 11 2 22 22 3 For n = 3 , ) 1 ( ρ ( 3 )− φ ρ ρ ( 2 )− φ 22 21 , = φ 33 − φ ( ρ ( 1 )− φ ) ρ 1 2 22 21 φ = φ − φ φ φ , φ = φ − φ , 31 22 33 21 22 21 33 32 3 2 2 2 2 2 )[ ] [ 1 − φ = φ ] = γ ( 0 P − − φ P 1 ][ 1 − φ 1 ][ , 22 11 33 3 33 4 and so on. Note that, in general, the standard error of the one-step-ahead forecast is the square root of n ÷ n 2 φ P (3.71) = γ ( 0 ) . ] − [ 1 j j n 1 + j 1 = An important consequence of the Durbin–Levinson algorithm is (see Prob- lem 3.13) as follows. Property 3.5 Iterative Solution for the PACF The PACF of a stationary process x , can be obtained iteratively via (3.69) as t for 2 . φ , , , . . . n = 1 nn Using Property 3.5 and putting n = p in (3.61) and (3.67), it follows that for an AR( p ) model, p + x + ··· φ = x φ + x φ x p p 2 pp 1 − 1 p 1 p 1 p + (3.72) + x + φ x φ + ··· φ x . = 1 2 p p p 1 − 1 Result (3.72) shows that for an AR( p ) model, the partial autocorrelation coefficient at , as was claimed in Example 3.16. φ φ , is also the last coefficient in the model, lag p , pp p i i i i

117 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 107 — #117 i i 107 3.4 Forecasting ) Example 3.21 The PACF of an AR( 2 We will use the results of Example 3.20 and Property 3.5 to calculate the first three ρ , φ values, , φ − , of the PACF. Recall from Example 3.10 that φ ( h )− φ h ρ ( 1 33 22 11 3 φ = ρ ( h − 2 ) = 0 for h ≥ 1 . When h = 1 , 2 , 1 , we have ρ ( 1 ) = φ )− /( 1 − φ ) ) , ρ ( 2 2 2 1 ρ ( 1 ) + φ Thus, , ρ ( 3 )− φ ρ ( 2 )− φ ρ ( 1 ) = 0 . φ 1 2 1 2 φ 1 ( 1 ) = φ ρ = 11 − φ 1 2 ) [ ) ( ] ( 2 φ φ 1 1 + φ φ − 2 1 2 − φ 1 − φ 1 ρ ( 1 ) ρ ( 2 )− 2 2 = φ = φ = 22 2 ( ) 2 2 − ρ ( 1 ) 1 φ 1 − 1 φ 1 − 2 φ φ = = ρ ( 1 )[ 1 − φ ] 2 1 21 1 ( ρ ρ ( ) )− φ φ ρ ( 2 )− 3 1 2 0 . = = φ 33 − φ 1 ρ ( 1 )− φ ) ρ ( 2 2 1 Notice that, as shown in (3.72), φ = φ for an AR(2) model. 22 2 So far, we have concentrated on one-step-ahead prediction, but Property 3.3 } , the x x for any m ≥ 1 . Given data, { x , . . ., allows us to calculate the BLP of n 1 + m n m -step-ahead predictor is ( ) m ) m ( m ) ( n , φ ··· + x + x (3.73) + φ x x φ = n − 1 1 n nn n m + 2 n 1 n ) m ( ( m ) ) m ( } satisfy the prediction equations, , . . ., φ , φ where φ { nn n 2 1 n n ’ ( m ) = , n , . . ., E ( x k x ) = E ( x x ) , 1 φ + 1 − j n n + m n n + 1 − k + 1 − k n j 1 j = or n ’ ( m ) φ (3.74) . n , . . ., γ ( k − j ) = γ ( m + k − 1 ) , k = 1 n j 1 = j The prediction equations can again be written in matrix notation as ) ) m ( m ( φ = γ Γ , (3.75) n n n m ) ( ( ) ) m ( m ( m ) ′ ′ ) ( ( m + n − 1 ) γ ( , and φ = where γ ( = m φ ) , . . ., γ are vectors. 1 ) , . . ., φ n × n n nn n 1 The is mean square m-step-ahead prediction error ) ( ′ 2 m ( ) ( m ) 1 n − n (3.76) γ γ . = γ ( 0 )− x x Γ = E − P m + n n n m n n + + n m Another useful algorithm for calculating forecasts was given by Brockwell and Davis (1991, Chapter 5). This algorithm follows directly from applying the projection t − 1 n theorem (Theorem B.1) to the innovations , x = − x , . . ., 1 , using the fact that t , for t t t − 1 1 − s are uncorrelated for − x x (see Problem 3.46). t , and s the innovations − x x s t s t x We present the case in which is a mean-zero stationary time series. t i i i i

118 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 108 — #118 i i 3 ARIMA Models 108 Property 3.6 The Innovations Algorithm t t The one-step-ahead predictors, x , and their mean-squared errors, P , can t + 1 1 t + be calculated iteratively as 0 0 = γ ( 0 ) 0 , P = x 1 1 t ’ t − j t θ , 1 = , . . . ( x 2 − x = x (3.77) t , ) + 1 − j t j t t + 1 − j 1 + t j 1 = 1 − t ’ j t 2 0 )− , . . ., (3.78) P = 2 , 1 θ γ = P ( t − j t , t 1 t + + 1 j 0 = j 1 − , where, for j = 0 , 1 , . . ., t j − 1 ( ) ’ / j k θ t − j )− . θ P = (3.79) γ ( θ P , , k t t , t − k j − t − j j 1 + k j + 1 0 k = Given data x , . . ., x , the innovations algorithm can be calculated successively 1 n n n t t = 1 , then is = 2 and so on, in which case the calculation of x for P and 1 + n + 1 n made at the final step t = n . The m -step-ahead predictor and its mean-square error based on the innovations algorithm (Problem 3.46) are given by 1 − m + n ’ 1 − m + n j − n − ) ( x = x , x (3.80) θ + n + m − j m − 1 , j n m n + j n + m − = m j 1 n + m − ’ n 1 − j − m + n 2 (3.81) P 0 θ )− = P ( γ , n m + n + m − 1 , j j + m − n m = j where the θ are obtained by continued iteration of (3.79). n + m − 1 , j 1 ) Example 3.22 Prediction for an MA( The innovations algorithm lends itself well to prediction for moving average pro- 2 2 x σ = w ) + θ w , cesses. Consider an MA(1) model, θ . Recall that γ ( 0 ) = ( 1 + 1 t t t − w 2 > γ ( 1 ) = θσ 1 . Then, using Property 3.6, we have , and γ ( h ) = 0 for h w 1 n − 2 / P θ θσ = n 1 n w θ = 0 , n = 2 , . . ., j n j 2 2 0 1 = ( θ + σ ) P w 1 2 n 2 . θθ ( 1 + θ − = P ) σ 1 n w + n 1 Finally, from (3.77), the one-step-ahead predictor is ( ) − − 2 1 n n 1 n θ . x / − x P σ = x n n n w n + 1 i i i i

119 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 109 — #119 i i 109 3.4 Forecasting Forecasting ARMA Processes The general prediction equations (3.60) provide little insight into forecasting for ARMA models in general. There are a number of different ways to express these forecasts, and each aids in understanding the special structure of ARMA prediction. x = is a causal and invertible ARMA( p , Throughout, we assume ) process, φ ( B ) x q t t 2 ( B ) w , simply , where w μ ∼ iid N ( 0 , σ ) = ) . In the non-zero mean case, E ( x θ t x t t w x − with x in the model. First, we consider two types of forecasts. We replace μ x t t n write x based on the to mean the minimum mean square error predictor of x m + n m + n { x x , . . ., data , that is, } 1 n n x = E ( x . , . . ., ) x x n 1 + n m m + n For ARMA models, it is easier to calculate the predictor of x , assuming we have n + m the complete history of the process { x We will denote the , x . } , . . . , . . ., x x , x , 0 1 1 − n n − 1 x predictor of as based on the infinite past m + n ) = E ( x ̃ x x , x , . . ., x , x , x , . . . . − 1 m + 1 m n 0 n n − 1 + n n and x are not the same, but the idea here is that, for large samples, x ̃ In general, n + m m + n n x . will provide a good approximation to ̃ x n m + n m + Now, write x in its causal and invertible forms: + m n ∞ ’ x = w , ψ (3.82) ψ 1 = n m + j n + m − j 0 j 0 = ∞ ’ w = π , π (3.83) 1 x = . m + n j + n 0 m − j 0 = j Then, taking conditional expectations in (3.82), we have ∞ ∞ ’ ’ w (3.84) , ψ ̃ = w ψ ̃ x = j + n m − j − j j n + m n + m m = j j 0 = because, by causality and invertibility, { > n t 0 = E ( w = w ̃ x ) , x , . . . x , , . . ., x 0 − 1 n n t t 1 − ≤ w n t . t Similarly, taking conditional expectations in (3.83), we have ∞ ’ π ̃ , x + ̃ = 0 x j − + n j m m + n 1 = j or ∞ m − 1 ’ ’ x − x = − π (3.85) , ̃ ̃ x π n j m n + m − j + n − m + j j j = m = j 1 i i i i

120 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 110 — #120 i i 3 ARIMA Models 110 x x x using the fact , x E . Prediction is accom- n , . . ., x ≤ ( t , for x , . . . ) = , 1 t − t 0 1 − n n = 1 , plished recursively using (3.85), starting with the one-step-ahead predictor, m m = 3 , . . . . Using (3.84), we can write and then continuing for , 2 1 m − ’ x x w ψ , − ̃ = m n m + j + n + m − j n j = 0 so the can be written as mean-square prediction error − 1 m ’ 2 n 2 2 = ψ − ̃ x . (3.86) = ) x E σ ( P m m + + n n w m n + j = j 0 n Also, we note, for a fixed sample size, , the prediction errors are correlated. That is, k ≥ 1 for , 1 − m ’ 2 {( x )( (3.87) . − ̃ x ψ ψ σ E x = )} x ̃ − m k + m + k + n m + k n n + n + j j + m w 0 = j Example 3.23 Long-Range Forecasts − Consider forecasting an ARMA process with mean μ x . Replacing x with + n x n + m m μ in (3.82), and taking conditional expectation as in (3.84), we deduce that the x m -step-ahead forecast can be written as ∞ ’ + = μ ̃ . x (3.88) ψ w x m n + m + n j j − m = j Noting that the ψ -weights dampen to zero exponentially fast, it is clear that x (3.89) μ ̃ → n + m x . Moreover, by (3.86), the exponentially fast (in the mean square sense) as m →∞ mean square prediction error ∞ ’ 2 2 2 n (3.90) , ( 0 ) σ γ = ψ = σ → P x x w + n m j j = 0 exponentially fast as m →∞ . It should be clear from (3.89) and (3.90) that ARMA forecasts quickly settle to the mean with a constant prediction error as the forecast horizon, m , grows. This effect can be seen in Figure 3.7 where the Recruitment series is forecast for 24 months; see Example 3.25. n When is small, the general prediction equations (3.60) can be used easily. is large, we would use (3.85) by truncating, because we do not observe n When i i i i

121 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 111 — #121 i i 3.4 Forecasting 111 are available. In this case, we can x x , x , x , . . . , and only the data x , . . ., , x 2 0 2 n − − 1 1 Õ ∞ x π = 0 . The truncated predictor is then truncate (3.85) by setting + m − n j j j = n + m written as n m − 1 + m − 1 ’ ’ n n (3.91) x , π ̃ = − ̃ x π x − − j m + n j j n + m j m n + − m = j 1 = j = 1 , 2 , . . . . The mean square prediction error, which is also calculated recursively, m in this case, is approximated using (3.86). p ) models, and when n > p , equation (3.67) yields the exact predictor, For AR( n n x = , of x > x ̃ , and there is no need for approximations. That is, for n , p m + n n n + m m + n x − x ( = x ̃ E Also, in this case, the one-step-ahead prediction error is . n + n + 1 m m + n n 2 2 q ) p = σ x , . For pure MA( q ) or ARMA( ) models, truncated prediction has a w + 1 n fairly nice form. Property 3.7 Truncated Prediction for ARMA p , q ) models, the truncated predictors for m = 1 , 2 , . . ., are For ARMA( n n n n n ̃ = φ w ̃ x x ̃ θ + ··· + + ··· + φ (3.92) ̃ x , w ̃ θ + p q 1 1 m m + n p n + m − q + n − + m − 1 1 − m + n n n n x where . The truncated prediction errors = x 0 for 1 ≤ t ≤ n and ̃ x ̃ ≤ = 0 for t t t t n , and n > or 0 ≤ t t = 0 for ̃ w are given by: t n n n n φ ( B ) w x −···− w − θ ̃ w ̃ ̃ = θ ̃ 1 q t t t − q 1 t − for 1 ≤ t ≤ n . Example 3.24 Forecasting an ARMA 1 , 1 ) Series ( , for forecasting purposes, write the model as x x , . . ., Given data 1 n w θ . x + w + = φ x n + + 1 n 1 n n Then, based on (3.92), the one-step-ahead truncated forecast is n n . ̃ + θ + 0 = φ x w x ̃ n n + n 1 m 2 For , we have ≥ n n x , = φ ̃ ̃ x n + m m − 1 + n = m 2 which can be calculated recursively, , 3 , . . . . n , which is needed to initialize the successive forecasts, the model ̃ To calculate w n n 1 , . . ., can be written as w = x − φ x . For truncated forecasting − θ w for t = t t t − 1 − 1 t n = = 0 , x 0 , and then iterate the errors forward in time w ̃ using (3.92), put 0 0 n n = t x − φ x , − θ ̃ w . n , . . ., 1 = w ̃ 1 t t − t 1 − t The approximate forecast variance is computed from (3.86) using the ψ -weights 1 j − -weights satisfy ψ = ( φ + θ ) φ determined as in Example 3.12. In particular, the ψ , j . This result gives 1 ≥ for j [ ] [ ] m − 1 2 − ( ) m 2 1 ’ φ ) θ + φ ( − ) 1 ( n 2 2 2 ( j − 1 ) 2 . P = σ φ 1 + ( φ ) + θ σ = + 1 m + n w w 2 1 ) ( − φ = 1 j i i i i

122 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 112 — #122 i i 112 3 ARIMA Models 100 100 80 80 l l l l l l l l l l l l l l l 60 60 l l l l l l 40 40 Recruitment Recruitment l l l 20 20 0 0 1982 1988 1986 1984 1982 1980 1990 1988 1986 1984 1990 1980 Time Time Fig. 3.7. Twenty-four month forecasts for the Recruitment series. The actual data shown are from about January 1980 to September 1987, and then the forecasts plus and minus one standard error are displayed. To assess the precision of the forecasts, prediction intervals are typically cal- 1 − α ) prediction intervals are of the culated along with the forecasts. In general, ( form √ n n α c x ± (3.93) , P n + m m + n 2 where is chosen to get the desired degree of confidence. For example, if the c / α 2 = 2 will yield an approximate 95% prediction c process is Gaussian, then choosing / α 2 interval for x . If we are interested in establishing prediction intervals over more m n + than one time period, then c should be adjusted appropriately, for example, by α / 2 using Bonferroni’s inequality [see (4.63) in Chapter 4 or Johnson and Wichern, 1992, Chapter 5]. Example 3.25 Forecasting the Recruitment Series Using the parameter estimates as the actual parameter values, Figure 3.7 shows the result of forecasting the Recruitment series given in Example 3.18 over a 24-month horizon, = 1 , 2 , . . ., 24 . The actual forecasts are calculated as m n n n 35 = x . 74 + 1 . 46 x 6 . − x m n + 2 m − 1 − m + + n n s when . The forecasts = x s t ≤ , 2 , . . ., 12 . Recall that n x = 453 and m for = 1 t t 2 n . , and using (3.40) 89 = 72 are calculated using (3.86). Recall that ˆ σ P errors w m + n = 1 and from Example 3.12, we have ψ ψ = 1 . 35 ψ , where 2 ≥ − . 46 ψ j for 2 − 1 − j j 0 j ψ n = 1 . 35 . Thus, for , = 453 1 n = , 72 . 89 P 1 + n 2 n 89 35 , . 1 + = ) . 72 ( 1 P 2 n + 2 n 2 2 ) 46 . − = 89 . 72 ] 1 + 1 . 35 , + [ 1 . 35 P ( 3 + n and so on. i i i i

123 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 113 — #123 i i 3.4 Forecasting 113 Note how the forecast levels off quickly and the prediction intervals are wide, even though in this case the forecast limits are only based on one standard error; √ n n ± that is, . x P n + m n + m To reproduce the analysis and Figure 3.7, use the following commands: regr = ar.ols(rec, order=2, demean=FALSE, intercept=TRUE) fore = predict(regr, n.ahead=24) ts.plot(rec, fore$pred, col=1:2, xlim=c(1980,1990), ylab="Recruitment") U = fore$pred+fore$se; L = fore$pred-fore$se xx = c(time(U), rev(time(U))); yy = c(L, rev(U)) polygon(xx, yy, border = 8, col = gray(.6, alpha = .2)) lines(fore$pred, type="p", col=2) backcasting . In backcasting, We complete this section with a brief discussion of , . . . x , for m = 1 , 2 we want to predict , based on the data { x , . . ., x } . Write the − m n 1 1 backcast as n ’ n x α x = (3.94) . j j m − 1 j = 1 ) are μ 0 Analogous to (3.74), the prediction equations (assuming = x n ’ n , , . . ., 1 = α k E ( x , x ) (3.95) = E ( x x ) − 1 k k m j j 1 = j or n ’ γ . n , . . ., α 1 γ ( k − j ) = m ( = + k − 1 ) , k (3.96) j 1 = j These equations are precisely the prediction equations for forward prediction. That is, ( m ) ( m ) are given by (3.75). Finally, the backcasts n , . . ., φ where the , , for j = 1 ≡ α φ j n j n j are given by ( m ) ( m ) n = = φ 2 (3.97) x m , x x + ··· + φ , . . . . , 1 1 n nn − 1 m n 1 Example 3.26 Backcasting an ARMA 1 , 1 ) ( w ) we will call this the Consider an ARMA ( 1 , 1 ; process, x + = φ x w θ + t 1 t t − 1 − t . We have just seen that best linear prediction backward in time is forward model the same as best linear prediction forward in time for stationary models. Assuming the models are Gaussian, we also have that minimum mean square error prediction 3.4 Thus, the backward in time is the same as forward in time for ARMA models. backward model , process can equivalently be generated by the x φ , x = + θ v + v 1 t + 1 t t t + 4 3 . x is the same as (b) the } In the stationary Gaussian case, (a) the distribution of { x , ..., x , n 1 n 1 + x ) x distribution of { x , ..., , x x ..., x | } . In forecasting we use (a) to obtain E ( ; in backcasting n n n 1 1 0 1 + x , ..., ) we use (b) to obtain E ( x x | . Because (a) and (b) are the same, the two problems are n 0 1 equivalent. i i i i

124 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 114 — #124 i i 3 ARIMA Models 114 Backcasting 8 6 l l l 4 l t l l l X l l l 2 0 −2 20 60 80 100 0 40 Time Fig. 3.8. ( 1 , 1 ) . Display for Example 3.26; backcasts from a simulated ARMA 2 { = where is a Gaussian white noise process with variance σ } v . We may write x t t w Õ ∞ , } , . . . ψ v v , v { , where ψ is uncorrelated with = 1 ; this means that x t t + t − 1 t j j − 2 0 0 = j in analogy to the forward model. n x { to zero and then , . . . ., x ) } , truncate v Given data x = E ( v , . . . ., | x 1 n n 1 n n n ̃ v iterate backward. That is, put , as an initial approximation, and then generate = 0 n the errors backward n n v v ̃ ̃ θ − = x x − φ 1 , , t = ( n − . ) 1 ( n − 2 ) , . . ., + t t 1 t 1 + t Then, n n n n ̃ x + ̃ v θ x = φ ̃ v + = φ x + θ ̃ v , 1 1 0 1 0 1 n ≤ t because . Continuing, the general truncated backcasts are given by ̃ v 0 for = 0 t n n , ̃ x 2 = m , = , . . . . ̃ x 3 φ 2 − m m − 1 To backcast data in R, simply reverse the data, fit the model and predict. In the following, we backcasted a simulated ARMA(1,1) process; see Figure 3.8. set.seed(90210) x = arima.sim(list(order = c(1,0,1), ar =.9, ma=.5), n = 100) xr = rev(x) # xr is the reversed data pxr = predict(arima(xr, order=c(1,0,1)), 10) # predict the reversed data pxrp = rev(pxr$pred) # reorder the predictors (for plotting) pxrse = rev(pxr$se) # reorder the SEs nx = ts(c(pxrp, x), start=-9) # attach the backcasts to the data plot(nx, ylab=expression(X[~t]), main= ' Backcasting ' ) U = nx[1:10] + pxrse; L = nx[1:10] - pxrse xx = c(-9:0, 0:-9); yy = c(L, rev(U)) polygon(xx, yy, border = 8, col = gray(0.6, alpha = 0.2)) ) ' ' o lines(-9:0, nx[1:10], col=2, type= i i i i

125 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 115 — #125 i i 115 3.5 Estimation 3.5 Estimation observations, , . . ., x Throughout this section, we assume we have , from a causal x n n 1 , q ) process in which, initially, the order parameters, p and invertible Gaussian ARMA( and q , are known. Our goal is to estimate the parameters, φ , p , . . ., θ , θ , . . ., φ 1 q 1 p 2 q later in this section. . We will discuss the problem of determining and and σ p w We begin with method of moments estimators. The idea behind these estimators is that of equating population moments to sample moments and then solving for the parameters in terms of the sample moments. We immediately see that, if E ( x = ) t μ is the sample average, ̄ x . Thus, , then the method of moments estimator of μ = 0 . Although the method while discussing method of moments, we will assume μ of moments can produce good estimators, they can sometimes lead to suboptimal estimators. We first consider the case in which the method leads to optimal (efficient) p ) models, estimators, that is, AR( + = x x + ··· φ φ x + w , − p 1 t − p 1 t t t where the first p + 1 equations of (3.47) and (3.48) lead to the following: The are given by Yule–Walker equations Definition 3.10 γ = ) = φ (3.98) γ ( h − 1 ) + ··· + φ , h ( h − p ) , h ( 1 , 2 , . . ., p γ p 1 2 ( (3.99) = γ ( 0 )− φ σ γ ( 1 )−···− φ γ p ) . 1 p w In matrix notation, the Yule–Walker equations are ′ 2 0 φ , σ (3.100) = γ ( γ )− φ Γ γ = , p p p w p ′ ( = { where Γ k − j )} γ p × 1 vector, and is a p × p matrix, φ = ( φ ) , . . ., φ is a p p 1 j , k = 1 ′ ) ( ( 1 ) , . . ., γ ( p ) γ γ is a p × 1 vector. Using the method of moments, we replace = p γ h ) in (3.100) by ˆ γ ( h ) [see equation (1.36)] and solve ( − 1 ′ 2 − 1 ˆ ˆ ˆ ˆ . (3.101) γ ˆ γ ( 0 )− ˆ = Γ γ , ˆ σ γ ˆ Γ = φ p p p p w p These estimators are typically called the Yule–Walker estimators . For calculation purposes, it is sometimes more convenient to work with the sample ACF. By factoring ˆ γ ( 0 ) in (3.101), we can write the Yule–Walker estimates as ] [ ′ 1 − 2 − 1 ˆ ˆ ˆ , (3.102) ˆ ρ ρ R γ = ˆ ˆ ( 0 ) − 1 σ ˆ , ˆ ρ φ R = p p p p w p p ′ ˆ ) ( × ρ ) p ( ρ ˆ , . . ., ) 1 ( p ρ ˆ is a = 1 is a p × p matrix and ˆ ρ R j − k ( )} ˆ { = where p p k , j 1 = vector. For AR( p ) models, if the sample size is large, the Yule–Walker estimators are 2 2 approximately normally distributed, and . We state ˆ σ σ is close to the true value of w w these results in Property 3.8; for details, see Section B.3. i i i i

126 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 116 — #126 i i 3 ARIMA Models 116 Property 3.8 Large Sample Results for Yule–Walker Estimators n The asymptotic ( → ∞ ) behavior of the Yule–Walker estimators in the case of ) processes is as follows: p causal AR( ) ( √ ( ) p d 2 − 1 2 2 ˆ (3.103) Γ N → . 0 , ˆ σ , σ σ → φ − n φ w p w w ˆ without The Durbin–Levinson algorithm, (3.68)–(3.70), can be used to calculate φ ˆ ˆ or Γ R in the algorithm. In running the , by replacing γ ( inverting ) by ˆ γ ( h ) h p p ′ ˆ ˆ ˆ , . . ., vector, ( 1 φ × = h φ ) algorithm, we will iteratively calculate the , for φ h h 1 hh = 1 2 , . . . . Thus, in addition to obtaining the desired forecasts, the Durbin–Levinson h , ˆ φ , the sample PACF. Using (3.103), we can show the following algorithm yields hh property. Property 3.9 Large Sample Distribution of the PACF ) process, asymptotically ( n For a causal AR( ), p →∞ √ d ˆ ( ) → N φ 0 , 1 (3.104) , for h n p . > hh 2 Example 3.27 Yule–Walker Estimation for an AR( ) Process The data shown in Figure 3.4 were n = 144 simulated observations from the AR(2) model , = 1 . 5 x x w + − . 75 x t − 2 1 − t t t = ) . 519 where w . ∼ iid N ( 0 , 1 ) . For these data, ˆ γ ( 0 ) = 8 . 903 , ˆ ρ ( 1 ) = . 849 , and ˆ ρ ( 2 t Thus, ) ( ] ( ( ) ) [ 1 − ˆ φ . 1 . 849 463 849 1 . 1 ˆ = = φ = ˆ 723 519 849 1 . − . . φ 2 and ( [ )] 463 1 . 2 ( ) 1 − 1 . 849 , . 519 . 187 = . 903 . 8 = σ ˆ w 723 − . ˆ By Property 3.8, the asymptotic variance–covariance matrix of φ is [ ] [ ] 1 − 2 1 . 187 1 1 . 849 058 . − . 003 = , 2 849 1 . − 003 . 058 . 8 . 903 144 ˆ φ and and it can be used to get confidence regions for, or make inferences about its components. For example, an approximate 95% confidence interval for φ is 2 . 723 ± 2 ( − 058 ) , or (− . 838 , − . 608 ) , which contains the true value of φ . = − . 75 . 2 ˆ φ = ( For these data, the first three sample partial autocorrelations are ) 1 ρ = ˆ 11 ˆ ˆ ˆ − . 849 , . φ = = . According to Property 3.9, the asymp- φ φ = − . 721 , and 085 2 22 33 √ ˆ . − , and the observed value, 144 083 . = , is 085 / is φ totic standard error of 1 33 0 = . φ about only one standard deviation from 33 i i i i

127 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 117 — #127 i i 117 3.5 Estimation Example 3.28 Yule–Walker Estimation of the Recruitment Series In Example 3.18 we fit an AR(2) model to the recruitment series using ordinary least squares (OLS). For AR models, the estimators obtained via OLS and Yule- Walker are nearly identical; we will see this when we discuss conditional sum of squares estimation in (3.111)–(3.116). Below are the results of fitting the same model using Yule-Walker estimation in R, which are nearly identical to the values in Example 3.18. rec.yw = ar.yw(rec, order=2) rec.yw$x.mean # = 62.26 (mean estimate) # = 1.33, -.44 (coefficient estimates) rec.yw$ar sqrt(diag(rec.yw$asy.var.coef)) # = .04, .04 (standard errors) # = 94.80 (error variance estimate) rec.yw$var.pred To obtain the 24 month ahead predictions and their standard errors, and then plot the results (not shown) as in Example 3.25, use the R commands: rec.pr = predict(rec.yw, n.ahead=24) ts.plot(rec, rec.pr$pred, col=1:2) lines(rec.pr$pred + rec.pr$se, col=4, lty=2) lines(rec.pr$pred - rec.pr$se, col=4, lty=2) In the case of AR( p ) models, the Yule–Walker estimators given in (3.102) are optimal in the sense that the asymptotic distribution, (3.103), is the best asymptotic p normal distribution. This is because, given initial conditions, AR( ) models are linear models, and the Yule–Walker estimators are essentially least squares estimators. If we use method of moments for MA or ARMA models, we will not get optimal estimators because such processes are nonlinear in the parameters. Example 3.29 Method of Moments Estimation for an MA( 1 ) Consider the time series θ + , x w = w t t − 1 t . The model can then be written as < 1 | θ | where ∞ ’ j , w + (− θ ) x x = j − t t t j = 1 2 ( θ . The first two population autocovariances are γ 0 + ) = σ which is nonlinear in ( 1 w 2 2 θ is found by solving: so the estimate of θ, ) and γ ( 1 ) = σ θ w ˆ θ ) 1 ( ˆ γ . = = 1 ˆ ) ρ ( 2 ˆ 0 ( γ ˆ ) 1 + θ 1 , the solutions | ˆ ρ ( 1 )| ≤ Two solutions exist, so we would pick the invertible one. If 2 1 )| 1 < are real, otherwise, a real solution does not exist. Even though | ρ ( for an 2 1 ( ˆ ρ because it is an estimator. For 1 )| ≥ invertible MA(1), it may happen that | 2 = example, the following simulation in R produces a value of ˆ ρ ( 1 ) . 507 when the 2 . 497 . true value is ρ ( 1 ) = . 9 /( 1 + . 9 = ) i i i i

128 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 118 — #128 i i 3 ARIMA Models 118 set.seed(2) ma1 = arima.sim(list(order = c(0,0,1), ma = 0.9), n = 50) acf(ma1, plot=FALSE)[1] # = .507 (lag 1 sample ACF) 1 , the invertible estimate is ( )| < ρ ˆ | When 1 2 √ 2 ( − 4 ˆ ρ 1 1 ) − 1 ˆ θ (3.105) = . 1 ( ρ 2 ˆ ) 3.5 It can be shown that ( ) 8 2 4 6 + θ θ + 4 θ + + 1 θ ˆ AN θ, ∼ θ ; 2 2 ( ) 1 − n θ asymptotically normal AN is read and is defined in Definition A.5. The maximum likelihood estimator (which we discuss next) of , in this case, has an asymptotic θ 2 1 − θ 5 )/ n variance of θ = . ( , for example, the ratio of the asymptotic . When variance of the method of moments estimator to the maximum likelihood estimator of θ is about 3.5. That is, for large samples, the variance of the method of moments estimator is about 3.5 times larger than the variance of the MLE of θ when θ = . 5 . Maximum Likelihood and Least Squares Estimation To fix ideas, we first focus on the causal AR(1) case. Let φ x (3.106) = μ + ( x w + ) − μ 1 − t t t 2 , we seek the likelihood where | φ | < 1 and w , . . ., ∼ iid N ( 0 , σ x x ) . Given data x , 1 2 t n w ( ) 2 2 L x , x ) x ( . = μ, φ, σ f μ, φ, σ , . . ., n 1 2 w w In the case of an AR(1), we may write the likelihood as 2 f ) = x ( x x ) f ( x ( , ) x f )··· L μ, φ, σ ( − 1 1 n n 2 1 w where we have dropped the parameters in the densities, f (·) , to ease the notation. ) ( 2 x Because , we have x ) μ − ∼ N , σ μ + φ ( x t t 1 − t − 1 w x )] , f x μ − x ) ( f ( [( x φ − μ )− = t w 1 t − 1 − t t is the density of w , that is, the normal density with mean zero and where f (·) w t 2 . We may then write the likelihood as variance σ w n ÷ ] [ μ, φ, σ − ) = f ( x x ) . μ ) f L ( ( x ( − μ )− φ t 1 t − 1 w w 2 t = . 5 3 The result follows from Theorem A.7 and the delta method. See the proof of Theorem A.7 for details on the delta method. i i i i

129 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 119 — #129 i i 3.5 Estimation 119 ( x To find ) , we can use the causal representation f 1 ∞ ’ j + μ = w x φ 1 − j 1 0 = j 2 2 . Finally, for an AR(1), ) is normal, with mean μ and variance σ x to see that /( 1 − φ 1 w the likelihood is ] [ S ( μ, φ ) / − / 2 2 1 2 2 2 n (3.107) , ( ( 1 − φ 2 ) πσ μ, φ, σ ) exp ) L ( = − w w 2 2 σ w where n ’ 2 2 2 ] [ − ) μ (3.108) . ( x x − μ )− φ ( + ) = ( 1 − φ S )( x ( − μ ) μ, φ t 1 t − 1 2 = t . We could have also Typically, S ( μ, φ ) is called the unconditional sum of squares , that is, μ and φ using unconditional least squares considered the estimation of . estimation by minimizing S ( μ, φ ) 2 σ and setting Taking the partial derivative of the log of (3.107) with respect to w the result equal to zero, we get the typical normal result that for any given values of μ − 2 1 φ in the parameter space, σ ) μ, φ = n maximizes the likelihood. Thus, the and S ( w 2 maximum likelihood estimate of σ is w − 1 2 ˆ φ σ (3.109) = n ) ˆ S ( ˆ μ, , w ˆ n are the MLEs of μ and φ , respectively. If we replace φ in (3.109) by ˆ and μ where 2 σ − 2 . , we would obtain the unconditional least squares estimate of n w 2 2 ˆ are and μ φ ˆ , and ignore constants, ˆ by σ If, in (3.107), we take logs, replace σ w w the values that minimize the criterion function [ ] − 1 − 1 2 n ( μ, φ S ( μ, φ ) ) − n = log log ( 1 − φ l ) ; (3.110) 3.6 2 μ, φ, l μ, φ )∝− 2 log L ( ( ˆ σ that is, . ) Because (3.108) and (3.110) are complicated w l is accomplished ) functions of the parameters, the minimization of μ, φ ( μ, φ ) or S ( numerically. In the case of AR models, we have the advantage that, conditional on initial values, they are linear models. That is, we can drop the term in the likelihood x , the conditional likelihood becomes that causes the nonlinearity. Conditioning on 1 n ÷ 2 ] [ ( x φ )− μ − ) − x ( f μ L = ) ( x μ, φ, σ 1 t − t w 1 w t = 2 [ ] μ, φ ( S ) c )/ 2 1 − n −( 2 πσ − ) = ( 2 exp , (3.111) w 2 2 σ w where the conditional sum of squares is 6 . 3 The criterion function is sometimes called the profile or concentrated likelihood. i i i i

130 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 120 — #130 i i 3 ARIMA Models 120 n ’ 2 ] [ ( μ, φ ) = S ( (3.112) μ )− φ ( x . − x − μ ) t 1 t − c t = 2 2 The conditional MLE of σ is w 2 ˆ = S σ ( ˆ μ, (3.113) φ )/( n − 1 ) , ˆ c w ˆ and φ μ are the values that minimize the conditional sum of squares, S and ( μ, φ ) . ˆ c , the conditional sum of squares can be written as ( 1 − φ ) μ α Letting = n ’ 2 [ ] x −( α + φ x ) = (3.114) μ, φ ) . ( S 1 − t t c t 2 = The problem is now the linear regression problem stated in Section 2.1. Following ˆ α = ̄ x = x ̄ − φ the results from least squares estimation, we have ̄ ˆ , where x 1 ( ) 2 ( 1 ) ( ) Õ Õ n − n 1 − 1 − 1 ̄ x n , and ) x 1 = ( n − 1 ) ( − , and the conditional estimates are then x t t ( 2 ) 2 = t 1 = t ˆ ̄ ̄ x φ x − 1 ) ) ( ( 2 (3.115) μ = ˆ ˆ φ − 1 Õ n ̄ ) ( x − − ̄ x )( x x − 1 t t ) ( ( 2 ) 1 = t 2 ˆ (3.116) . = φ Õ n 2 x ( x − ̄ ) 1 t − ) 1 ( 2 t = ˆ ) 1 ( From (3.115) and (3.116), we see that ˆ μ ≈ ̄ x and ≈ φ . That is, the Yule–Walker ˆ ρ estimators and the conditional least squares estimators are approximately the same. The only difference is the inclusion or exclusion of terms involving the endpoints, x 1 2 σ x and in (3.113) to be equivalent to the least . We can also adjust the estimate of n w ˆ ) ˆ μ, squares estimator, that is, divide φ ) by ( n − 3 S instead of ( n − 1 ) in (3.113). ( c For general AR( p ) models, maximum likelihood estimation, unconditional least squares, and conditional least squares follow analogously to the AR(1) example. For general ARMA models, it is difficult to write the likelihood as an explicit function of the parameters. Instead, it is advantageous to write the likelihood in terms of the 1 − t . This will also be useful innovations − x , or one-step-ahead prediction errors, x t t in Chapter 6 when we study state-space models. ′ ( p , q ) model, let β = ( μ, φ be the , . . ., φ For a normal ARMA , θ ) , . . ., θ q p 1 1 ( p + q + 1 ) -dimensional vector of the model parameters. The likelihood can be written as n ÷ 2 x , . . ., x x . ( f ) = ) β, σ ( L t − 1 t 1 w = t 1 1 − t x given x , . . ., x and is Gaussian with mean x The conditional distribution of − 1 t 1 t t Œ 1 − t 2 1 − t 1 − t = γ ( 0 ) φ − 1 For ARMA models, . ) ( . Recall from (3.71) that P P variance t t 1 = j j j Õ ∞ 2 2 0 ) = σ , in which case we may write ψ γ ( w j = 0 j       − t ∞ 1   ÷ ’       def 1 − t 2 2 2 2     P = σ ψ − φ 1 ( = ) σ r , t t w w j j j             j 1 0 = j =       i i i i

131 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 121 — #131 i i 3.5 Estimation 121 r where terms are functions only of the r is the term in the braces. Note that the t t 2 ( 1 − φ r regression parameters and that they may be computed recursively as ) r = + t 1 t tt Õ ∞ 2 = ψ . The likelihood of the data can now be written as r with initial condition 1 j = 0 j ] [ S β ) ( 1 − 2 / n − 2 2 / 2 [ ] (3.117) , ) r L ( β )··· r ) ( β ) ) = ( 2 β, σ exp r − ( ( β πσ n 2 1 w w 2 2 σ w where [ ] n 2 t − 1 ’ − x ( x )) β ( t t ( S ) = β . (3.118) r ( β ) t = t 1 t − 1 x Both and r are functions of β alone, and we make that fact explicit in (3.117)- t t 2 and σ (3.118). Given values for β , the likelihood may be evaluated using the w techniques of Section 3.4. Maximum likelihood estimation would now proceed by 2 σ 1 maximizing (3.117) with respect to . As in the AR ( β ) example, we have and w 2 − 1 ˆ n σ ( S ˆ = β ) , (3.119) w ˆ β is the value of β that minimizes the concentrated likelihood where n ’ [ ] − 1 − 1 = log β n (3.120) . S ( β ) ) + n l ( β ) ( r log t 1 = t t 0 1 − x x μ = = For the AR(1) model (3.106) discussed previously, recall that and t 1 + φ ( x t for 0 − μ ) , for μ = 2 , . . ., n . Also, using the fact that φ = = φ and φ hh 1 − t 11 Õ ∞ 1 2 j 2 − 1 2 − 2 = ) φ − 1 φ , and in h = ( 1 − φ > ) 1 , we have , r ( = ( 1 − φ r ) 1 = 2 1 = 0 j = r = 1 general, t . Hence, the likelihood presented in (3.107) is identical 2 , . . ., n for t to the innovations form of the likelihood given by (3.117). Moreover, the generic ( S in β ) in (3.118) is S ( μ, φ ) given in (3.108) and the generic l ( β ) in (3.120) is l ( μ, φ ) (3.110). Unconditional least squares would be performed by minimizing (3.118) with respect to . Conditional least squares estimation would involve minimizing (3.118) β with respect to but where, to ease the computational burden, the predictions and β their errors are obtained by conditioning on initial values of the data. In general, numerical optimization routines are used to obtain the actual estimates and their standard errors. Example 3.30 The Newton–Raphson and Scoring Algorithms Two common numerical optimization routines for accomplishing maximum like- lihood estimation are Newton–Raphson and scoring. We will give a brief account of the mathematical ideas here. The actual implementation of these algorithms is much more complicated than our discussion might imply. For details, the reader is referred to any of the Numerical Recipes books, for example, Press et al. (1993). ( Let l ) β that we wish be a criterion function of k parameters β = ( β ) , . . ., β k 1 to minimize with respect to β . For example, consider the likelihood function given ˆ ) β is the extremum that we are interested in by (3.110) or by (3.120). Suppose l ( i i i i

132 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 122 — #132 i i 3 ARIMA Models 122 ( 1 ) ˆ ( β )/ ∂β finding, and = 0 , for β = 1 , . . ., k . Let l is found by solving ∂ l ( β ) j j k denote the 1 vector of partials × ) ( ′ ∂ l ( β ) ( l β ) ∂ ) 1 ( . = l ( β , . . ., ) ∂β ∂β k 1 2 ) ( ( 1 ) ˆ ( ) = 0 , the k × 1 zero vector. Let l l Note, β ( β ) denote the k × k matrix of second-order partials { } k 2 β ( l ∂ ) ( ) 2 − β ( l = ) , ∂β ∂β j i i , j = 1 2 ) ( β ( β ) is nonsingular. Let and assume l be a “sufficiently good” initial estimator ) 0 ( of . Then, using a Taylor expansion, we have the following approximation: β [ ] ) ) ( ( 1 1 ( 2 ) ˆ ˆ = l ( β )− l β β )≈ ( β . ) l 0 β − ( 0 ) ( ( 0 ) ) 0 ( ˆ Setting the right-hand side equal to zero and solving for β β [call the solution ], ) 1 ( we get ] [ − 1 ) ( 2 ( ) 1 . l ) β = ( β β l ( ) β + 0 ) ) 0 ( ) ( ( ) 1 ( 0 β The Newton–Raphson algorithm proceeds by iterating this result, replacing by 0 ) ( β to get β , and so on, until convergence. Under a set of appropriate conditions, ( 1 ) ( 2 ) ˆ β . the sequence of estimators, β β , the MLE of , will converge to , β , . . . 2 ) 1 ( ( ) ( given ) l For maximum likelihood estimation, the criterion function used is β 1 ( ( 2 ) ) ) is called the score vector, and l by (3.120); l ( ( β ) is called the Hessian . In the β 2 ) ( ( 2 ) information , the ( β ) by E [ l method of scoring, we replace l matrix. Under ( β )] appropriate conditions, the inverse of the information matrix is the asymptotic ˆ β . This is sometimes approximated by variance–covariance matrix of the estimator ˆ the inverse of the Hessian at β . If the derivatives are difficult to obtain, it is possible to use quasi-maximum likelihood estimation where numerical techniques are used to approximate the derivatives. Example 3.31 MLE for the Recruitment Series So far, we have fit an AR(2) model to the Recruitment series using ordinary least squares (Example 3.18) and using Yule–Walker (Example 3.28). The following is an R session used to fit an AR(2) model via maximum likelihood estimation to the Recruitment series; these results can be compared to the results in Example 3.18 and Example 3.28. rec.mle = ar.mle(rec, order=2) rec.mle$x.mean # 62.26 rec.mle$ar # 1.35, -.46 sqrt(diag(rec.mle$asy.var.coef)) # .04, .04 rec.mle$var.pred # 89.34 Gauss–Newton ) models via . For We now discuss least squares for ARMA( p , q general and complete details of the Gauss–Newton procedure, the reader is referred i i i i

133 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 123 — #133 i i 3.5 Estimation 123 ′ = ( φ ) , . . ., φ , and for the ease of , θ β , . . ., θ to Fuller (1996). As before, write q p 1 1 μ = 0 . We write the model in terms of the errors discussion, we will put p q ’ ’ ) = x w − ( β x − φ θ β ( w ) (3.121) , t t j − t j t k k − j 1 = 1 k = emphasizing the dependence of the errors on the parameters. For conditional least squares, we approximate the residual sum of squares by (if 0 = conditioning on x w , . . ., x = (if p > 0 ) and w ··· = w = w = 1 − − 2 p p p 1 1 − q p > , . . ., ), in which case, given β , we may evaluate (3.121) for t = p + 1 , p + 2 q n . 0 Using this conditioning argument, the conditional error sum of squares is n ’ 2 S = ( β ) w ( β ) . (3.122) c t = t p 1 + yields the conditional least squares estimates. β with respect to Minimizing S ) ( β c q = 0 , the problem is linear regression and no iterative technique is needed to If minimize S , the problem becomes nonlinear regression and ( φ 0 , . . ., φ > ) . If q p 1 c we will have to rely on numerical optimization. When n is large, conditioning on a few initial values will have little influence on the final parameter estimates. In the case of small to moderate sample sizes, one may wish to rely on unconditional least squares. The unconditional least squares problem is to choose β to minimize the unconditional sum of squares, which we have in this section. The unconditional sum of squares can be S ( β ) generically denoted by written in various ways, and one useful form in the case of ARMA ( p , q ) models is derived in Box et al. (1994, Appendix A7.3). They showed (see Problem 3.19) the unconditional sum of squares can be written as n ’ 2 ) (3.123) ( β , = ̃ S ( β w ) t −∞ = t x ̃ are obtained by backcasting. ( β ) = E ( w ) | x where , . . ., w β ) . When t ≤ 0 , the ˆ w ( t n t t 1 , where M As a practical matter, we approximate S ( β ) by starting the sum at t = − M + 1 Õ − M 2 w 0 ) ≈ ̃ ( . In the case of unconditional β is chosen large enough to guarantee −∞ = t t least squares estimation, a numerical optimization technique is needed even when = q 0 . ) 0 ( 0 ( ) ( 0 ) 0 ) ( ′ = ( β To employ Gauss–Newton, let φ , . . ., φ , θ , . . ., θ be an initial ) 0 ( ) p q 1 1 β . For example, we could obtain β by method of moments. The first- estimate of 0 ) ( order Taylor expansion of w ( β ) is t ) ( ′ (3.124) w ( ( β )≈ w z ( β β ) β )− , β − t t t ) ( ) ) 0 ( ( 0 0 where ) ( ) β ( w ∂ β ( w ∂ ) t t ′ 1 = n , . . ., − , . , . . ., t ) β = z ( − ) 0 ( t ∂β ∂β p 1 + q = β β ( 0 ) i i i i

134 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 124 — #134 i i 3 ARIMA Models 124 β ) is S ( The linear approximation of c n ’ [ ] ( ) 2 ′ = β w Q ( β ) ( ( )− β β − β z ) (3.125) t t ( ) 0 ( 0 ) ( ) 0 + 1 t = p and this is the quantity that we will minimize. For approximate unconditional least − M + 1 , for a large value of M , and squares, we would start the sum in (3.125) at t = work with the backcasted values. Using the results of ordinary least squares (Section 2.1), we know n n ( ) ( ) ’ ’ 1 − 1 − 1 − ′ ̂ n ( β − ) = β n z β ( β ) ( z ) β (3.126) ) w β ( ( z ) t t t 0 ) ( ) ( 0 0 ) ( ( ) ( ) 0 0 t + p = 1 t + t = p 1 minimizes ( β ) . From (3.126), we write the one-step Gauss–Newton estimate as Q ( β β + ∆ = β ) , (3.127) ( ( 0 ) 1 ( 0 ) ) ∆ ( where β denotes the right-hand side of (3.126). Gauss–Newton estimation is ) ) ( 0 β β in (3.127). This process is repeated by calcu- by accomplished by replacing ( 1 ) ) ( 0 j = 2 , lating, at iteration , . . . , 3 β β ( = β ∆ + ) − ) j ( ) j ( j − 1 ) ( 1 until convergence. Example 3.32 Gauss–Newton for an MA( ) 1 w x = w . Write the truncated errors Consider an invertible MA(1) process, θ + t t − 1 t as n , . . ., w (3.128) , 1 ( θ ) = x = − θ w t , ) ( θ 1 − t t t w θ ) = 0 . Taking derivatives and negating, where we condition on ( 0 w ∂ ) θ ( ) ( w ∂ θ t − 1 t − + θ = w , n , . . ., ( θ ) 1 , t = (3.129) − t 1 ∂θ ∂θ = ∂ ( θ )/ ∂θ where 0 . We can also write (3.129) as w 0 (3.130) ( θ ) = w , n z ( θ )− θ z , . . ., 1 = ( θ ) , t 1 − t − t t 1 . 0 = ) where z θ ( θ ) = − ∂ w ( ( θ )/ ∂θ and z 0 t t θ be an initial estimate of , for example, the estimate given in Exam- θ Let ) 0 ( ple 3.29. Then, the Gauss–Newton procedure for conditional least squares is given by Õ n ) w ) θ ( θ ( z t t j ) ( j ( ) t 1 = = , . . ., j , 1 , 0 2 (3.131) , θ θ + = Õ ( 1 ) j ( + ) j n 2 θ ) z ( ) j ( t t = 1 where the values in (3.131) are calculated recursively using (3.128) and (3.130). | )| , or The calculations are stopped when | θ θ ( Q )− θ − θ ( Q | , are smaller 1 ) j ( ) ( j + ) ) 1 + j ( ( j than some preset amount. i i i i

135 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 125 — #135 i i 125 3.5 Estimation 0.2 ACF −0.1 −0.4 30 25 35 0 5 10 15 20 LAG 0.2 −0.1 PACF −0.4 20 15 0 25 30 35 5 10 LAG Fig. 3.9. ACF and PACF of transformed glacial varves. Example 3.33 Fitting the Glacial Varve Series n Consider the series of glacial varve thicknesses from Massachusetts for 634 = years, as analyzed in Example 2.7 and in Problem 2.8, where it was argued that a first-order moving average model might fit the logarithmically transformed and differenced varve series, say, ( ) x t log ( x = ) ∇ log ( x log )− log ( x = ) , 1 t t − t x − 1 t which can be interpreted as being approximately the percentage change in the thickness. The sample ACF and PACF, shown in Figure 3.9, confirm the tendency of ) x to behave as a first-order moving average process as the ACF has only a ∇ log ( t significant peak at lag one and the PACF decreases exponentially. Using Table 3.1, this sample behavior fits that of the MA(1) very well. ) using (3.105). The 495 Since ˆ ρ ( 1 . = − . 397 , our initial estimate is θ = − 0 ) ( results of eleven iterations of the Gauss–Newton procedure, (3.131), starting with ˆ θ θ θ = are given in Table 3.2. The final estimate is = − . 773 ; interim values ) ) ( 0 ( 11 θ and the corresponding value of the conditional sum of squares, S ) ( given in c (3.122), are also displayed in the table. The final estimate of the error variance is 2 σ / degrees of freedom (one is lost in differencing). = 148 . 98 ˆ 632 = . 236 with 632 w Õ n 2 = ( ) z The value of the sum of the squared derivatives at convergence is θ ( ) 11 t 1 t = √ ˆ / . 236 368 . 741 = , and consequently, the estimated standard error of is . 741 θ 368 3.7 -value of ; t this leads to a . 025 − . 773 / . 025 = − 30 . 92 with 632 degrees of freedom. Figure 3.10 displays the conditional sum of squares, S ( θ ) as a function of θ , c as well as indicating the values of each step of the Gauss–Newton algorithm. Note 3 7 . To estimate the standard error, we are using the standard regression results from (2.6) as an approxi- mation i i i i

136 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 126 — #136 i i 3 ARIMA Models 126 170 165 ) θ ( c S 160 l 155 l l 150 l l l l l l l l l −0.4 −0.8 −0.9 −0.5 −0.6 −0.3 −0.7 θ Fig. 3.10. Conditional sum of squares versus values of the moving average parameter for the glacial varve example, Example 3.33. Vertical lines indicate the values of the parameter obtained via Gauss–Newton; see Table 3.2 for the actual values. that the Gauss–Newton procedure takes large steps toward the minimum initially, and then takes very small steps as it gets close to the minimizing value. When there is only one parameter, as in this case, it would be easy to evaluate S on a grid ( θ ) c of points, and then choose the appropriate value of θ from the grid search. It would be difficult, however, to perform grid searches when there are many parameters. The following code was used in this example. x = diff(log(varve)) # Evaluate Sc on a Grid c(0) -> w -> z c() -> Sc -> Sz -> Szw num = length(x) th = seq(-.3,-.94,-.01) for (p in 1:length(th)){ for (i in 2:num){ w[i] = x[i]-th[p]*w[i-1] } Sc[p] = sum(w^2) } plot(th, Sc, type="l", ylab=expression(S[c](theta)), xlab=expression(theta), lwd=2) # Gauss-Newton Estimation r = acf(x, lag=1, plot=FALSE)$acf[-1] rstart = (1-sqrt(1-4*(r^2)))/(2*r) # from (3.105) c(0) -> w -> z c() -> Sc -> Sz -> Szw -> para niter = 12 para[1] = rstart for (p in 1:niter){ for (i in 2:num){ w[i] = x[i]-para[p]*w[i-1] z[i] = w[i-1]-para[p]*z[i-1] } Sc[p] = sum(w^2) Sz[p] = sum(z^2) Szw[p] = sum(z*w) para[p+1] = para[p] + Szw[p]/Sz[p] } i i i i

137 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 127 — #137 i i 127 3.5 Estimation Table 3.2. Gauss–Newton Results for Example 3.33 Õ n 2 j ( ) ( S θ ) θ z θ c ) ( ) j ) ( j ( j t 1 t = 0 − 495 158.739 171.240 0 . − 0 1 668 150.747 235.266 . 2 0 . 733 149.264 300.562 − − . 756 149.031 336.823 3 0 − 0 . 766 148.990 354.173 4 5 − . 769 148.982 362.167 0 6 0 . 771 148.980 365.801 − 367.446 148.980 772 7 − 0 . − 8 . 772 148.980 368.188 0 − . 772 148.980 368.522 9 0 − 0 . 773 148.980 368.673 10 − . 11 773 148.980 368.741 0 round(cbind(iteration=0:(niter-1), thetahat=para[1:niter] , Sc , Sz ), 3) abline(v = para[1:12], lty=2) points(para[1:12], Sc[1:12], pch=16) p , In the general case of causal and invertible ARMA( ) models, maximum like- q lihood estimation and conditional and unconditional least squares estimation (and Yule–Walker estimation in the case of AR models) all lead to optimal estimators. The proof of this general result can be found in a number of texts on theoretical time series analysis (for example, Brockwell and Davis, 1991, or Hannan, 1970, to mention a few). ′ We will denote the ARMA coefficient parameters by β . ( φ = , . . ., φ ) , θ , . . ., θ 1 q 1 p Property 3.10 Large Sample Distribution of the Estimators Under appropriate conditions, for causal and invertible ARMA processes, the maximum likelihood, the unconditional least squares, and the conditional least squares estimators, each initialized by the method of moments estimator, all pro- 2 2 is consistent, and the β ˆ and σ , in the sense that vide optimal estimators of σ w w ˆ asymptotic distribution of β is the best asymptotic normal distribution. In particular, as n →∞ , ) ) ( ( √ d − 1 2 ˆ (3.132) . Γ n N 0 , σ − β → β p q , w ˆ The asymptotic variance–covariance matrix of the estimator β is the inverse of the matrix. In particular, the ( p + q )×( p + q ) matrix Γ information , has the form q , p ( ) Γ Γ φθ φφ = Γ . (3.133) q p , Γ Γ θφ θθ The p × , for matrix Γ Γ is given by (3.100) , that is, the i j -th element of p φφ φφ Γ . Similarly, w is i , j = 1 , . . ., p , is γ = ( i − j ) from an AR( p ) process, φ ( B ) x t t x θθ ) from an j a q × q matrix with the i j -th element, for i , j = 1 , . . ., q , equal to γ − ( i y = , . . ., ; AR( q ) process, θ ( B ) y = w . The p 1 q matrix Γ = { γ p ( i − j )} , for i × xy t t φθ j = 1 , . . ., q ; that is, the i j -th element is the cross-covariance between the two AR ′ and . p processes given by φ ( B ) x q = w is × θ ( B ) y Γ = w = . Finally, Γ θφ t t t t φθ i i i i

138 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 128 — #138 i i 3 ARIMA Models 128 Further discussion of Property 3.10, including a proof for the case of least squares p ) processes, can be found in Section B.3. estimators for AR( Example 3.34 Some Specific Asymptotic Distributions The following are some specific cases of Property 3.10. 2 1 − 2 2 2 Γ σ /( 1 − φ ) ) , so σ 0 . Thus, ) ( γ = = ( 1 − φ AR(1): x w w 1 , 0 [ ] − 1 2 ˆ (3.134) φ, n . − ( 1 φ φ ∼ ) AN AR(2): The reader can verify that ) ( 2 σ 1 φ − 2 w = ) 0 ( γ x 2 2 φ 1 + ( 1 − φ ) − φ 2 2 1 1 − . In particular, + ( ) = φ γ ( 0 and 1 φ γ γ ( 1 ) . From these facts, we can compute Γ ) x 2 1 x x 2 , 0 we have ( )] ) [( ( ) 2 ˆ φ φ φ φ − + 1 φ 1 ( ) − 1 1 2 1 1 − 2 AN ∼ (3.135) . , n 2 ˆ φ φ φ − sym 1 2 2 2 B MA(1): In this case, write θ ( . Then, analogous to ) y w = w = , or y y + θ t 1 t t t − t 2 1 2 2 − 2 ) = σ . Thus, θ /( 1 the AR(1) case, θ γ ) , so σ ) − Γ ( 0 1 ( = − y w w 1 , 0 ] [ 2 1 − ˆ θ ∼ AN ( 1 − n θ ) . (3.136) θ, , so , analogous to the AR(2) case, we have MA(2): Write y w + θ = y y θ + 1 1 t t − 2 2 t t − ( [( ) ( ) )] 2 ˆ θ θ 1 1 + θ ) ( θ − θ 1 1 1 2 1 − 2 AN ∼ , n (3.137) . 2 ˆ θ 1 − θ sym θ 2 2 2 ARMA(1,1): To calculate w , we must find γ ( 0 ) , where x − φ x and = Γ 1 xy t − φθ t t . We have + θ y y = w − 1 t t t ) w γ ) ( 0 + = cov ( x , y ) = cov ( φ x + w , − θ y 1 t t t t xy t − t − 1 2 . ( 0 ) + σ = − φθγ xy w 2 . Thus, ) /( 1 + φθ = Solving, we find, 0 ) γ σ ( xy w [ ] ( ) ( [ ] ) − 1 1 − 1 − 2 ˆ φ φ 1 − 1 ) ( φθ ( φ + ) 1 − AN ∼ n . (3.138) , 1 − 2 ˆ θ θ − 1 ( sym ) θ Example 3.35 Overfitting Caveat The asymptotic behavior of the parameter estimators gives us an additional insight into the problem of fitting ARMA models to data. For example, suppose a time series follows an AR(1) process and we decide to fit an AR(2) to the data. Do any problems occur in doing this? More generally, why not simply fit large-order AR models to make sure that we capture the dynamics of the process? After all, i i i i

139 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 129 — #139 i i 129 3.5 Estimation if the process is truly an AR(1), the other autoregressive parameters will not be , we obtain less efficient, or less precise overfit significant. The answer is that if we parameter estimates. For example, if we fit an AR(1) to an AR(1) process, for large 1 − 2 ˆ − n , var ( ( 1 , φ φ n ) . But, if we fit an AR(2) to the AR(1) process, for large n )≈ 1 1 − 1 2 − 1 ˆ ) ≈ φ ( 1 − φ var has been ) = n ( n because φ = 0 . Thus, the variance of φ 1 2 1 2 inflated, making the estimator less precise. We do want to mention, however, that overfitting can be used as a diagnostic tool. For example, if we fit an AR(2) model to the data and are satisfied with that model, then adding one more parameter and fitting an AR(3) should lead to approximately the same model as in the AR(2) fit. We will discuss model diagnostics in more detail in Section 3.7. ˆ The reader might wonder, for example, why the asymptotic distributions of from φ ˆ from an MA(1) are of the same form; compare (3.134) to (3.136). It an AR(1) and θ is possible to explain this unexpected result heuristically using the intuition of linear regression. That is, for the normal regression model presented in Section 2.1 with no ˆ , and = β z β + intercept term, is normally distributed with mean , we know x β w t t t from (2.6), ( ( ) ) − 1 − 1 n n )} ( { ’ ’ √ 2 2 2 2 − 1 ˆ = σ β β n − n z n . = σ z var w t t w t 1 1 = = t , the intuition of regression + For the causal AR(1) model given by x = φ x w − t t t 1 n tells us to expect that, for large, √ ) ( ˆ n φ − φ is approximately normal with mean zero and with variance given by ) ( − 1 n ’ 2 1 − 2 x n σ . w 1 t − 2 = t Õ n 2 1 − is the sample variance (recall that the mean of x , x is zero) of the x Now, n t t t = 2 − t 1 2 2 γ becomes large we would expect it to approach var ( x . ) = 1 ( 0 ) = σ so as ) /( − φ n t w ( ) √ ˆ n Thus, the large sample variance of is φ − φ ( ) − 1 2 σ w − 1 2 2 2 φ ( = 1 ; ) − = γ ( 0 ) σ σ x w w 2 − 1 φ that is, (3.134) holds. In the case of an MA(1), we may use the discussion of Example 3.32 to write an approximate regression model for the MA(1). That is, consider the approximation (3.130) as the regression model ˆ ˆ w ) z θ ( + θ ) = − θ z , ( 1 t t − 1 t − i i i i

140 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 130 — #140 i i 130 3 ARIMA Models ˆ where now, z ) θ as defined in Example 3.32, plays the role of the regressor. Con- ( 1 t − ) ( √ ˆ θ θ n − tinuing with the analogy, we would expect the asymptotic distribution of to be normal, with mean zero, and approximate variance ) ( − 1 n ’ 2 − 2 1 ˆ ) . z σ θ n ( w 1 t − = 2 t Õ n − 1 2 ˆ ˆ n ) ( so, for large z θ z is the sample variance of the ) ( θ As in the AR(1) case, t 2 = t 1 − t is ) n , this should be var { z θ ( θ )} = γ ( ( 0 ) , say. But note, as seen from (3.130), z t t z approximately an AR(1) process with parameter − θ . Thus, ( ) − 1 2 σ w 2 − 1 2 2 = σ 0 γ σ ( ) − ) 1 ( = , θ z w w 2 θ −(− 1 ) which agrees with (3.136). Finally, the asymptotic distributions of the AR parameter estimates and the MA parameter estimates are of the same form because in the MA that have AR structure, and ) θ case, the “regressors” are the differential processes z ( t it is this structure that determines the asymptotic variance of the estimators. For a rigorous account of this approach for the general case, see Fuller (1996, Theorem 5.5.4). ˆ θ In Example 3.33, the estimated standard error of . 025 . In that example, we was used regression results to estimate the standard error as the square root of ( ) 1 − n 2 ’ ˆ σ w 1 1 − 2 − 2 ˆ ˆ σ n n ) θ ( z = , Õ w t n 2 ˆ θ ) z ( t = t 1 1 t = Õ n 2 2 ˆ ˆ ˆ σ = . Using (3.136), we = . 236 , where n 773 . − z = = ( 632 θ ) 368 . 74 and , θ w t t 1 = could have also calculated this value using the asymptotic approximation, the square 2 . −(− . 773 ) root of )/ 632 , which is also 1 025 ( . If is small, or if the parameters are close to the boundaries, the asymptotic n bootstrap can be helpful in this case; for a approximations can be quite poor. The broad treatment of the bootstrap, see Efron and Tibshirani (1994). We discuss the case of an AR(1) here and leave the general discussion for Chapter 6. For now, we give a simple example of the bootstrap for an AR(1) process. Example 3.36 Bootstrapping an AR(1) We consider an AR(1) model with a regression coefficient near the boundary of causality and an error process that is symmetric but not normal. Specifically, consider the causal model − x μ + φ ( x = μ ) + w , (3.139) 1 t t t − = where = 50 , φ μ . 95 , and w are iid double exponential (Laplace) with location t 2 w is given by . The density of zero, and scale parameter β = t i i i i

141 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 131 — #141 i i 3.5 Estimation 131 l l l 60 l l l l l l l l l l l l l l l l l l l l l l 55 l l l l l l l l l l l l l l 50 l l l l l l l l l l l l l l l l l l 45 l l l t l l X l l l l l l l l 40 l l l l l l l l l l l l 35 l l l l l l l l l l l 30 l l l l l l l 25 60 40 20 0 100 80 Time Fig. 3.11. One hundred observations generated from the model in Example 3.36. 1 { } . ∞ ( w ) = < < −∞ exp w −| w |/ β f β 2 2 n In this example, E ( w . Figure 3.11 shows ) = 0 and var 100 w 8 ) = 2 β = = ( t t simulated observations from this process. This particular realization is interesting; the data look like they were generated from a nonstationary process with three different mean levels. In fact, the data were generated from a well-behaved, albeit non-normal, stationary and causal model. To show the advantages of the bootstrap, we will act as if we do not know the actual error distribution. The data in Figure 3.11 were generated as follows. set.seed(101010) e = rexp(150, rate=.5); u = runif(150,-1,1); de = e*sign(u) dex = 50 + arima.sim(n=100, list(ar=.95), innov=de, n.start=50) ' o ' , ylab=expression(X[~t])) plot.ts(dex, type= ˆ ˆ μ = 45 . 25 , Using these data, we obtained the Yule–Walker estimates φ = . 96 , and 2 7 . 88 , as follows. = ˆ σ w fit = ar.yw(dex, order=1) round(cbind(fit$x.mean, fit$ar, fit$var.pred), 2) [1,] 45.25 0.96 7.88 ˆ , we simulated 1000 To assess the finite sample distribution of when n = 100 φ realizations of this AR(1) process and estimated the parameters via Yule–Walker. φ , based on the 1000 The finite sampling density of the Yule–Walker estimate of repeated simulations, is shown in Figure 3.12. Based on Property 3.10, we would ˆ say that φ is approximately normal with mean φ (which we supposedly do not know) 2 2 2 ; . ( 1 − φ and variance )/ 100 , which we would approximate by ( 1 − . 96 03 )/ 100 = this distribution is superimposed on Figure 3.12. Clearly the sampling distribution is not close to normality for this sample size. The R code to perform the simulation is as follows. We use the results at the end of the example set.seed(111) phi.yw = rep(NA, 1000) for (i in 1:1000){ i i i i

142 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 132 — #142 i i 3 ARIMA Models 132 e = rexp(150, rate=.5); u = runif(150,-1,1); de = e*sign(u) x = 50 + arima.sim(n=100,list(ar=.95), innov=de, n.start=50) phi.yw[i] = ar.yw(x, order=1)$ar } The preceding simulation required full knowledge of the model, the parameter values and the noise distribution. Of course, in a sampling situation, we would not have the information necessary to do the preceding simulation and consequently would not be able to generate a figure like Figure 3.12. The bootstrap, however, gives us a way to attack the problem. x throughout the To simplify the discussion and the notation, we condition on 1 example. In this case, the one-step-ahead predictors have a simple form, − 1 t = μ + φ ( x . 100 − x μ ) , t = 2 , . . ., − 1 t t t 1 −  Consequently, the innovations, x = x − , are given by t t t (3.140) = ( x  − μ )− φ ( x , 100 , . . ., − μ ) , t = 2 1 t t − t 2 t − 1 2 2 σ E = ( ) = E ( w . We can use 100 ) =  P , . . ., for t = 2 each with MSPE w t t t (3.140) to write the model in terms of the innovations, 1 − t t x (3.141) . +  100 = x + φ ( x , . . ., 2 = − μ ) +  = μ t − t t 1 t t To perform the bootstrap simulation, we replace the parameters with their ˆ μ = 45 . 25 estimates in (3.141), that is, ˆ φ = . 96 , and denote the resulting and  sample innovations as To obtain one bootstrap sample, first randomly . } { ˆ  ˆ , . . ., 2 100 n = 99 values from the set of sample innovations; call the sample, with replacement, ∗ ∗ , . . ., sampled values {  Now, generate a bootstrapped data set sequentially } . 100 2 by setting ∗ ∗ ∗ 100 2 (3.142) 45 . 25 + . 96 ( x , . . ., . − 45 . 25 ) +  x = , t = t t 1 t − ∗ ∗ with x held fixed at x . x . Next, estimate the parameters as if the data were 1 t 1 2 ˆ ) . Repeat this process a large num- ( 1 μ ( 1 ) , σ φ ( 1 ) , and Call these estimates ˆ w ber, B , of times, generating a collection of bootstrapped parameter estimates, 2 ˆ ˆ μ ( b ) , , σ φ ( b ) . We can then approximate the finite sample dis- { } ( b ) ; b = 1 , . . ., B w tribution of an estimator from the bootstrapped parameter values. For example, we ˆ ˆ ˆ φ can approximate the distribution of , φ − )− by the empirical distribution of φ φ ( b . b = 1 , . . ., B for Figure 3.12 shows the bootstrap histogram of 500 bootstrapped estimates of φ ˆ using the data shown in Figure 3.11. Note that the bootstrap distribution of is φ ˆ shown in Figure 3.12. The following code was used to φ close to the distribution of perform the bootstrap. set.seed(666) # not that 666 fit = ar.yw(dex, order=1) # assumes the data were retained m = fit$x.mean # estimate of mean phi = fit$ar # estimate of phi # number of bootstrap replicates nboot = 500 resids = fit$resid[-1] # the 99 innovations i i i i

143 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 133 — #143 i i 133 3.6 Integrated Models for Nonstationary Data 14 true distribution bootstrap distribution 12 normal approximation 10 8 6 Density 4 2 0 1.0 0.9 0.8 0.7 ^ φ Fig. 3.12. Finite sample density of the Yule–Walker estimate of φ (solid line) in Example 3.36 ˆ and the corresponding asymptotic normal density (dashed line). Bootstrap histogram of φ based on 500 bootstrapped samples. x.star = dex # initialize x* phi.star.yw = rep(NA, nboot) # Bootstrap for (i in 1:nboot) { resid.star = sample(resids, replace=TRUE) for (t in 1:99){ x.star[t+1] = m + phi*(x.star[t]-m) + resid.star[t] } phi.star.yw[i] = ar.yw(x.star, order=1)$ar } # Picture culer = rgb(.5,.7,1,.5) hist(phi.star.yw, 15, main="", prob=TRUE, xlim=c(.65,1.05), ylim=c(0,14), col=culer, xlab=expression(hat(phi))) # from previous simulation lines(density(phi.yw, bw=.02), lwd=2) # normal approximation u = seq(.75, 1.1, by=.001) lines(u, dnorm(u, mean=.96, sd=.03), lty=2, lwd=2) legend(.65, 14, legend=c( ' true distribution ' ' bootstrap distribution ' , , , lty=c(1,0,2), lwd=c(2,0,2), ' n ' ' normal approximation ' ), bty= col=1, pch=c(NA,22,NA), pt.bg=c(NA,culer,NA), pt.cex=2.5) 3.6 Integrated Models for Nonstationary Data x x is a random walk, , In Chapter 1 and Chapter 2, we saw that if = x + w − 1 t t t t = is stationary. In many situations, time then by differencing x , we find that ∇ x w t t t series can be thought of as being composed of two components, a nonstationary trend component and a zero-mean stationary component. For example, in Section 2.1 we considered the model = x , (3.143) μ y + t t t y is stationary. Differencing such a process will lead to a t where μ β = β + and 0 t 1 t stationary process: i i i i

144 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 134 — #144 i i 3 ARIMA Models 134 y ∇ = − x x . y = β ∇ + y + − x β = 1 1 t 1 t 1 − t − t t t Another model that leads to first differencing is the case in which in (3.143) is μ t stochastic and slowly varying according to a random walk. That is, = μ μ v + 1 t t t − where v is stationary. In this case, t ∇ x v , + ∇ y = t t t Õ k j is stationary. If μ = μ -th order polynomial, k , then in (3.143) is a β t t j t 0 = j k is stationary. Stochastic trend models can (Problem 3.27) the differenced series x ∇ t also lead to higher order differencing. For example, suppose μ + μ and v = v = + e , v 1 − t t 1 t t t − t where e is stationary. Then, ∇ x is not stationary, but = v y + ∇ t t t t 2 2 y = e x ∇ ∇ + t t t is stationary. The integrated ARMA, or ARIMA, model is a broadening of the class of ARMA models to include differencing. q , p if Definition 3.11 A process x d is said to be ARIMA( ) , t d d x ∇ ( 1 − B ) x = t t is ARMA( p , q ). In general, we will write the model as d ( B )( 1 − B ) w x (3.144) = θ ( B ) φ . t t d (∇ x E ) = μ , we write the model as If t d ( B )( , − B ) φ x w = δ + θ ( B ) 1 t t . ) φ ( where δ = μ −···− 1 − φ 1 p Because of the nonstationarity, care must be taken when deriving forecasts. For the sake of completeness, we discuss this issue briefly here, but we stress the fact that both the theoretical and computational aspects of the problem are best handled via state-space models. We discuss the theoretical details in Chapter 6. For information on the state-space based computational aspects in R, see the ARIMA help files ( ?arima and ?predict.Arima ); our scripts sarima and sarima.for are basically wrappers for these R scripts. d It should be clear that, since y = ∇ x is ARMA, we can use Section 3.4 methods t t d , = 1 to obtain forecasts of y . For example, if , which in turn lead to forecasts for x t t n n n n given forecasts y x x = − for m = 1 , 2 , . . . , we have , so that y m + m + n n n + m n + m − 1 i i i i

145 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 135 — #145 i i 3.6 Integrated Models for Nonstationary Data 135 n n n y + x x = m + n + m n + m − 1 n n n n x with initial condition = y x x ). (noting x + = n n n 1 + n n + 1 n It is a little more difficult to obtain the prediction errors , but for large P m n + , the approximation used in Section 3.4, equation (3.86), works well. That is, the n mean-squared prediction error can be approximated by 1 m − ’ ∗ 2 n 2 , (3.145) ψ σ P = n + m w j 0 = j d ∗ j ∗ ψ )( z is the coefficient of z ) in ψ . ( z ) = θ ( z )/ φ ( z where 1 − j To better understand integrated models, we examine the properties of some simple 1 , 1 , ( ) case. cases; Problem 3.29 covers the ARIMA 0 Example 3.37 Random Walk with Drift To fix ideas, we begin by considering the random walk with drift model first presented in Example 1.11, that is, x = δ + x + w , 1 t t t − , . . . for = 1 , 2 t , and x 0 = . Technically, the model is not ARIMA, but we could 0 model. Given data , the one-step- x include it trivially as an ARIMA ( 0 , 1 , 0 ) , . . ., x 1 n ahead forecast is given by n = ) x = E ( x , . . ., x δ . x x + , . . ., x x ) = E ( δ + x w + n n + 1 n 1 + n n n 1 1 + n 1 n n x The two-step-ahead forecast is given by + x δ = = 2 δ + x , and consequently, n n 2 + 1 + n m = 1 , 2 , . . . , is the m -step-ahead forecast, for n , = m δ + x x (3.146) n m n + To obtain the forecast errors, it is convenient to recall equation (1.4); i.e., Õ n δ + x , in which case we may write = n w n j j = 1 n + m n + m ’ ’ w . w = m δ + x + x δ + ) = ( n + m j n j + n m n 1 = j + 1 j = m -step-ahead prediction error is given by From this it follows that the n m + ) ( ’ 2 2 2 n n . (3.147) = m σ w ) = E E = x − ( x P j n m + w m n + m n + 1 = j n + Hence, unlike the stationary case (see Example 3.23), as the forecast horizon grows, the prediction errors, (3.147), increase without bound and the forecasts follow a δ emanating from x straight line with slope . We note that (3.145) is exact in this n Õ ∞ ∗ j ∗ z ) = 1 /( 1 − z ) = . case because j for all 1 z ψ for | z | < 1 , so that ψ ( = = j 0 j The w are Gaussian, so estimation is straightforward because the differenced t , are independent and identically distributed normal variates with data, say y x = ∇ t t 2 2 . Consequently, optimal estimates of δ σ mean are the and variance δ and σ w w y sample mean and variance of the , respectively. t i i i i

146 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 136 — #146 i i 3 ARIMA Models 136 1 , 1 ) and EWMA Example 3.38 IMA ( The ARIMA(0,1,1), or IMA(1,1) model is of interest because many economic time series can be successfully modeled this way. In addition, the model leads to a frequently used, and abused, forecasting method called exponentially weighted moving averages (EWMA). We will write the model as x + w x = λ w , (3.148) − t t − 1 − 1 t t | with | < 1 , for t = 1 , 2 , . . ., and x , because this model formulation is easier = 0 λ 0 to work with here, and it leads to the standard representation for EWMA. We could have included a drift term in (3.148), as was done in the previous example, but for the sake of simplicity, we leave it out of the discussion. If we write , = w − y λ w t − 1 t t we may write (3.148) as x has an invertible = x y , 1 + y < . Because | λ | t t t 1 t − Õ ∞ j y , we may = x y + w representation, , and substituting y = x − λ t − t j t t t 1 t − 1 j = write ∞ ’ j − 1 = x + ( (3.149) − λ ) λ 1 . w x t t j t − 1 = j as an approximation for large t (put x ). Verification of (3.149) is left = 0 for t ≤ 0 t to the reader (Problem 3.28). Using the approximation (3.149), we have that the approximate one-step-ahead predictor, using the notation of Section 3.4, is ∞ ’ j − 1 = x ( 1 x λ ) λ ̃ − + 1 + j − n 1 n 1 j = ∞ ’ 1 − j ( x + λ ( − λ = ) 1 1 − λ ) λ x n n − j 1 j = (3.150) . x = ( 1 − λ ) x ̃ + λ n n From (3.150), we see that the new forecast is a linear combination of the old forecast and the new observation. Based on (3.150) and the fact that we only 0 = ), observe x = , . . ., x x , and consequently y ; , . . ., y x (because y − x 1 1 n t − t 1 n 0 t the truncated forecasts are − 1 n n n , 1 , (3.151) ≥ = ( 1 − λ ) x ̃ x λ ̃ x + n n + n 1 0 x = x ̃ as an initial value. The mean-square prediction error can be approxi- with 1 1 Õ ∞ j ∗ − z ) mated using (3.145) by noting that ( 1 ψ λ z )/( 1 − z ) = 1 + ( 1 − λ ) z ( = 1 = j , (3.145) leads to for | < 1 ; consequently, for large n z | n 2 2 P ≈ σ [ ) λ ] 1 + ( m − 1 )( 1 − . m + n w λ In EWMA, the parameter 1 − is often called the smoothing parameter and is λ restricted to be between zero and one. Larger values of lead to smoother forecasts. i i i i

147 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 137 — #147 i i 137 3.7 Building ARIMA Models This method of forecasting is popular because it is easy to use; we need only retain the previous forecast value and the current observation to forecast the next time period. Unfortunately, as previously suggested, the method is often abused because some forecasters do not verify that the observations follow an IMA ( 1 1 ) process, , λ . In the following, we show how to generate and often arbitrarily pick values of 100 observations from an IMA(1,1) model with λ − θ = . 8 and then calculate and = display the fitted EWMA superimposed on the data. This is accomplished using the Holt-Winters command in R (see the help file ?HoltWinters for details; no output is shown): set.seed(666) x = arima.sim(list(order = c(0,1,1), ma = -0.8), n = 100) λ (x.ima = HoltWinters(x, beta=FALSE, gamma=FALSE)) # α below is 1 − Smoothing parameter: alpha: 0.1663072 plot(x.ima) 3.7 Building ARIMA Models There are a few basic steps to fitting ARIMA models to time series data. These steps involve • plotting the data, possibly transforming the data, • identifying the dependence orders of the model, • parameter estimation, • • diagnostics, and • model choice. First, as with any data analysis, we should construct a time plot of the data, and inspect the graph for any anomalies. If, for example, the variability in the data grows with time, it will be necessary to transform the data to stabilize the variance. In such cases, the Box–Cox class of power transformations, equation (2.34), could be employed. Also, the particular application might suggest an appropriate transformation. For example, , where we have seen numerous examples where the data behave as x = ( 1 + p ) x t t t 1 − p t − 1 to t , which may be negative. is a small percentage change from period t p will be relatively stable. ) ≈ ∇ log ( x p If is a relatively stable process, then t t t Frequently, ∇ log ( x . This general idea was used ) is called the return or growth rate t in Example 3.33, and we will use it again in Example 3.39. After suitably transforming the data, the next step is to identify preliminary values , the order of differencing, of the autoregressive order, p d , and the moving average order, . A time plot of the data will typically suggest whether any differencing is q needed. If differencing is called for, then difference the data once, d = 1 , and inspect ∇ x the time plot of . If additional differencing is necessary, then try differencing t 2 ∇ again and inspect a time plot of x . Be careful not to overdifference because t this may introduce dependence where none exists. For example, x = w is serially t t − is MA(1). In addition to time plots, the sample w w uncorrelated, but ∇ x = t t t − 1 i i i i

148 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 138 — #148 i i 3 ARIMA Models 138 8000 6000 Billions of Dollars 4000 2000 Series gnp 2000 1990 1970 1960 1950 1980 Time 0.8 0.4 ACF 0.0 4 6 8 0 2 12 10 Lag Top: Bottom: Fig. 3.13. Sample ACF of the GNP Quarterly U.S. GNP from 1947(1) to 2002(3). data. Lag is in terms of years. ACF can help in indicating whether differencing is needed. Because the polynomial d ˆ z 1 − ( ) φ has a unit root, the sample ACF, )( ρ ( h ) , will not decay to zero fast as h z ) h ( ρ increases. Thus, a slow decay in ˆ is an indication that differencing may be needed. have been settled, the next step is to look at the d When preliminary values of d ∇ d x sample ACF and PACF of for whatever values of have been chosen. Using t Table 3.1 as a guide, preliminary values of p and q are chosen. Note that it cannot be the case that both the ACF and PACF cut off. Because we are dealing with estimates, it will not always be clear whether the sample ACF or PACF is tailing off or cutting off. Also, two models that are seemingly different can actually be very similar. With this in mind, we should not worry about being so precise at this stage of the model q should be at hand, and fitting. At this point, a few preliminary values of p , d , and we can start estimating the parameters. Example 3.39 Analysis of GNP Data In this example, we consider the analysis of quarterly U.S. GNP from 1947(1) to 2002(3), n = 223 observations. The data are real U.S. gross national product in billions of chained 1996 dollars and have been seasonally adjusted. The data were obtained from the Federal Reserve Bank of St. Louis ( http://research.stlouisfed. ). Figure 3.13 shows a plot of the data, say, org/ y . Because strong trend tends to t obscure other effects, it is difficult to see any other variability in data except for periodic large dips in the economy. When reports of GNP and similar economic indicators are given, it is often in growth rate (percent change) rather than in actual ( , is plotted ) (or adjusted) values that is of interest. The growth rate, say, x log = ∇ y t t in Figure 3.14, and it appears to be a stable process. i i i i

149 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 139 — #149 i i 3.7 Building ARIMA Models 139 0.04 0.02 0.00 GNP Growth Rate −0.02 1980 1990 2000 1950 1960 1970 Time Fig. 3.14. U.S. GNP quarterly growth rate. The horizontal line displays the average growth of the process, which is close to 1%. The sample ACF and PACF of the quarterly growth rate are plotted in Fig- ure 3.15. Inspecting the sample ACF and PACF, we might feel that the ACF is cutting off at lag 2 and the PACF is tailing off. This would suggest the GNP growth model. ) rate follows an MA(2) process, or log GNP follows an ARIMA ( 0 , 1 , 2 Rather than focus on one model, we will also suggest that it appears that the ACF is tailing off and the PACF is cutting off at lag 1. This suggests an AR(1) model for 1 ( 1 , , the growth rate, or ARIMA 0 ) for log GNP. As a preliminary analysis, we will fit both models. , the estimated model Using MLE to fit the MA(2) model for the growth rate, x t is w , ˆ x = . 008 ˆ + (3.152) + . 303 w ˆ 204 ˆ w + . − 2 t t − 1 t t ) ( 064 ) 065 . ( ) 001 . ( . ˆ . σ = where 0094 is based on 219 degrees of freedom. The values in parentheses w are the corresponding estimated standard errors. All of the regression coefficients We make a special note of this because, as are significant, including the constant. a default, some computer packages do not fit a constant in a differenced model. That is, these packages assume, by default, that there is no drift. In this example, not including a constant leads to the wrong conclusions about the nature of the U.S. economy. Not including a constant assumes the average quarterly growth rate is zero, whereas the U.S. GNP average quarterly growth rate is about 1% (which can be seen easily in Figure 3.14). We leave it to the reader to investigate what happens when the constant is not included. The estimated AR(1) model is , w ˆ ( ˆ x + = . 008 x ˆ 347 . (3.153) 1 − . 347 ) + t − 1 t t ( . ) 063 ) ( . 001 . on 0095 where ˆ σ 220 = degrees of freedom; note that the constant in (3.153) is w 1 . . . 008 ( = − 005 347 ) . We will discuss diagnostics next, but assuming both of these models fit well, how are we to reconcile the apparent differences of the estimated models (3.152) i i i i

150 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 140 — #150 i i 140 3 ARIMA Models 0.4 0.2 ACF −0.2 2 6 4 1 3 5 LAG 0.4 0.2 PACF −0.2 6 1 2 3 4 5 LAG Fig. 3.15. Sample ACF and PACF of the GNP quarterly growth rate. Lag is in terms of years. and (3.153)? In fact, the fitted models are nearly the same. To show this, consider an AR(1) model of the form in (3.153) without a constant term; that is, , x . 35 x + w = 1 t t t − Õ ∞ j and write it in its causal form, = x 35 . = ψ ψ . Thus, , where we recall w t − j t j j 0 = j = ,ψ 002 . ψ = = 1 ,ψ ,ψ = . 350 ,ψ 005 = . 123 ,ψ . = . 043 ,ψ = = . 015 ,ψ 5 3 1 6 0 4 7 2 0 001 and so forth. Thus, = . ,ψ , = ,ψ ,ψ 0 = 0 10 9 8 w , x + ≈ . 35 w w 12 . + 1 t t − 2 t − t which is similar to the fitted MA(2) model in (3.153). The analysis can be performed in R as follows. plot(gnp) acf2(gnp, 50) gnpgr = diff(log(gnp)) # growth rate plot(gnpgr) acf2(gnpgr, 24) sarima(gnpgr, 1, 0, 0) # AR(1) sarima(gnpgr, 0, 0, 2) # MA(2) ARMAtoMA(ar=.35, ma=0, 10) # prints psi-weights The next step in model fitting is diagnostics. This investigation includes the analysis of the residuals as well as model comparisons. Again, the first step involves a t − 1 (or residuals), x or of the standardized innovations ˆ x time plot of the innovations , − t t ) ( √ / − t 1 t − 1 ˆ , (3.154) P − e ˆ x = x t t t t t − 1 − 1 t ˆ where P x is the one-step-ahead prediction of ˆ based on the fitted model and x t t t is the estimated one-step-ahead error variance. If the model fits well, the standardized i i i i

151 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 141 — #151 i i 3.7 Building ARIMA Models 141 residuals should behave as an iid sequence with mean zero and variance one. The time plot should be inspected for any obvious departures from this assumption. Unless the time series is Gaussian, it is not enough that the residuals are uncorrelated. For example, it is possible in the non-Gaussian case to have an uncorrelated process for which values contiguous in time are highly dependent. As an example, we mention the family of GARCH models that are discussed in Chapter 5. Investigation of marginal normality can be accomplished visually by looking at a histogram of the residuals. In addition to this, a normal probability plot or a Q-Q plot can help in identifying departures from normality. See Johnson and Wichern (1992, Chapter 4) for details of this test as well as additional tests for multivariate normality. There are several tests of randomness, for example the runs test, that could be applied to the residuals. We could also inspect the sample autocorrelations of , for any patterns or large values. Recall that, for a white ) the residuals, say, ˆ ρ h ( e noise sequence, the sample autocorrelations are approximately independently and normally distributed with zero means and variances 1 / n . Hence, a good check on the ) along with the error correlation structure of the residuals is to plot ˆ ρ h ( h versus e √ n . The residuals from a model fit, however, will not quite have the bounds of ± 2 / ) h properties of a white noise sequence and the variance of ˆ ρ can be much less than ( e 1 . Details can be found in Box and Pierce (1970) and McLeod (1978). This part of / n ˆ ρ with the main concern ( h ) the diagnostics can be viewed as a visual inspection of e being the detection of obvious departures from the independence assumption. In addition to plotting ρ ( h ) , we can perform a general test that takes into con- ˆ e sideration the magnitudes of ˆ ρ ( h ) as a group. For example, it may be the case that, e individually, each ρ ( h ) is small in magnitude, say, each one is just slightly less that ˆ e √ / 2 n in magnitude, but, collectively, the values are large. The Ljung–Box–Pierce Q-statistic given by H 2 ’ ) ( ˆ h ρ e + 2 ) Q = n ( n (3.155) n − h 1 h = can be used to perform such a test. The value in (3.155) is chosen somewhat H arbitrarily, typically, H = 20 . Under the null hypothesis of model adequacy, asymp- 2 . Thus, we would reject the null hypothesis at level n → ∞ ), Q ∼ χ totically ( H q − − p 2 α α if the value of Q exceeds the ( 1 − -quantile of the ) distribution. De- χ H − p − q tails can be found in Box and Pierce (1970), Ljung and Box (1978), and Davies et 2 h ) , ( w al. (1977). The basic idea is that if is white noise, then by Property 1.2, n ˆ ρ t w 2 χ for h random variables. This means 1 , . . ., H , are asymptotically independent = 1 Õ H 2 2 ( is approximately a χ random variable. Because the test involves h ) ρ ˆ n that w = 1 h H p q degrees of freedom; the the ACF of residuals from a model fit, there is a loss of + other values in (3.155) are used to adjust the statistic to better match the asymptotic chi-squared distribution. Example 3.40 Diagnostics for GNP Growth Rate Example We will focus on the MA(2) fit from Example 3.39; the analysis of the AR(1) residuals is similar. Figure 3.16 displays a plot of the standardized residuals, the ACF of the residuals, a boxplot of the standardized residuals, and the p-values i i i i

152 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 142 — #152 i i 142 3 ARIMA Models Standardized Residuals Model: (0,0,2) 3 1 −1 −3 1960 1980 1970 2000 1990 1950 Time Normal Q−Q Plot of Std Residuals ACF of Residuals l l 3 l 0.4 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 1 l l l l l l l l l l l l l l l 0.2 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ACF l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 −1 l l l l l l l l l l l l l l l l l Sample Quantiles l l −3 −0.2 4 2 1 6 3 −3 −2 −1 0 1 2 5 3 Theoretical Quantiles LAG p values for Ljung−Box statistic 0.8 l l l p value 0.4 l l l l l l l l l l l l l l l 0.0 5 10 15 20 lag Diagnostics of the residuals from MA(2) fit on GNP growth rate. Fig. 3.16. H = 3 through H = 20 (with associated with the Q-statistic, (3.155), at lags − corresponding degrees of freedom H 2 ). Inspection of the time plot of the standardized residuals in Figure 3.16 shows no obvious patterns. Notice that there may be outliers, with a few values exceeding 3 standard deviations in magnitude. The ACF of the standardized residuals shows no apparent departure from the model assumptions, and the Q-statistic is never significant at the lags shown. The normal Q-Q plot of the residuals shows that the assumption of normality is reasonable, with the exception of the possible outliers. The model appears to fit well. The diagnostics shown in Figure 3.16 are a 3.8 by-product of the sarima command from the previous example. 8 . 3 The script tsdiag is available in R to run diagnostics for an ARIMA object, however, the script has errors and we do not recommend using it. i i i i

153 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 143 — #153 i i 3.7 Building ARIMA Models 143 p values for Ljung−Box statistic 0.8 0.4 p value l l l l l l l l l l l l l l l l l l l 0.0 15 10 5 20 lag l l l 0.8 l l l l l l l l l l l 0.4 p value l l l l 0.0 5 10 15 20 lag 1 , fit (bottom) ) Fig. 3.17. Q-statistic p-values for the ARIMA ( 0 , 1 , 1 ) fit (top) and the ARIMA ( 1 , 1 to the logged varve data. Example 3.41 Diagnostics for the Glacial Varve Series ( 0 , 1 In Example 3.33, we fit an ARIMA 1 ) model to the logarithms of the glacial , varve data and there appears to be a small amount of autocorrelation left in the residuals and the Q-tests are all significant; see Figure 3.17. , To adjust for this problem, we fit an ARIMA 1 , 1 ( 1 ) to the logged varve data and obtained the estimates 2 ˆ ˆ − φ = . 23 23 . = 89 , . θ = σ . ˆ , ( . ( ) 03 05 . ) w Hence the AR term is significant. The Q-statistic p-values for this model are also displayed in Figure 3.17, and it appears this model fits the data well. As previously stated, the diagnostics are byproducts of the individual sarima runs. We note that we did not fit a constant in either model because there is no apparent drift in the differenced, logged varve series. This fact can be verified by noting the constant is not significant when the command is no.constant=TRUE removed in the code: sarima(log(varve), 0, 1, 1, no.constant=TRUE) # ARIMA(0,1,1) sarima(log(varve), 1, 1, 1, no.constant=TRUE) # ARIMA(1,1,1) In Example 3.39, we have two competing models, an AR(1) and an MA(2) on the GNP growth rate, that each appear to fit the data well. In addition, we might also consider that an AR(2) or an MA(3) might do better for forecasting. Perhaps combining both models, that is, fitting an ARMA( 1 , 2 ) to the GNP growth rate, would overfitting the be the best. As previously mentioned, we have to be concerned with model; it is not always the case that more is better. Overfitting leads to less-precise estimators, and adding more parameters may fit the data better but may also lead to bad forecasts. This result is illustrated in the following example. i i i i

154 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 144 — #154 i i 3 ARIMA Models 144 l U.S. Population by Official Census l 8 2.5 l 10 × l 2.0 l l 1.5 l l l l 1.0 Population 0.5 0.0 1920 2000 1980 1960 1940 Year Fig. 3.18. A perfect fit and a terrible forecast. Example 3.42 A Problem with Overfitting Figure 3.18 shows the U.S. population by official census, every ten years from 1910 to 1990, as points. If we use these nine observations to predict the future population, we can use an eight-degree polynomial so the fit to the nine observations is perfect. The model in this case is 8 2 + t x t = β β + β + t + β ··· . w + 2 8 t 1 t 0 The fitted line, which is plotted in the figure, passes through the nine observations. The model predicts that the population of the United States will be close to zero in the year 2000, and will cross zero sometime in the year 2002! The final step of model fitting is model choice or model selection. That is, we must decide which model we will retain for forecasting. The most popular techniques, AIC, AICc, and BIC, were described in Section 2.1 in the context of regression models. Example 3.43 Model Choice for the U.S. GNP Series Returning to the analysis of the U.S. GNP data presented in Example 3.39 and 1 Example 3.40, recall that two models, an AR ( ), fit the GNP growth ) and an MA ( 2 rate well. To choose the final model, we compare the AIC, the AICc, and the BIC runs displayed at the sarima for both models. These values are a byproduct of the end of Example 3.39, but for convenience, we display them again here (recall the growth rate data are in gnpgr ): sarima(gnpgr, 1, 0, 0) # AR(1) $AIC: -8.294403 $AICc: -8.284898 $BIC: -9.263748 sarima(gnpgr, 0, 0, 2) # MA(2) $AIC: -8.297693 $AICc: -8.287854 $BIC: -9.251711 fit, whereas the BIC prefers the The AIC and AICc both prefer the MA ) 2 ( simpler AR ( 1 ) model. It is often the case that the BIC will select a model of smaller ( ) 1 order than the AIC or AICc. In either case, it is not unreasonable to retain the AR because pure autoregressive models are easier to work with. i i i i

155 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 145 — #155 i i 145 3.8 Regression with Autocorrelated Errors 3.8 Regression with Autocorrelated Errors In Section 2.1, we covered the classical regression model with uncorrelated errors w . In this section, we discuss the modifications that might be considered when the t errors are correlated. That is, consider the regression model r ’ β y x + (3.156) = z j t j t t j 1 = x . In ordinary least squares, is a process with some covariance function γ where ( s , t ) t x the assumption is that x t is white Gaussian noise, in which case γ , ( s , t ) = 0 for s x t 2 = ( t , t ) and σ γ , independent of t . If this is not the case, then weighted least squares x should be used. ′ y = Z β + x , where y = ( y and , . . ., y Write the model in vector notation, ) n 1 ′ ′ ′ x = ( x | ··· | , . . ., x z ) ] are n × 1 vectors, β = ( β | , . . ., β z ) z is r × 1 , and Z = [ 1 r 2 1 n n 1 − 1 / 2 matrix composed of the input variables. Let Γ = { = y ( s , t )} , then Γ is the n × r γ x 1 / 2 − − 1 / 2 Z β Γ Γ x , so that we can write the model as + ∗ ∗ y = Z β + δ, 1 / 2 − 2 ∗ / 1 − 1 / 2 − ∗ δ Z = Γ y Γ = y Z , and , = Γ where . Consequently, the covariance x is the identity and the model is in the classical linear model form. It follows matrix of δ ′ ′ ∗ ∗ 1 − ′ 1 − 1 ∗ − 1 − ∗ ′ ˆ β β that the weighted estimate of Z ( = is Z ) y Z ) Z = Z ( Γ Z Γ y , and w ′ − 1 − 1 ˆ = ( Z the variance-covariance matrix of the estimator is var Γ ( If Z ) β is . ) x t w 2 Γ σ I and these results reduce to the usual least squares results. white noise, then = In the time series case, it is often possible to assume a stationary covariance x that corresponds to a linear process and try to find structure for the error process t an ARMA representation for x . For example, if we have a pure AR ( p ) error, then t φ ( B ) x , = w t t p φ ( B ) = 1 is the linear transformation that, when applied to φ and B −···− φ B − p 1 . Multiplying the regression equation the error process, produces the white noise w t ) B yields, through by the transformation φ ( r ’ z x ) B ( + ) B ( φ β φ , = φ ) y ( B j t j t t ︷︷ ︸ ︸ ︷︷ ︸ ︸ ︸ ︷︷ ︸ = j 1 ∗ ∗ w y t z t t j and we are back to the linear regression model where the observations have been ∗ ∗ is the dependent variable, = = φ ( B ) y z y z transformed so that j = φ ( B ) for t t j t t j 1 , . . ., r , are the independent variables, but the β s are the same as in the original ∗ ∗ φ p 1 , then y . = y − = y and z model. For example, if = z − φ z − j t 1 t t j − 1 t , t t j In the AR case, we may set up the least squares problem as minimizing the error sum of squares i i i i

156 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 146 — #156 i i 3 ARIMA Models 146 n n r [ ] ’ ’ ’ 2 2 S ( φ, β = ) w = ( B ) y φ − z ) β φ ( B t t j j t 1 = t 1 = t = 1 j = φ { φ . Of course, , . . ., φ } with respect to all the parameters, and β = { β , . . ., β } 1 r p 1 the optimization is performed using numerical methods. w ( If the error process is ARMA( p , q ), i.e., φ , then in the above B ) x ) = θ ( B t t − 1 ( B ) x ) = w B , where π . In this case the B ) = θ ( B ) discussion, we transform by π φ ( ( t t : { error sum of squares also depends on θ = } θ , . . ., θ 1 q n r n [ ] ’ ’ ’ 2 2 B ( w ) π = φ, θ, β = β S ( π ( B ) y ) − z j t j t t 1 t j 1 = t = = 1 At this point, the main problem is that we do not typically know the behavior prior to the analysis. An easy way to tackle this problem was first x of the noise t presented in Cochrane and Orcutt (1949), and with the advent of cheap computing is modernized below: z z , . . ., on (acting as if the errors are y (i) First, run an ordinary regression of tr 1 t t Õ r ˆ uncorrelated). Retain the residuals, − x ˆ = . β y z t t j t j 1 = j (ii) Identify ARMA model(s) for the residuals ˆ x . t (iii) Run weighted least squares (or MLE) on the regression model with autocorre- lated errors using the model specified in step (ii). (iv) Inspect the residuals w for whiteness, and adjust the model if necessary. ˆ t Example 3.44 Mortality, Temperature and Pollution We consider the analyses presented in Example 2.2, relating mean adjusted tem- T , and particulate levels perature P . We consider to cardiovascular mortality M t t t the regression model 2 + M (3.157) , P = β β + β + t + β x T T + β 4 2 1 5 t t t t 3 t is white noise. The sample ACF and PACF of the where, for now, we assume that x t residuals from the ordinary least squares fit of (3.157) are shown in Figure 3.19, and the results suggest an AR(2) model for the residuals. Our next step is to fit the correlated error model (3.157), but where x is AR(2), t x + = φ x w + φ x t t − 1 2 t 2 1 t − function as follows sarima and w is white noise. The model can be fit using the t (partial output shown). trend = time(cmort); temp = tempr - mean(tempr); temp2 = temp^2 summary(fit <- lm(cmort~trend + temp + temp2 + part, na.action=NULL)) acf2(resid(fit), 52) # implies AR2 sarima(cmort, 2,0,0, xreg=cbind(trend,temp,temp2,part)) Coefficients: ar1 ar2 intercept trend temp temp2 part 0.3848 0.4326 80.2116 -1.5165 -0.0190 0.0154 0.1545 s.e. 0.0436 0.0400 1.8072 0.4226 0.0495 0.0020 0.0272 sigma^2 estimated as 26.01: loglikelihood = -1549.04, aic = 3114.07 i i i i

157 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 147 — #157 i i 147 3.8 Regression with Autocorrelated Errors 0.5 0.2 ACF −0.1 1.0 0.8 0.0 0.2 0.4 0.6 LAG 0.5 0.2 PACF −0.1 0.4 1.0 0.2 0.0 0.6 0.8 LAG Fig. 3.19. Sample ACF and PACF of the mortality residuals indicating an AR(2) process. The residual analysis output from sarima (not shown) shows no obvious departure of the residuals from whiteness. Example 3.45 Regression with Lagged Variables (cont) In Example 2.9 we fit the model S w R = β + β + + β , D + β D S 0 t t 1 t − 6 2 t 3 − t − 6 6 t − 6 is SOI, and 0 where R < is Recruitment, S S is a dummy variable that is 0 if D t t t t and 1 otherwise. However, residual analysis indicates that the residuals are not white noise. The sample (P)ACF of the residuals indicates that an AR(2) model might be appropriate, which is similar to the results of Example 3.44. We display partial results of the final model below. dummy = ifelse(soi<0, 0, 1) fish = ts.intersect(rec, soiL6=lag(soi,-6), dL6=lag(dummy,-6), dframe=TRUE) summary(fit <- lm(rec ~soiL6*dL6, data=fish, na.action=NULL)) attach(fish) plot(resid(fit)) acf2(resid(fit)) # indicates AR(2) intract = soiL6*dL6 # interaction term sarima(rec,2,0,0, xreg = cbind(soiL6, dL6, intract)) $ttable Estimate SE t.value p.value ar1 1.3624 0.0440 30.9303 0.0000 ar2 -0.4703 0.0444 -10.5902 0.0000 intercept 64.8028 4.1121 15.7590 0.0000 soiL6 8.6671 2.2205 3.9033 0.0001 dL6 -2.5945 0.9535 -2.7209 0.0068 intract -10.3092 2.8311 -3.6415 0.0003 i i i i

158 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 148 — #158 i i 3 ARIMA Models 148 3.9 Multiplicative Seasonal ARIMA Models In this section, we introduce several modifications made to the ARIMA model to account for seasonal and nonstationary behavior. Often, the dependence on the past tends to occur most strongly at multiples of some underlying seasonal lag . For s example, with monthly economic data, there is a strong yearly component occurring at lags that are multiples of 12 , because of the strong connections of all activity s = to the calendar year. Data taken quarterly will exhibit the yearly repetitive period at = s quarters. Natural phenomena such as temperature also have strong components 4 corresponding to seasons. Hence, the natural variability of many physical, biological, and economic processes tends to match with seasonal fluctuations. Because of this, it is appropriate to introduce autoregressive and moving average polynomials that pure seasonal autoregressive moving identify with the seasonal lags. The resulting , say, ARMA ( average model , Q ) , then takes the form P s s s B ( ) x (3.158) = Θ , ( w Φ ) B P t Q t where the operators Ps s 2 s s = ) − Φ B B − Φ 1 B ( −···− Φ B Φ (3.159) P P 1 2 and s Qs s 2 s Θ (3.160) = 1 + Θ B + Θ ) B B + ··· + Θ B ( Q Q 1 2 seasonal moving average opera- are the seasonal autoregressive operator and the , respectively, with seasonal period P and Q of orders s . tor Analogous to the properties of nonseasonal ARMA models, the pure seasonal s , ( P lie outside the unit circle, and Q ) ) is causal only when the roots of Φ ARMA ( z P s s z lie outside the unit circle. ) it is invertible only when the roots of Θ ( Q Example 3.46 A Seasonal AR Series A first-order seasonal autoregressive series that might run over months could be written as 12 ) x 1 − Φ B ( w = t t or w . + x = Φ x t t t − 12 x in terms of past lags at the multiple of the yearly This model exhibits the series t seasonal period s = 12 months. It is clear from the above form that estimation and forecasting for such a process involves only straightforward modifications of the | | . unit lag case already treated. In particular, the causal condition requires 1 Φ < . We simulated 3 years of data from the model with Φ = 9 , and exhibit the ACF and PACF of the model. See Figure 3.20. theoretical set.seed(666) phi = c(rep(0,11),.9) sAR = arima.sim(list(order=c(12,0,0), ar=phi), n=37) sAR = ts(sAR, freq=12) layout(matrix(c(1,1,2, 1,1,3), nc=2)) i i i i

159 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 149 — #159 i i 149 3.9 Multiplicative Seasonal ARIMA Models seasonal AR(1) J J J J J J D 2 D D A S A A A 0 A S A N sAR F S O F O F M N N M M M M −2 J O M J −4 J J 1 4 2 3 year 0.8 0.8 0.4 0.4 ACF PACF 0.0 0.0 36 48 96 72 84 96 60 0 12 84 24 0 12 24 36 48 60 72 LAG LAG Fig. 3.20. Data generated from a seasonal ( s = 12 ) AR(1), and the true ACF and PACF of the + 9 . model x x = . w t t − 12 t par(mar=c(3,3,2,1), mgp=c(1.6,.6,0)) ) ' plot(sAR, axes=FALSE, main= ' seasonal AR(1) ' , xlab="year", type= ' c Months = c("J","F","M","A","M","J","J","A","S","O","N","D") points(sAR, pch=Months, cex=1.25, font=4, col=1:4) axis(1, 1:4); abline(v=1:4, lty=2, col=gray(.7)) axis(2); box() ACF = ARMAacf(ar=phi, ma=0, 100) PACF = ARMAacf(ar=phi, ma=0, 100, pacf=TRUE) plot(ACF,type="h", xlab="LAG", ylim=c(-.1,1)); abline(h=0) plot(PACF, type="h", xlab="LAG", ylim=c(-.1,1)); abline(h=0) s = 12 ) MA model, x w = For the first-order seasonal ( , it is easy to + Θ w t − 12 t t verify that 2 2 0 ) = ( 1 + Θ γ ) σ ( 2 γ 12 ) = Θσ (± γ ( h ) = 0 , otherwise . Thus, the only nonzero correlation, aside from lag zero, is 2 (± . ρ ) 12 ) = Θ /( 1 + Θ For the first-order seasonal ( 12 ) AR model, using the techniques of the = s nonseasonal AR(1), we have 2 2 ) σ γ ( /( 1 − Φ 0 ) = 2 2 k γ ) = σ k Φ (± /( 1 − Φ 12 ) k = 1 , 2 , . . . , otherwise . γ ( h ) = 0 In this case, the only non-zero correlations are i i i i

160 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 150 — #160 i i 150 3 ARIMA Models Table 3.3. Behavior of the ACF and PACF for Pure SARMA Models P ) ( Q ) ARMA ( MA , Q ) AR ( P s s s Tails off at k s ACF* Tails off at lags , Cuts off after , 2 , . . ., k Qs lags k s 1 = lag k s Tails off at PACF* Cuts off after Tails off at lags Ps k 1 , 2 , . . ., lags k s lag = , , are zero. , for k = 1 , 2 , . . . h *The values at nonseasonal lags k s k = (± = Φ 12 , k k 0 , 1 , 2 , . . . . ρ ) γ ( h ) = Φγ ( h − 12 ) , for These results can be verified using the general result that Φγ ≥ h = 1 , γ ( 1 ) = . For example, when ( 11 ) , but when h = 11 , we have 1 h ( 11 ) = Φγ ( 1 ) , which implies that γ ( 1 ) = γ ( 11 ) = 0 . In addition to these results, the γ PACF have the analogous extensions from nonseasonal to seasonal models. These results are demonstrated in Figure 3.20. As an initial diagnostic criterion, we can use the properties for the pure seasonal autoregressive and moving average series listed in Table 3.3. These properties may be considered as generalizations of the properties for nonseasonal models that were presented in Table 3.1. In general, we can combine the seasonal and nonseasonal operators into a multi- , denoted by ARMA ( p , q )× plicative seasonal autoregressive moving average model P , Q ) , and write ( s s s B (3.161) Φ ( ( w ) ) φ ( B ) x θ = Θ ) ( B B Q t P t as the overall model. Although the diagnostic properties in Table 3.3 are not strictly true for the overall mixed model, the behavior of the ACF and PACF tends to show rough patterns of the indicated form. In fact, for mixed models, we tend to see a mixture of the facts listed in Table 3.1 and Table 3.3. In fitting such models, focusing on the seasonal autoregressive and moving average components first generally leads to more satisfactory results. Example 3.47 A Mixed Seasonal Model model ( 0 , 1 )×( 1 , 0 ) Consider an ARMA 12 θ , w x + = Φ x w + 12 t t t t − 1 − are uncorrelated, and w , and where | Φ | < 1 and | θ | < 1 . Then, because x w , 1 12 t t − − t 2 2 2 2 , or σ θ + 0 x = Φ ) γ ( 0 ) + σ is stationary, γ ( t w w 2 θ + 1 2 = 0 ( γ ) σ . w 2 Φ 1 − , and taking expectations, we have > h 0 In addition, multiplying the model by x , − t h 2 ≥ h γ ( 1 ) = Φγ ( 11 ) + , for . Thus, the ACF for this 2 , and γ ( h ) = Φγ ( h − 12 ) θσ w model is i i i i

161 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 151 — #161 i i 3.9 Multiplicative Seasonal ARIMA Models 151 0.8 0.8 0.4 0.4 ACF PACF 0.0 0.0 −0.4 −0.4 36 48 0 12 24 36 48 0 12 24 LAG LAG ACF and PACF of the mixed seasonal ARMA model x . = . 8 x Fig. 3.21. w w − . 5 + t t t − t 12 1 − h = h ) = Φ ρ h 12 1 , 2 , . . . ( θ h 12 h − 1 ) = , . . ., ( 12 h + 1 ) = ρ 2 , 1 , Φ ( h = 0 ρ 2 θ + 1 0 . otherwise , ρ ( h ) = . The ACF and PACF for this model, with 8 and θ = − . 5 , are shown in = Φ Figure 3.21. These type of correlation relationships, although idealized here, are typically seen with seasonal data. To reproduce Figure 3.21 in R, use the following commands: phi = c(rep(0,11),.8) # [-1] removes 0 lag ACF = ARMAacf(ar=phi, ma=-.5, 50)[-1] PACF = ARMAacf(ar=phi, ma=-.5, 50, pacf=TRUE) par(mfrow=c(1,2)) plot(ACF, type="h", xlab="LAG", ylim=c(-.4,.8)); abline(h=0) plot(PACF, type="h", xlab="LAG", ylim=c(-.4,.8)); abline(h=0) Seasonal persistence occurs when the process is nearly periodic in the season. For example, with average monthly temperatures over the years, each January would be approximately the same, each February would be approximately the same, and so on. In this case, we might think of average monthly temperature x as being modeled as t x , = S w + t t t where S is a seasonal component that varies a little from one year to the next, t according to a random walk, + S . = S v − 12 t t t w In this model, and v are uncorrelated white noise processes. The tendency of data t t to follow this type of model will be exhibited in a sample ACF that is large and decays very slowly at lags h = 12 k , for k = 1 , 2 , . . . . If we subtract the effect of successive years from each other, we find that 12 w w − ( 1 − B . ) x + = x v − x = t − t t t t 12 12 − t i i i i

162 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 152 — #162 i i 152 3 ARIMA Models 1 ) This model is a stationary MA , and its ACF will have a peak only at lag 12. ( 12 In general, seasonal differencing can be indicated when the ACF decays slowly at s , but is negligible between the periods. Then, a seasonal multiples of some season difference of order D is defined as D D s ) (3.162) 1 − B x ( = x ∇ , t t s is sufficient to = 1 where 2 , . . . , takes positive integer values. Typically, D = 1 D , obtain seasonal stationarity. Incorporating these ideas into a general model leads to the following definition. Definition 3.12 The multiplicative seasonal autoregressive integrated moving av- erage model, or SARIMA model is given by s D d s Θ = ) φ ( B )∇ Φ , ∇ (3.163) x w ) δ + ( B ( B B ) θ ( Q t P t s where w is the usual Gaussian white noise process. The general model is denoted t as ARIMA ( p . The ordinary autoregressive and moving average d , q )×( P , D , Q ) , s and components are represented by polynomials φ ( B ) , respec- θ ( B ) of orders p and q s ( B ) Φ tively, and the seasonal autoregressive and moving average components by P s B and ordinary and seasonal difference components by ( and Θ ) of orders P and Q Q D d d D s B = ( 1 − B ) ) and ∇ . ∇ = ( 1 − s Example 3.48 An SARIMA Model Consider the following model, which often provides a reasonable representation for seasonal, nonstationary, economic time series. We exhibit the equations for the ( 0 , 1 , 1 )×( 0 , 1 , 1 ) in the notation given above, where model, denoted by ARIMA 12 the seasonal fluctuations occur every 12 months. Then, with = 0 , the model δ (3.163) becomes 12 ∇ x = Θ ( B ∇ B w ) θ ( ) t 12 t or 12 12 − B + )( 1 − B ) x (3.164) = ( 1 ( Θ B 1 )( 1 + θ B ) w . t t Expanding both sides of (3.164) leads to the representation 13 12 13 12 ( 1 − B − , w + B ) ) x B = ( 1 + θ B + Θ B B + Θθ t t or in difference equation form w x = x + x − x . w + θ w + Θ w + Θθ + − 13 t − t 1 1 t t − t t − 12 t − 12 − 13 t Note that the multiplicative nature of the model implies that the coefficient of w t − 13 is the product of the coefficients of rather than a free parameter. w and w 12 − 1 t t − The multiplicative model assumption seems to work well with many seasonal time series data sets while reducing the number of parameters that must be estimated. i i i i

163 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 153 — #163 i i 3.9 Multiplicative Seasonal ARIMA Models 153 500 x 300 100 6.5 Time 6.0 lx 5.5 5.0 4.5 0.2 Time 0.1 0.0 dlx −0.2 0.15 Time 0.05 ddlx −0.05 −0.15 1956 1958 1950 1960 1952 1954 Time Time AirPassengers R data set Fig. 3.22. , which are the monthly totals of international airline = ∇ passengers x , and the transformed data: lx = log x x , dlx . ∇ log x ∇ , and ddlx = log t t t 12 Selecting the appropriate model for a given set of data from all of those represented by the general form (3.163) is a daunting task, and we usually think first in terms of finding difference operators that produce a roughly stationary series and then in terms of finding a set of simple autoregressive moving average or multiplicative seasonal ARMA to fit the resulting residual series. Differencing operations are applied first, and then the residuals are constructed from a series of reduced length. Next, the ACF and the PACF of these residuals are evaluated. Peaks that appear in these functions can often be eliminated by fitting an autoregressive or moving average component in accordance with the general properties of Table 3.1 and Table 3.3. In considering whether the model is satisfactory, the diagnostic techniques discussed in Section 3.7 still apply. Example 3.49 Air Passengers , which are the monthly totals of interna- AirPassengers We consider the R data set tional airline passengers, 1949 to 1960, taken from Box & Jenkins (1970). Various plots of the data and transformed data are shown in Figure 3.22 and were obtained as follows: x = AirPassengers lx = log(x); dlx = diff(lx); ddlx = diff(dlx, 12) i i i i

164 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 154 — #164 i i 3 ARIMA Models 154 0.4 0.0 ACF −0.4 2 3 4 0 1 LAG 0.4 0.0 PACF −0.4 3 0 4 1 2 LAG x Fig. 3.23. ∇ Sample ACF and PACF of ∇ log ddlx ). ( t 12 plot.ts(cbind(x,lx,dlx,ddlx), main="") # below of interest for showing seasonal RW (not shown here): par(mfrow=c(2,1)) monthplot(dlx); monthplot(ddlx) Note that is the original series, which shows trend plus increasing variance. The x lx , and the transformation stabilizes the variance. The logged data logged data are in . It is clear the there is still dlx are then differenced to remove trend, and are stored in ≈ dlx dlx persistence in the seasons (i.e., ), so that a twelfth-order difference 12 − t t ddlx is applied and stored in . The transformed data appears to be stationary and we are now ready to fit a model. ddlx x ∇ ) are shown in Figure 3.23. ∇ log The sample ACF and PACF of ( 12 t The R code is: acf2(ddlx,50) Seasonsal Component: It appears that at the seasons, the ACF is cutting off a lag 4 2 . These results , . . . 1 s ( s = 12 ), whereas the PACF is tailing off at lags 1 s , s s , 3 s , 1 implies an SMA(1), 0 , Q = = , in the season ( s = 12 ). P Non-Seasonsal Component: Inspecting the sample ACF and PACF at the lower lags, it appears as though both are tailing off. This suggests an ARMA( , 1 ) within the 1 seasons, p = q = 1 . ( Thus, we first try an ARIMA 1 1 , 1 , 1 )×( 0 , 1 , ) on the logged data: 12 sarima(lx, 1,1,1, 0,1,1,12) Coefficients: ar1 ma1 sma1 0.1960 -0.5784 -0.5643 s.e. 0.2475 0.2132 0.0747 sigma^2 estimated as 0.001341 $AIC -5.5726 $AICc -5.556713 $BIC -6.510729 However, the AR parameter is not significant, so we should try dropping one parameter from the within seasons part. In this case, we try both an ARIMA ( 0 , 1 , 1 )× , model: ) 1 ( 0 , 1 , 1 ) 1 and an ARIMA ( 1 , 1 , 0 )×( 0 , 12 12 i i i i

165 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 155 — #165 i i 3.9 Multiplicative Seasonal ARIMA Models 155 Standardized Residuals Model: (0,1,1) (0,1,1) [12] 3 2 1 −1 −3 1954 1958 1960 1956 1952 1950 Time ACF of Residuals Normal Q−Q Plot of Std Residuals 3 l l l l l 2 l 0.4 l l l l l l l l l l l l l l l 1 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.2 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ACF l l l l l l l l l l l l l l l l l l l l l −1 l l l l l l l l l Sample Quantiles l −3 −0.2 −1 0 1 2 1.0 0.5 1.5 −2 Theoretical Quantiles LAG p values for Ljung−Box statistic 0.8 l l l l l l l l l l l l l l l l l l p value 0.4 l l l l l l l l l l l l l l l l 0.0 10 15 20 25 30 35 5 lag )×( Fig. 3.24. Residual analysis for the ARIMA ( 0 , 1 , 1 fit to the logged air passengers 0 , 1 , 1 ) 12 data set. sarima(lx, 0,1,1, 0,1,1,12) Coefficients: ma1 sma1 -0.4018 -0.5569 s.e. 0.0896 0.0731 sigma^2 estimated as 0.001348 $AIC -5.58133 $AICc -5.56625 $BIC -6.540082 sarima(lx, 1,1,0, 0,1,1,12) Coefficients: ar1 sma1 -0.3395 -0.5619 s.e. 0.0822 0.0748 sigma^2 estimated as 0.001367 $AIC -5.567081 $AICc -5.552002 $BIC -6.525834 ( 1 model, which is the All information criteria prefer the ARIMA , 0 , 1 , 1 )×( 0 , 1 ) 12 model displayed in (3.164). The residual diagnostics are shown in Figure 3.24, and except for one or two outliers, the model seems to fit well. i i i i

166 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 156 — #166 i i 3 ARIMA Models 156 l l l l 6.5 6.5 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 6.0 6.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l log(AirPassengers) l l l l l l l 5.5 5.5 l l l l l l l l l l l l l l 1962 1960 1958 1956 1954 1960 1958 1956 1954 1962 Time Time , Twelve month forecast using the ARIMA 0 , 1 Fig. 3.25. 1 )×( 0 , 1 ( 1 ) model on the logged air , 12 passenger data set. Finally, we forecast the logged data out twelve months, and the results are shown in Figure 3.25. sarima.for(lx, 12, 0,1,1, 0,1,1,12) Problems Section 3.1 3.1 . For θ for any number 2 For an MA(1), x / = w 1 + θ w )| ≤ 1 ( , show that | ρ x t t 1 − t attain its maximum and minimum? 1 which values of θ does ρ ( ) x 2 1 < | and let | φ t = 0 , 1 Let } 3.2 σ { w be a white noise process with variance ; , . . . t w be a constant. Consider the process x w = , and 0 0 2 , . . . . , x 1 = φ x = t , + w t 1 − t t We might use this method to simulate an AR(1) process from simulated white noise. Õ t j . x (a) Show that = , . . . φ 1 w , 0 = for any t j t t − j 0 = E ( x . (b) Find the ) t t = 0 (c) Show that, for 1 , . . . , , 2 σ w 1 ) + t ( 2 = ) var ( x ) ( φ 1 − t 2 − φ 1 h ≥ 0 , (d) Show that, for h cov ( x x ( var , x ) ) = φ t h + t t stationary? x (e) Is t i i i i

167 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 157 — #167 i i 157 Problems t → ∞ (f) Argue that, as x is , the process becomes stationary, so in a sense, t “asymptotically stationary." observations of a (g) Comment on how you could use these results to simulate n stationary Gaussian AR(1) model from simulated iid N(0,1) values. √ 2 1 − φ is . Is this process stationary? Hint : Show var ( x ) (h) Now suppose w x / = t 0 0 constant. 3.3 Verify the calculations made in Example 3.4 as follows. 2 x and = φ x 0 = ) + w x where | φ | > 1 and w ( ∼ iid N ( 0 , σ (a) Let E ) . Show t 1 − t t t t w 2 − h − 2 − 2 φ γ = φ ) h /( 1 − σ ( φ ) for h ≥ 0 . x w 1 − 2 2 − y (b) Let y σ and φ + v and where v ) ∼ iid N ( 0 , σ are as in part (a). = φ φ t 1 − t w t t w Argue that y is causal with the same mean function and autocovariance function t x . as t 3.4 Identify the following models as ARMA( p , q ) models (watch out for parameter redundancy), and determine whether they are causal and/or invertible: . . w 30 (a) x − = . 80 x w + x − . 15 1 − − t t t t t − 1 2 + x = x . w − − (b) 50 x w . − t t t 1 − t 2 1 t − 3.5 Verify the causal conditions for an AR(2) model given in (3.28). That is, show that an AR(2) is causal if and only if (3.28) holds. Section 3.2 w , find the roots of the autoregres- 3.6 For the AR(2) model given by x + = − . 9 x t − t t 2 ρ sive polynomial, and then plot the ACF, h ) . ( 3.7 For the AR(2) series shown below, use the results of Example 3.10 to determine a ; solve = set of difference equations that can be used to find the ACF ρ ( h ) , h , . . . 0 , 1 for the constants in the ACF using the initial conditions. Then plot the ACF values to lag 10 (use ARMAacf as a check on your answers). 1 w (a) x + = . 6 x . + . 64 x t 1 t t − t − 2 (b) . − . 40 x x − . 45 x = w − − t 2 t t t 1 . = (c) x − 1 . 2 x + . 85 x w t 2 t − t − t 1 Section 3.3 ) Verify the calculations for the autocorrelation function of an ARMA ( 1 , 1 3.8 process given in Example 3.14. Compare the form with that of the ACF for the ARMA ( 1 , 0 ) ( and the ARMA 0 , 1 ) series. Plot the ACFs of the three series on the same graph for φ = . 6 , θ = . 9 , and comment on the diagnostic capabilities of the ACF in this case. n observations from each of the three models discussed in Prob- = 3.9 Generate 100 lem 3.8. Compute the sample ACF for each model and compare it to the theoretical values. Compute the sample PACF for each of the generated series and compare the sample ACFs and PACFs with the general results given in Table 3.1. i i i i

168 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 158 — #168 i i 3 ARIMA Models 158 Section 3.4 x represent the cardiovascular mortality series ( cmort ) discussed in Exam- 3.10 Let t ple 2.2. using linear regression as in Example 3.18. (a) Fit an AR(2) to x t (b) Assuming the fitted model in (a) is the true model, find the forecasts over a four- n week horizon, x , and the corresponding 95% prediction 4 , for m = 1 , 2 , 3 , m + n intervals. 3.11 Consider the MA(1) series x , = w w + θ t t t − 1 2 . w σ where is white noise with variance t w (a) Derive the minimum mean-square error one-step forecast based on the infinite past, and determine the mean-square error of this forecast. n (b) Let x ̃ be the truncated one-step-ahead forecast as given in (3.92). Show that 1 n + [ ] 2 2 + 2 2 n n σ ) ̃ ) − . = x x ( 1 + θ ( E + 1 n 1 + n Compare the result with (a), and indicate how well the finite approximation works in this case. 3.12 In the context of equation (3.63), show that, if γ ( 0 ) , 0 and γ ( h )→ 0 as h →∞ > is positive definite. then Γ n 3.13 Suppose x is stationary with zero mean and recall the definition of the PACF t given by (3.55) and (3.56). That is, let h − 1 1 h − ’ ’ − a x x = b x and δ  = x − − i j − t − h j i t − h t t t t 1 = 1 = j i , . . ., { , . . ., a be the two residuals where a are chosen so that they } and { b } b 1 1 − 1 1 h h − minimize the mean-squared errors 2 2 . ] ] and E [ δ  [ E t − h t h was defined as the cross-correlation between  The PACF at lag and δ ; that is, − t h t ) E (  δ h t t − . = φ √ hh 2 2 ( )  ( δ E ) E t − h t ρ , and let h × Let R , . . ., be the h = h matrix with elements ρ ( i − j ) for i , j = 1 h h ′ ( ρ ( 1 ) , ρ ( 2 ) , . . ., ρ ( h )) be the vector of lagged autocorrelations, . ρ ( h ) = corr ( x ) x , h + t t h ′ x ρ Let ̃ be the reversed vector. In addition, let denote = ( ρ ( h ) , ρ ( h − 1 ) , . . ., ρ ( 1 )) h t } x : , . . ., the BLP of x { given x t t − 1 h − t i i i i

169 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 159 — #169 i i Problems 159 h x x + ··· α α x = , + h t hh − t − 1 h 1 t as described in Property 3.3. Prove 1 ′ − )− ̃ ρ ρ ρ ( R h h h 1 − h − 1 = φ . α = hh hh ′ − 1 − ̃ ρ ρ 1 R ̃ 1 − h 1 h − 1 h − In particular, this result proves Property 3.4. Hint: Divide the prediction equations [see (3.63)] by γ ( 0 ) and write the matrix equation in the partitioned form as ( ) ) ( ) ( ̃ ρ R ρ α 1 − 1 h h − 1 h 1 − = , ′ ) ( ρ ρ ̃ 0 α h ) ρ ( hh − h 1 ′ h × 1 vector of coefficients α = ( α = α , . . .,α is partitioned as ) where the hh h 1 ′ ′ α ,α ) ( . hh 1 3.14 Suppose we wish to find a prediction function g ( x ) that minimizes 2 = )) , M SE ] E [( y − g ( x and where are jointly distributed random variables with density function f ( x , y ) . x y (a) Show that MSE is minimized by the choice = E ( y g ( x ) . ) x Hint: 2 [( EE − y M SE g ( x )) = | x ] . (b) Apply the above result to the model 2 = x + z , y where x and z are independent zero-mean normal variables with variance one. Show that M SE = 1 . (c) Suppose we restrict our choices for the function g ( x ) to linear functions of the form a bx + g ( x ) = = 1 and b to minimize M SE . Show that a a and determine and E ( x y ) = = b 0 2 ( ) E x and M SE = 3 . What do you interpret this to mean? For an AR m 3.15 -step-ahead forecast ( 1 ) model, determine the general form of the t x and show t + m m 2 φ − 1 2 2 t . x x E [( − ) ] = σ m + t m + t w 2 1 − φ i i i i

170 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 160 — #170 i i 160 3 ARIMA Models 3.16 Consider the ARMA(1,1) model discussed in Example 3.8, equation (3.27); that = 5 9 x x . Show that truncated prediction as defined in (3.91) is w + . is, w . + 1 − 1 t − t t t equivalent to truncated prediction using the recursive formula (3.92). 3.17 Verify statement (3.87), that for a fixed sample size, the ARMA prediction errors are correlated. Section 3.5 cmort ) discussed in 3.18 Fit an AR(2) model to the cardiovascular mortality series ( Example 2.2. using linear regression and using Yule–Walker. (a) Compare the parameter estimates obtained by the two methods. (b) Compare the estimated standard errors of the coefficients obtained by linear re- gression with their corresponding asymptotic approximations, as given in Prop- erty 3.10. x , . . ., 3.19 x . Suppose μ = 0 are observations from an AR(1) process with n 1 n 1 − t φ x . (a) Show the backcasts can be written as x = , for t ≤ 1 1 t 1 , the backcasted errors are (b) In turn, show, for t ≤ n n 1 − t 2 ̃ − φ w ( . φ x = φ ) = x ( 1 − φ ) x t 1 t 1 t − Õ 1 2 2 2 ̃ w 1 x ( φ ) = ( . − φ (c) Use the result of (b) to show ) = −∞ t t 1 S ( φ ) , can be (d) Use the result of (c) to verify the unconditional sum of squares, Õ n 2 ) φ ( written as . w ̃ −∞ = t t 1 − t n r (e) Find for 1 ≤ t ≤ x , and show that and t t n ’ / t − 1 2 ( x r − x . S ( φ ) ) = t t t 1 t = 3.20 Repeat the following numerical exercise three times. Generate n = 500 obser- vations from the ARMA model given by . − , w x 9 = . 9 x w + 1 − t t t − 1 t with 1 w ∼ iid N ( 0 , . Plot the simulated data, compute the sample ACF and PACF ) t 1 , 1 ) model to the data. What happened and of the simulated data, and fit an ARMA( how do you explain the results? 3.21 Generate 10 realizations of length n = 200 each of an ARMA(1,1) process with 2 = . Find the MLEs of the three parameters in each case and 1 φ = . 9 , θ = . 5 and σ compare the estimators to the true values. = 3.22 Generate n = 50 observations from a Gaussian AR(1) model with φ . 99 and σ = 1 . Using an estimation technique of your choice, compare the approximate w asymptotic distribution of your estimate (the one you would use for inference) with ). 200 = B the results of a bootstrap experiment (use i i i i

171 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 161 — #171 i i Problems 161 Using Example 3.32 as your guide, find the Gauss–Newton procedure for es- 3.23 , from the AR(1) model, = φ x φ x + w , timating the autoregressive parameter, t t 1 t − x . Does this procedure produce the unconditional or the condi- , . . ., x given data 1 n − Hint: w ; your solution should ( φ ) = x tional estimator? Write the model as φ x t − 1 t t work out to be a non-recursive procedure. 3.24 Consider the stationary series generated by x , = α + φ x w θ + + w t t t − 1 t 1 − E ( x | ) = μ , where θ | < 1 , | φ | < 1 and the w are iid random variables with zero mean t t 2 σ and variance . w for the above model. Find the autocovari- α (a) Determine the mean as a function of ance and ACF of the process x , and show that the process is weakly stationary. t Is the process strictly stationary? n →∞ of the sample mean, (b) Prove the limiting distribution as n ’ − 1 ̄ x = n x , t t = 1 2 , and . is normal, and find its limiting mean and variance in terms of σ α , φ , θ w (Note: This part uses results from Appendix A.) A problem of interest in the analysis of geophysical time series involves a simple 3.25 model for observed data containing a signal and a reflected version of the signal with and unknown time delay δ a unknown amplification factor . For example, the depth of an earthquake is proportional to the time delay δ for the P wave and its reflected , is white and Gaussian with form pP on a seismic record. Assume the signal, say s t 2 σ , and consider the generating model variance s x s = + as . t − δ t t < x is stationary. If | a | (a) Prove the process 1 , show that t ∞ ’ j x (− ) a = s δ − t j t 0 = j s 1 , for t = 1 , ± is a mean square convergent representation for the signal , ± 2 , . . . . t (b) If the time delay δ is assumed to be known, suggest an approximate computational 2 using maximum likelihood and and σ method for estimating the parameters a s the Gauss–Newton method. (c) If the time delay δ is an unknown integer, specify how we could estimate the 2 δ . Generate a n = 500 and a = . 9 , σ parameters including 1 = point series with w 7 . , . . ., 4 δ = 5 . Estimate the integer time delay δ by searching over δ = 3 , i i i i

172 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 162 — #172 i i 3 ARIMA Models 162 x x , x Forecasting with estimated parameters: , . . ., Let 3.26 be a sample of size n n 2 1 ˆ ) process, x φ = from a causal AR x ( be the Yule–Walker estimator φ + w 1 . Let t − 1 t t of φ . 2 − 1 / ˆ O (a) Show ( n φ − φ = ) . See Appendix A for the definition of O . (·) p p n x , based (b) Let x be the one-step-ahead forecast of x , . . ., x given the data 1 1 n n + 1 + n n , and let on the known parameter, x φ ˆ be the one-step-ahead forecast when the + n 1 n 2 n / 1 − ˆ ) n − ˆ x φ . Show x parameter is replaced by = O . ( p 1 n 1 + n + Section 3.6 3.27 Suppose q = β 0 + β , t + ··· + β + t y , x , β t q 1 0 t q k k is stationary. First, show that ∇ x where and is stationary for any , . . ., = 1 , 2 x t t k ∇ is not stationary for y . then show that k < q , but is stationary for k ≥ q t Verify that the IMA(1,1) model given in (3.148) can be inverted and written as 3.28 (3.149). 3.29 For the ARIMA ( 1 , 1 , 0 ) model with drift, ( 1 − φ B )( 1 − B ) x let = δ + w , t t y x = ( 1 − . ) x ∇ = B t t t 1 ≥ j , (a) Noting that y is AR(1), show that, for t j 1 − n j ··· . δ [ 1 + φ + = + φ y ] + φ y n j n + , m = 1 (b) Use part (a) to show that, for 2 , . . ., [ ] m m ) ( φ φ ) − φ − 1 ( φ 1 δ n x . m ) − − x ( + x x + = n n 1 − n + n m ) 1 1 − ( φ − φ ) φ ( 1 − j φ − 1 n j n = δ − x x ) x − x ( . Now sum both sides over From (a), Hint: φ + n 1 − n n j + − φ 1 j + − 1 n . to m j from 1 n ∗ ∗ ψ by first showing that (c) Use (3.145) to find P = 1 , ψ + φ = , and ) 1 ( n m + 0 1 1 j + − φ 1 ∗ ∗ ∗ ∗ . , for ≥ 1 j 1 + φ ) ψ ψ −( φψ + j ≥ 2 , in which case ψ = = 0 for j j 1 − φ 1 − j j − 2 Note that, as in Example 3.37, equation (3.145) is exact here. x For the logarithm of the glacial varve data, say, 3.30 , presented in Example 3.33, t t use the first 100 observations and calculate the EWMA, ̃ x , given in (3.151) for t + 1 = 1 75 t , and plot the EWMAs and the data . , . . ., 100 , using λ = . 25 , . 50 , and superimposed on each other. Comment on the results. i i i i

173 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 163 — #173 i i 163 Problems Section 3.7 In Example 3.40, we presented the diagnostics for the MA(2) fit to the GNP 3.31 growth rate series. Using that example as a guide, complete the diagnostics for the AR(1) fit. , ( p , 3.32 . Fit an ARIMA q ) model to Crude oil prices in dollars per barrel are in oil d the growth rate performing all necessary diagnostics. Comment. model to the global temperature data ( , d , q ) p globtemp perform- Fit an ARIMA 3.33 ing all of the necessary diagnostics. After deciding on an appropriate model, forecast (with limits) the next 10 years. Comment. Fit an ARIMA ( p , 3.34 , q ) model to the sulfur dioxide series, so2 , performing all d of the necessary diagnostics. After deciding on an appropriate model, forecast the data into the future four time periods ahead (about one month) and calculate 95% prediction intervals for each of the four forecasts. Comment. (Sulfur dioxide is one of the pollutants monitored in the mortality study described in Example 2.2.) Section 3.8 Let S be the represent the monthly sales data in sales ( n = 150 ), and let 3.35 L t t lead leading indicator in . S , the monthly sales data. Discuss your model fitting (a) Fit an ARIMA model to t in a step-by-step fashion, presenting your (A) initial examination of the data, (B) transformations, if necessary, (C) initial identification of the dependence orders and degree of differencing, (D) parameter estimation, (E) residual diagnostics and model choice. ∇ to argue that a regression of L ∇ (b) Use the CCF and lag plots between ∇ S S and t t t Note that in ∇ , the first named series is the lag2.plot() on is reasonable. [ L − t 3 one that gets lagged. ] = (c) Fit the regression model ∇ S x is an ARMA β , where + β x ∇ L + t − t 1 t 0 3 t ). Discuss your results. process (explain how you decided on your model for x t [ ] See Example 3.45 for help on coding this problem. 3.36 One of the remarkable technological developments in the computer industry has been the ability to store information densely on a hard drive. In addition, the cost of storage has steadily declined causing problems of too much data as opposed to big data . The data set for this assignment is cpg , which consists of the median annual retail price per GB of hard drives, say c , taken from a sample of manufacturers from t 1980 to 2008. c and describe what you see. (a) Plot t t β behaves like by fitting a linear regression versus t (b) Argue that the curve c ≈ α e c t t on t and then plotting the fitted line to compare it to the logged data. of log c t Comment. i i i i

174 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 164 — #174 i i 3 ARIMA Models 164 (c) Inspect the residuals of the linear regression fit and comment. (d) Fit the regression again, but now using the fact that the errors are autocorrelated. Comment. 3.37 Redo Problem 2.2 without assuming the error term is white noise. Section 3.9 3.38 Consider the ARIMA model = x . + Θ w w t − 2 t t ( p , d , q )×( P , D , Q ) (a) Identify the model using the notation ARIMA . s 1 Θ | < | , and find the coefficients in the (b) Show that the series is invertible for representation ∞ ’ = w . x π t − k t k 0 k = m -step ahead forecast, ̃ x (c) Develop equations for the , and its variance based on m + n the infinite past, x , x , . . . . − 1 n n 0 3.39 ( 0 , 1 )×( 1 , Plot the ACF of the seasonal ARIMA ) 8 model with Φ = . and 12 . = . 5 θ 3.40 Fit a seasonal ARIMA model of your choice to the chicken price data in . chicken Use the estimated model to forecast the next 12 months. 3.41 Fit a seasonal ARIMA model of your choice to the unemployment data in unemp . Use the estimated model to forecast the next 12 months. 3.42 Fit a seasonal ARIMA model of your choice to the unemployment data in UnempRate . Use the estimated model to forecast the next 12 months. 3.43 Fit a seasonal ARIMA model of your choice to the U.S. Live Birth Series ( birth ). Use the estimated model to forecast the next 12 months. Fit an appropriate seasonal ARIMA model to the log-transformed Johnson and 3.44 Johnson earnings series ( jj ) of Example 1.1. Use the estimated model to forecast the next 4 quarters. The following problems require supplemental material given in Appendix B. Õ p w x is white noise such that = w 3.45 and 0 , φ φ x , where Suppose + j − t j p t t t j = 1 . Use the Projection Theorem, Theorem B.1, to w } is uncorrelated with { x t ; k < t k } n ≤ is sp { x k , n > p , the BLP of x on show that, for k 1 + n p ’ . x φ x = ˆ + n j 1 − j + 1 n 1 j = i i i i

175 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 165 — #175 i i Problems 165 3.46 Use the Projection Theorem to derive the Innovations Algorithm, Property 3.6, -step-ahead forecast equations (3.77)-(3.79). Then, use Theorem B.2 to derive the m results given in (3.80) and (3.81). = 3.47 Consider the series x w w − w is a white noise process with mean , where t − t t 1 t 2 , based on zero and variance σ . Suppose we consider the problem of predicting x + 1 n w . Use the Projection Theorem to answer the questions below. only x , . . ., x 1 n (a) Show the best linear predictor is n ’ 1 n k x . x − = k + 1 n 1 n + k = 1 (b) Prove the mean square error is 2 n + n 2 2 x x E ( − ) = σ . n 1 + w n 1 + n + 1 3.48 Use Theorem B.2 and Theorem B.3 to verify (3.117). 3.49 Prove Theorem B.2. 3.50 Prove Property 3.2. i i i i

176 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 166 — #176 i i i i i i

177 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 167 — #177 i i Chapter 4 Spectral Analysis and Filtering In this chapter, we focus on the approach to time series analysis. frequency domain We argue that the concept of regularity of a series can best be expressed in terms of periodic variations of the underlying phenomenon that produced the series. Many of the examples in Section 1.1 are time series that are driven by periodic components. For example, the speech recording in Figure 1.3 contains a complicated mixture of frequencies related to the opening and closing of the glottis. The monthly SOI displayed in Figure 1.5 contains two periodicities, a seasonal periodic component of 12 months and an El Niño component of about three to seven years. Of fundamental interest is the return period of the El Niño phenomenon, which can have profound effects on local climate. An important part of analyzing data in the frequency domain, as well as the time domain, is the investigation and exploitation of the properties of the time-invariant linear filter. This special linear transformation is used similarly to linear regression in conventional statistics, and we use many of the same terms in the time series context. We also introduce coherency as a tool for relating the common periodic behavior of two series. Coherency is a frequency based measure of the correlation between two series at a given frequency, and we show later that it measures the performance of the best linear filter relating the two series. Many frequency scales will often coexist, depending on the nature of the problem. For example, in the Johnson & Johnson data set in Figure 1.1, the predominant cycles per 25 . = frequency of oscillation is one cycle per year (4 quarters), or ω observation. The predominant frequency in the SOI and fish populations series in Figure 1.5 is also one cycle per year, but this corresponds to 1 cycle every 12 months, ω = . or cycles per observation. Throughout the text, we measure frequency, ω , at 083 = λ πω 2 cycles per time point rather than the alternative that would give radians per point. Of descriptive interest is the period of a time series, defined as the number of points in a cycle, i.e., 1 / ω. Hence, the predominant period of the Johnson & Johnson 25 series is 1 / . or 4 quarters per cycle, whereas the predominant period of the SOI series is months per cycle. 12 i i i i

178 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 168 — #178 i i 4 Spectral Analysis and Filtering 168 4.1 Cyclical Behavior and Periodicity We have already encountered the notion of periodicity in numerous examples in Chapters 1, 2 and 3. The general notion of periodicity can be made more precise by introducing some terminology. In order to define the rate at which a series oscillates, as one complete period of a sine or cosine function defined cycle we first define a over a unit time interval. As in (1.5), we consider the periodic process φ ( 2 πω x + cos ) (4.1) = A t t 0 , ± 1 , ± 2 , . . . , where ω is a frequency index, defined in cycles per unit time t for = with amplitude of the function and φ , called the phase , A determining the height or determining the start point of the cosine function. We can introduce random variation in this time series by allowing the amplitude and phase to vary randomly. As discussed in Example 2.10, for purposes of data analysis, it is easier to use a 4.1 trigonometric identity and write (4.1) as = U (4.2) cos ( 2 πω x ) t U , sin ( 2 πω t ) + 2 1 t U φ = A cos φ and U are often taken to be normally distributed = − A sin where 1 2 √ 2 2 ( U + U and the phase is ) A = random variables. In this case, the amplitude is 2 1 − 1 tan (− U φ / U = ) . From these facts we can show that if, and only if, in (4.1), A 1 2 2 and are independent random variables, where is chi-squared with 2 degrees of A φ U is uniformly distributed on π, π ) , then φ and U freedom, and are independent, (− 1 2 standard normal random variables (see Problem 4.3). If we assume that U and U 0 are uncorrelated random variables with mean 2 1 2 and, writing , then x and variance in (4.2) is stationary with mean E ( x σ ) = 0 t t , autocovariance function = cos ( 2 πω t ) and c ) = sin ( 2 πω t s t t + + ) s γ c ( h ) = cov ( x U , s , x U ) = cov ( U U c t t 2 h t + h + t 1 x t 2 t + 1 h ) cov ( U c , U c ) + cov ( U c s , U = + 1 1 t t t h + h 2 1 t (4.3) cov cov ( U ) s s U , , U s c U ) + + ( + t 1 t + h h 2 2 t t 2 2 2 2 = σ ) c πω 2 ( c cos + 0 + 0 + σ , s h σ = s t h + t + h t t using Footnote 4.1 and noting that 0 ( U , U . From (4.3), we see that ) = cov 2 1 2 σ ( x . ) var γ ( 0 ) = = x t 2 U Thus, if we observe U is the sample variance of = a and σ , an estimate of = b 2 1 2 2 b + a 2 2 2 = S these two observations, which in this case is simply = + b . a 1 − 2 ω . For The random process in (4.2) is function of its frequency, = 1 , the series ω makes one cycle per time unit; for ω = . 50 , the series makes a cycle every two = ω , every four units, and so on. In general, for data that occur 25 time units; for . at discrete time points, we will need at least two points to determine a cycle, so the 4 . 1 ( ) β . cos ( α ± β ) = cos ( α ) cos ( β )∓ sin ( α ) sin i i i i

179 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 169 — #179 i i 169 4.1 Cyclical Behavior and Periodicity 5 cycles per point. This frequency is called the folding highest frequency of interest is . and defines the highest frequency that can be seen in discrete sampling. frequency Higher frequencies sampled this way will appear at lower frequencies, called aliases ; an example is the way a camera samples a rotating wheel on a moving automobile in a movie, in which the wheel appears to be rotating at a different rate, and sometimes wagon wheel effect ). For example, most movies are recorded at 24 backwards (the frames per second (or 24 Hertz). If the camera is filming a wheel that is rotating at 24 Hertz, the wheel will appear to stand still. Consider a generalization of (4.2) that allows mixtures of periodic series with multiple frequencies and amplitudes, q ’ ] [ , = x U ) t cos ( 2 πω U t ) + (4.4) πω 2 sin ( 2 k k 1 k t k k 1 = U , are uncorrelated zero-mean random variables , U q , . . ., , for where = 1 , 2 k 2 k 1 k 2 σ with variances , and the ω are distinct frequencies. Notice that (4.4) exhibits the k k 2 . for frequency ω process as a sum of uncorrelated components, with variance σ k k As in (4.3), it is easy to show (Problem 4.4) that the autocovariance function of the process is q ’ 2 h πω (4.5) , 2 ( cos σ ) ) γ = h ( k x k = k 1 and we note the autocovariance function is the sum of periodic components with 2 σ x is a mean-zero stationary pro- . Hence, weights proportional to the variances t k cesses with variance q ’ 2 , σ (4.6) ( ( γ ) = 0 ) = var x t x k 1 = k exhibiting the overall variance as a sum of variances of each of the component parts. , q , . . ., 1 As in the simple case, if we observe U = k = a for and U b = 2 k k k k 1 2 ) , of x , would be the sample var ( σ th variance component, k then an estimate of the t k 2 2 2 x , namely, . In addition, an estimate of the total variance of + b a = S variance t k k k would be the sum of the sample variances, ( 0 ) γ x q ’ 2 2 ˆ = ˆ var ( x . ) = ) ( γ ( a 0 ) + b (4.7) x t k k 1 k = Hold on to this idea because we will use it in Example 4.2. Example 4.1 A Periodic Series q constructed in the 3 = Figure 4.1 shows an example of the mixture (4.4) with = 1 , . . ., 100 , we generated three series t following way. First, for t 100 / = 2 cos ( 2 π t 6 / 100 ) + 3 sin ( 2 π x 6 ) 1 t 2 x 100 / = 4 cos ( π t 10 / 100 ) + 5 sin ( 2 π t 10 ) 2 t ) 100 / 40 x t π = 6 cos ( 2 π t 40 / 100 ) + 7 sin ( 2 3 t i i i i

180 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 170 — #180 i i 170 4 Spectral Analysis and Filtering 2 2 A ω = 6 100 A = 13 = 41 100 10 = ω 10 10 5 5 0 0 x1 x2 −5 −5 −10 −10 100 80 0 0 20 40 60 80 100 20 40 60 Time Time 2 sum ω 85 40 100 A = = 10 15 5 5 x 0 x3 −5 −5 −15 −10 100 20 0 20 40 60 80 80 0 100 40 60 Time Time Periodic components and their sum as described in Example 4.1. Fig. 4.1. These three series are displayed in Figure 4.1 along with the corresponding fre- quencies and squared amplitudes. For example, the squared amplitude of x is t 1 2 2 2 A + 3 13 = = 2 x will attain . Hence, the maximum and minimum values that 1 t √ ± are = ± 13 . 61 . 3 Finally, we constructed = x + x x x + t t t 3 t 2 1 and this series is also displayed in Figure 4.1. We note that x appears to behave t as some of the periodic series we saw in Chapters 1 and 2. The systematic sorting out of the essential frequency components in a time series, including their relative contributions, constitutes one of the main objectives of spectral analysis. The R code to reproduce Figure 4.1 is x1 = 2*cos(2*pi*1:100*6/100) + 3*sin(2*pi*1:100*6/100) x2 = 4*cos(2*pi*1:100*10/100) + 5*sin(2*pi*1:100*10/100) x3 = 6*cos(2*pi*1:100*40/100) + 7*sin(2*pi*1:100*40/100) x = x1 + x2 + x3 par(mfrow=c(2,2)) plot.ts(x1, ylim=c(-10,10), main=expression(omega==6/100~~~A^2==13)) plot.ts(x2, ylim=c(-10,10), main=expression(omega==10/100~~~A^2==41)) plot.ts(x3, ylim=c(-10,10), main=expression(omega==40/100~~~A^2==85)) plot.ts(x, ylim=c(-16,16), main="sum") The model given in (4.4) along with the corresponding autocovariance function given in (4.5) are population constructs. Although, in (4.7), we hinted as to how we would estimate the variance components, we now discuss the practical aspects of 2 in (4.6). σ , to actually estimate the variance components how, given data x x , . . ., 1 n k i i i i

181 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 171 — #181 i i 171 4.1 Cyclical Behavior and Periodicity Example 4.2 Estimation and the Periodogram , . . ., x x , where n is odd, we may write, exactly For any time series sample n 1 − 1 )/ n ( 2 ’ ] [ + x , ( ) = a a cos (4.8) 2 π t j / n ) + b n sin ( 2 π t j / j j t 0 j 1 = = 1 , . . ., for and suitably chosen coefficients. If n is even, the representation t n ( n / 2 − 1 ) and adding an additional component (4.8) can be modified by summing to 1 t 1 . The crucial point here is that (4.8) is exact ) = a ) (− given by cos ( 2 π t a 2 n / 2 / n 2 for any sample. Hence (4.4) may be thought of as an approximation to (4.8), the idea being that many of the coefficients in (4.8) may be close to zero. Using the regression results from Chapter 2, the coefficients and b are of the a j j Õ Õ n n 2 2 n / x form z ) . Using , where z / is either cos ( z π t j / n ) or sin ( 2 π t j t j t j t = 1 t t = 1 t j Õ n 2 , so the regression coefficients in z , 2 = n / 2 when j / n , 0 Problem 4.1, 1 / t 1 = t j x = ̄ (4.8) can be written as ( ), a 0 n n ’ ’ 2 2 = a = cos ( 2 π t j / n ) and b n . x / t j ) x π sin ( 2 j t j t n n 1 t = 1 t = We then define the scaled periodogram to be 2 2 b (4.9) P ( j / , ) = a n + j j and it is of interest because it indicates which frequency components in (4.8) are large in magnitude and which components are small. The scaled periodogram is simply the sample variance at each frequency component and consequently is an 2 = . corresponding to the sinusoid oscillating at a frequency of ω estimate of σ j / n j j These particular frequencies are called the Fourier or fundamental frequencies . Large values of P ( j / n ) indicate which frequencies ω are predominant = j / n j n P ( j / in the series, whereas small values of ) may be associated with noise. The periodogram was introduced in Schuster (1898) and used in Schuster (1906) for studying the periodicities in the sunspot series (shown in Figure 4.22). Fortunately, it is not necessary to run a large regression to obtain the values of n b is a highly composite integer. a because they can be computed quickly if and j j discrete Fourier Although we will discuss it in more detail in Section 4.3, the 4.2 transform (DFT) is a complex-valued weighted average of the data given by n ’ − 1 / 2 exp / x n it j π 2 (− ) n d ( j ) / = n t 1 = t ) ( (4.10) n n ’ ’ − 1 / 2 / n t j π 2 ( ) sin x , π t j / n x i )− cos ( 2 n = t t 1 t = 1 = t α i − α i − α i α i − e e + e e . α 2 4 i , and sin ( α ) = . α ( sin i + ) cos . Consequently, ( α ) Euler’s formula: ) ( cos α = e = i 2 2 1 2 2 2 ∗ is complex, then = − i because − i × i = 1 . If z = a + ib ; b z | | = z z Also, = ( a + ib )( a − ib ) = a + i the * denotes conjugation. i i i i

182 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 172 — #182 i i 172 4 Spectral Analysis and Filtering l l 80 60 l l 40 20 scaled periodogram l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0 0.8 0.6 0.4 0.2 0.0 1.0 frequency Fig. 4.2. The scaled periodogram (4.12) of the data generated in Example 4.1. are the Fourier or fundamental for j = 0 , 1 , . . ., n − 1 , where the frequencies j / n frequencies. Because of a large number of redundancies in the calculation, (4.10) may be computed quickly using the fast Fourier transform (FFT) . Note that ( ( ) ) 2 2 n n ’ ’ 1 1 2 | | / d ( j t j π 2 x ( cos ( 2 π t j / n / n ) sin ) x n = ) + (4.11) t t n n t = t = 1 1 periodogram . We may calculate the scaled and it is this quantity that is called the periodogram, (4.9), using the periodogram as 2 4 / )| n (4.12) . j | d ( ) n / j ( P = n x The scaled periodogram of the data, , simulated in Example 4.1 is shown in t and . Note x Figure 4.2, and it clearly identifies the three components x of x , x , 2 t 3 t 1 t t that 1 − n , . . ., P ( j / n ) = P ( 1 − j / n ) , j = 0 , 1 , so there is a mirroring effect at the folding frequency of 1/2; consequently, the peri- odogram is typically not plotted for frequencies higher than the folding frequency. In addition, note that the heights of the scaled periodogram shown in the figure are 10 6 94 90 40 60 = P ( P ) = 13 , P ( , ) = P ( ( ) = 41 , P ( P 85 ) = = ( ) ) 100 100 100 100 100 100 ) and ( j / n P = 0 otherwise. These are exactly the values of the squared amplitudes of the components generated in Example 4.1. Assuming the simulated data, x , were retained from the previous example, the R code to reproduce Figure 4.2 is P = Mod(2*fft(x)/100)^2; Fr = 0:99/100 plot(Fr, P, type="o", xlab="frequency", ylab="scaled periodogram") Different packages scale the FFT differently, so it is a good idea to consult the − 1 / 2 documentation. R computes it without the factor and with an additional factor n i ω π 2 j that can be ignored because we will be interested in the squared modulus. e of i i i i

183 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 173 — #183 i i 4.1 Cyclical Behavior and Periodicity 173 30 20 10 star magnitude 0 400 500 600 0 100 200 300 day 29 day cycle 10000 24 day cycle 4000 Periodogram 0 0.02 0.04 0.06 0.08 0.00 Frequency Star magnitudes and part of the corresponding periodogram. Fig. 4.3. If we consider the data x in Example 4.1 as a color (waveform) made up of t x x at various strengths (amplitudes), then we might consider , x , primary colors t 2 t 1 t 3 x the periodogram as a prism that decomposes the color into its primary colors t (spectrum). Hence the term . The following is an example using spectral analysis actual data. Example 4.3 Star Magnitude The data in Figure 4.3 are the magnitude of a star taken at midnight for 600 consec- utive days. The data are taken from the classic text, The Calculus of Observations, a Treatise on Numerical Mathematics , by E.T. Whittaker and G. Robinson, (1923, Blackie & Son, Ltd.). The periodogram for frequencies less than .08 is also displayed in the figure; the periodogram ordinates for frequencies higher than .08 are essentially zero. Note ) day cycle are the most that the 29 ( ≈ 1 / . 035 ) day cycle and the 24 ( ≈ 1 / . 041 prominent periodic components of the data. We can interpret this result as we are observing an sig- amplitude modulated + nal. For example, suppose we are observing signal-plus-noise, = s x , where v t t t ( δ = cos ) 2 πω t s cos ( 2 πδ t ) , and is very small. In this case, the process will os- t . Since cillate at frequency ω , but the amplitude will be modulated by cos ( 2 πδ t ) 2 cos ( x , the periodogram of data generated as ) α ) cos ( δ ) = cos ( α + δ ) + cos ( α − δ t δ will have two peaks close to each other at α ± . Try this on your own: t = 1:200 plot.ts(x <- 2*cos(2*pi*.2*t)*cos(2*pi*.01*t)) # not shown lines(cos(2*pi*.19*t)+cos(2*pi*.21*t), col=2) # the same ) # the periodogram ' Px = Mod(fft(x))^2; plot(0:199/200, Px, type= ' o The R code to reproduce Figure 4.3 is n = length(star) par(mfrow=c(2,1), mar=c(3,3,1,1), mgp=c(1.6,.6,0)) plot(star, ylab="star magnitude", xlab="day") i i i i

184 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 174 — #184 i i 174 4 Spectral Analysis and Filtering Per = Mod(fft(star-mean(star)))^2/n Freq = (1:n -1)/n ' , lwd=3, ylab="Periodogram", plot(Freq[1:50], Per[1:50], type= h ' xlab="Frequency") # 22 freq=21/600=.035 cycles/day u = which.max(Per[1:50]) uu = which.max(Per[1:50][-u]) # 25 freq=25/600=.041 cycles/day 1/Freq[22]; 1/Freq[26] # period = days/cycle text(.05, 7000, "24 day cycle"); text(.027, 9000, "29 day cycle") ### another way to find the two peaks is to order on Per y = cbind(1:50, Freq[1:50], Per[1:50]); y[order(y[,3]),] 4.2 The Spectral Density In this section, we define the fundamental frequency domain tool, the spectral density. In addition, we discuss the spectral representations for stationary processes. Just as the Wold decomposition (Theorem B.5) theoretically justified the use of regression for analyzing time series, the spectral representation theorems supply the theoretical justifications for decomposing stationary time series into periodic components ap- pearing in proportion to their underlying variances. This material is enhanced by the results presented in Appendix C. Example 4.4 A Periodic Stationary Process Consider a periodic stationary random process given by (4.2), with a fixed frequency ω , say, 0 + = U (4.13) cos ( 2 πω , t ) x U ) sin ( 2 πω t 0 2 1 t 0 and U are uncorrelated zero-mean random variables with equal variance where U 1 2 2 σ . The number of time periods needed for the above series to complete one cycle is 1 / ω ω , and the process makes exactly exactly . cycles per point for t = 0 , ± 1 , ± 2 , . . . 0 0 Recalling (4.3) and using Footnote 4.2, we have 2 2 σ σ 2 ω i h ω i π π − 2 h 2 0 0 e + e ) cos πω σ = 2 = h h ( ) ( γ 0 2 2 1 π 2 ω 2 π i h = e ω dF ( ) 1 − 2 using Riemann–Stieltjes integration (see Section C.4.1), where F ( ω ) is the function defined by   − 0 ω < , ω 0   2 ) ( F = ω 2 , ω < ω σ ≤ / ω − 0 0   2  σ . ω ≥ ω 0  ( behaves like a cumulative distribution function for a discrete ) ω The function F 2 F random variable, except that F (∞) = σ ( = var ( x is a ) instead of one. In fact, ) ω t cumulative distribution function, not of probabilities, but rather of variances, with x the F (∞) being the total variance of the process spectral ω . Hence, we term F ( ) t . This example is continued in Example 4.9. distribution function i i i i

185 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 175 — #185 i i 4.2 The Spectral Density 175 A representation such as the one given in Example 4.4 always exists for a stationary process. For details, see Theorem C.1 and its proof; Riemann–Stieltjes integration is described in Section C.4.1. Property 4.1 Spectral Representation of an Autocovariance Function If { x , then there exists } is stationary with autocovariance γ ( h ) = cov ( x ) x , h t t t + a unique monotonically increasing function spectral distribution ( ω ) , called the F 0 such that ) function , with F (−∞) = F (− 1 / 2 ) = 0 , and F (∞) = F ( 1 / 2 ) = γ ( 1 π 2 π 2 i ω h (4.14) . ) ω dF ( e h ) = ( γ 1 − 2 An important situation we use repeatedly is the case when the autocovariance function is absolutely summable, in which case the spectral distribution function is ( absolutely continuous with dF ( ω ) = f ω ) d ω , and the representation (4.14) becomes the motivation for the property given below. Property 4.2 The Spectral Density If the autocovariance function, γ ( h ) , of a stationary process satisfies ∞ ’ (4.15) , ∞ | γ ( h )| < −∞ = h then it has the representation 1 π 2 h i ω π 2 = γ ( h ) 0 e f ( ω ) d ω h = (4.16) , ± 1 , ± 2 , . . . 1 − 2 spectral density , as the inverse transform of the ∞ ’ ω 2 h i π − 1 ) ( h ) e (4.17) f . ( = ω − γ / 2 ≤ ω ≤ 1 / 2 h = −∞ This spectral density is the analogue of the probability density function; the fact γ is non-negative definite ensures ) h that ( f ω ) ≥ 0 ( for all ω . It follows immediately from (4.17) that f ( ω ) = f (− ω ) verifying the spectral density is an even function. Because of the evenness, we will typically only plot ≤ f ( ω ) for 0 ≤ ω 1 / 2 . In addition, putting h = 0 in (4.16) yields 1 π 2 ω ( f ω, d ) ) var γ ) = = x 0 ( ( t 1 − 2 i i i i

186 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 176 — #186 i i 4 Spectral Analysis and Filtering 176 which expresses the total variance as the integrated spectral density over all of the frequencies. We show later on, that a linear filter can isolate the variance in certain . frequency intervals or bands It should now be clear that the autocovariance and the spectral distribution func- tions contain the same information. That information, however, is expressed in dif- ferent ways. The autocovariance function expresses information in terms of lags, whereas the spectral distribution expresses the same information in terms of cycles. Some problems are easier to work with when considering lagged information and we would tend to handle those problems in the time domain. Nevertheless, other prob- lems are easier to work with when considering periodic information and we would tend to handle those problems in the spectral domain. ( h ) , in (4.16) and the spectral density, γ We note that the autocovariance function, ω ) , in (4.17) are Fourier transform pairs. In particular, this means that if f ( ω ) and f ( ω ) are two spectral densities for which ( g 1 1 π π 2 2 h h π i ω 2 2 π i ω = h f ( d ω (4.18) ( γ = g ( ω ) e ω ) e ω ) d ) γ ( h = g f 1 1 − − 2 2 , . . ., 2 ± , for all h = 0 , ± 1 then (4.19) f ) = g ( ω ) . ω ( Finally, the absolute summability condition, (4.15), is not satisfied by (4.5), the example that we have used to introduce the idea of a spectral representation. The condition, however, is satisfied for ARMA models. It is illuminating to examine the spectral density for the series that we have looked at in earlier discussions. Example 4.5 White Noise Series As a simple example, consider the theoretical power spectrum of a sequence of 2 w . A simulated set of data , with variance σ uncorrelated random variables, t w is displayed in the top of Figure 1.8. Because the autocovariance function was 2 , and zero, otherwise, it follows 0 for h = = computed in Example 1.16 as γ σ ( h ) w w from (4.17), that 2 ω ) ( f σ = w w − 1 / 2 ≤ ω ≤ for / 2 . Hence the process contains equal power at all frequencies. This 1 property is seen in the realization, which seems to contain all different frequencies in a roughly equal mix. In fact, the name white noise comes from the analogy to white light, which contains all frequencies in the color spectrum at the same level of intensity. The top of Figure 4.4 shows a plot of the white noise spectrum for 2 = 1 . The R code to reproduce the figure is given at the end of Example 4.7. σ w Since the linear process is an essential tool, it is worthwhile investigating the spectrum of such a process. In general, a linear filter uses a set of specified coefficients, , producing an output say a x , for j = 0 , ± 1 , ± 2 , . . . , to transform an input series, j t , of the form y series, t i i i i

187 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 177 — #187 i i 177 4.2 The Spectral Density ∞ ∞ ’ ’ = . ∞ < , y a (4.20) x | a | j j − t j t = −∞ = j j −∞ convolution in some statistical contexts. The coeffi- The form (4.20) is also called a cients are collectively called the impulse response function , and the Fourier transform ∞ ’ j − 2 π i ω ) a = e ( A ω (4.21) , j −∞ = j . If, in (4.20), x frequency response function is called the f ( ω ) , has spectral density x t we have the following result. Property 4.3 Output Spectrum of a Filtered Stationary Series (4.20) f x , then the spectrum of the has spectrum For the process in , if ( ω ) t x y by x filtered output, , is related to the spectrum of the input ) , say f ω ( y t t 2 , ω ) = | A ( ω )| ω f (4.22) ( f ) ( x y . A ω ) is defined in where the frequency response function ( (4.21) Proof: The autocovariance function of the filtered output y in (4.20) is t x = , γ x ( h ) ) cov ( y h t t + ) ( ’ ’ a x cov , = x a t + h − s t r r − s s r ’ ’ a = − s ) r a γ ( h + r s x s r [ ] 1 π ’ ’ 2 ) ( 1 s h + r − ) 2 π i ω ( = a d a ω e f ) ( ω r x s 1 − s r 2 ( ) ( ) 1 π ’ ’ 2 − 2 π i ω r π 2 h π ω i s ω i 2 = e a d e ) ω ( e a f ω r x s 1 − r s 2 1 π 2 ( 2 ) 2 2 π i ω h ω )| e ( ω, d ) f | A ( ω = x 1 ︸ ︷︷ ︸ − 2 ω ) ( f y γ A (·) by its representation (4.16), and (2) substituted where we have, (1) replaced ( ω ) x from (4.21). The result holds by exploiting the uniqueness of the Fourier transform.  x is ARMA, its The use of Property 4.3 is explored further in Section 4.7. If t spectral density can be obtained explicitly using the fact that it is a linear process, Õ Õ ∞ ∞ x i.e., = < | . The following property is a direct | ∞ ψ ψ , where w t t j j j − = 0 j 0 = j consequence of Property 4.3, by using the additional facts that the spectral density of 2 ( φ . white noise is f )/ ( ) ) = σ z z , and by Property 3.1, ψ ( z ) = θ ( ω w w i i i i

188 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 178 — #188 i i 178 4 Spectral Analysis and Filtering Property 4.4 The Spectral Density of ARMA w is ARMA( p , q ), φ ( B ) x x = θ ( B ) If , its spectral density is given by t t t ω 2 i − 2 π e | θ )| ( 2 (4.23) f ) σ ( ω = x w i π − 2 2 ω e )| | φ ( Õ Õ p q k k ( − φ = z . ) φ z z = and θ ( z ) θ 1 + where 1 k k k k = 1 = 1 Example 4.6 Moving Average As an example of a series that does not have an equal mix of frequencies, we consider a moving average model. Specifically, consider the MA(1) model given by . = w w + . x 5 t t − 1 t A sample realization is shown in the top of Figure 3.2 and we note that the series has less of the higher or faster frequencies. The spectral density will verify this observation. The autocovariance function is displayed in Example 3.5, and for this particular example, we have 2 2 2 2 = ( 1 + . 5 ) . σ ( 1 = 1 . 25 σ 0 > ; γ (± γ ) = . 5 σ ) h ; γ (± h ) = 0 for 1 w w w Substituting this directly into the definition given in (4.17), we have ∞ ( [ )] ’ π 2 − ω h πω 2 2 i ω i π 2 − e ) ( γ h = σ e + e 1 . 25 + . 5 = ω ( f ) w (4.24) h −∞ = 2 [ ] 1 . 25 + cos ( 2 πω ) = . σ w We can also compute the spectral density using Property 4.4, which states that 2 ω i π 2 − 2 . Because ω θ ( e σ = ) for an MA, ( )| f | θ ( z ) = 1 + . 5 z , we have w ω i π 2 ω i π 2 − 2 π i ω − 2 2 ω i π 2 − 5e . 1 | | = = ( 1 + . 5e )| e ( θ | )( 1 + . 5e ) + ) ( πω 2 ω i π − 2 1 5 . + 25 . = e + e which leads to agreement with (4.24). 2 1 , as in the middle of Figure 4.4, shows the = Plotting the spectrum for σ w lower or slower frequencies have greater power than the higher or faster frequencies. Example 4.7 A Second-Order Autoregressive Series We now consider the spectrum of an AR(2) series of the form = x − , x − φ x w φ − 1 − t 2 1 t t 2 t and φ φ 9 = 1 for the special case . = − . Figure 1.9 shows a sample realization of 2 1 . We note the data exhibit a strong periodic component 1 = such a process for σ w that makes a cycle about every six points. i i i i

189 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 179 — #189 i i 179 4.2 The Spectral Density White Noise 1.4 1.2 1.0 spectrum 0.8 0.6 0.5 0.0 0.2 0.4 0.1 0.3 frequency Moving Average 2.0 1.5 1.0 spectrum 0.5 0.2 0.3 0.4 0.5 0.0 0.1 frequency Autoregression 120 80 spectrum 40 0 0.2 0.3 0.4 0.0 0.1 0.5 frequency Theoretical spectra of white noise (top), a first-order moving average (middle), and a Fig. 4.4. second-order autoregressive process (bottom). 2 ( z ) = 1 , φ ( To use Property 4.4, note that ) = 1 − z + . 9 z θ and z ω i π 4 − 2 π i ω ω 2 i π 2 ω i − 2 π i ω π 4 − ) 9e e − )( 1 − e 1 ( = )| + . 9e e ( φ | + . 4 2 i ω − 2 π i ω π π i ω − 4 π i ω . 9 ( ) + . 9 ( e ) = 2 . + e 81 + e − 1 e 4 . ) πω = 2 . 81 − 3 . 8 cos ( 2 πω ) + 1 . 8 cos ( Using this result in (4.23), we have that the spectral density of x is t 2 σ w f ( = ω ) . x 81 − 3 . 8 cos ( 2 πω ) + 1 . 8 cos ( 4 πω ) 2 . σ 1 = Setting , the bottom of Figure 4.4 displays f and shows a strong power ( ω ) x w ω = . 16 component at about cycles per point or a period between six and seven cycles per point and very little power at other frequencies. In this case, modifying the white noise series by applying the second-order AR operator has concentrated the power or variance of the resulting series in a very narrow frequency band. The spectral density can also be obtained from first principles, without having to use Property 4.4. Because w x = x 9 − x in this example, we have . + 1 − t t t − 2 t , γ ) ( h ) = cov ( w w + t t w h = cov ( x x 9 . − x + x − x , + . 9 ) x t h − 2 1 − t h + + − 1 t h + t t − 2 t 2 h − )] = 2 . 81 γ ( ( h )− 1 . 9 [ γ γ ( h + 1 ) + γ + ( h − 1 )] + . 9 [ γ ) ( h + 2 x x x x x i i i i

190 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 180 — #190 i i 180 4 Spectral Analysis and Filtering h in the above equation Now, substituting the spectral representation (4.16) for ( ) γ x yields 1 π 2 [ ] π i ω − h π i ω ω 4 π i ω − 4 π i ω 2 2 π i 2 ω 2 . ) 81 . 9 ( e − 1 . 9 + e ( e + ω d ) ( e + e f ) ( γ = ) h x w 1 − 2 1 π 2 [ ] 2 π i ω h − 3 . 8 cos ( ω. πω ) + 1 . 8 cos ( 4 πω ) d e ) = 2 . 81 f ω ( 2 x 1 − 2 ω , is g w ( If the spectrum of the white noise process, ) , the uniqueness of the w t Fourier transform allows us to identify [ ] ω ) = 1 2 . 81 − 3 . 8 cos ( 2 πω ) + g . 8 cos ( 4 πω ) ( f ) ( ω . x w 2 g ( ω ) = σ But, as we have already seen, , from which we deduce that w w 2 σ w ( ω ) = f x ) πω 4 ( 2 . 81 − 3 . 8 cos ( 2 πω ) + 1 . 8 cos is the spectrum of the autoregressive series. from astsa : arma.spec To reproduce Figure 4.4, use par(mfrow=c(3,1)) arma.spec(log="no", main="White Noise") arma.spec(ma=.5, log="no", main="Moving Average") arma.spec(ar=c(1,-.9), log="no", main="Autoregression") Example 4.8 Every Explosion has a Cause (cont) In Example 3.4, we discussed the fact that explosive models have causal counter- parts. In that example, we also indicated that it was easier to show this result in general in the spectral domain. In this example, we give the details for an AR(1) model, but the techniques used here will indicate how to generalize the result. 2 w . = 2 x ) 0 , σ + As in Example 3.4, we suppose that x , where w ( ∼ iid N t 1 − t t t w Then, the spectral density of x is t 2 − ω i π 2 − 2 (4.25) . | 1 − 2 e | σ = ) ω ( f x w 1 1 π − π i ω 2 π i ω ω i 2 2 − π ω 2 − i 2 π i ω | 1 |( 2 e − | = | )( 1 − 2 e 2 e = But, | 1 )| 2 | 1 − − = e e . | 2 2 Thus, (4.25) can be written as 2 2 − i π ω − 2 1 1 | 1 − , σ | e ) = ω ( f x w 4 2 1 1 2 , with v = is an equivalent form σ x ) , 0 + v ( x which implies that iid N ∼ t t 1 − t t w 4 2 of the model. We end this section by mentioning another spectral representation that deals with the process directly. In nontechnical terms, the result suggests that (4.4) is approx- imately true for any stationary time series, and this gives an additional theoretical justification for decomposing time series into harmonic components. i i i i

191 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 181 — #191 i i 4.3 Periodogram and Discrete Fourier Transform 181 Example 4.9 A Periodic Stationary Process (cont) In Example 4.4, we considered the periodic stationary process given in (4.13), = U x cos ( 2 πω πω t ) + U . Using Footnote 4.2, we may write sin ( 2 namely, ) t 0 0 1 2 t this as π π t ω i 2 t ω i − 2 1 1 0 0 iU , + iU + ( U e − ) ( ) e U = x 1 2 1 2 t 2 2 U are uncorrelated, mean-zero, random variables each U where we recall that and 2 1 1 1 ∗ 2 iU − ( U ) iU U ) , then Z ( = , where * denotes + = Z . If we call σ with variance 2 1 2 1 2 2 1 ∗ ( Z ) = conjugation. In this case, E . [ U E ) + i E ( U ( )] = 0 and similarly E ( Z 0 ) = 2 1 2 ∗ and Y , cov ( X , Y ) = E ( XY For mean-zero complex random variables, say ) . Thus X 2 ∗ 1 ) = E (| Z | var ) = E ( Z Z ( ) = Z )] E [( U iU + iU − )( U 1 1 2 2 4 2 σ 2 2 1 ( U = + ) . E ( U [ E )] = 2 1 4 2 ∗∗ ∗ 2 , Similarly, var ( Z = ) = σ Z / 2 . Moreover, since Z ∗∗ ∗ 2 2 1 1 E ( Z Z , ) = Z cov Z ( = ) E iU )( U + iU [( )] = U + [ E ( U U ( )− E )] = 0 . 2 1 2 1 1 2 4 4 Hence, (4.13) may be written as 1 π 2 π t 2 i ω ∗ t − ω 2 2 π i ω π t i 0 0 ( , dZ e ) ω = e + e Z = x Z t 1 − 2 is a complex-valued random process that makes uncorrelated jumps at where Z ( ω ) 2 . Stochastic integration is discussed − ω and 2 / with mean-zero and variance σ ω 0 0 further in Section C.4.2. This notion generalizes to all stationary series in the following property (also, see Theorem C.2). Property 4.5 Spectral Representation of a Stationary Process as given in is a mean-zero stationary process, with spectral distribution F ( ω ) If x t Z ( ω ) , on the in- Property 4.1, then there exists a complex-valued stochastic process ω ∈[− 1 / 2 , terval / 2 ] , having stationary uncorrelated non-overlapping increments, 1 x can be written as the stochastic integral (see Section C.4.2) such that t 1 π 2 t i ω π 2 ω e ) , dZ ( = x t 1 − 2 − 1 / 2 ≤ , where, for ≤ ω ≤ 1 / 2 ω 2 1 { } ) var ω Z ( ω ( )− Z ( ω F ) . = F ( ω )− 2 1 2 1 4.3 Periodogram and Discrete Fourier Transform We are now ready to tie together the periodogram, which is the sample-based concept presented in Section 4.1, with the spectral density, which is the population-based concept of Section 4.2. i i i i

192 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 182 — #192 i i 182 4 Spectral Analysis and Filtering Definition 4.1 Given data x discrete Fourier transform , . . ., x , we define the n 1 (DFT) to be n ’ − 1 / 2 t π 2 i ω − j ) n d ( = ω e x (4.26) j t t = 1 , where the frequencies ω j = 0 , 1 , . . ., n − 1 for or = j / n are called the Fourier j fundamental frequencies . is a highly composite integer (i.e., it has many factors), the DFT can be If n computed by the fast Fourier transform (FFT) introduced in Cooley and Tukey (1965). Also, different packages scale the FFT differently, so it is a good idea to consult the − 2 1 / n documentation. R computes the DFT defined in (4.26) without the factor , but ω π 2 i j e that can be ignored because we will be interested with an additional factor of in the squared modulus of the DFT. Sometimes it is helpful to exploit the inversion inverse result for DFTs, which shows the linear transformation is one-to-one. For the we have, DFT n − 1 ’ 1 / 2 t 2 ω − i π j d x ( ω (4.27) ) e = n j t = j 0 for . The following example shows how to calculate the DFT and its inverse = 1 , . . ., n t ib in R for the data set { 1 , 2 , 3 , 4 } ; note that R writes a complex number z = a + as a+bi . (dft = fft(1:4)/sqrt(4)) [1] 5+0i -1+1i -1+0i -1-1i (idft = fft(dft, inverse=TRUE)/sqrt(4)) [1] 1+0i 2+0i 3+0i 4+0i (Re(idft)) # keep it real [1] 1 2 3 4 We now define the periodogram as the squared modulus of the DFT. to be , . . ., x Given data , we define the periodogram x Definition 4.2 n 1 2 d ( ω I ) = ( ω (4.28) ) j j n j 0 , 1 for 2 , . . ., = − 1 . , Õ j n 2 ) = n ̄ x 2 , where ̄ x is the sample mean. Also, Note that I ( it π exp (− 0 ) = 0 for 1 = t n 4.3 so we can write the DFT as , 0 , j n ’ 2 ω i − t π 2 / − 1 j x (4.29) e ( x ) − ̄ n ( ω d ) = t j 1 t = for j , 0 . Thus, Õ n 1 − z n π 2 − t . 3 i j 4 n z . 1 z = e = = for z , 1 . In this case, z z − 1 1 = t i i i i

193 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 183 — #193 i i 183 4.3 Periodogram and Discrete Fourier Transform n n ’ ’ 2 π t 2 − i − s ω ( ) 1 − j = n ) ω ( d − x ) − ̄ x )( x ( I ̄ x ) e = ω ( j s t j t = 1 s 1 = n | −| h 1 − n ’ ’ 1 − h − ω π i 2 j n = ) x ( x ̄ − )( x e − ̄ x t | | + t h = t 1 1 n h −( = ) − 1 − n ’ − π i ω h 2 j = ) e γ ( (4.30) ˆ h −( n − 1 ) h = 4.4 , 0 , where we have put h = t − for , with ˆ γ ( h ) as given in (1.36). j In view of s (4.30), the periodogram, ( ω I ) , is the sample version of f ( ω given in (4.17). That ) j j sample spectral density of x . is, we may think of the periodogram as the t At first, (4.30) seems to be an obvious way to estimate a spectral density (4.17); i.e, simply put a hat on ( h ) and sum as far as the sample size will allow. However, γ after further consideration, it turns out that this is not a very good estimator because it γ ( h ) . For example, there is only one pair of observations, uses some bad estimates of , 1 that can ) ( x x , x ( ) for estimating γ ( n − x ) , and only two pairs ( x , and , x ) − 1 1 n 1 2 n n be used to estimate − 2 ) , and so on. We will discuss this problem further as we n ( γ ˆ ( ω progress, but an obvious improvement over (4.30) would be something like = ) f Õ − 2 π i ω h ˆ γ ( h ) e , where m is much smaller than n . h |≤ m | It is sometimes useful to work with the real and imaginary parts of the DFT individually. To this end, we define the following transforms. Definition 4.3 Given data cosine transform , . . ., x x , we define the 1 n n ’ 2 / 1 − = ) ω ( d n ) x t πω 2 (4.31) ( cos c j t j 1 = t and the sine transform n ’ 1 2 / − (4.32) ) t πω 2 x ( sin = n ( d ) ω t j s j = 1 t ω . = j / n where j = 0 , 1 , . . ., n − 1 for j ) ω ( ( We note that d ( ω i d ) = d )− and hence ω j s j c j 2 2 (4.33) . ω ( ) ( ω ) + d ) d I ω ( = j j j s c We have also discussed the fact that spectral analysis can be thought of as an analysis of variance. The next example examines this notion. 4 . 4 . This approach was ) ω γ Note that (4.30) can be used to obtain ˆ ( ( h ) by taking the inverse DFT of I j used in Example 1.31 to obtain a two-dimensional ACF. i i i i

194 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 184 — #194 i i 4 Spectral Analysis and Filtering 184 Example 4.10 Spectral ANOVA is odd. Then, recalling x be a sample of size n , where for ease, n Let x , . . ., n 1 Example 4.2, m ’ ] [ = a + x ) t (4.34) a πω cos ( 2 πω 2 t ) , b ( + sin t 0 j j j j j 1 = m = ( n − 1 )/ 2 , is exact for t = 1 where n . In particular, using multiple regression , . . ., formulas, we have a = ̄ x , 0 n ’ 2 2 d ) ω ( = ) x t cos ( 2 πω a = √ j j c t j n n t = 1 n ’ 2 2 = ( ) ω b d . sin 2 πω t ) ( x = √ j j s j t n n 1 = t Hence, we may write m ’ ] [ 2 ̄ ) = ( x x − t ( ) t πω d 2 ( ω ( ) cos sin 2 πω ) ω ) + d ( √ t j j s j j c n j 1 = n , . . ., 1 . Squaring both sides and summing we obtain for t = m n m ’ ’ ’ [ ] 2 2 2 ) ω = 2 ) ω ( ( 2 x d I = ( ω − ) + d ̄ ) ( x j j t j c s j = 1 = t 1 j = 1 using the results of Problem 4.1. Thus, we have partitioned the sum of squares into harmonic components represented by frequency ω , with the periodogram, I ( ω ) j j being the mean square regression. This leads to the ANOVA table for n odd: Source df SS MS ) ω ω 2 2 I ( ω ( ) I 1 1 1 I ω ) ω ( 2 2 I ( ω ) 2 2 2 . . . . . . . . . . . . ω ) ω 2 2 I ( ω ( ) I m m m Õ n 2 ̄ − x ( ) x Total 1 − n t t 1 = The following is an R example to help explain this concept. We consider n = 5 . Note that the data 1 = observations given by x x = 1 , x , = 2 , x 2 = 3 , x = 4 2 1 5 3 5 / 1 complete one cycle, but not in a sinusoidal way. Thus, we should expect the ω = 1 component 5 / component to be relatively large but not exhaustive, and the ω 2 = 2 to be small. x = c(1, 2, 3, 2, 1) c1 = cos(2*pi*1:5*1/5); s1 = sin(2*pi*1:5*1/5) c2 = cos(2*pi*1:5*2/5); s2 = sin(2*pi*1:5*2/5) i i i i

195 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 185 — #195 i i 4.3 Periodogram and Discrete Fourier Transform 185 omega1 = cbind(c1, s1); omega2 = cbind(c2, s2) anova(lm(x~omega1+omega2)) # ANOVA Table Df Sum Sq Mean Sq omega1 2 2.74164 1.37082 omega2 2 .05836 .02918 Residuals 0 .00000 Mod(fft(x))^2/5 # the periodogram (as a check) [1] 16.2 1.37082 .029179 .029179 1.37082 # I(0) I(1/5) I(2/5) I(3/5) I(4/5) 2 2 ( 16 ) = n ̄ x I = 5 × 1 . 8 0 = Note that . 2 . Also, the sum of squares associated with the residuals (SSE) is zero, indicating an exact fit. Example 4.11 Spectral Analysis as Principal Component Analysis It is also possible to think of spectral analysis as a principal component analysis. In Section C.5, we show that the spectral density may be though of as the approximate eigenvalues of the covariance matrix of a stationary process. If X = ( x ) , . . ., x n 1 , then n values of a mean-zero time series, x ) with spectral density are ω ( f x t   ) − n ( γ ) ··· 1 ( γ ) 1 0 ( γ     − n ( γ ) 0 ( γ ) ) ··· 1 γ ( 2   X ) cov Γ . = ( =   . . . n . . . . .   . . . .     ) γ ( n − 2 ) ··· γ ( 0 ) γ ( n − 1   sufficiently large, the eigenvalues of For n Γ are n ∞ ’ n 2 − π ih j / γ , ( h ) e ω λ ) = f ≈ ( j j −∞ h = with approximate eigenvectors − n / j ) 1 n ( − 2 π i 0 j / n i π − 2 π i 1 j / n 2 − ∗ 1 √ e e ( ) , . . ., e , , g = j n j = 0 , 1 , . . ., n − 1 . If we let G for g , then the be the complex matrix with columns j ∗ Y = G complex vector has elements that are the DFTs, X n ’ 1 n 2 − / it j π e x = y √ t j n 1 t = are asymptotically uncorrelated for 0 , 1 , . . ., n − 1 . In this case, the elements of Y = j complex random variables, with mean-zero and variance f ( ω ) . Also, X may be j Õ n − 1 1 it j n / π 2 √ recovered as X = GY = , so that x y e . t j 0 = j n We are now ready to present some large sample properties of the periodogram. First, let μ be the mean of a stationary process x with absolutely summable autoco- t . We can use the same argument as γ ( h ) variance function f ( ω ) and spectral density x in (4.29), to write μ by in (4.30), replacing ̄ i i i i

196 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 186 — #196 i i 186 4 Spectral Analysis and Filtering | h −| n 1 − n ’ ’ 1 − 2 π − ω i h j n = ) ω ( I (4.35) e ) μ ( x − x )( μ − j t | + t | h 1 = t − n −( = h ) 1 ω where is a non-zero fundamental frequency. Taking expectation in (4.35) we obtain j ) ( − 1 n ’ ] [ n −| h | ω h π 2 − i j (4.36) . γ ( h ) e = ( ω E ) I j n 1 − ) −( = h n 4.5 For any given , 0 , choose a sequence of fundamental frequencies ω ω ω → n : j 4.6 →∞ n from which it follows by (4.36) that, as ∞ ’ ] [ − ω ih π 2 ( I ) = ω E ) ( → f ω (4.37) γ . h ) e ( n : j −∞ h = In other words, under absolute summability of γ ( h ) , the spectral density is the long- term average of the periodogram. Additional asymptotic properties may be established under the condition that the autocovariance function satisfies ∞ ’ )| < (4.38) ∞ . h | h || γ ( = θ −∞ h = First, we note that straight-forward calculations lead to n n ’ ’ − 1 ω t ) , d πω ( ω 2 )] = n , (4.39) cov [ d ( cos ) ( ) s πω γ ( s − t ) cos ( 2 k j c k j c 1 = t 1 = s n n ’ ’ 1 − d ( cov ω [ ) , d n ( ω = )] s cos , ) t πω 2 ( γ ( (4.40) − t ) ( 2 πω sin s ) s c k j j k 1 = t 1 = s n n ’ ’ − 1 s ) ( , t πω 2 sin ) (4.41) πω 2 γ ( s − t ) sin ( ( ω ω )] = n d ) [ cov , d ( j k s k s j s 1 t = 1 = where the variance terms are obtained by setting ω in (4.39) and (4.41). = ω j k In Appendix C, Section C.2, we show the terms in (4.39)–(4.41) have interesting ,ω properties under assumption that (4.38) holds. In particular, for ω , 2 , 0 or 1 / k j { f = + , ω 2 ω )/ ( ω ε j k n j ω d )] = ) ( ω (4.42) cov , d [ ( k c c j ε , ω ω , n j k 4 . 5 By this we mean ω is the closest n / = j j / n , where { j is a sequence of integers chosen so that } n n n n j : 1 . − ω Fourier frequency to ω ; consequently, | j | ≤ / n n 2 n 4 . 6 2 = is 0 From Definition 4.2 we have I ( 0 ) = n ̄ x ω , so the analogous result of (4.37) for the case 2 n →∞ n E [ I ( 0 )]− n μ . = as var ( ̄ x )→ f ( 0 ) i i i i

197 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 187 — #197 i i 187 4.3 Periodogram and Discrete Fourier Transform { 2 f ( ω ω )/ , + ε = ω j n j k ω [ ) , d (4.43) ( ω d )] = cov ( k j s s , ω , ω ε n j k and d = ( ω cov ) , d (4.44) ( ω , )] [ ε n s c j k where the error term ε in the approximations can be bounded, n n | ≤ θ / | , (4.45) ε n θ ω 1 = ω is given by (4.38). If = 0 or and / 2 in (4.42), the multiplier 1 / 2 disappears; j k 1 ( 0 ) = d , so (4.43) does not apply in these cases. ( note that / 2 ) = 0 d s s Example 4.12 Covariance of Sine and Cosine Transforms n = 256 For the three-point moving average series of Example 1.9 and obser- vations, the theoretical covariance matrix of the vector = ( d D ( ω , ) , d ) ( ω 26 c s 26 ′ d ω ) , d using (4.39)–(4.41) is ( ω ( )) 27 c 27 s − . − . 3752 0010 . 0009 − . 0022 © ™ ≠ Æ . − 3777 . . 0009 0003 − . 0009 ≠ Æ . D = ) cov ( ≠ Æ ≠ Æ 0009 . 3667 − . 0010 − . 0022 − . ≠ Æ . 3692 0003 − 0010 0010 . . . − ́ ̈ The diagonal elements can be compared with half the theoretical spectral val- 1 ues of f ( ω 256 ) = . , and of for the spectrum at frequency ω / = 26 3774 26 26 2 1 . Hence, the cosine and sine 256 / 27 = f ( ω ω ) = . 3689 for the spectrum at 27 27 2 transforms produce nearly uncorrelated variables with variances approximately equal to one half of the theoretical spectrum. For this particular case, the uniform θ = 8 / 9 , yielding | ε bound is determined from | ≤ . 0035 for the bound on the 256 approximation error. 4.7 2 ∼ iid ( 0 x If ) , then it follows from (4.38)–(4.44), and a central limit theorem , σ t that 2 2 ( ω ) 2 / )∼ AN ( 0 , σ , σ / 2 ) and d d ( ω (4.46) 0 ( )∼ AN n j s n : j c : jointly and independently, and independent of d → ( ω ω provided ) ) and d ω ( j s n n : k : : n c k 2 . and ω σ = ) → ω ω ω 0 < ω ( , ω f < 1 / 2 . We note that in this case, where 2 x 1 2 n : k 1 In view of (4.46), it follows immediately that as , →∞ n 2 ( ) I ω 2 ) ( I ω d d n : j k : n 2 2 → χ (4.47) χ → and 2 2 2 2 σ σ 2 denotes a chi- ω with I χ ) ( I ( ω being asymptotically independent, where ) and n : k n : j ν squared random variable with ν degrees of freedom. If the process is also Gaussian, then the above statements are true for any sample size. Using the central limit theory of Section C.2, it is fairly easy to extend the results of the iid case to the case of a linear process. Õ n 2 2 4 . 2 7 } ∼ iid ( 0 , σ If ) and { a n } are constants for which , then { as →∞ Y a →∞ a / max j j j ≤ n 1 ≤ j 1 = j j ) ( Õ Õ n n 2 2 ; see Definition A.5. asymptotically normal . AN is read a AN Y 0 , σ a ∼ j j 1 j = 1 = j j i i i i

198 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 188 — #198 i i 4 Spectral Analysis and Filtering 188 Property 4.6 Distribution of the Periodogram Ordinates If ∞ ∞ ’ ’ | ψ ∞ < | (4.48) ψ , w = x j j − j t t −∞ j = j −∞ = 2 distinct fre- ∼ iid ( 0 where w m ), and (4.38) holds, then for any collection of , σ t w → ) quencies ω ω ∈( 0 , 1 / 2 ω with j j n j : ( I ω 2 ) d n : j 2 → iid χ (4.49) 2 ) ( ω f j . m provided f ( ω , . . ., ) > 0 , for j = 1 j This result is stated more precisely in Theorem C.7. Other approaches to large sample normality of the periodogram ordinates are in terms of cumulants, as in Brillinger (1981), or in terms of mixing conditions, such as in Rosenblatt (1956a). Here, we adopt the approach used by Hannan (1970), Fuller (1996), and Brockwell and Davis (1991). The distributional result (4.49) can be used to derive an approximate confidence 2 in the usual way. Let interval for the spectrum α χ probability denote the lower ( α ) ν degrees of freedom; that is, ν tail for the chi-squared distribution with 2 2 Pr χ χ { ( α )} = α. (4.50) ≤ ν ν ) − α ( % confidence interval for the spectral density function Then, an approximate 1 100 would be of the form ( ω 2 I ) I ) ω ( 2 : j n : j n ( f ) ≤ ω (4.51) ≤ . 2 2 χ ( 1 − α χ / 2 ( α / 2 ) ) 2 2 Often, trends are present that should be eliminated before computing the peri- odogram. Trends introduce extremely low frequency components in the periodogram that tend to obscure the appearance at higher frequencies. For this reason, it is usually conventional to center the data prior to a spectral analysis using either mean-adjusted x x to eliminate the zero or d-c component or to use detrended data of the form ̄ − t ˆ ˆ − t data of the form − x β to eliminate the term that will be considered a half β 1 t 2 t or cycle by the spectral analysis. Note that higher order polynomial regressions in nonparametric smoothing (linear filtering) could be used in cases where the trend is nonlinear. As previously indicated, it is often convenient to calculate the DFTs, and hence the periodogram, using the fast Fourier transform algorithm. The FFT utilizes a number is highly composite; that is, an of redundancies in the calculation of the DFT when n p 3 is a factor of integer with many factors of 2 , 2 , or 5 , the best case being when n = 2. Details may be found in Cooley and Tukey (1965). To accommodate this property, we can pad the centered (or detrended) data of length n to the next highly composite c c c c ′ , where = 0 x n x = = x by adding zeros, i.e., setting integer x = ··· ′ t n + n 2 1 + n i i i i

199 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 189 — #199 i i 4.3 Periodogram and Discrete Fourier Transform 189 Series: soi Series: soi Raw Periodogram Raw Periodogram 0.8 0.8 0.4 0.4 spectrum spectrum 0.0 0.0 1/4 0 4 2 3 4 5 6 3 2 5 1 6 0 1 frequency frequency bandwidth = 0.025 bandwidth = 0.025 Series: rec Series: rec Raw Periodogram Raw Periodogram 1000 1000 spectrum spectrum 500 500 0 0 1/4 0 2 5 1 2 3 4 5 6 6 4 0 3 1 frequency frequency bandwidth = 0.025 bandwidth = 0.025 ′ Periodogram of SOI and Recruitment, = 453 ( Fig. 4.5. n = 480 ) , where the frequency axis n 12 , or one cycle ∆ = 1 / 12 . Note the common peaks at ω = 1 ∆ = 1 / is labeled in multiples of 1 per year (12 months), and some larger values near = ω 1 / 48 , or one cycle every four = ∆ 4 years (48 months). denotes the centered data. This means that the fundamental frequency ordinates will ′ . We illustrate by considering the periodogram of the SOI / j / n be instead of j ω n = j and Recruitment series shown in Figure 1.5. Recall that they are monthly series and ′ ′ to see that months. To find n n in R, use the command nextn(453) = n 453 = 480 will be used in the spectral analyses by default. Example 4.13 Periodogram of SOI and Recruitment Series Figure 4.5 shows the periodograms of each series, where the frequency axis is . As previously indicated, the centered data have labeled in multiples of ∆ = 1 / 12 been padded to a series of length 480 . We notice a narrow-band peak at the obvious / . In addition, there is considerable power 1 ∆ = 1 = 12 ω yearly (12 month) cycle, in a wide band at the lower frequencies that is centered around the four-year (48 1 representing a possible El Niño effect. This wide 48 ∆ = 1 / month) cycle ω = 4 band activity suggests that the possible El Niño cycle is irregular, but tends to be around four years on average. We will continue to address this problem as we move to more sophisticated analyses. 2 2 . Noting χ 38 = ( . 025 ) = . 05 and χ , we can obtain approximate 95% ) ( . 975 7 2 2 confidence intervals for the frequencies of interest. For example, the periodogram i i i i

200 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 190 — #200 i i 190 4 Spectral Analysis and Filtering of the SOI series is I at the yearly cycle. An approximate 95% ( 1 / 12 ) = . 97 S confidence interval for the spectrum f ( 1 / 12 ) is then S , [ 2 ( . 97 )/ 7 . 38 , 2 ( . 97 )/ . 05 ] = [ . 26 , 38 . 4 ] which is too wide to be of much use. We do notice, however, that the lower value of . is higher than any other periodogram ordinate, so it is safe to say that this value 26 is significant. On the other hand, an approximate 95% confidence interval for the spectrum at the four-year cycle, f 1 / 48 ) , is ( S [ 2 ( . 05 )/ 7 . 38 , 2 ( . 05 )/ . 05 ] = [ . 01 , 2 . 12 ] , which again is extremely wide, and with which we are unable to establish signifi- cance of the peak. We now give the R commands that can be used to reproduce Figure 4.5. To calculate and graph the periodogram, we used the mvspec command in available frequency astsa . We note that the value of ∆ is the reciprocal of the value of from frequency for the data of a time series object. If the data are not a time series object, is set to 1. Also, we set because the periodogram is plotted on a log log="no" 10 bandwidth . We will discuss bandwidth in the scale by default. Figure 4.5 displays a next section, so ignore this for the time being. par(mfrow=c(2,1)) soi.per = mvspec(soi, log="no") abline(v=1/4, lty=2) rec.per = mvspec(rec, log="no") abline(v=1/4, lty=2) The confidence intervals for the SOI series at the yearly cycle, ω = 1 / 12 = = 40 / 480 , and the possible El Niño cycle of four years ω = 1 / 48 / 10 can be 480 computed in R as follows: # 0.97223; soi pgram at freq 1/12 = 40/480 soi.per$spec[40] soi.per$spec[10] # 0.05372; soi pgram at freq 1/48 = 10/480 # conf intervals - returned value: U = qchisq(.025,2) # 0.05063 L = qchisq(.975,2) # 7.37775 2*soi.per$spec[10]/L # 0.01456 2*soi.per$spec[10]/U # 2.12220 2*soi.per$spec[40]/L # 0.26355 2*soi.per$spec[40]/U # 38.40108 The preceding example made it clear that the periodogram as an estimator is susceptible to large uncertainties, and we need to find a way to reduce the variance. Not surprisingly, this result follows if consider (4.49) and the fact that, for any , the periodogram is based on only two observations. Recall that the mean and n 2 , respectively. Thus, using (4.49), we have ν 2 and distribution are ν variance of the χ ν · 1 2 χ , implying ) f ω ( I ∼ ) ω ( 2 2 2 ) f E [ I ( ω )]≈ ω ( ω ) and var [ I ( ω )]≈ f . ( I →∞ as Consequently, var [ and thus the periodogram is not a consistent ( ω )]6→ 0 n estimator of the spectral density. The solution to this dilemma can be resolved by smoothing the periodogram. i i i i

201 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 191 — #201 i i 4.4 Nonparametric Spectral Estimation 191 100 50 spectrum 20 10 0.1502 0.1506 0.1508 0.1500 0.1510 0.1504 frequency Fig. 4.6. A small section (near the peak) of the AR(2) spectrum shown in Figure 4.4 . 4.4 Nonparametric Spectral Estimation frequency To continue the discussion that ended the previous section, we introduce a band , contiguous fundamental frequencies, centered around frequency , of L  n B . For frequencies of the ω ω = j / n , which is chosen close to a frequency of interest, j ∗ form = ω , let ω k / n + j { } m m ∗ ∗ ω , (4.52) ≤ ω ≤ + ω = − ω B : j j n n where (4.53) 1 = 2 m + L , B is an odd number, chosen such that the spectral values in the interval m f ( + k / n ) , k = − m , . . ., 0 , . . ., ω j are approximately equal to ( ω ) . This structure can be realized for large sample f sizes, as shown formally in Section C.2. Values of the spectrum in this band should be relatively constant for the smoothed spectra defined below to be good estimators. For example, to see a small section of the AR(2) spectrum (near the peak) shown in Figure 4.4, use arma.spec(ar=c(1,-.9), xlim=c(.15,.151), n.freq=100000) which is displayed in Figure 4.6. We now define an averaged (or smoothed) periodogram as the average of the periodogram values, say, m ’ 1 ̄ + , ) n (4.54) I ( ω k / ω ) = f ( j L − = k m over the band B . Under the assumption that the spectral density is fairly constant in 4.8 B , and in view of (4.49) we can show that under appropriate conditions, the band 4 . 8 The conditions, which are sufficient, are that x is a linear process, as described in Property 4.6, with t √ Õ ∞ , and has a finite fourth moment. < w | j || ψ | t j j i i i i

202 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 192 — #202 i i 192 4 Spectral Analysis and Filtering for large n , the periodograms in (4.54) are approximately distributed as independent 2 1 ) f / 2 ω 0 < ω < χ / 2 , as long as we keep L fairly small ( random variables, for 2 . This result is discussed formally in Section C.2. Thus, under these relative to n 2 ̄ ( ω ) is the sum of L approximately independent f ( ω ) f L conditions, / 2 random χ 2 variables. It follows that, for large n , ̄ ( f 2 ω ) L · 2 ∼ χ (4.55) 2 L ( f ω ) · ∼ is approximately distributed as . where means In this scenario, where we smooth the periodogram by simple averaging, it seems reasonable to call the width of the frequency interval defined by (4.52), L B = , (4.56) n 4.9 . the The concept of bandwidth, however, becomes more complicated bandwidth with the introduction of spectral estimators that smooth with unequal weights. Note that (4.56) implies the degrees of freedom can be expressed as = 2B n , L 2 (4.57) or twice the time-bandwidth product . The result (4.55) can be rearranged to obtain 100 ( 1 − α ) % confidence interval of the form an approximate ̄ ̄ ( ) 2 L ω f ( ω ) f L 2 (4.58) ) ≤ ω ( ≤ f 2 2 α ) / − χ ( χ 2 ( 1 / 2 ) α 2 L 2 L for the true spectrum, ( ω ) . f Many times, the visual impact of a spectral density plot will be improved by plotting the logarithm of the spectrum instead of the spectrum (the log transformation is the variance stabilizing transformation in this situation). This phenomenon can occur when regions of the spectrum exist with peaks of interest much smaller than some of the main power components. Taking logs in (4.58), we obtain an interval for the logged spectrum given by [ ] ̄ ̄ b log + f ( ω )− a (4.59) , log ) f ( ω L L where 4 . 9 There are many definitions of bandwidth and an excellent discussion may be found in Percival and is based on Grenander (1951). spec.pgram Walden (1993, §6.7). The bandwidth value used in R for The basic idea is that bandwidth can be related to the standard deviation of the weighting distribution. √ − n / n to m / m , the standard deviation is L / n 12 For the uniform distribution on the frequency range (using a continuity correction). Consequently, in the case of (4.54), R will report a bandwidth of √ √ / n 12 , which amounts to dividing our definition by L 12 . Note that in the extreme case L = n , we would have B = 1 indicating that everything was used in the estimation. In this case, R would report a √ , which seems to miss the point. ≈ . bandwidth of 1 / 29 12 i i i i

203 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 193 — #203 i i 4.4 Nonparametric Spectral Estimation 193 2 2 = − log 2 L + log χ / 2 / ( 1 − α a 2 ) and b α = log 2 L − log χ ) ( L L L 2 L 2 do not depend on ω . If zeros are appended before computing the spectral estimators, we need to adjust the degrees of freedom (because you do not get more information by padding) and ′ n 2 Ln / by an approximation is to replace . Hence, we define the adjusted degrees L 2 as of freedom 2 Ln (4.60) = d f ′ n and use it instead of 2 L in the confidence intervals (4.58) and (4.59). For example, (4.58) becomes ̄ ̄ f ( ω d f ) ω d f ) f ( ( ω ) ≤ . (4.61) ≤ f 2 2 χ χ ( 1 − ) / 2 ) 2 α / α ( d f d f A number of assumptions are made in computing the approximate confidence intervals given above, which may not hold in practice. In such cases, it may be reasonable to employ resampling techniques such as one of the parametric bootstraps proposed proposed by Hurvich and Zeger (1987) or a nonparametric local bootstrap by Paparoditis and Politis (1999). To develop the bootstrap distributions, we assume that the contiguous DFTs in a frequency band of the form (4.52) all came from a time series with identical spectrum ω ) . This, in fact, is exactly the same assumption made f ( in deriving the large-sample theory. We may then simply resample the DFTs in the L band, with replacement, calculating a spectral estimate from each bootstrap sample. The sampling distribution of the bootstrap estimators approximates the distribution of the nonparametric spectral estimator. For further details, including the theoretical properties of such estimators, see Paparoditis and Politis (1999). Before proceeding further, we consider computing the average periodograms for the SOI and Recruitment series. Example 4.14 Averaged Periodogram for SOI and Recruitment Generally, it is a good idea to try several bandwidths that seem to be compatible with the general overall shape of the spectrum, as suggested by the periodogram. We will discuss this problem in more detail after the example. The SOI and Recruitment series periodograms, previously computed in Figure 4.5, suggest the power in the lower El Niño frequency needs smoothing to identify the predominant overall period. Trying values of as a reasonable value, and the 9 = L L leads to the choice result is displayed in Figure 4.7. The smoothed spectra shown provide a sensible compromise between the noisy version, shown in Figure 4.5, and a more heavily smoothed spectrum, which might lose some of the peaks. An undesirable effect of averaging can be noticed at the yearly cycle, , where the narrow band peaks that appeared in the ω = 1 ∆ periodograms in Figure 4.5 have been flattened and spread out to nearby frequencies. harmonics of the yearly cycle, We also notice, and have marked, the appearance of k Harmonics typically occur , . . . . that is, frequencies of the form ω = 2 ∆ for k = 1 , when a periodic non-sinusoidal component is present; see Example 4.15. i i i i

204 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 194 — #204 i i 4 Spectral Analysis and Filtering 194 Series: soi Series: soi Smoothed Periodogram Smoothed Periodogram 0.12 0.12 0.08 0.08 spectrum spectrum 0.04 0.04 0.00 0.00 1/4 3 1 3 4 5 6 0 2 4 5 6 0 1 2 frequency frequency bandwidth = 0.225 bandwidth = 0.225 Series: rec Series: rec Smoothed Periodogram Smoothed Periodogram 600 600 400 400 spectrum spectrum 200 200 0 0 1/4 0 1 2 3 4 5 6 6 4 3 2 0 5 1 frequency frequency bandwidth = 0.225 bandwidth = 0.225 ′ n = 453 , n Fig. 4.7. = 480 , L = The averaged periodogram of the SOI and Recruitment series 1 , = 17 , showing common peaks at the four year period, ω = 9 d f ∆ = 1 / 48 cycles/month, the 4 , 2 ∆ . yearly period, ω = 1 3 = 1 / 12 cycles/month and some of its harmonics ω = k ∆ for k = Figure 4.7 can be reproduced in R using the following commands. To compute , where L = 2 m + 1 m averaged periodograms, use the Daniell kernel, and specify 4 = 9 and m = ( in this example). We will explain the kernel concept later in this L section, specifically just prior to Example 4.16. soi.ave = mvspec(soi, kernel( ' daniell ' ,4)), log= ' no ' ) abline(v=c(.25,1,2,3), lty=2) soi.ave$bandwidth # = 0.225 # Repeat above lines using rec in place of soi on line 3 The displayed bandwidth (.225) is adjusted for the fact that the frequency scale of the plot is in terms of cycles per year instead of cycles per month. Using (4.56), ; the displayed value is simply the bandwidth in terms of months is 9 / 480 = . 01875 converted to years, . . 225 . 01875 × 12 = )/ ≈ 2 ( 9 )( 453 = 480 d f 17 . We can use this The adjusted degrees of freedom are 2 2 = ) 975 . ( ) = 7 . 56 and χ ( . 025 χ value for the 95% confidence intervals, with d f d f . Substituting into (4.61) gives the intervals in Table 4.1 for the two frequency 30 . 17 bands identified as having the maximum power. To examine the two peak power possibilities, we may look at the 95% confidence intervals and see whether the lower limits are substantially larger than adjacent baseline spectral levels. For example, i i i i

205 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 195 — #205 i i 195 4.4 Nonparametric Spectral Estimation Series: soi Series: soi Smoothed Periodogram Smoothed Periodogram 0.050 0.050 0.010 0.010 spectrum spectrum 0.002 0.002 1/4 2 1 0 6 2 3 4 5 6 4 5 0 3 1 frequency frequency bandwidth = 0.225 bandwidth = 0.225 Series: rec Series: rec Smoothed Periodogram Smoothed Periodogram 100 100 20 20 spectrum spectrum 5 5 1 1 1/4 4 2 4 0 1 0 3 3 5 6 5 6 1 2 frequency frequency bandwidth = 0.225 bandwidth = 0.225 Figure 4.7 with the average periodogram ordinates plotted on a Fig. 4.8. log scale. The 10 display in the upper right-hand corner represents a generic 95% confidence interval where the middle tick mark is the width of the bandwidth. the El Niño frequency of 48 months has lower limits that exceed the values the spectrum would have if there were simply a smooth underlying spectral function without the peaks. The relative distribution of power over frequencies is different, with the SOI having less power at the lower frequency, relative to the seasonal periods, and the Recruitment series having more power at the lower or El Niño frequency. The entries in Table 4.1 for SOI can be obtained in R as follows: df = soi.ave$df # df = 16.9875 (returned values) U = qchisq(.025, df) # U = 7.555916 L = qchisq(.975, df) # L = 30.17425 soi.ave$spec[10] # 0.0495202 soi.ave$spec[40] # 0.1190800 # intervals df*soi.ave$spec[10]/L # 0.0278789 df*soi.ave$spec[10]/U # 0.1113333 df*soi.ave$spec[40]/L # 0.0670396 df*soi.ave$spec[40]/U # 0.2677201 # repeat above commands with soi replaced by rec Finally, Figure 4.8 shows the averaged periodograms in Figure 4.7 plotted on a . log="no" scale. This is the default can be obtained by removing the statement log 10 i i i i

206 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 196 — #206 i i 4 Spectral Analysis and Filtering 196 Confidence Intervals for the Spectra of the SOI and Recruitment Series Table 4.1. Series ω Period Power Lower Upper 1/48 4 years .05 SOI .11 .03 1/12 .12 .07 .27 1 year Recruits 1/48 6.59 3.71 14.82 4 years 2 1 year 4.93 × 10 1.24 1/12 2.19 Notice that the default plot also shows a generic confidence interval of the form ) in the upper right-hand corner. To use it, imagine replaced by log (4.59) (with log 10 placing the middle tick mark (the width of which is the bandwidth) on the averaged periodogram ordinate of interest; the resulting bar then constitutes an approximate 95% confidence interval for the spectrum at that frequency. We note that displaying the estimates on a log scale tends to emphasize the harmonic components. Example 4.15 Harmonics In the previous example, we saw that the spectra of the annual signals displayed = minor peaks at the harmonics; that is, the signal spectra had a large peak at ω 1 ∆ = 1 / 12 cycles/month (the one-year cycle) and minor peaks at its harmonics 3 , (two-, three-, and so on, cycles per year). This will often be ω = k ∆ for k = 2 , . . . the case because most signals are not perfect sinusoids (or perfectly cyclic). In this case, the harmonics are needed to capture the non-sinusoidal behavior of the signal. As an example, consider the signal formed in Figure 4.9 from a (fundamental) sinusoid oscillating at two cycles per unit time along with the second through sixth harmonics at decreasing amplitudes. In particular, the signal was formed as 5 sin x t = sin ( 2 π 2 t ) + . ) ( 2 π 4 t ) + . 4 sin ( 2 π 6 t 8 ) (4.62) + . 3 sin ( 2 π ) t t + . 2 sin ( 2 π 10 t ) + . 1 sin ( 2 π 12 . Notice that the signal is non-sinusoidal in appearance and rises for 0 ≤ t ≤ 1 quickly then falls slowly. A figure similar to Figure 4.9 can be generated in R as follows. t = seq(0, 1, by=1/200) amps = c(1, .5, .4, .3, .2, .1) x = matrix(0, 201, 6) for (j in 1:6){ x[,j] = amps[j]*sin(2*pi*t*2*j) } x = ts(cbind(x, rowSums(x)), start=0, deltat=1/200) ts.plot(x, lty=c(1:6, 1), lwd=c(rep(1,6), 2), ylab="Sinusoids") names = c("Fundamental","2nd Harmonic","3rd Harmonic","4th Harmonic", "5th Harmonic", "6th Harmonic", "Formed Signal") legend("topright", names, lty=c(1:6, 1), lwd=c(rep(1,6), 2)) Example 4.14 points out the necessity for having some relatively systematic pro- cedure for deciding whether peaks are significant. The question of deciding whether a single peak is significant usually rests on establishing what we might think of as a baseline level for the spectrum, defined rather loosely as the shape that one would i i i i

207 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 197 — #207 i i 197 4.4 Nonparametric Spectral Estimation Fundamental 2nd Harmonic 1.5 3rd Harmonic 4th Harmonic 5th Harmonic 1.0 6th Harmonic Formed Signal 0.5 0.0 Sinusoids −0.5 −1.0 −1.5 1.0 0.8 0.4 0.2 0.0 0.6 Time A signal (thick solid line) formed by a fundamental sinusoid (thin solid line) oscillating Fig. 4.9. (4.62) at two cycles per unit time and its harmonics as specified in . expect to see if no spectral peaks were present. This profile can usually be guessed by looking at the overall shape of the spectrum that includes the peaks; usually, a kind of baseline level will be apparent, with the peaks seeming to emerge from this baseline level. If the lower confidence limit for the spectral value is still greater than the baseline level at some predetermined level of significance, we may claim that frequency value as a statistically significant peak. To be consistent with our stated indifference to the upper limits, we might use a one-sided confidence interval. An important aspect of interpreting the significance of confidence intervals and tests involving spectra is that typically, more than one frequency will be of interest, simultaneous statements about a whole so that we will potentially be interested in collection of frequencies. For example, it would be unfair to claim in Table 4.1 the two frequencies of interest as being statistically significant and all other potential candidates as nonsignificant at the overall level of α = . 05 . In this case, we follow K are made at S , . . ., the usual statistical approach, noting that if S statements S , 1 k 2 α , i.e., P { , then the overall probability all statements significance level } = 1 − α S k are true satisfies the Bonferroni inequality K (4.63) α. } ≥ P { all S − true 1 k For this reason, it is desirable to set the significance level for testing each frequency K at α / K if there are K potential frequencies of interest. If, a priori, potentially 10 = frequencies are of interest, setting α = . 01 would give an overall significance level of bound of .10. The use of the confidence intervals and the necessity for smoothing requires that B we make a decision about the bandwidth over which the spectrum will be essentially i i i i

208 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 198 — #208 i i 4 Spectral Analysis and Filtering 198 constant. Taking too broad a band will tend to smooth out valid peaks in the data when the constant variance assumption is not met over the band. Taking too narrow a band will lead to confidence intervals so wide that peaks are no longer statistically significant. Thus, we note that there is a conflict here between variance properties or and resolution , which bandwidth stability , which can be improved by increasing B . A common approach is to try a number of different B can be improved by decreasing bandwidths and to look qualitatively at the spectral estimators for each case. To address the problem of resolution, it should be evident that the flattening of the peaks in Figure 4.7 and Figure 4.8 was due to the fact that simple averaging was ̄ ( ω ) defined in (4.54). There is no particular reason to use simple f used in computing averaging, and we might improve the estimator by employing a weighted average, say m ’ ˆ f = ( ω ) n (4.64) , h ) I ( ω / + k j k k m − = h using the same definitions as in (4.54) but where the weights > 0 satisfy k m ’ h . = 1 k − m k = In particular, it seems reasonable that the resolution of the estimator will improve if we use weights that decrease as distance from the center weight increases; we will h 0 ̄ f return to this idea shortly. To obtain the averaged periodogram, ω ) , in (4.64), set ( − 1 ̄ ( = L f ) , for all k , where L = 2 m + 1 . The asymptotic theory established for ω h k ˆ still holds for f ( ω ) provided that the weights satisfy the additional condition that if , then n m →∞ as n →∞ but m / 0 → m ’ 2 h → . 0 k − = k m →∞ , Under these conditions, as n ) ( ˆ (i) E ω ) → f ( ω ) ( f ( ) ( ) − 1 Õ m 2 2 ˆ ˆ (ii) / cov 1 h f ( ω ) , , f ( λ ) . → f 2 ( ω ) for ω = λ , 0 m − = k k 2 2 ( In (ii), replace f 2 ( ω ) by 0 if ω , λ and by 2 f . λ ω ) if ω = / = 0 or 1 ̄ f ( ω ) , where the weights are We have already seen these results in the case of Õ m − 1 − 1 2 constant, h . The distributional properties = L L = , in which case h k m − = k k ˆ is a weighted linear combination of ) of (4.64) are more difficult now because f ( ω 2 asymptotically independent χ random variables. An approximation that seems to ) ( 1 − Õ m 2 by work well is to replace . That is, define L h m = k − k ) ( − 1 m ’ 2 (4.65) h = L h k = − m k i i i i

209 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 199 — #209 i i 4.4 Nonparametric Spectral Estimation 199 4.10 and use the approximation ˆ L 2 ω f ( ) · h 2 . ∼ χ (4.66) 2 L h ) ( f ω In analogy to (4.56), we will define the bandwidth in this case to be L h . (4.67) = B n ( 1 − α ) % confidence Using the approximation (4.66) we obtain an approximate 100 interval of the form ˆ ˆ 2 f ( L ) ω ) ω ( f L 2 h h (4.68) ) ≤ ≤ f ( ω 2 2 ) ) / χ α − 1 ( χ 2 2 / α ( 2 2 L L h h ′ ( ω ) . If the data are padded to n for the true spectrum, , then replace 2 L f in (4.68) h ′ d f = 2 L as in (4.60). n / n with h . An easy way to generate the weights in R is by repeated use of the Daniell kernel = 1 and L = 2 m + 1 = 3 , the Daniell kernel has weights For example, with m 1 1 1 = { h } { { , , , produces } } ; applying this kernel to a sequence of numbers, u k t 3 3 3 1 1 1 = ˆ u . u + + u u t t − t t 1 + 1 3 3 3 ˆ u We can apply the same kernel again to the , t 1 1 1 ˆ = ˆ u ˆ u ˆ u + + , u ˆ 1 1 − t t + t t 3 3 3 which simplifies to 1 2 1 3 2 ˆ u u = + + u . u u + ˆ + u t − 1 2 2 − t t + t t t + 1 9 9 9 9 9 m = 1 the The modified Daniell kernel puts half weights at the end points, so with 1 2 1 weights are { { h , = and , } } k 4 4 4 1 1 1 = ˆ u . u + + u u t t t t 1 − + 1 2 4 4 ˆ u Applying the same kernel again to yields t 1 1 4 4 6 ˆ u = ˆ u . u u u + u + + + t − t t 1 2 2 t + 1 − t + t 16 16 16 16 16 kernel These coefficients can be obtained in R by issuing the command. For example, kernel("modified.daniell", c(1,1)) would produce the coefficients of the last example. The other kernels that are currently available in R are the Dirichlet kernel and the Fejér kernel, which we will discuss shortly. It is interesting to note that these kernel weights form a probability distribution. , 1 each with } If X and Y are independent discrete uniforms on the integers {− 1 , 0 1 1 X + Y is discrete on the integers {− 2 , − , then the convolution , 0 , 1 , 2 } probability 3 1 2 3 2 1 , , , . , { } with corresponding probabilities 9 9 9 9 9 · 4 . 10 2 ˆ ˆ The approximation proceeds as follows: If ∼ c χ E and , where c is a constant, then f f ≈ c ν ν ( ) − 1 Õ Õ Õ 2 2 2 2 2 ˆ c 2 ν h c ≈ f ≈ f f ≈ var . Solving, h 2 . / 2 = f / 2 L h and ν ≈ 2 L = h h k k k k k k i i i i

210 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 200 — #210 i i 4 Spectral Analysis and Filtering 200 Series: soi Series: soi Smoothed Periodogram Smoothed Periodogram 0.15 0.15 0.10 0.10 spectrum spectrum 0.05 0.05 0.00 0.00 1/4 4 5 6 1 0 0 2 3 4 5 6 1 2 3 frequency frequency bandwidth = 0.231 bandwidth = 0.231 Series: rec Series: rec Smoothed Periodogram Smoothed Periodogram 600 600 400 400 spectrum spectrum 200 200 0 0 1/4 4 1 6 0 1 2 3 4 5 6 3 2 5 0 frequency frequency bandwidth = 0.231 bandwidth = 0.231 Fig. 4.10. Smoothed (tapered) spectral estimates of the SOI and Recruitment series; see Example 4.16 for details. Example 4.16 Smoothed Periodogram for SOI and Recruitment In this example, we estimate the spectra of the SOI and Recruitment series using the smoothed periodogram estimate in (4.64). We used a modified Daniell kernel Õ m 2 . = , which is h 9 232 = L 1 / both times. This yields 3 twice, with m = h m = − k k L = 9 used in Example 4.14. In this case, the bandwidth is close to the value of = 480 / 453 B = 9 . 232 / 480 = . 019 and the modified degrees of freedom is d f = 2 L h , can be obtained and graphed in R as follows: 17 . 43 . The weights, h k kernel("modified.daniell", c(3,3)) coef[-6] = 0.006944 = coef[ 6] coef[-5] = 0.027778 = coef[ 5] coef[-4] = 0.055556 = coef[ 4] coef[-3] = 0.083333 = coef[ 3] coef[-2] = 0.111111 = coef[ 2] coef[-1] = 0.138889 = coef[ 1] coef[ 0] = 0.152778 plot(kernel("modified.daniell", c(3,3))) # not shown The resulting spectral estimates can be viewed in Figure 4.10 and we notice that the estimates more appealing than those in Figure 4.7. Figure 4.10 was generated in R as follows; we also show how to obtain the associated bandwidth and degrees of freedom. i i i i

211 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 201 — #211 i i 201 4.4 Nonparametric Spectral Estimation k = kernel("modified.daniell", c(3,3)) soi.smo = mvspec(soi, kernel=k, taper=.1, log="no") abline(v=c(.25,1), lty=2) ## Repeat above lines with rec replacing soi in line 3 df = soi.smo$df # df = 17.42618 # B = 0.2308103 soi.smo$bandwidth Note that a taper was applied in the estimation process; we discuss tapering in the mvspec commands with log="no" removed will result in a next part. Reissuing the figure similar to Figure 4.8. Finally, we mention that the modified Daniell kernel is used by default and an easier way to obtain is to issue the command: soi.smo soi.smo = mvspec(soi, taper=.1, spans=c(7,7)) + is a vector of odd integers, given in terms of L = Notice that m spans 1 instead 2 of m . There have been many attempts at dealing with the problem of smoothing the periodogram in a automatic way; an early reference is Wahba (1980). It is apparent from Example 4.16 that the smoothing bandwidth for the broadband El Niño behavior (near the 4 year cycle), should be much larger than the bandwidth for the annual cycle (the 1 year cycle). Consequently, it is perhaps better to perform automatic adaptive smoothing for estimating the spectrum. We refer interested readers to Fan and Kreutzberger (1998) and the numerous references within. Tapering We are now ready to introduce the concept of tapering ; a more detailed discussion x is a mean-zero, stationary may be found in Bloomfield (2000, §9.5). Suppose t process with spectral density f ( ω ) . If we replace the original series by the tapered x series y = h x , (4.69) t t t for t = 1 , 2 , . . ., n , use the modified DFT n ’ π t 2 − ω i − 1 / 2 j (4.70) e x h , n ) ( ω d = t t j y 1 t = 2 d ( ω , we obtain (see Problem 4.17) ) = | and let I ( ω )| j y j y 1 π 2 f ω ( ) ) ω − ω ( W (4.71) ω d ( )] ω = I [ E j n x y j 1 − 2 where 2 W (4.72) ( ω ) = | H )| ( ω n n and n ’ t − ω / 2 − i π 2 1 H ) ( ω (4.73) = n . h e t n = t 1 because, in view of (4.71), it is de- spectral window is called a ) The value W ω ( n is being “seen” by the estimator ) ω ( termining which part of the spectral density f x i i i i

212 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 202 — #212 i i 4 Spectral Analysis and Filtering 202 Smoothed Fejer Smoothed Fejer − logged 0 50 40 −20 30 20 −40 10 −60 0 + + −0.04 0.00 0.02 0.04 0.04 0.02 0.00 −0.02 −0.02 −0.04 frequency frequency Cosine Taper Cosine Taper − logged 20 0 15 −20 10 −40 5 −60 0 + + −0.04 −0.02 0.00 0.02 0.04 −0.04 0.04 0.00 −0.02 0.02 frequency frequency Fig. 4.11. Averaged Fejér window (top row) and the corresponding cosine taper window (bottom row) for L = 9 , n = 480 . The extra tic marks on the horizontal axis of the left-hand plots exhibit the predicted bandwidth, B = 9 / 480 = . 01875 . ω ( is simply the ) I I ( ω = ) on average. In the case that h ) = 1 for all t , I ω ( y x j j j t y periodogram of the data and the window is 2 πω ) sin ( n = ( W ω ) (4.74) n 2 ) n sin πω ( = W ( 0 ) with n , which is known as the Fejér or modified Bartlett kernel. If we n consider the averaged periodogram in (4.54), namely m ’ 1 ̄ / k , ( ω ) I + n = ) ω ( f x j x L m − = k ω ) the window, W ( , in (4.71) will take the form n m 2 ’ sin n / k + 1 ω ( π n [ )] W (4.75) . = ) ω ( n 2 nL n sin )] [ π ( ω + k / = − m k Tapers generally have a shape that enhances the center of the data relative to the extremities, such as a cosine bell of the form i i i i

213 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 203 — #213 i i 203 4.4 Nonparametric Spectral Estimation 0.4 0.2 0.0 −0.2 −0.4 leakage 0.050 spectrum 0.010 0.002 1/4 1 0 6 5 4 3 2 frequency Fig. 4.12. Smoothed spectral estimates of the SOI without tapering (dashed line) and with full tapering (solid line); see Example 4.17. The insert shows a full cosine bell taper, (4.76) , with ̄ t − horizontal axis t ( n , for t = 1 , . . ., n . )/ )] [ ( t π − t ( ) 2 . h + 1 5 cos (4.76) = , t n )/ 1 n , favored by Blackman and Tukey (1959). The shape of this taper where t = ( 2 + is shown in the insert to Figure 4.12. In Figure 4.11, we have plotted the shapes = W , in which case, ( ω ) , for n = 480 and L of two windows, 9 , when (i) h 1 ≡ t n (4.75) applies, and (ii) h is the cosine taper in (4.76). In both cases the predicted t = bandwidth should be B = 9 / 480 01875 . cycles per point, which corresponds to the “width” of the windows shown in Figure 4.11. Both windows produce an integrated average spectrum over this band but the untapered window in the top panels shows considerable ripples over the band and outside the band. The ripples outside the band are called sidelobes and tend to introduce frequencies from outside the interval that may contaminate the desired spectral estimate within the band. For example, a large dynamic range for the values in the spectrum introduces spectra in contiguous frequency intervals several orders of magnitude greater than the value in the interval leakage of interest. This effect is sometimes called . Figure 4.11 emphasizes the suppression of the sidelobes in the Fejér kernel when a cosine taper is used. Example 4.17 The Effect of Tapering the SOI Series The estimates in Example 4.16 were obtained by tapering the upper and lower 10% of the data. In this example, we examine the effect of tapering on the estimate of the spectrum of the SOI series (the results for the Recruitment series are similar). Figure 4.12 shows two spectral estimates plotted on a log scale. The dashed line in Figure 4.12 shows the estimate without any tapering. The solid line shows the result with full tapering. Notice that the tapered spectrum does a better job in separating the yearly cycle ( ω = 1 ) and the El Niño cycle ( ω = 1 / 4 ). The following R session was used to generate Figure 4.12. We note that, by mvspec to taper=.5 does not taper. For full tapering, we use the argument default, i i i i

214 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 204 — #214 i i 4 Spectral Analysis and Filtering 204 to taper 50% of each end of the data; any value between 0 and .5 is instruct mvspec acceptable. In Example 4.16, we used . taper=.1 s0 = mvspec(soi, spans=c(7,7), plot=FALSE) # no taper # full taper s50 = mvspec(soi, spans=c(7,7), taper=.5, plot=FALSE) plot(s50$freq, s50$spec, log="y", type="l", ylab="spectrum", # solid line xlab="frequency") lines(s0$freq, s0$spec, lty=2) # dashed line estimators. First, We close this section with a brief discussion of lag window ( ) , which was shown in (4.30) to be ω I consider the periodogram, j ’ ω i π h 2 − j ω . ) ˆ γ ( = ) h I ( e j | h | n < Thus, (4.64) can be written as ’ ’ ’ n / k − + h 2 ω ( i π ) j ˆ k / n ) = ) = h I ( h ( ω ω f + e ) h ˆ γ ( k k j |≤ k | m m |≤ k | n | h < | ’ i h 2 π − ω h j ) ( h ˆ e ) γ (4.77) . g ( = n n < | h | Õ h ( ) . Equation (4.77) suggests estimators of the where = g ) n / ik h h π exp (− 2 k k | |≤ m n form ’ 2 ω i π − h h ̃ ω ) = ( (4.78) f ) ˆ γ ( h ) e ( w r r |≤ h | (·) w where is a weight function, called the lag window, that satisfies (i) ) = 1 0 w ( w ( x )| ≤ 1 and w ( x ) = 0 for | x | > 1 , (ii) | x ( ) = w (− x ) . w (iii) ̃ n x ) = 1 for | x | < 1 and r = Note that if , then w f ( ω , the periodogram. ) = I ( ω ) ( j j This result indicates the problem with the periodogram as an estimator of the spectral ) γ ( density is that it gives too much weight to the values of ˆ when h is large, and h hence is unreliable [e.g, there is only one pair of observations used in the estimate ˆ γ ( n − 1 ) , and so on]. The smoothing window is defined to be r ’ h ω i π 2 − h ( w ( ω = ) W (4.79) ) e , r = r − h and it determines which part of the periodogram will be used to form the estimate of ˆ ̃ ) . The asymptotic theory for f f ( ω ( holds for ω f ( ω ) under the same conditions and ) provided r →∞ as n →∞ but with r / n → 0 . That is, ̃ , E f ( ω )}→ f ( ω ) { (4.80) π 1 ( ) n 2 2 ̃ ̃ cov ( ω ) , / f ( λ ) 1 → f 2 ( ω ) (4.81) . , 0 w f ( x ) dx ω = λ , r 1 − i i i i

215 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 205 — #215 i i 205 4.5 Parametric Spectral Estimation 2 2 ( ω ) by 0 if ω , λ and by 2 f f ( ω ) if ω = λ In (4.81), replace 0 or 1 / 2 . = Many authors have developed various windows and Brillinger (2001, Ch 3) and Brockwell and Davis (1991, Ch 10) are good sources of detailed information on this topic. 4.5 Parametric Spectral Estimation The methods of the previous section lead to what is generally referred to as non- parametric spectral estimators because no assumption is made about the parametric form of the spectral density. In Property 4.4, we exhibited the spectrum of an ARMA process and we might consider basing a spectral estimator on this function, substitut- ing the parameter estimates from an ARMA( p , q ) fit on the data into the formula for ) ω the spectral density f given in (4.23). Such an estimator is called a parametric ( x spectral estimator. For convenience, a parametric spectral estimator is obtained by fitting an AR( p ) to the data, where the order p is determined by one of the model selection criteria, such as AIC, AICc, and BIC, defined in (2.15)–(2.17). Parametric autoregressive spectral estimators will often have superior resolution in problems when several closely spaced narrow spectral peaks are present and are preferred by engineers for a broad variety of problems (see Kay, 1988). The development of autoregressive spectral estimators has been summarized by Parzen (1983). 2 ˆ ˆ ˆ φ x , σ φ ) fit to , . . ., If φ p and ˆ are the estimates from an AR( , then based on p 2 1 t w is attained by substituting these ) Property 4.4, a parametric spectral estimate of f ω ( x estimates into (4.23), that is, 2 σ ˆ w ˆ ω ) = f ( (4.82) , x ω 2 i π 2 − ˆ e ( )| | φ where 2 p ˆ ˆ ˆ ˆ − z φ φ ( − z φ (4.83) z ) −···− = φ . z 1 p 2 1 The asymptotic distribution of the autoregressive spectral estimator has been obtained 3 by Berk (1974) under the conditions p →∞ , p , which may be / n → 0 as p , n →∞ too severe for most applications. The limiting results imply a confidence interval of the form ˆ ˆ ) ω f ( ω ( f ) x x ω ) ≤ ≤ , (4.84) ( f x ( 1 ) Cz − 1 ( ) Cz + 2 α α / 2 / √ 2 / α where C = probability 2 p / n and z is the ordinate corresponding to the upper / α 2 of the standard normal distribution. If the sampling distribution is to be checked, we ˆ ( f ) suggest applying the bootstrap estimator to get the sampling distribution of ω x p in Example 3.36. An alternative 1 using a procedure similar to the one used for = ) in state-space form and use for higher order autoregressive series is to put the AR( p the bootstrap procedure discussed in Section 6.7. An interesting fact about rational spectra of the form (4.23) is that any spectral density can be approximated, arbitrarily close, by the spectrum of an AR process. i i i i

216 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 206 — #216 i i 206 4 Spectral Analysis and Filtering l l l −1.20 −1.20 l l l l l l l l l l l l l −1.30 −1.30 l l l l l l l l l l BIC l l l l l l l l l l l l l AIC / BIC l −1.40 −1.40 l l l l l l l l l l l AIC l l l l l l l l l −1.50 −1.50 20 15 0 5 10 15 20 25 30 30 10 5 0 25 p Fig. 4.13. Model selection criteria AIC and BIC as a function of order p for autoregressive models fitted to the SOI series. Property 4.7 AR Spectral Approximation  > be the spectral density of a stationary process. Then, given , there 0 Let g ( ω ) is a time series with the representation p ’ w φ + x = x k t t − k t 1 = k 2 w , such that is white noise with variance σ where t w | f . ( ω )− g ( ω )| <  for all ω ∈[− 1 / 2 , 1 / 2 ] x Õ p k is finite and the roots of p are outside the unit circle. z ( z ) = 1 − φ Moreover, φ k = 1 k p One drawback of the property is that it does not tell us how large must be before the approximation is reasonable; in some situations p may be extremely large. Property 4.7 also holds for MA and for ARMA processes in general, and a proof of the result may be found in Section C.6. We demonstrate the technique in the following example. Example 4.18 Autoregressive Spectral Estimator for SOI Consider obtaining results comparable to the nonparametric estimators shown in Figure 4.7 for the SOI series. Fitting successively higher order AR( p ) models for = p , as shown 15 = 1 , 2 , . . ., 30 yields a minimum BIC and a minimum AIC at p in Figure 4.13. We can see from Figure 4.13 that BIC is very definite about which model it chooses; that is, the minimum BIC is very distinct. On the other hand, it is not clear what is going to happen with AIC; that is, the minimum is not so clear, and there is some concern that AIC will start decreasing after p = 30 . Minimum 15 model, but suffers from the same uncertainty as AIC. The = AICc selects the p spectrum is shown in Figure 4.14, and we note the strong peaks near the four year i i i i

217 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 207 — #217 i i 207 4.5 Parametric Spectral Estimation 0.25 0.20 0.15 spectrum 0.10 0.05 0.00 1/4 1 6 5 4 3 2 0 frequency Fig. 4.14. Autoregressive spectral estimator for the SOI series using the AR(15) model selected by AIC, AICc, and BIC. and one year cycles as in the nonparametric estimates obtained in Section 4.4. In addition, the harmonics of the yearly period are evident in the estimated spectrum. To perform a similar analysis in R, the command spec.ar can be used to fit the best model via AIC and plot the resulting spectrum. A quick way to obtain the AIC values is to run the ar command as follows. spaic = spec.ar(soi, log="no") # min AIC spec # El Nino peak abline(v=frequency(soi)*1/52, lty=3) # estimates and AICs (soi.ar = ar(soi, order.max=30)) dev.new() plot(1:30, soi.ar$aic[-1], type="o") # plot AICs No likelihood is calculated here, so the use of the term AIC is loose. To generate Figure 4.13 we used the following code to (loosely) obtain AIC, AICc, and BIC. Because AIC and AICc are nearly identical in this example, we only graphed AIC and BIC+1; we added 1 to the BIC to reduce white space in the graphic. n = length(soi) AIC = rep(0, 30) -> AICc -> BIC for (k in 1:30){ sigma2 = ar(soi, order=k, aic=FALSE)$var.pred BIC[k] = log(sigma2) + (k*log(n)/n) AICc[k] = log(sigma2) + ((n+k)/(n-k-2)) AIC[k] = log(sigma2) + ((n+2*k)/n) } IC = cbind(AIC, BIC+1) ts.plot(IC, type="o", xlab="p", ylab="AIC / BIC") , depend- Finally, it should be mentioned that any parametric spectrum, say f ( ω ; θ ) ing on the vector parameter can be estimated via the Whittle likelihood (Whittle, θ 1961), using the approximate properties of the discrete Fourier transform derived in , are approximately complex normally Appendix C. We have that the DFTs, d ( ω ) j f distributed with mean zero and variance ( ω ) ; θ and are approximately independent j . This implies that an approximate log likelihood can be written in the ω ω for , k j form i i i i

218 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 208 — #218 i i 4 Spectral Analysis and Filtering 208 ) ( 2 ’ )| ( d | ω j , + ) θ ( ln f ω (4.85) ; ; ln L ( x θ )≈− x j ) θ ω f ; ( j x 0 < 1 / 2 <ω j where the sum is sometimes expanded to include the frequencies ω . If the = 0 , 1 / 2 j form with the two additional frequencies is used, the multiplier of the sum will be ω . = 0 , 1 / 2 for which the multiplier is 1 / 2 unity, except for the purely real points at j For a discussion of applying the Whittle approximation to the problem of estimating parameters in an ARMA spectrum, see Anderson (1978). The Whittle likelihood is especially useful for fitting long memory models that will be discussed in Chapter 5. 4.6 Multiple Series and Cross-Spectra The notion of analyzing frequency fluctuations using classical statistical ideas extends x and to the case in which there are several jointly stationary series, for example, y . t t In this case, we can introduce the idea of a correlation indexed by frequency, called the coherence . The results in Section C.2 imply the covariance function μ γ ( h ) = E [( x )] μ − − y )( y x t h + xy t has the representation 1 π 2 ω 2 i π h e , , ..., 2 f ± ( ω ) (4.86) d ω h = 0 , ± 1 ) ( h γ = xy xy 1 − 2 where the cross-spectrum is defined as the Fourier transform ∞ ’ − 2 π i ω h 2 / γ 1 ( h ) ( ω ) = f (4.87) , − 1 / 2 ≤ ω ≤ e xy xy −∞ = h assuming that the cross-covariance function is absolutely summable, as was the case for the autocovariance. The cross-spectrum is generally a complex-valued function, and it is often written as c (4.88) , ) f ( ( ω ) = ω iq ( ω )− xy xy xy where ∞ ’ = h c ( (4.89) ω γ ) ( h ) cos ( 2 πω ) xy xy −∞ = h and ∞ ’ h πω q 2 ( ω ) = (4.90) ) ( sin ) γ h ( xy xy = h −∞ are defined as the cospectrum and quadspectrum , respectively. Because of the rela- , it follows, by substituting into (4.87) and rearranging, ) h (− tionship γ γ ( h ) = yx xy that i i i i

219 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 209 — #219 i i 209 4.6 Multiple Series and Cross-Spectra ∗ ( ω ) = f (4.91) f ( ω ) , yx xy ∗ denoting conjugation. This result, in turn, implies that the cospectrum and with quadspectrum satisfy c ( ω ) = c (4.92) ( ω ) yx xy and − q . ( ω ) = (4.93) q ) ( ω xy yx An important example of the application of the cross-spectrum is to the problem of predicting an output series y x through a linear filter from some input series t t relation such as the three-point moving average considered below. A measure of the strength of such a relation is the squared coherence function, defined as 2 )| | f ( ω yx 2 (4.94) , ρ ( ω ) = x y · ) ( f ) f ( ω ω yy xx and y ( ω ) and f where ( ω ) are the individual spectra of the x series, respectively. f xx t t yy Although we consider a more general form of this that applies to multiple inputs later, it is instructive to display the single input case as (4.94) to emphasize the analogy with conventional squared correlation, which takes the form 2 σ yx 2 = ρ , yx 2 2 σ σ x y 2 2 . This and for random variables with variances σ and covariance σ σ = σ xy yx y x motivates the interpretation of squared coherence and the squared correlation between two time series at frequency ω . Example 4.19 Three-Point Moving Average As a simple example, we compute the cross-spectrum between x and the three- t , where x point moving average y = ( x + x + x is a stationary input )/ 3 1 t t t t + − 1 t f ω ) . First, process with spectral density ( xx 1 x + x + x , x ( cov ) x y = , γ ( ) ( cov = ) h − 1 t t 1 + t t + h t t + h xy 3 [ ] 1 γ = ( h + 1 ) + γ ) ( h ) + γ 1 ( h − xx xx xx 3 1 π ( ) 2 ω 2 π i ω − 2 π i ω h 2 π i 1 = e + 1 e + e d ) ω f ω ( xx 3 1 − 2 1 π ] [ 2 ω h π i 2 1 2 cos ω, 2 πω ) d f ( 1 ω ) e + ( = xx 3 1 − 2 where we have use (4.16). Using the uniqueness of the Fourier transform, we argue from the spectral representation (4.86) that [ ] 1 ( ω ) = f ω f ( ) ) 1 + 2 cos ( 2 πω xy xx 3 i i i i

220 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 210 — #220 i i 4 Spectral Analysis and Filtering 210 so that the cross-spectrum is real in this case. Using Property 4.3, the spectral y density of is t [ ] 2 2 ω 2 2 π ω − i π i 1 1 f + e ω 1 = + e ( ) . ) ω ( f 1 + 2 cos ( 2 πω ) = ) ω ( f xx xx yy 9 9 Substituting into (4.94) yields, 2 1 ω ( ) ( f )] [ 1 + 2 cos πω 2 xx 3 2 = ) ω ( ρ = 1 ; x · y 1 2 )· ( [ 1 + 2 cos ( 2 πω )] ω f f ( ω ) xx xx 9 is unity over all frequencies. This is and y x that is, the squared coherence between t t a characteristic inherited by more general linear filters; see Problem 4.30. However, if some noise is added to the three-point moving average, the coherence is not unity; these kinds of models will be considered in detail later. Property 4.8 Spectral Representation of a Vector Stationary Process ′ p = ( x stationary process with autocovariance matrix 1 , x × If , . . ., x is a ) x t p t 1 t 2 t ′ γ h ( E [( x satisfying )} h − μ )( x ( − μ ) Γ ] = { ) = jk h + t t ∞ ’ (4.95) ∞ | γ < )| h ( jk h −∞ = for all has the representation ) h ( j , k = 1 , . . ., p , then Γ 1 π 2 2 π i ω h ( ) = Γ h 2 e (4.96) f , ..., ω ) d ω h = 0 , ± 1 , ± ( 1 − 2 as the inverse transform of the spectral density matrix , f ( ω ) = { f = ( ω )} , for j , k jk ( , . . ., p . The matrix f 1 ω ) has the representation ∞ ’ i π 2 − h ω = ) ω ( f ( Γ h ) e ≤ 1 . 2 / (4.97) − 1 / 2 ≤ ω = −∞ h ∗ ( ω ) is Hermitian, f ( ω ) = f The spectral matrix ( ω f , where ∗ means to conjugate ) and transpose. Example 4.20 Spectral Matrix of a Bivariate Process y ( x . We arrange the autocovari- , Consider a jointly stationary bivariate process ) t t ances in the matrix ( ) h γ ( ( ) ) γ h xy xx Γ ( h ) = . ( h ) γ γ ( h ) yy yx The spectral matrix would be given by ) ( ( ) f ) ω ω f ( xx xy = ) ω f , ( ) f ω ( ω ) f ( yy yx where the Fourier transform (4.96) and (4.97) relate the autocovariance and spectral matrices. i i i i

221 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 211 — #221 i i 211 4.6 Multiple Series and Cross-Spectra The extension of spectral estimation to vector series is fairly obvious. For the ′ ) x vector series x , x = ) , . . ., x ω ( = , we may use the vector of DFTs, say d ( t p j 2 t 1 t t ′ ( ω , and estimate the spectral matrix by ) , d ( ( ω )) d , . . ., ) ω ( d p j 2 j 1 j m ’ 1 − ̄ ) = L ( f ω (4.98) ) I ( ω n + k / j m − k = where now ∗ ω (4.99) ) = d ( ω ) ) d ( I ω ( j j j p complex matrix. The series may be tapered before the DFT is taken in (4.98) × is a p and we can use weighted estimation, m ’ ˆ n (4.100) ) / k + ω h ( I f ( ω = ) j k k = − m { h } are weights as defined in (4.64). The estimate of squared coherence where k and x is y between two series, t t 2 ˆ ω )| | ( f yx 2 (4.101) . ω ) = ˆ ( ρ x · y ˆ ˆ ( f ω ω ) f ) ( xx yy If the spectral estimates in (4.101) are obtained using equal weights, we will write 2 ̄ for the estimate. ) ω ρ ( x · y 2 0 > ) Under general conditions, if ρ ω then ( x · y ( ) / ) ( 2 2 )| ω )| ∼ AN ˆ | ρ ρ , L ( ω | 2 1 − ρ ( ) ω ( (4.102) y y x · h x · · y x is defined in (4.65); the details of this result may be found in Brockwell and L where h Davis (1991, Ch 11). We may use (4.102) to obtain approximate confidence intervals 2 ω ρ ( for the squared coherence, ) . y · x 2 2 for the ) ω ( ω if we use ( ρ ) = 0 ̄ We may also test the null hypothesis that ρ y · x y x · 4.11 that is, estimate with L > 1 , 2 ̄ ( ω f | )| yx 2 = ( ω ) ρ ̄ (4.103) . x · y ̄ ̄ ω ) f f ω ( ( ) yy xx In this case, under the null hypothesis, the statistic 2 ( ω ) ̄ ρ y · x = F 1 − (4.104) ) L ( 2 − ̄ ρ ( 1 ( ω )) x y · F -distribution with 2 and has an approximate L − 2 degrees of freedom. When the 2 ′ d f , where 2 series have been extended to length n L , we replace 2 is − 2 by d f − defined in (4.60). Solving (4.104) for a particular significance level α leads to 4 . 11 2 1 1 )≡ If L = ω then ̄ ρ . ( · x y i i i i

222 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 212 — #222 i i 4 Spectral Analysis and Filtering 212 SOI & Recruitment 1.0 0.8 0.6 0.4 squared coherency 0.2 0.0 6 3 5 2 4 0 1 frequency ′ L = 19 , n Squared coherency between the SOI and Recruitment series; 453 , n Fig. 4.15. = = 001 . C 480 , and α = . . The horizontal line is 001 . ) F ( α 2 − 2 , 2 L (4.105) = C α L − 1 F + ) α ( − 2 L 2 , 2 as the approximate value that must be exceeded for the original squared coherence to 2 at an a priori specified frequency. 0 be able to reject ρ = ω ( ) x · y Example 4.21 Coherence Between SOI and Recruitment Figure 4.15 shows the squared coherence between the SOI and Recruitment series L = 19 , d f = over a wider band than was used for the spectrum. In this case, we used 36 ( 19 )( 453 / 480 ) ≈ F and 2 ( . 001 ) ≈ 8 . 53 at the significance level α = . 001 . − 2 , 2 d f 2 that Hence, we may reject the hypothesis of no coherence for values of ̄ ρ ( ) ω · x y exceed C = . 32 . We emphasize that this method is crude because, in addition to 001 . the fact that the -statistic is approximate, we are examining the squared coherence F across all frequencies with the Bonferroni inequality, (4.63), in mind. Figure 4.15 also exhibits confidence bands as part of the R plotting routine. We emphasize that 2 > ) . 0 ( ω ρ where ω these bands are only valid for y · x In this case, the two series are obviously strongly coherent at the annual seasonal frequency. The series are also strongly coherent at lower frequencies that may be attributed to the El Niño cycle, which we claimed had a 3 to 7 year period. The peak in the coherency, however, occurs closer to the 9 year cycle. Other frequencies are also coherent, although the strong coherence is less impressive because the underlying power spectrum at these higher frequencies is fairly small. Finally, we note that the coherence is persistent at the seasonal harmonic frequencies. This example may be reproduced using the following R commands. sr = mvspec(cbind(soi,rec), kernel("daniell",9), plot=FALSE) sr$df # df = 35.8625 f = qf(.999, 2, sr$df-2) # = 8.529792 # = 0.321517 C = f/(18+f) plot(sr, plot.type = "coh", ci.lty = 2) abline(h = C) i i i i

223 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 213 — #223 i i 213 4.7 Linear Filters 4.7 Linear Filters Some of the examples of the previous sections have hinted at the possibility the distribution of power or variance in a time series can be modified by making a linear transformation. In this section, we explore that notion further by showing how linear filters can be used to extract signals from a time series. These filters modify the spectral characteristics of a time series in a predictable way, and the systematic development of methods for taking advantage of the special properties of linear filters is an important topic in time series analysis. Recall Property 4.3 that stated if ∞ ∞ ’ ’ < = | , a | ∞ a y x , − j t t j j −∞ j = j −∞ = , then y and x has spectrum f ( ω ) has spectrum t xx t 2 ω , ( ω ) = | A ) ( f )| f ω ( xx yx yy where ∞ ’ ω 2 j i π − ) A = ( ω e a j yx −∞ j = frequency response function is the . This result shows that the filtering effect can be characterized as a frequency-by-frequency multiplication by the squared magnitude of the frequency response function. Example 4.22 First Difference and Moving Average Filters We illustrate the effect of filtering with two common examples, the first difference filter y = ∇ x = x − x t t 1 t − t and the annual symmetric moving average filter, 5 ’ ) ( 1 1 + x + x x , = y t − t + 6 6 t − r t 24 12 r = − 5 6 = . The results of filtering the SOI series m which is a modified Daniell kernel with using the two filters are shown in the middle and bottom panels of Figure 4.16. Notice that the effect of differencing is to roughen the series because it tends to retain the higher or faster frequencies. The centered moving average smoothes the series because it retains the lower frequencies and tends to attenuate the higher because it frequencies. In general, differencing is an example of a high-pass filter retains or passes the higher frequencies, whereas the moving average is a low-pass filter because it passes the lower or slower frequencies. Notice that the slower periods are enhanced in the symmetric moving average and the seasonal or yearly frequencies are attenuated. The filtered series makes i i i i

224 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 214 — #224 i i 4 Spectral Analysis and Filtering 214 SOI 1.0 0.5 0.0 −1.0 1960 1980 1970 1950 Time First Difference 0.5 0.0 −1.0 1980 1970 1960 1950 Time Seasonal Moving Average 0.4 0.0 −0.4 1970 1950 1960 1980 Time SOI series (top) compared with the differenced SOI (middle) and a centered 12-month Fig. 4.16. moving average (bottom). about 9 cycles in the length of the data (about one cycle every 52 months) and the extract moving average filter tends to enhance or the El Niño signal. Moreover, by low-pass filtering the data, we get a better sense of the El Niño effect and its irregularity. Now, having done the filtering, it is essential to determine the exact way in which the filters change the input spectrum. We shall use (4.21) and (4.22) for this purpose. The first difference filter can be written in the form (4.20) by letting a , a = − 1 1 a = = 0 otherwise. This implies that , and r 1 0 − 2 π i ω 1 − e = ( ω ) A , yx and the squared frequency response becomes 2 − 2 π ω ω 2 π i i | = ( 1 − e (4.106) . )] πω A )( 1 − e 2 ( ω )| ) = 2 [ 1 − cos ( yx The top panel of Figure 4.17 shows that the first difference filter will attenuate the lower frequencies and enhance the higher frequencies because the multiplier 2 | A , is large for the higher frequencies and small for the ( ω )| of the spectrum, yx lower frequencies. Generally, the slow rise of this kind of filter does not particularly recommend it as a procedure for retaining only the high frequencies. , 24 / = For the centered 12-month moving average, we can take a 1 a = 6 − 6 for 0 = a a and = 1 / 12 elsewhere. Substituting and recognizing the − 5 ≤ k ≤ 5 k k cosine terms gives i i i i

225 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 215 — #225 i i 215 4.7 Linear Filters 4 3 2 1 First Difference 0 0.5 0.0 0.1 0.2 0.3 0.4 frequency 0.8 0.4 Seasonal Moving Average 0.0 0.2 0.3 0.4 0.5 0.1 0.0 frequency Fig. 4.17. Squared frequency response functions of the first difference (top) and twelve-month moving average (bottom) filters. 5 ] [ ’ 1 ) cos (4.107) . k πω 2 ( 1 + ( 12 πω ) + 2 cos ) ω ( A = yx 12 1 k = Plotting the squared frequency response of this function as in the bottom of Fig- ure 4.17 shows that we can expect this filter to cut most of the frequency con- tent above .05 cycles per point, and nearly all of the frequency content above 12 ≈ 1 083 . In particular, this drives down the yearly components with peri- / . ods of 12 months and enhances the El Niño frequency, which is somewhat lower. The filter is not completely efficient at attenuating high frequencies; some power 2 | A . ( ω )| contributions are left at higher frequencies, as shown in the function yx The following R session shows how to filter the data, perform the spectral analysis of a filtered series, and plot the squared frequency response curves of the difference and moving average filters. par(mfrow=c(3,1), mar=c(3,3,1,1), mgp=c(1.6,.6,0)) # plot data plot(soi) plot(diff(soi)) # plot first difference k = kernel("modified.daniell", 6) # filter weights plot(soif <- kernapply(soi, k)) # plot 12 month filter dev.new() spectrum(soif, spans=9, log="no") # spectral analysis (not shown) abline(v=12/52, lty="dashed") dev.new() ##-- frequency responses --## par(mfrow=c(2,1), mar=c(3,3,1,1), mgp=c(1.6,.6,0)) w = seq(0, .5, by=.01) FRdiff = abs(1-exp(2i*pi*w))^2 ' ' frequency ) plot(w, FRdiff, type= ' l ' , xlab= i i i i

226 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 216 — #226 i i 4 Spectral Analysis and Filtering 216 u = cos(2*pi*w)+cos(4*pi*w)+cos(6*pi*w)+cos(8*pi*w)+cos(10*pi*w) FRma = ((1 + cos(12*pi*w) + 2*u)/12)^2 ' l ' , xlab= ' frequency ' ) plot(w, FRma, type= The two filters discussed in the previous example were different in that the fre- quency response function of the first difference was complex-valued, whereas the frequency response of the moving average was purely real. A short derivation similar to that used to verify (4.22) shows, when x y are related by the linear filter and t t relation (4.20), the cross-spectrum satisfies f , ( ω ) = A ) ( ω ) f ω ( xx yx yx so the frequency response is of the form ω ) ( f yx (4.108) = ω A ( ) yx f ( ω ) xx ) c ω ( ω ) ( q yx yx = − i (4.109) , f ( ω f ) ( ω ) xx xx where we have used (4.88) to get the last form. Then, we may write (4.109) in polar coordinates as )} ω (4.110) , A ( ( ω ) = | A φ ( ω )| exp {− i yx yx yx of the filter are defined by phase where the amplitude and √ 2 2 c ( q + ) ω ω ( ) yx yx (4.111) | = )| ω ( A yx ( ) f ω xx and ) ( ) ω ( q yx − 1 . (4.112) − tan = ) ω ( φ yx ω ) c ( yx A simple interpretation of the phase of a linear filter is that it exhibits time delays as a function of frequency in the same way as the spectrum represents the variance as a function of frequency. Additional insight can be gained by considering the simple delaying filter y Ax , = t t − D and delayed where the series gets replaced by a version, amplified by multiplying by A D points. For this case, by − 2 π i ω D ) A e ω , ( ω ) = f f ( xx yx and the amplitude is | A | , and the phase is φ D , ω ) = − 2 πω ( yx . For this case, applying a simple time delay or just a linear function of frequency ω causes phase delays that depend on the frequency of the periodic component being delayed. Interpretation is further enhanced by setting i i i i

227 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 217 — #227 i i 217 4.7 Linear Filters = cos ( 2 πω t ) , x t in which case − cos ( 2 πω t = 2 πω D ) . y A t Thus, the output series, , has the same period as the input series, x y , but the amplitude t t A | and the phase has been changed by a of the output has increased by a factor of | 2 D . − factor of πω Example 4.23 Difference and Moving Average Filters We consider calculating the amplitude and phase of the two filters discussed in Example 4.22. The case for the moving average is easy because ( ω ) given in A yx | A (4.107) is purely real. So, the amplitude is just ( ω )| and the phase is φ . ( ω ) = 0 yx yx ) filters have zero phase. The first difference, a a In general, symmetric ( = j − j however, changes this, as we might expect from the example above involving the time delay filter. In this case, the squared amplitude is given in (4.106). To compute the phase, we write i − πω i πω i − ω − 2 π πω i e e − 1 = ( e ) ω = e ( ) A − yx − i πω 2 = 2 i e ) πω ( sin ( πω ) = 2 sin sin ( πω ) + 2 i cos ( πω ) q ) ω ( c ω ) ( yx yx i − , = ( f ω f ( ω ) ) xx xx so ) ( ( ) ( ) q ω ) πω ( cos yx 1 − 1 − = − ) φ tan . = tan ( ω yx sin ( ) c ω ( πω ) yx Noting that cos ) 2 ( πω ) = sin (− πω + π / and that πω ( πω ) = cos (− sin + π / 2 ) , we get ( φ , + ω ) = − πω 2 π / yx and the phase is again a linear function of frequency. The above tendency of the frequencies to arrive at different times in the filtered version of the series remains as one of two annoying features of the difference type filters. The other weakness is the gentle increase in the frequency response function. If low frequencies are really unimportant and high frequencies are to be preserved, we would like to have a somewhat sharper response than is obvious in Figure 4.17. Similarly, if low frequencies are important and high frequencies are not, the moving average filters are also not very efficient at passing the low frequencies and attenuating the high frequencies. Improvement is possible by designing better and longer filters, but we do not discuss this here. ′ x that x x = ( We will occasionally use results for multivariate series , . . ., ) 1 t t t p matrix filter are comparable to the simple property shown in (4.22). Consider the i i i i

228 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 218 — #228 i i 4 Spectral Analysis and Filtering 218 ∞ ’ x y (4.113) , A = t t − j j −∞ = j Õ ∞ ‖ A and ‖ < ∞ where A } denotes a sequence of q × p matrices such that { j j −∞ j = ′ x stationary vector process = ( x 1 × , . . ., x p ) denotes any matrix norm, is a ‖·‖ t p 1 t t Γ and spectral matrix ) with mean vector μ ( and p × p , matrix covariance function h xx x is the ( ) , and y f ω q × 1 vector output process. Then, we can obtain the following xx t property. Property 4.9 Output Spectral Matrix of Filtered Vector Series The spectral matrix of the filtered output in (4.113) is related to the spectrum y t x of the input by t ∗ ) ( f ω ( ω ) = A (4.114) ω ) f ( ( ω ) A , xx yy where the matrix frequency response function A ( ω ) is defined by ∞ ’ . A ( ) ) = j ω (4.115) A i exp (− 2 π ω j −∞ j = 4.8 Lagged Regression Models One of the intriguing possibilities offered by the coherence analysis of the relation be- tween the SOI and Recruitment series discussed in Example 4.21 would be extending classical regression to the analysis of lagged regression models of the form ∞ ’ = y (4.116) , β v + x − r t r t t −∞ = r v is is a stationary noise process, x y is the observed input series, and where t t t β the observed output series. We are interested in estimating the filter coefficients r to the output series relating the adjacent lagged values of . x y t t In the case of SOI and Recruitment series, we might identify the El Niño driving x , the Recruitment series, as the output. In general, series, SOI, as the input, y , and t t q × there will be more than a single possible input series and we may envision a 1 vector of driving series. This multivariate input situation is covered in Chapter 7. The model given by (4.116) is useful under several different scenarios, corresponding to different assumptions that can be made about the components. We assume that the inputs and outputs have zero means and are jointly stationary ′ with the 2 1 vector process ( x having a spectral matrix of the form , y × ) t t ) ( ) ω ( f f ( ω ) xx xy f = ) . ( ω (4.117) f ω ( ω ) ( f ) yx yy to the output Here, ω ( ω ) is the cross-spectrum relating the input x ( f y f , and ) t xx t xy are the spectra of the input and output series, respectively. Generally, we ( ) ω and f yy i i i i

229 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 219 — #229 i i 219 4.8 Lagged Regression Models observe two series, regarded as input and output and search for regression functions β relating the inputs to the outputs. We assume all autocovariance functions satisfy { } t the absolute summability conditions of the form (4.38). Then, minimizing the mean squared error ) ( 2 ∞ ’ = E (4.118) y M SE β − x t − r t r r = −∞ leads to the usual orthogonality conditions [( ] ) ∞ ’ − y E β x x (4.119) = 0 t t r − s − r t r = −∞ s = 0 , ± 1 , ± for all , . . . . Taking the expectations inside leads to the normal equations 2 ∞ ’ ) (4.120) γ ) ( s − β = γ s ( r yx xx r r −∞ = s = 0 , ± 1 , ± 2 , . . . for . These equations might be solved, with some effort, if the covariance functions were known exactly. If data x ( , y are available, ) for t = 1 , . . ., n t t ˆ γ ) ( h ) we might use a finite approximation to the above equations with ˆ γ h ( and yx xx 2 / M | substituted into (4.120). If the regression vectors are essentially zero for , s | ≥ < n , the system (4.120) would be of full rank and the solution would involve M and − ( M − 1 )×( M matrix. 1 ) inverting an A frequency domain approximate solution is easier in this case for two reasons. First, the computations depend on spectra and cross-spectra that can be estimated from sample data using the techniques of Section 4.5. In addition, no matrices will have to be inverted, although the frequency domain ratio will have to be computed for each frequency. In order to develop the frequency domain solution, substitute the representation (4.96) into the normal equations, using the convention defined in (4.117). The left side of (4.120) can then be written in the form 1 1 π π ∞ ’ 2 2 i ω π 2 s π i ω ( s − r ) 2 d ω ( f ) ω ( B ω, ) e d e f ( ω β = ω ) xx r xx 1 1 − − r −∞ = 2 2 where ∞ ’ ω r i π 2 − (4.121) B ( ω ) β = e r r −∞ = is ) is the Fourier transform of the regression coefficients β s . Now, because γ ( yx t ω ) the inverse transform of the cross-spectrum f , we might write the system of ( yx equations in the frequency domain, using the uniqueness of the Fourier transform, as (4.122) ) ω B ( ω ) f , ( ω ) = f ( yx xx which then become the analogs of the usual normal equations. Then, we may take i i i i

230 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 220 — #230 i i 220 4 Spectral Analysis and Filtering Input: SOI 5 −5 beta(s) −15 5 15 −10 −5 10 0 −15 s Input: Recruitment 0.00 beta(s) −0.02 10 15 −15 −10 −5 0 5 s Fig. 4.18. Estimated impulse response functions relating SOI to Recruitment (top) and Re- L = 15 , M = 32 . cruitment to SOI (bottom) ˆ ω f ) ( yx k ˆ = ω ) ( B (4.123) k ˆ f ( ω ) k xx as the estimator for the Fourier transform of the regression coefficients, evaluated at some subset of fundamental frequencies ω M = k / M with . Generally, we as- << n k ; B (·) over intervals of the form { ω , + ` / n sume smoothness of ` = − m , . . ., 0 , . . ., m } k ˆ ˆ β ω with L = 2 m + 1 . The inverse transform of the function would give B ( , and we ) t note that the discrete time approximation can be taken as − 1 M ’ 1 − π 2 i t ω k ˆ ˆ β = M ω (4.124) ) e B ( t k = k 0 ˆ ) | for t = 0 , ± 1 , ± 2 , . . ., ±( M / 2 − 1 2 . If we were to use (4.124) to define , β / for t | ≥ M t we would end up with a sequence of coefficients that is periodic with a period of M . ˆ In practice we define instead. Problem 4.32 explores the error β 2 / 0 for | t | ≥ M = t resulting from this approximation. Example 4.24 Lagged Regression for SOI and Recruitment The high coherence between the SOI and Recruitment series noted in Example 4.21 suggests a lagged regression relation between the two series. A natural direction for the implication in this situation is implied because we feel that the sea surface temperature or SOI should be the input and the Recruitment series should be the be the SOI series and output. With this in mind, let x y the Recruitment series. t t Although we think naturally of the SOI as the input and the Recruitment as the output, two input-output configurations are of interest. With SOI as the input, the model is ∞ ’ y = + w x a t − t t r r −∞ = r i i i i

231 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 221 — #231 i i 221 4.8 Lagged Regression Models whereas a model that reverses the two roles would be ∞ ’ , v + y b = x r t − t r t r −∞ = w and v are white noise processes. Even though there is no plausible where t t environmental explanation for the second of these two models, displaying both possibilities helps to settle on a parsimonious transfer function model. Based on the script LagReg in astsa , the estimated regression or impulse re- sponse function for SOI, with is 15 M = 32 and L = LagReg(soi, rec, L=15, M=32, threshold=6) lag s beta(s) [1,] 5 -18.479306 [2,] 6 -12.263296 [3,] 7 -8.539368 [4,] 8 -6.984553 The prediction equation is rec(t) = alpha + sum_s[ beta(s)*soi(t-s) ], where alpha = 65.97 MSE = 414.08 Note the negative peak at a lag of five points in the top of Figure 4.18; in this case, SOI is the input series. The fall-off after lag five seems to be approximately exponential and a possible model is 18 + y = 66 − w . 5 x − 12 . 3 x − 8 . 5 x . − 7 x 5 t t t t − − 7 t − 6 t − 8 If we examine the inverse relation, namely, a regression model with the Recruitment y as the input, the bottom of Figure 4.18 implies a much simpler model, series t LagReg(rec, soi, L=15, M=32, inverse=TRUE, threshold=.01) lag s beta(s) [1,] 4 0.01593167 [2,] 5 -0.02120013 The prediction equation is soi(t) = alpha + sum_s[ beta(s)*rec(t+s) ], where alpha = 0.41 MSE = 0.07 depending on only two coefficients, namely, 41 x v = . . + . 016 y + y 02 − . 4 + t + 5 t t t 5 and rearranging, we have Multiplying both sides by 50 B 5 50 1 − . 8 B ) y . = 20 . 5 − (  B x + t t t  Finally, we check whether the noise, , is white. In addition, at this point, it sim- t plifies matters if we rerun the regression with autocorrelated errors and reestimate the coefficients. The model is referred to as an ARMAX model (the X stands for exogenous; see Section 5.6 and Section 6.6.1): fish = ts.intersect(R=rec, RL1=lag(rec,-1), SL5=lag(soi,-5)) (u = lm(fish[,1]~fish[,2:3], na.action=NULL)) # suggests ar1 acf2(resid(u)) # armax model sarima(fish[,1], 1, 0, 0, xreg=fish[,2:3]) i i i i

232 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 222 — #232 i i 4 Spectral Analysis and Filtering 222 Coefficients: ar1 intercept RL1 SL5 0.4487 12.3323 0.8005 -21.0307 s.e. 0.0503 1.5746 0.0234 1.0915 sigma^2 estimated as 49.93 Our final parsimonious fitted model is (with rounding) , = 12 + . 8 y w +  − 21 x 45 . = +   , y and t 5 − t 1 t − 1 − t t t t 2 = is white noise with σ w . This example is also examined in Chapter 5 50 where t w and the fitted values for the final model can be viewed Figure 5.12. The example shows we can get a clean estimator for the transfer functions relating 2 is large. The reason is that we can write the ) the two series if the coherence ˆ ρ ω ( xy minimized mean squared error (4.118) as ] [ ∞ ∞ ’ ’ ) ( y − β x y E = , r γ )− 0 ( M SE β γ ) = (− t r − t r t r yy xy = r −∞ −∞ r = using the result about the orthogonality of the data and error term in the Projection theorem. Then, substituting the spectral representations of the autocovariance and cross-covariance functions and identifying the Fourier transform (4.121) in the result leads to 1 π 2 ( ( ω d ω )] f [ f ω ( ω )− B ) = M SE xy yy 1 − 2 1 π 2 2 (4.125) ω, d ω ( )] f ρ ( ω )[ 1 − = yy yx 1 − 2 2 ρ where ( ω ) is just the squared coherence given by (4.94). The similarity of (4.125) yx is obvious. In x from to the usual mean square error that results from predicting y that case, we would have 2 2 2 ) − ρ 1 ( β ) − = σ y ( E x xy y 2 and y with zero means, variances x for jointly distributed random variables and σ x 2 . Because the mean squared error in (4.125) σ , and covariance σ σ = ρ σ xy x xy y y ω with a non-negative function, it follows that the coherence ) satisfies M SE ≥ 0 ( f yy satisfies 2 ω ( 1 ) ≤ ≤ 0 ρ xy . Furthermore, Problem 4.33 shows the squared coherence is one when ω for all the output are linearly related by the filter relation (4.116), and there is no noise, v . Hence, the multiple coherence gives a measure of the association or = i.e., 0 t correlation between the input and output series as a function of frequency. The matter of verifying that the -distribution claimed for (4.104) will hold when F the sample coherence values are substituted for theoretical values still remains. Again, i i i i

233 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 223 — #233 i i 4.9 Signal Extraction and Optimum Filtering 223 the form of the F -statistic is exactly analogous to the usual t -test for no correlation in a regression context. We give an argument leading to this conclusion later using the results in Section C.3. Another question that has not been resolved in this section is the extension to the case of multiple inputs x , x . Often, more than just a , . . ., x tq t t 1 2 single input series is present that can possibly form a lagged predictor of the output series y . An example is the cardiovascular mortality series that depended on possibly t a number of pollution series and temperature. We discuss this particular extension as a part of the multivariate time series techniques considered in Chapter 7. 4.9 Signal Extraction and Optimum Filtering A model closely related to regression can be developed by assuming again that ∞ ’ y = (4.126) , β v + x t t r r t − −∞ = r s are known and x but where the is some unknown random signal that is uncorrelated β t noise process v with the . In this case, we observe only y and are interested in an t t estimator for the signal x of the form t ∞ ’ x ˆ (4.127) a = . y t r t − r r −∞ = In the frequency domain, it is convenient to make the additional assumptions that the ω ( ) f x and v are both mean-zero stationary series with spectra f , ( ω ) and series t xx t vv often referred to as the noise spectrum , respectively. Often, the signal spectrum and β is the Kronecker delta, is of interest because (4.126) = δ special case , in which δ t t t signal plus noise model reduces to the simple y x + v (4.128) = t t t in that case. In general, we seek the set of filter coefficients a that minimize the mean t squared error of estimation, say, ) ( 2   ∞ ’     = M SE E − a x y . (4.129) t t − r r     −∞ r =   This problem was originally solved by Kolmogorov (1941) and by Wiener (1949), who derived the result in 1941 and published it in classified reports during World War II. We can apply the orthogonality principle to write [( ) ] ∞ ’ x E − y a 0 = y t r t − r s t − r −∞ = i i i i

234 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 224 — #234 i i 4 Spectral Analysis and Filtering 224 0 , ± 1 , ± = , . . . , which leads to for s 2 ∞ ’ ) a s γ ( s − r ) = γ ( , xy yy r −∞ r = to be solved for the filter coefficients. Substituting the spectral representations for the autocovariance functions into the above and identifying the spectral densities through the uniqueness of the Fourier transform produces ) ω (4.130) A ( ω ) f , ( ω ) = f ( xy yy β . where A ( ω ) and the optimal filter a and are Fourier transform pairs for B ( ω ) t t Now, a special consequence of the model is that (see Problem 4.30) ∗ f ω ( ω ) = B ( ) ) f ω ( (4.131) xx xy and 2 , ) f ( ω ) = | B ( ω )| ω f (4.132) ( ω ) + f ( xx vv yy implying the optimal filter would be Fourier transform of ∗ ) ω ( B ( ) = ( ) ω , A (4.133) ) ω ( f vv 2 + )| ω ( | B ( ω ) f x x where the second term in the denominator is just the inverse of the signal to noise ratio , say, f ( ω ) xx = . SNR ( (4.134) ) ω f ω ( ) vv The result shows the optimum filters can be computed for this model if the signal and noise spectra are both known or if we can assume knowledge of the signal-to- noise ratio SNR ( ω ) as function of frequency. In Chapter 7, we show some methods for estimating these two parameters in conjunction with random effects analysis of variance models, but we assume here that it is possible to specify the signal-to-noise ratio a priori. If the signal-to-noise ratio is known, the optimal filter can be computed by the inverse transform of the function A ( ω ) . It is more likely that the inverse transform will be intractable and a finite filter approximation like that used in the previous section can be applied to the data. In this case, we will have 1 M − ’ t π 2 ω i − 1 M k (4.135) e A ( ω ) M = a k t 0 = k as the estimated filter function. It will often be the case that the form of the specified frequency response will have some rather sharp transitions between regions where the signal-to-noise ratio is high and regions where there is little signal. In these cases, the shape of the frequency response function will have ripples that can introduce i i i i

235 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 225 — #235 i i 4.9 Signal Extraction and Optimum Filtering 225 Filter coefficients l l l l l l l 0.08 l l l l 0.04 l l a(s) l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −0.02 −10 0 10 20 30 −30 −20 s Desired and attained frequency response functions 0.8 0.4 freq. response 0.0 0.1 0.2 0.3 0.4 0.5 0.0 freq Filter coefficients (top) and frequency response functions (bottom) for designed SOI Fig. 4.19. filters. frequencies at different amplitudes. An aesthetic solution to this problem is to intro- duce tapering as was done with spectral estimation in (4.69)–(4.76). We use below h a where = the tapered filter is the cosine taper given in (4.76). The squared a h ̃ t t t t 2 ̃ A frequency response of the resulting filter will be ( ω )| | , where ∞ ’ π 2 − t i ω ̃ = . ω (4.136) h ) e ( A a t t = −∞ t The results are illustrated in the following example that extracts the El Niño compo- nent of the sea surface temperature series. Example 4.25 Estimating the El Niño Signal via Optimal Filters Figure 4.7 shows the spectrum of the SOI series, and we note that essentially two components have power, the El Niño frequency of about .02 cycles per month (the four-year cycle) and a yearly frequency of about .08 cycles per month (the annual cycle). We assume, for this example, that we wish to preserve the lower frequency as signal and to eliminate the higher order frequencies, and in particular, the annual cycle. In this case, we assume the simple signal plus noise model = y , x v + t t t so that there is no convolving function β . Furthermore, the signal-to-noise ratio is t assumed to be high to about .06 cycles per month and zero thereafter. The optimal frequency response was assumed to be unity to .05 cycles per point and then to decay linearly to zero in several steps. Figure 4.19 shows the coefficients as specified by = 64 , as well as the frequency response function given by (4.136), (4.135) with M of the cosine tapered coefficients; recall Figure 4.11, where we demonstrated the i i i i

236 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 226 — #236 i i 4 Spectral Analysis and Filtering 226 Original series 1.0 0.0 series −1.0 100 200 300 400 0 Time Filtered series 0.4 0.0 series.filt −0.4 0 200 300 400 100 Time Original SOI series (top) compared to filtered version showing the estimated El Niño Fig. 4.20. temperature signal (bottom). need for tapering to avoid severe ripples in the window. The constructed response function is compared to the ideal window in Figure 4.19. Figure 4.20 shows the original and filtered SOI index, and we see a smooth extracted signal that conveys the essence of the underlying El Niño signal. The frequency response of the designed filter can be compared with that of the symmetric 12-month moving average applied to the same series in Example 4.22. The filtered series, shown in Figure 4.16, shows a good deal of higher frequency chatter riding on the smoothed version, which has been introduced by the higher frequencies that leak through in the squared frequency response, as in Figure 4.17. The analysis can be replicated using the script SigExtract . SigExtract(soi, L=9, M=64, max.freq=.05) The design of finite filters with a specified frequency response requires some experimentation with various target frequency response functions and we have only touched on the methodology here. The filter designed here, sometimes called a low-pass filter reduces the high frequencies and keeps or passes the low frequencies. Alternately, we could design a high-pass filter to keep high frequencies if that is where the signal is located. An example of a simple high-pass filter is the first difference with a frequency response that is shown in Figure 4.17. We can also design band-pass filters that keep frequencies in specified bands. For example, seasonal adjustment filters are often used in economics to reject seasonal frequencies while keeping both high frequencies, lower frequencies, and trend (see, for example, Grether and Nerlove, 1970). The filters we have discussed here are all symmetric two-sided filters, because the designed frequency response functions were purely real. Alternatively, we may design recursive filters to produce a desired response. An example of a recursive filter by the filtered output x is one that replaces the input t i i i i

237 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 227 — #237 i i 4.10 Spectral Analysis of Multidimensional Series 227 p q ’ ’ y = φ − y + x θ (4.137) . x t k t k t − k t − k k = 1 k = 1 ( , q ) model, in which the white Note the similarity between (4.137) and the ARMA p y and noise component is replaced by the input. Transposing the terms involving t using the basic linear filter result in Property 4.3 leads to 2 ω i π 2 − | θ ( e )| ( (4.138) , ) ω f ( ) f ω = x y ω i 2 2 π − e φ | )| ( where p ’ π − ω − 2 ω ik π 2 i e φ ( ) = φ 1 e − k 1 = k and q ’ ik π − 2 ω − π ω 2 i e . θ ( 1 ) = e θ − k = k 1 Recursive filters such as those given by (4.138) distort the phases of arriving frequen- cies, and we do not consider the problem of designing such filters in any detail. 4.10 Spectral Analysis of Multidimensional Series ′ is an s Multidimensional series of the form x -dimensional , where s = ( s ) , s r , . . ., 2 r s 1 vector of spatial coordinates or a combination of space and time coordinates, were introduced in Section 1.6. The example given there, shown in Figure 1.18, was a collection of temperature measurements taking on a rectangular field. These data would form a two-dimensional process, indexed by row and column in space. In that section, the multidimensional autocovariance function of an r -dimensional stationary ) , where the multidimensional lag vector is ] series was given as γ x ( h x = E [ x + h s s ′ . ) h = ( h , h , . . ., h 2 r 1 The multidimensional wavenumber spectrum is given as the Fourier transform of the autocovariance, namely, ’ ’ ′ 2 ω i π h − = ) ω ( f ··· (4.139) γ . ( h ) e x x h Again, the inverse result 1 1 π π 2 2 ′ ω i π h 2 = ( h ) γ e d ω ( ω ) f ··· (4.140) x x 1 1 − − 2 2 holds, where the integral is over the multidimensional range of the vector ω . The wavenumber argument is exactly analogous to the frequency argument, and we have the corresponding intuitive interpretation as the cycling rate ω per distance traveled i s -th direction. i in the i i i i i

238 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 228 — #238 i i 228 4 Spectral Analysis and Filtering Two-dimensional processes occur often in practical applications, and the repre- sentations above reduce to ∞ ∞ ’ ’ h ω h − ( i π 2 ω ) + 2 1 2 1 ) (4.141) ( h γ , h e f ( ω = ) ,ω x 2 1 2 x 1 h −∞ −∞ h = = 2 1 and 1 1 π π 2 2 π ω + i ) 2 h h ( ω 1 2 2 1 , h ) ( = h γ ω ,ω e ω ( f d ) ω (4.142) d 2 1 x 2 x 1 2 1 1 1 − − 2 2 r = 2 . The notion of linear filtering generalizes easily to the two- in the case a dimensional case by defining the impulse response function and the spatial s s , 1 2 filter output as ’ ’ y = a x . (4.143) , s s u , u u s u − − s , 1 2 2 1 2 1 2 1 u u 1 2 The spectrum of the output of this filter can be derived as 2 f ( ω (4.144) ) = | A ( ω ,ω )| , f ( ω ,ω ) ,ω 1 x 1 2 1 2 2 y where ’ ’ i u ω ( ) ω + π u 2 − 1 1 2 2 = A a ) ,ω ω ( e . (4.145) 2 u 1 , u 2 1 u u 2 1 These results are analogous to those in the one-dimensional case, described by Prop- erty 4.3. The multidimensional DFT is also a straightforward generalization of the uni- variate expression. In the two-dimensional case with data on a rectangular grid, { x 1 1 2 ≤ , ,ω ; s ω / 1 , ..., n ≤ , s 2 = 1 , ..., n / } , we will write, for − = , s s 2 1 2 2 1 1 2 1 n n 1 2 ’ ’ − 1 / 2 − + ω s ω s i ) π 2 ( 2 1 2 1 ,ω n d ( ω n = ) ) ( (4.146) e x 1 2 1 2 , s s 2 1 s = 1 1 = s 1 2 as the two-dimensional DFT, where the frequencies ,ω are evaluated at multiples ω 1 2 / on the spatial frequency scale. The two-dimensional wavenumber ) of ( 1 / n n , 1 1 2 spectrum can be estimated by the smoothed sample wavenumber spectrum ’ 2 − 1 ̄ | | ) + ,ω n / ` + ω ( d n / ` = , (4.147) ( ω ,ω L ) L ) ( f 1 2 2 1 1 2 2 1 2 1 x ,` ` 2 1 {− m L ≤ ` 1 ≤ m + ; j = 1 , 2 } , where where the sum is taken over the grid m = 2 1 j j j 1 L 2 = and m . The statistic + 1 2 2 ̄ ω ( 2 ) L f L ,ω · 2 x 1 1 2 2 (4.148) ∼ χ L 2 L 2 1 ω ,ω ) f ( x 2 1 can be used to set confidence intervals or make approximate tests against a fixed . ) assumed spectrum f ,ω ( ω 0 2 1 i i i i

239 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 229 — #239 i i 229 4.10 Spectral Analysis of Multidimensional Series 80 Periodogram Ordinate 60 40 0.4 20 0.2 0 −0.4 0.0 −0.2 −0.2 0.0 cycles/column cycles/row 0.2 −0.4 0.4 Two-dimensional periodogram of soil temperature profile showing peak at .0625 Fig. 4.21. ft. 272 = cycles/row. The period is 16 rows, and this corresponds to 16 × 17 ft Example 4.26 Soil Surface Temperatures As an example, consider the periodogram of the two-dimensional temperature series shown in Figure 1.18 and analyzed by Bazza et al. (1988). We recall the spatial coordinates in this case will be ( s , which define the spatial coordinates rows , s ) 2 1 and columns so that the frequencies in the two directions will be expressed as cycles per row and cycles per column. Figure 4.21 shows the periodogram of the two-dimensional temperature series, and we note the ridge of strong spectral peaks running over rows at a column frequency of zero. An obvious periodic component . cycles per row, which corresponds to 0625 appears at frequencies of 0625 . and − 16 rows or about 272 ft. On further investigation of previous irrigation patterns over this field, treatment levels of salt varied periodically over columns. This analysis is extended in Problem 4.24, where we recover the salt treatment profile over rows and compare it to a signal, computed by averaging over columns. Figure 4.21 may be reproduced in R as follows. In the code for this example, the periodogram is computed in one step as per ; the rest of the code is simply manipulation to obtain a nice graphic. per = Mod(fft(soiltemp-mean(soiltemp))/sqrt(64*36))^2 per2 = cbind(per[1:32,18:2], per[1:32,1:18]) per3 = rbind(per2[32:2,],per2) par(mar=c(1,2.5,0,0)+.1) persp(-31:31/64, -17:17/36, per3, phi=30, theta=30, expand=.6, ticktype="detailed", xlab="cycles/row", ylab="cycles/column", zlab="Periodogram Ordinate") i i i i

240 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 230 — #240 i i 230 4 Spectral Analysis and Filtering Another application of two-dimensional spectral analysis of agricultural field trials is given in McBratney and Webster (1981), who used it to detect ridge and furrow patterns in yields. The requirement for regular, equally spaced samples on fairly large grids has tended to limit enthusiasm for strict two-dimensional spectral analysis. An exception is when a propagating signal from a given velocity and azimuth is present so predicting the wavenumber spectrum as a function of velocity and azimuth becomes feasible (see Shumway et al., 1999). Problems Section 4.1 n and j , k = 0 , 1 , . . ., [[ n / 2 ]] , where [[·]] denotes 4.1 Verify that for any positive integer the greatest integer function: 4.12 or j = n / 2 , 0 j = (a) Except for n n ’ ’ 2 2 t j / n ) = π 2 ( cos 2 / n = sin . ( 2 π t j / n ) 1 = t 1 t = = 0 or j = n / 2 , (b) When j n n ’ ’ 2 2 ( 2 π t j / n ) cos n but = π = sin ( 2 . t j / n ) 0 = 1 t t 1 = j , k , (c) For n n ’ ’ = ) n / t k cos ( 2 π t j / n ) cos ( 2 π t k / n ) = . 0 π 2 ( sin ( 2 π t j / n ) sin 1 1 t = t = j k , Also, for any and n ’ . 0 cos ( 2 π t j / n ) sin ( 2 π t k / n ) = 1 = t 4 . 12 Hint: We’ll do part of the problem. n n ’ ’ ) ( ) ( 2 n π / it j π 2 n n / it j π 2 / it j 2 − − 2 π it j / n 1 t j = ) n / π 2 ( cos e e + e e + 4 1 = t t = 1 n ’ ) ( n it j / / n π n 4 4 π it j − 1 . = + + 1 + e e 1 = 4 2 t 1 = i i i i

241 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 231 — #241 i i 231 Problems 4.2 Repeat the simulations and analyses in Example 4.1 and Example 4.2 with the following changes: n = 128 and generate and plot the same series as in (a) Change the sample size to Example 4.1: , = 2 cos ( 2 π . 06 t ) + 3 sin ( 2 π . x 06 ) t 1 t 5 sin ) 4 cos ( 2 π . 10 t x + = ( 2 π . 10 t ) , t 2 , 40 = 6 cos ( 2 π . 40 t ) + 7 sin ( 2 π . x t ) 3 t = + x + x . x x 1 t t 3 t t 2 What is the major difference between these series and the series generated in fundamental . But if your answer is the series Example 4.1? (Hint: The answer is are longer, you may be punished severely.) x , generated (b) As in Example 4.2, compute and plot the periodogram of the series, t in (a) and comment. = (c) Repeat the analyses of (a) and (b) but with (as in Example 4.1), and n 100 x adding noise to ; that is t x x + x + x + w = 1 t t 3 t t t 2 25 w ∼ iid N ( 0 , . That is, you should simulate and plot the data, and then ) where t plot the periodogram of x and comment. t − = U be 4.3 With reference to equations (4.1) and (4.2), let Z Z = U and 1 1 2 2 independent, standard normal variables. Consider the polar coordinates of the point ( , Z Z ) , that is, 1 2 2 2 − 1 2 + Z = . and φ = tan Z A ( Z ) / Z 1 2 2 1 2 2 and φ , and from the result, conclude that A A and φ are (a) Find the joint density of 2 A is a chi-squared random variable with 2 independent random variables, where φ is uniformly distributed on (− π, π ) . df, and (b) Going in reverse from polar coordinates to rectangular coordinates, suppose we 2 2 assume that A and φ are independent random variables, where A is chi-squared ) φ φ ( with 2 df, and and is uniformly distributed on (− π, π ) . With Z cos = A 1 2 are Z Z and = A sin ( φ ) , where A Z A is the positive square root of , show that 2 2 1 independent, standard normal random variables. 4.4 Verify (4.5). Section 4.2 w from a A time series was generated by first drawing the white noise series 4.5 t x was normal distribution with mean zero and variance one. The observed series t generated from w x ± = w , − θ 2 , . . ., 1 ± , t = 0 , 1 − t t t where is a parameter. θ i i i i

242 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 232 — #242 i i 4 Spectral Analysis and Filtering 232 x (a) Derive the theoretical mean value and autocovariance functions for the series t w stationary? Give your reasons. . Are the series x w and and t t t x (b) Give a formula for the power spectrum of θ and ω . , expressed in terms of t w 4.6 A first-order autoregressive model is generated from the white noise series t using the generating equations , = φ x w x + 1 t t t − where φ , for | φ | < 1 , is a parameter and the w are independent random variables t 2 with mean zero and variance σ . w is given by (a) Show that the power spectrum of x t 2 σ w . f = ( ω ) x 2 − ( 1 2 φ cos + 2 πω ) φ (b) Verify the autocovariance function of this process is | 2 | h σ φ w ( h ) = γ , x 2 1 − φ h is the spectrum h = 0 , ± 1 , ± 2 , . . . , by showing that the inverse transform of γ ( ) x derived in part (a). In applications, we will often observe series containing a signal that has been 4.7 delayed by some unknown time D , i.e., As s + x = + n , D t t t t − where s are stationary and independent with zero means and spectral densities and n t t ( ω ) and f , respectively. The delayed signal is multiplied by some unknown ( ω ) f n s constant A . Show that 2 2 . ( ω ) = [ 1 + A ( + f A cos ( 2 πω D )] f ) ω ) + f ω ( n s x are stationary zero-mean time series with Suppose y and y 4.8 x x independent of t t s t for all s and t . Consider the product series . z = x y t t t Prove the spectral density for z can be written as t 1 π 2 ν. ( f ) ν − ω ν d ) ( f ( = ) ω f x y z 1 − 2 i i i i

243 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 233 — #243 i i Problems 233 200 150 100 sunspotz 50 0 1750 1800 1850 1900 1950 Time Smoothed 12-month sunspot numbers ( sunspotz Fig. 4.22. ) sampled twice per year. Section 4.3 Figure 4.22 shows the biyearly smoothed (12-month moving average) number of 4.9 sunspots from June 1749 to December 1978 with n = 459 points that were taken twice per year; the data are contained in sunspotz . With Example 4.13 as a guide, perform a periodogram analysis identifying the predominant periods and obtaining confidence intervals for the identified periods. Interpret your findings. The levels of salt concentration known to have occurred over rows, corresponding 4.10 to the average temperature levels for the soil science data considered in Figure 1.18 . Plot the series and then identify the salt saltemp and Figure 1.19, are in and dominant frequencies by performing separate spectral analyses on the two series. Include confidence intervals for the dominant frequencies and interpret your findings. 4.11 Let the observed series x be composed of a periodic signal and noise so it can t be written as 2 πω x w = β + cos ( ) t t ) + β πω sin ( 2 , 2 k k 1 t t 2 is assumed . The frequency ω where w σ is a white noise process with variance k t w in this problem. Suppose we consider estimating to be known and of the form k / n 2 are β , β and σ w by least squares, or equivalently, by maximum likelihood if the 2 t 1 w assumed to be Gaussian. (a) Prove, for a fixed ω , the minimum squared error is attained by k ) ( ( ) ˆ ) d ω ( β c k 1 − 1 / 2 2 n , = ˆ d ( ω ) β k s 2 where the cosine and sine transforms (4.31) and (4.32) appear on the right-hand side. (b) Prove that the error sum of squares can be written as n ’ 2 I ( ) 2 − ω x SSE = x k t 1 = t i i i i

244 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 234 — #244 i i 4 Spectral Analysis and Filtering 234 so that the value of ω that minimizes squared error is the same as the value that k ( maximizes the periodogram I estimator (4.28). ) ω k x ω -test of no regression , show that the F (c) Under the Gaussian assumption and fixed k leads to an -statistic that is a monotone function of I F ( ω . ) k x Prove the convolution property of the DFT, namely, 4.12 n n − 1 ’ ’ a = x } exp ) ω ( t πω 2 , ( d ) d ω { − s s t A k k x k 1 = s 0 k = a are the discrete Fourier transforms of for t = 1 , 2 , . . ., n , where d ) ( ω ω ) and d ( x A t k k and is periodic. , respectively, and we assume that x x = x t n t t + Section 4.4 4.13 Analyze the chicken price data ( chicken ) using a nonparametric spectral esti- mation procedure. Aside from the obvious annual cycle discovered in Example 2.5, what other interesting cycles are revealed? 4.14 Repeat Problem 4.9 using a nonparametric spectral estimation procedure. In addition to discussing your findings in detail, comment on your choice of a spectral estimate with regard to smoothing and tapering. Repeat Problem 4.10 using a nonparametric spectral estimation procedure. In 4.15 addition to discussing your findings in detail, comment on your choice of a spectral estimate with regard to smoothing and tapering. 4.16 Cepstral Analysis. The periodic behavior of a time series induced by echoes can also be observed in the spectrum of the series; this fact can be seen from the results stated in Problem 4.7. Using the notation of that prob- x , which implies the spectra sat- = lem, suppose we observe n + As + s − D t t t t 2 ( f ) . ω ) = [ 1 + A isfy + 2 A cos ( 2 πω D )] f ω ( ω ) + f ( If the noise is negligible n s x ( ) is approximately the sum of a periodic component, ω f ( ( ω ) ≈ 0 ) then log f x n 2 ) ω log [ 1 + A ( + 2 A cos ( 2 πω D )] , and log f . Bogart et al. (1962) proposed treating s the detrended log spectrum as a pseudo time series and calculating its spectrum, . The / D or cepstrum , which should show a peak at a quefrency corresponding to 1 cepstrum can be plotted as a function of quefrency, from which the delaty D can be estimated. For the speech series presented in Example 1.3, estimate the pitch period using cepstral analysis as follows. The data are in speech . (a) Calculate and display the log-periodogram of the data. Is the periodogram peri- odic, as predicted? (b) Perform a cepstral (spectral) analysis on the detrended logged periodogram, and D . How does your answer compare with the use the results to estimate the delay analysis of Example 1.27, which was based on the ACF? i i i i

245 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 235 — #245 i i 235 Problems Use Property 4.2 to verify (4.71). Then verify (4.74) and (4.75). 4.17 Consider two time series 4.18 x = w w , − − t 1 t t 1 , ( w = + w y ) t − 1 t t 2 2 w with variance σ = 1 . formed from the white noise series t w (a) Are jointly stationary? Recall the cross-covariance function must also and y x t t be a function only of the lag h and cannot depend on time. (b) Compute the spectra f , and comment on the difference between ( ω ) and f ) ( ω x y the two results. ̄ (c) Suppose sample spectral estimators f ( . 10 ) are computed for the series using y L 3 . Find a and b such that = } { ̄ 90 ≤ f . ( . 10 ) ≤ b P = . a y 90 % of the sample spectral This expression gives two points that will contain values. Put 5% of the area in each tail. Section 4.5 4.19 Often, the periodicities in the sunspot series are investigated by fitting an au- toregressive spectrum of sufficiently high order. The main periodicity is often stated to be in the neighborhood of 11 years. Fit an autoregressive spectral estimator to the sunspot data using a model selection method of your choice. Compare the result with a conventional nonparametric spectral estimator found in Problem 4.9. 4.20 ) using a parametric spectral estimation chicken Analyze the chicken price data ( procedure. Compare the results to Problem 4.13. Fit an autoregressive spectral estimator to the Recruitment series and compare 4.21 it to the results of Example 4.16. Suppose a sample time series with n = 256 points is available from the first- 4.22 order autoregressive model. Furthermore, suppose a sample spectrum computed with ̄ yields the estimated value 3 25 L f . Is this sample value consistent with ( 1 / 8 ) = 2 . = x 2 if we just happen to obtain the same sample = 11 1 , φ = . 5 ? Repeat using L = σ w value. 4.23 Suppose we wish to test the noise alone hypothesis H n against the x = : t 0 t x are uncorrelated zero- signal-plus-noise hypothesis H : n = s + n , where s and 1 t t t t t and . Suppose that we want the test f ) ( ω ) mean stationary processes with spectra f ω ( n s frequencies of the form ± m over a band of L = 2 m + 1 , . . ., ω 2 ± , + k / n , for k = 0 , ± 1 n : j . Assume that both the signal and noise spectra are ω near some fixed frequency approximately constant over the interval. i i i i

246 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 236 — #246 i i 4 Spectral Analysis and Filtering 236 against H H is (a) Prove the approximate likelihood-based test statistic for testing 0 1 proportional to ) ( ’ 1 1 2 . − = T + / d n )| | k ( ω j x n : ) f ( f ω ( ω ) + f ) ( ω n n s k T under (b) Find the approximate distributions of and H . H 0 1 (c) Define the false alarm and signal detection probabilities as } H | P K = P { T > F 0 and P } = P { , respectively. Express these probabilities in terms of the > k | H T 1 d ω ) and appropriate chi-squared integrals. signal-to-noise ratio f ( ( ω )/ f n s Section 4.6 4.24 Analyze the coherency between the temperature and salt data discussed in Prob- lem 4.10. Discuss your findings. 4.25 Consider two processes x y v x φ = w = and + t t t − D t t 2 are independent white noise processes with common variance σ where w φ and v , t t is a constant, and D is a fixed integer delay. x (a) Compute the coherency between y . and t t 2 n = 1024 normal observations from x , and and y 1 for φ = . 9 , σ = (b) Simulate t t D = 0 . Then estimate and plot the coherency between the simulated series for the following values of L and comment: (i) L = 1 , (ii) L = 3 , (iii) L = 41 , and (iv) L = 101 . Section 4.7 For the processes in Problem 4.25: 4.26 (a) Compute the phase between x . y and t t 2 . = (b) Simulate n = 1024 observations from x D and y , and for φ = . 9 , σ 1 = 1 t t Then estimate and plot the phase between the simulated series for the following values of L and comment: . 101 = (i) L = 1 , (ii) L = 3 , (iii) L = 41 , and (iv) L 4.27 Consider the bivariate time series records containing monthly U.S. production ( prod ) as measured by the Federal Reserve Board Production Index and the monthly unemployment series ( ). unemp (a) Compute the spectrum and the log spectrum for each series, and identify statis- tically significant peaks. Explain what might be generating the peaks. Compute the coherence, and explain what is meant when a high coherence is observed at a particular frequency. i i i i

247 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 237 — #247 i i 237 Problems (b) What would be the effect of applying the filter = − x followed by v x u − u u = − − t 1 t t t 12 t t to the series given above? Plot the predicted frequency responses of the simple difference filter and of the seasonal difference of the first difference. (c) Apply the filters successively to one of the two series and plot the output. Examine the output after taking a first difference and comment on whether stationarity is a reasonable assumption. Why or why not? Plot after taking the seasonal difference of the first difference. What can be noticed about the output that is consistent with what you have predicted from the frequency response? Verify by computing the spectrum of the output after filtering. 4.28 Determine the theoretical power spectrum of the series formed by combining to form w the white noise series t . = w w + w + 4 w 4 + + y 6 w 1 − t 2 t + 1 − t t + 2 t t Determine which frequencies are present by plotting the power spectrum. 4.29 x Let = cos ( 2 πω t ) , and consider the output t ∞ ’ , a x y = − k k t t = k −∞ Õ | where | < ∞ . Show a k k , )) ω ( y φ = | A ( ω )| cos ( 2 πω t + t where ( ω )| and φ ( A ) are the amplitude and phase of the filter, respectively. Interpret | ω x the result in terms of the relationship between the input series, , and the output series, t . y t 4.30 Suppose x is a stationary series, and we apply two filtering operations in suc- t cession, say, ’ ’ = y = x then z a y b . t r t r − t s t − s r s (a) Show the spectrum of the output is 2 2 = , f ( ( ω ) f | A ( ω )| ) | B ( ω )| ω z x and and where A ( ω ) , B ( ω ) are the Fourier transforms of the filter sequences a b t t respectively. (b) What would be the effect of applying the filter x u = x − u followed by v = u − − 1 12 t t t t t − t to a time series? i i i i

248 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 238 — #248 i i 4 Spectral Analysis and Filtering 238 (c) Plot the predicted frequency responses of the simple difference filter and of the seasonal difference of the first difference. Filters like these are called seasonal adjustment filters in economics because they tend to attenuate frequencies at multiples of the monthly periods. The difference filter tends to attenuate low- frequency trends. ) ( f Suppose we are given a stationary zero-mean series 4.31 ω with spectrum and x x t then construct the derived series = a y , ... . 2 ± + y x , t = ± 1 , 1 t t t − ( is related to ) f f ω ( ω ) . (a) Show how the theoretical y x f . This ( ω ) in part (a) for a (b) Plot the function that multiplies . 1 and for a = . 8 = x filter is called a recursive filter. Section 4.8 Consider the problem of approximating the filter output 4.32 ∞ ∞ ’ ’ ∞ < , a = x a | | , y k k − t k t −∞ = k −∞ by ’ M M y = x a t − k t k | | M / 2 k < 1 n and n for t = M / 2 − 1 , M / 2 , . . ., , . . ., − M / 2 , where x = is available for t t M − 1 ’ M − 1 { t ω i } π 2 exp ) ω A ( M = a k k t 0 k = . Prove M with ω / = k k ) ( 2 ’ 2 M ) } ≤ . 4 | | ( 0 ) a γ − y {( y E x k t t M / 2 | k |≥ 2 ( when ω ) for all 1 = ω 4.33 ρ Prove the squared coherence x · y ∞ ’ y = , x a − r t r t −∞ = r that is, when x and y can be related exactly by a linear filter. t t climhyd The data set , contains 454 months of measured values for six climatic 4.34 CldCvr variables: (i) air temperature [ Temp ], (ii) dew point [ DewPt ], (iii) cloud cover [ ], Inflow ], and (vi) inflow [ ], at Lake (iv) wind speed [ WndSpd ], (v) precipitation [ Precip Shasta in California; the data are displayed in Figure 7.3. We would like to look at possible relations among the weather factors and between the weather factors and the inflow to Lake Shasta. i i i i

249 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 239 — #249 i i 239 Problems (a) First transform the inflow and precipitation series as follows: = log i I , where i t t t √ , where is precipitation. Then, compute the square co- p p is inflow, and P = t t t herencies between all the weather variables and transformed inflow and argue that the strongest determinant of the inflow series is (transformed) precipitation. [ Tip: x contains multiple time series, then the easiest way to display all the squared If coherencies is to plot the coherencies suppressing the confidence intervals, e.g., mvspec(x, spans=c(7,7), taper=.5, plot.type="coh", ci=-1) . (b) Fit a lagged regression model of the form ∞ ’ I β + = + w β P , t 0 t − j j t = j 0 using thresholding, and then comment of the predictive ability of precipitation for inflow. Section 4.9 Consider the 4.35 signal plus noise model ∞ ’ = , β x y + v − t t r t r −∞ = r x and v ) f where the signal and noise series, ( ω are both stationary with spectra x t t x and ( ω ) , respectively. Assuming that are independent of each other for all v and f t t v t , verify (4.131) and (4.132). 4.36 Consider the model y = x + v , t t t where + x , w x φ = t t t 1 − 2 is Gaussian white noise and independent of x with , and ( v such that ) = σ v w var t t t t v 2 1 < and | φ , and | w v ) = σ , with is Gaussian white noise and independent of var ( t t w is y E x . Prove that the spectrum of the observed series = 0 0 t 2 ω i π 2 − 1 | e θ − | 2 , ) ( f = ω σ y ω − 2 π i 2 e | | 1 − φ where √ 2 2 σ φ 4 c ± − c v 2 = , σ = θ , θ 2 and 2 2 2 1 ( ) φ + + σ σ w v = c . 2 σ φ v 4.37 Consider the same model as in the preceding problem. i i i i

250 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 240 — #250 i i 240 4 Spectral Analysis and Filtering (a) Prove the optimal smoothed estimator of the form ∞ ’ y = a ˆ x − s t s t −∞ s = has 2 | s | σ θ w . = a s 2 2 θ − σ 1 (b) Show the mean square error is given by 2 2 σ σ w v 2 . {( x } = ) − ˆ x E t t 2 2 ( 1 − θ σ ) (c) Compare mean square error of the estimator in part (b) with that of the optimal finite estimator of the form ˆ = a y y a + x − t 2 1 t − 2 t 1 2 2 053 . = . when , σ σ 9 = . 172 , and φ . = 1 w v Section 4.10 4.38 Consider the two-dimensional linear filter given as the output (4.143). , (a) Express the two-dimensional autocovariance function of the output, say, γ h ( h , ) 1 y 2 and the x in terms of an infinite sum involving the autocovariance function of s filter coefficients a . , s s 1 2 (b) Use the expression derived in (a), combined with (4.142) and (4.145) to derive the spectrum of the filtered output (4.144). The following problems require supplemental material from Appendix C 2 . Prove that the results 4.39 be a Gaussian white noise series with variance σ Let w t w of Theorem C.4 hold without error for the DFT of w . t 4.40 Show that condition (4.48) implies (C.19) by showing ’ ’ ’ √ 2 − / 2 1 | ψ | σ )| ≤ ( γ | h h | | n ψ . j k j w ≥ 0 j 0 ≥ k ≥ 0 h 4.41 Prove Lemma C.4. 4.42 Finish the proof of Theorem C.5. For the zero-mean complex random vector zzz = x 4.43 − i x , with cov ( z ) = Σ = c s ∗ , define C − iQ , with Σ = Σ ∗ w 2Re ( a = zzz ) , where a = a − ia is an arbitrary non-zero complex vector. Prove s c ∗ Σ cov ( w ) = 2 a . a ∗ Recall denotes the complex conjugate transpose. i i i i

251 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 241 — #251 i i Chapter 5 Additional Time Domain Topics In this chapter, we present material that may be considered special or advanced topics in the time domain. Chapter 6 is devoted to one of the most useful and interesting time domain topics, state-space models. Consequently, we do not cover state-space models or related topics—of which there are many—in this chapter. This chapter contains sections of independent topics that may be read in any order. Most of the sections depend on a basic knowledge of ARMA models, forecasting and estimation, which is the material that is covered in Chapter 3. A few sections, for example the section on long memory models, require some knowledge of spectral analysis and related topics covered in Chapter 4. In addition to long memory, we discuss unit root testing, GARCH models, threshold models , lagged regression or transfer functions, and selected topics in multivariate ARMAX models. 5.1 Long Memory ARMA and Fractional Differencing q ( p , ) The conventional ARMA process is often referred to as a short-memory process because the coefficients in the representation ∞ ’ w , ψ = x − t j j t 0 = j obtained by solving = z ) ψ ( φ ) ( θ ( z ) , z are dominated by exponential decay. As pointed out in Section 3.2 and Section 3.3, this 0 )→ h result implies the ACF of the short memory process satisfies exponentially ρ ( fast as h →∞ . When the sample ACF of a time series decays slowly, the advice given in Chapter 3 has been to difference the series until it seems stationary. Following this advice with the glacial varve series first presented in Example 3.33 leads to the first difference of the logarithms of the data being represented as a first-order moving average. In Example 3.41, further analysis of the residuals leads to fitting an ) 1 , model, ARIMA ( 1 , 1 i i i i

252 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 242 — #252 i i 242 5 Additional Time Domain Topics 1.0 0.8 0.6 ACF 0.4 0.2 0.0 100 80 20 0 40 60 LAG Sample ACF of the log transformed varve series. Fig. 5.1. x , x = φ ∇ ∇ + w + θ w 1 − t t 1 t − t x is the log-transformed varve series. In particular, the estimates where we understand t ˆ ˆ θ φ = . 23 ( . 05 ) , of the parameters (and the standard errors) were . = − 89 ( . 03 ) , and 2 = . 23 . ˆ σ w x , however, can sometimes be too ) The use of the first difference ∇ x = ( 1 − B t t severe a modification in the sense that the nonstationary model might represent an overdifferencing of the original process. Long memory (or persistent) time series were considered in Hosking (1981) and Granger and Joyeux (1980) as intermediate compromises between the short memory ARMA type models and the fully integrated nonstationary processes in the Box–Jenkins class. The easiest way to generate a long d 1 − B ) memory series is to think of using the difference operator for fractional ( d d , say, 0 < 5 < . values of , so a basic long memory series gets generated as d x B ) ( − 1 = w , (5.1) t t 2 where w . The fractionally differenced still denotes white noise with variance σ t w series (5.1), for is zero). Now, d | d | < . 5 , is often called fractional noise (except when 2 . Differencing the original d σ becomes a parameter to be estimated along with w process, as in the Box–Jenkins approach, may be thought of as simply assigning a value of d = 1 . This idea has been extended to the class of fractionally integrated ARMA, or ARFIMA models, where 5 − . < d < . 5 ; when d is negative, the term antipersistent is used. Long memory processes occur in hydrology (see Hurst, 1951, and McLeod and Hipel, 1978) and in environmental series, such as the varve data we have previously analyzed, to mention a few examples. Long memory time series data tend to exhibit sample autocorrelations that are not necessarily large (as in the case 1 ), but persist for a long time. Figure 5.1 shows the sample ACF, to lag 100, of d = of the log-transformed varve series, which exhibits classic long memory behavior: acf(log(varve), 100) acf(cumsum(rnorm(1000)), 100) # compare to ACF of random walk (not shown) i i i i

253 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 243 — #253 i i 5.1 Long Memory ARMA and Fractional Differencing 243 Figure 5.1 can be contrasted with the ACF of the original GNP series shown in Figure 3.13, which is also persistent and decays linearly, but the values of the ACF are large. ) to write − To investigate its properties, we can use the binomial expansion ( d > 1 ∞ ∞ ’ ’ d j − ) x w = ( = 1 B π B x = π (5.2) x t t t j j t − j = 0 j 0 = j where ( ) d Γ − j (5.3) π = j ) Γ ( j + 1 ) Γ (− d being the gamma function. Similarly ( + 1 ) = x Γ ( x ) x ( d < 1 ) , we can write Γ with ∞ ∞ ’ ’ j d − x ψ w = ( 1 (5.4) ψ B − w = B ) w = t t j t t j j − 0 = j 0 = j where d ) ( + Γ j = ψ (5.5) . j ( + 1 ) Γ ( d ) j Γ d | < . 5 , the processes (5.2) and (5.4) are well-defined stationary processes | When (see Brockwell and Davis, 1991, for details). In the case of fractional differencing, Õ Õ 2 2 π however, the coefficients satisfy and ∞ < ψ < ∞ as opposed to the absolute j j summability of the coefficients in ARMA processes. Using the representation (5.4)–(5.5), and after some nontrivial manipulations, it can be shown that the ACF of x is t Γ 1 h + d ) Γ ( ( − d ) 1 − d 2 (5.6) h ∼ h ) = ρ ( + ) ( d ) 1 Γ d − h Γ ( 5 < . < d for large h . From this we see that for 0 ∞ ’ h = )| | ρ ( ∞ −∞ = h and hence the term long memory . In order to examine a series such as the varve series for a possible long memory . Using (5.3) it is easy to derive pattern, it is convenient to look at ways of estimating d the recursions d ) ( π ) d − j ( j = d ( (5.7) ) π , j 1 + j ( ) + 1 d . Maximizing the joint likelihood of the errors under j = 0 , 1 , . . . , with π 1 ( for ) = 0 normality, say, w ( d ) , will involve minimizing the sum of squared errors t ’ 2 ) Q ( d ) = d w . ( t The usual Gauss–Newton method, described in Section 3.5, leads to the expansion i i i i

254 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 244 — #254 i i 5 Additional Time Domain Topics 244 0.30 0.20 ) d ( π 0.10 0.00 0 30 20 15 10 5 25 Index . Fig. 5.2. Coefficients π (5.7) ( . 384 ) , j = 1 , 2 , . . ., 30 in the representation j ′ w ( d ) = w ( d ) + w − d ( d , )( d ) t t 0 0 0 t where w ∂ t ′ ( d ) w = 0 t ∂ d d = d 0 . Setting up the usual regression d is an initial estimate (guess) at to the value of and d 0 leads to Õ ′ ( ( w ) d ) d w 0 0 t t t d d = − . (5.8) 0 Õ 2 ′ ) w d ( 0 t t The derivatives are computed recursively by differentiating (5.7) successively with respect to d : ′ d ( π )− d ( ) π ( j − d ) j j ′ , π ( d ) = + 1 j + 1 j ′ ) where . The errors are computed from an approximation to (5.2), namely, ( d π = 0 0 t ’ . x d ( π ) (5.9) = d ) w ( j t − j t j 0 = It is advisable to omit a number of initial terms from the computation and start the sum, (5.8), at some fairly large value of to have a reasonable approximation. t Example 5.1 Long Memory Fitting of the Glacial Varve Series We consider analyzing the glacial varve series discussed in various examples and first presented in Example 2.7 . Figure 2.7 shows the original and log-transformed could be modeled x x ). In Example 3.41, we noted that series (which we denote by t t 1 , ) process. We fit the fractionally differenced model, (5.1), to as an ARIMA ( 1 , 1 . Applying the Gauss–Newton iterative procedure x − ̄ x the mean-adjusted series, t and omitting the first 30 points from the previously described, starting with d = . 1 d , which implies the set of coefficients . computation, leads to a final value of 384 = ) = 1 π 384 ( . 384 . We can compare roughly , as given in Figure 5.2 with π . ( ) 0 j the performance of the fractional difference operator with the ARIMA model by i i i i

255 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 245 — #255 i i 5.1 Long Memory ARMA and Fractional Differencing 245 0.2 0.0 ACF −0.2 60 80 100 20 40 LAG 0.2 0.0 ACF −0.2 60 80 100 40 20 LAG ACF of residuals from the ARIMA , ( 1 , 1 Fig. 5.3. 1 ) fit to the logged varve series (top) and of d . 384 the residuals from the long memory model fit, ( 1 − B ) = x = w (bottom). , with d t t examining the autocorrelation functions of the two residual series as shown in Figure 5.3. The ACFs of the two residual series are roughly comparable with the white noise model. To perform this analysis in R, first download and install the fracdiff package. Then use library(fracdiff) lvarve = log(varve)-mean(log(varve)) varve.fd = fracdiff(lvarve, nar=0, nma=0, M=30) varve.fd$d # = 0.3841688 varve.fd$stderror.dpq # = 4.589514e-06 (questionable result !! ) p = rep(1,31) for (k in 1:30){ p[k+1] = (k-varve.fd$d)*p[k]/(k+1) } plot(1:30, p[-1], ylab=expression(pi(d)), xlab="Index", type="h") res.fd = diffseries(log(varve), varve.fd$d) # frac diff resids res.arima = resid(arima(log(varve), order=c(1,1,1))) # arima resids par(mfrow=c(2,1)) acf(res.arima, 100, xlim=c(4,97), ylim=c(-.2,.2), main="") acf(res.fd, 100, xlim=c(4,97), ylim=c(-.2,.2), main="") The R package uses a truncated maximum likelihood procedure that was discussed in Haslett and Raftery (1989), which is a little more elaborate than simply zeroing 100 out initial values. The default truncation value in R is M = . In the default ˆ with approximately the same standard case, the estimate is questionable d = . 37 error. The standard error is (supposedly) obtained from the Hessian as described in Example 3.30. A more believable standard error is given in Example 5.2. Forecasting long memory processes is similar to forecasting ARIMA models. That is, (5.2) and (5.7) can be used to obtain the truncated forecasts i i i i

256 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 246 — #256 i i 5 Additional Time Domain Topics 246 n ’ n n ˆ = − x ̃ ̃ , π x ( (5.10) d ) j m + n m n + − j = j 1 Error bounds can be approximated by using for m = 1 , 2 , . . . . − 1 m ’ © ™ n 2 2 ˆ (5.11) σ P ˆ = ) d ( ψ ≠ Æ n m + w j = j 0 ́ ̈ where, as in (5.7), ˆ ˆ d ψ ( ) ( j + ) d j ˆ = ) d ψ ( , (5.12) j j + 1 ) ( ˆ ψ . 1 ( with d ) = 0 No obvious short memory ARMA-type component can be seen in the ACF of the residuals from the fractionally differenced varve series shown in Figure 5.3. It is natural, however, that cases will exist in which substantial short memory-type components will also be present in data that exhibits long memory. Hence, it is natural to define the general ARFIMA ( p , d , q ) , − . 5 < d < . 5 process as d w φ ( B )∇ , ( x (5.13) − μ ) = θ ( B ) t t θ where B ) and ( ( B ) are as given in Chapter 3. Writing the model in the form φ φ ( B ) π (5.14) ( B )( x w − μ ) = θ ( B ) t d t makes it clear how we go about estimating the parameters for the more general model. Forecasting for the ARFIMA ( p , d , q ) series can be easily done, noting that we may equate coefficients in d − z ( z ) ψ ( z ) = ( 1 − z ) (5.15) φ θ ( ) and d ( z ) = ( 1 − z π φ ) ( z ) (5.16) z ( θ ) to obtain the representations ∞ ∞ ’ ’ ) − x ( . π μ w = and w ψ μ + = x − t j j t j − t j t j = 0 0 = j We then can proceed as discussed in (5.10) and (5.11). Comprehensive treatments of long memory time series models are given in the texts by Beran (1994), Palma (2007), and Robinson (2003), and it should be noted that several other techniques for estimating the parameters, especially, the long memory parameter, can be developed in the frequency domain. In this case, we may think of the equations as generated by an infinite order autoregressive series with coefficients given by (5.7) . Using the same approach as before, we obtain π j i i i i

257 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 247 — #257 i i 5.1 Long Memory ARMA and Fractional Differencing 247 2 σ w ω = ( f ) Õ x ∞ 2 ω − 2 π ik | π e | k (5.17) = 0 k − 2 π i 2 − 2 d 2 2 − d ω − e σ = [ 4 sin πω ( = )] | 1 σ | w w as equivalent representations of the spectrum of a long memory process. The long memory spectrum approaches infinity as the frequency 0 . ω → The main reason for defining the Whittle approximation to the log likelihood d in the long memory case as is to propose its use for estimating the parameter an alternative to the time domain method previously mentioned. The time domain approach is useful because of its simplicity and easily computed standard errors. One may also use an exact likelihood approach by developing an innovations form of the likelihood as in Brockwell and Davis (1991). For the approximate approach using the Whittle likelihood (4.85), we consider using the approach of Fox and Taqqu (1986) who showed that maximizing the Whit- tle log likelihood leads to a consistent estimator with the usual asymptotic normal distribution that would be obtained by treating (4.85) as a conventional log likeli- hood (see also Dahlhaus, 1989; Robinson, 1995; Hurvich et al., 1998). Unfortunately, the periodogram ordinates are not asymptotically independent (Hurvich and Beltrao, 1993), although a quasi-likelihood in the form of the Whittle approximation works well and has good asymptotic properties. To see how this would work for the purely long memory case, write the long memory spectrum as d − 2 2 = ; d , σ ( g ) f σ ω , (5.18) k x w w k where 2 (5.19) . ) g ( = 4 sin πω k k Then, differentiating the log likelihood, say, m m ’ ’ 1 2 2 d d ln ( d + ; L , σ ln x )≈− m σ − g ln ( g I ω ) (5.20) k k w w k 2 σ w = k 1 1 = k 2 yields σ frequencies and solving for at m = n / 2 − 1 w m ’ 1 2 d ( d ) = g ω ) ( σ (5.21) I k w k m 1 k = as the approximate maximum likelihood estimator for the variance parameter. To estimate d , we can use a grid search of the concentrated log likelihood m ’ 2 g ln (5.22) m − d ln m ln ( d ) + d )≈− σ ( x ; L k w 1 = k ) , . , followed by a Newton–Raphson procedure to convergence. 5 over the interval ( 0 i i i i

258 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 248 — #258 i i 5 Additional Time Domain Topics 248 2 1 0 log(spectrum) −1 0.20 0.15 0.10 0.05 0.00 0.25 frequency ) [solid line] and autoregressive AR(8) [dashed line] spectral Fig. 5.4. Long Memory ( d = . 380 estimators for the paleoclimatic glacial varve series. Example 5.2 Long Memory Spectra for the Varve Series In Example 5.1, we fit a long memory model to the glacial varve data via time domain methods. Fitting the same model using frequency domain methods and the ˆ d = . 380 , with an estimated standard error of Whittle approximation above gives ˆ ˆ d . The earlier time domain method gave 384 . = . with M = 30 and 028 d = . 370 . Both estimates obtained via time domain methods had a standard 100 with M = 6 − 10 , which seems implausible. The error variance estimate in error of about 4 . 6 × 2 as an this case is ˆ = . 2293 ; in Example 5.1, we could have used var(res.fd) σ w estimate, in which case we obtain .2298. The R code to perform this analysis is series = log(varve) # specify series to be analyzed d0 = .1 # initial value of d n.per = nextn(length(series)) m = (n.per)/2 - 1 per = Mod(fft(series-mean(series))[-1])^2 # remove 0 freq and per = per/n.per # scale the peridogram g = 4*(sin(pi*((1:m)/n.per))^2) # Function to calculate -log.likelihood whit.like = function(d){ g.d=g^d sig2 = (sum(g.d*per[1:m])/m) log.like = m*log(sig2) - d*sum(log(g)) + m return(log.like) } # Estimation (output not shown) (est = optim(d0, whit.like, gr=NULL, method="L-BFGS-B", hessian=TRUE, lower=-.5, upper=.5, control=list(trace=1,REPORT=1))) ##-- Results: d.hat = .380, se(dhat) = .028, and sig2hat = .229 --## cat("d.hat =", est$par, "se(dhat) = ",1/sqrt(est$hessian),"\n") g.dhat = g^est$par; sig2 = sum(g.dhat*per[1:m])/m cat("sig2hat =",sig2,"\n") One might also consider fitting an autoregressive model to these data using a procedure similar to that used in Example 4.18. Following this approach gave an ˆ , . } 09 autoregressive model with p = 8 and , φ 02 = { . 34 , . 11 , . 04 , . 09 , . 08 , . 08 , . 1:8 i i i i

259 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 249 — #259 i i 249 5.1 Long Memory ARMA and Fractional Differencing 2 with σ = . ˆ as the error variance. The two log spectra are plotted in Figure 5.4 23 w for ω > 0 , and we note that long memory spectrum will eventually become infinite, ω = whereas the AR(8) spectrum is finite at . The R code used for this part of the 0 example (assuming the previous values have been retained) is # produces AR(8) u = spec.ar(log(varve), plot=FALSE) g = 4*(sin(pi*((1:500)/2000))^2) fhat = sig2*g^{-est$par} # long memory spectral estimate plot(1:500/2000, log(fhat), type="l", ylab="log(spectrum)", xlab="frequency") lines(u$freq[1:250], log(u$spec[1:250]), lty="dashed") ar.mle(log(varve)) # to get AR(8) estimates Often, time series are not purely long memory. A common situation has the long memory component multiplied by a short memory component, leading to an alternate version of (5.18) of the form − d ω f θ ( ω ; ; d , θ ) = g (5.23) , ) f ( 0 x k k k f might be the spectrum of an autoregressive moving average process ω where ; θ ) ( 0 k with vector parameter θ , or it might be unspecified. If the spectrum has a parametric form, the Whittle likelihood can be used. However, there is a substantial amount of semiparametric literature that develops the estimators when the underlying spectrum estimators simply uses Gaussian semi-parametric is unknown. A class of f ) ( ω ; θ 0 the same Whittle likelihood (5.22), evaluated over a sub-band of low frequencies, say √ ′ n . There is some latitude in selecting a band that is relatively free from low m = frequency interference due to the short memory component in (5.23). If the spectrum is highly parameterized, one might estimate using the Whittle log likelihood (5.19) and using the Newton–Raphson θ under (5.23) and jointly estimate the parameters d method. If we are interested in a nonparametric estimator, using the conventional smoothed spectral estimator for the periodogram, adjusted for the long memory d component, say g might be a possible approach. ) I ( ω k k Geweke and Porter–Hudak (1983) developed an approximate method for estimat- ing d based on a regression model, derived from (5.22). Note that we may write a simple equation for the logarithm of the spectrum as 2 ; ω ; d ) = ln f ( ω f ( θ )− d ln [ 4 sin (5.24) ( πω ln )] , k x k 0 k k with the frequencies ω near the zero = / n restricted to a range k = 1 , 2 , . . ., m k √ frequency with m = n as the recommended value. Relationship (5.24) suggests using a simple linear regression model of the form, 2 (5.25) ln I ( ω e = β − d ln [ 4 sin ) ( πω )] + k k 0 k 2 and d . In this case, one performs σ for the periodogram to estimate the parameters w 2 4 sin ln I ( ω as the ) as the dependent variable, and ln [ ( least squares using πω )] k k independent variable for k = 1 , 2 , . . ., m . The resulting slope estimate is then used as an estimate of − d . For a good discussion of various alternative methods for selecting , see Hurvich and Deo (1999). The R package fracdiff also provides this method m ; see the help file for further information. Here is a quick fdGPH() via the command example using the logged varve data. i i i i

260 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 250 — #260 i i 5 Additional Time Domain Topics 250 library(fracdiff) # m = n^bandw fdGPH(log(varve), bandw=.9) dhat = 0.383 se(dhat) = 0.041 5.2 Unit Root Testing x 1 = ( As discussed in the previous section, the use of the first difference ∇ B ) x − t t can be too severe a modification in the sense that the nonstationary model might represent an overdifferencing of the original process. For example, consider a causal AR(1) process (we assume throughout this section that the noise is Gaussian), (5.26) = φ x + . w x 1 t − t t ∇ ( − B ) to both sides shows that differencing, ∇ x or Applying φ 1 x , w ∇ + = 1 − t t t − w , y w = φ y + − 1 t t t − 1 t y = ∇ x , introduces extraneous correlation and invertibility problems. That where t t x will is a causal AR(1) process, working with the differenced process y is, while t t ( 1 , 1 be problematic because it is a non-invertible ARMA . ) A unit root test provides a way to test whether (5.26) is a random walk (the null case) as opposed to a causal process (the alternative). That is, it provides a procedure for testing 1 < φ H | : . = 1 versus H φ : | 1 0 ˆ An obvious test statistic would be to consider ( φ − 1 ) , appropriately normalized, in ˆ φ is one of the the hope to develop an asymptotically normal test statistic, where optimal estimators discussed in Chapter 3. Unfortunately, the theory of Section 3.5 will not work in the null case because the process is nonstationary. Moreover, as seen in Example 3.36, estimation near the boundary of stationarity produces highly skewed sample distributions (see Figure 3.12) and this is a good indication that the problem will be atypical. ˆ ( 1 To examine the behavior of , or more φ φ − 1 ) under the null hypothesis that = Õ t w + with x w = , or x precisely that the model is a random walk, = x t j t − 1 t t 1 j = = μ 0 , consider the least squares estimator of φ . Noting that x , the least squares = 0 x 0 estimator can be written as Õ Õ n 1 n w x x x t 1 − t t − 1 t t = 1 1 t = n ˆ , (5.27) + = 1 = φ Õ Õ n n 2 1 2 x x 1 t = = 1 t t 1 − n − 1 t x = x and in the 0 where we have written + w = in the numerator; recall that x 0 t − t t 1 x on x least squares setting, we are regressing for t = 1 , . . ., n . Hence, under H , t − 1 0 t we have that Õ n 1 x w − t 1 t 2 t 1 = σ n w ˆ 1 = − φ (5.28) . Õ n 1 2 x 2 t 1 = t − 1 σ n w i i i i

261 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 251 — #261 i i 251 5.2 Unit Root Testing = x Consider the numerator of (5.28). Note first that by squaring both sides of t 2 2 2 so that , we obtain x x + = x + w + w w 2 x t 1 t t − 1 − t t t t − 1 1 2 2 2 − ) , ( x w x − = w x t t 1 − t t t − 1 2 and summing, Õ ) ( n n 2 2 ’ w x 1 1 t t = 1 n . w − = x t 1 t − 2 2 2 2 n σ σ σ n n w w w 1 t = Õ n 1 2 2 2 = ) , so that χ Because ∼ x ( 0 , n σ N w = , we have that x has a x n t n 2 w n 1 1 σ n w is white w chi-squared distribution with one degree of freedom. Moreover, because t Õ Õ n n 1 1 2 2 2 σ ), 1 →∞ n → , or Consequently ( → . Gaussian noise, w w p p 2 w t t 1 1 n σ n w n ’ ( ) 1 d 2 1 (5.29) − . χ 1 w x → t − 1 t 1 2 2 σ n w 1 t = Next we focus on the denominator of (5.28). First, we introduce standard Brownian motion. is called { W ( t ) ; t ≥ Definition 5.1 } A continuous time process standard Brownian 0 motion if it satisfies the following conditions: W ( (i) ) = 0 ; 0 ( t are independent for any )} (ii) { W ( t W )− W ( t )− ) , W ( t t )− W ( t ( ) , . . ., W 2 3 1 2 n − 1 n collection of points, 0 t ≤ < t ; ··· < t 2 , and integer n > n 1 2 W ( t + ∆ t )− W ( t )∼ N ( (iii) , ∆ t ) for ∆ t > 0 . 0 In addition to (i)–(iii), it is assumed that almost all sample paths of W ( t ) are continuous in t . The result for the denominator uses the functional central limit theorem, which can be found in Billlingsley (1999, §2.8). In particular, if ξ is a sequence of , . . ., ξ n 1 0 ≤ t iid random variables with mean 0 and variance 1, then, for 1 , the continuous ≤ 5.1 time process ]] nt [[ ’ 1 d t = ( S ) ξ (5.30) , ( W ) → t √ j n n j = 1 W as is standard Brownian , where [[]] is the greatest integer function and n ( t ) →∞ 2 σ , ) motion on [ 0 , 1 ] . Note the under the null hypothesis, x s = w , + ··· + w 0 ∼ N ( s 1 s w x s √ and based on (5.30), we have . From this fact, we can show that ( ) W s → d n σ w ) →∞ n ( ( ) π n 2 1 ’ 1 x d t − 1 2 t (5.31) . dt ) → ( W √ n n σ 0 w t = 1 − 1 n The denominator in (5.28) is off from the left side of (5.31) by a factor of , and we adjust accordingly to finally obtain ( n →∞ ), √ Õ k 1 5 . 1 √ t nt ]] and fixed [[ , the central limit theorem has t = The intuition here is, for k AN , 0 ( ) ξ ∼ t j = 1 j k with n →∞ . i i i i

262 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 252 — #262 i i 5 Additional Time Domain Topics 252 Õ ) ( n 1 1 2 x w t − t 1 2 χ − 1 t = 1 d n σ w 2 1 ˆ n → (5.32) − 1 φ ( = . ) Õ ∫ n 1 1 2 2 x 2 ( ) W t dt 2 t = 1 1 t − n σ w 0 ˆ n is known as the unit root or Dickey-Fuller (DF) statistic φ − 1 ) The test statistic ( (see Fuller, 1976 or 1996), although the actual DF test statistic is normalized a little differently. Related derivations were discussed in Rao (1978; Correction 1980) and in Evans & Savin (1981). Because the distribution of the test statistic does not have a closed form, quantiles of the distribution must be computed by numerical approximation or by simulation. The R package tseries provides this test along with more general tests that we mention briefly. Toward a more general model, we note that the DF test was established by noting + ) , and one that if x x = φ x γ = w + w + , then ∇ x x = ( φ − 1 w t t t 1 t − t 1 t − 1 t t − H = : γ could test 0 by regressing ∇ x . They formed a Wald statistic and on x t 1 t − 0 derived its limiting distribution [the previous derivation based on Brownian motion p ) models, is due to Phillips (1987)]. The test was extended to accommodate AR( Õ p from both sides to obtain x x = , as follows. Subtract w + x φ − j j t 1 t t t − j = 1 p − 1 ’ x = γ w ∇ ψ (5.33) + , + ∇ x x 1 − j t t t − j t j = 1 Õ Õ p p where γ = φ = j for , . . ., . For a quick check of − 1 and ψ = − p 2 φ j i j j = i = 1 j φ = , note that x p = ( φ ; now subtract + φ w ) x + ) x (5.33) when 2 − ( x − t 1 2 1 t − − − t 2 t 1 t 2 from both sides. To test the hypothesis that the process has a unit root at 1 (i.e., x 1 t − φ ( z ) = 0 when z = 1 ), we can test H the AR polynoimial : γ = 0 by estimating γ in 0 the regression of x , and forming a Wald test based on on x x ∇ , . . ., , ∇ x ∇ − − 1 − t t 1 p + 1 t t γ ˆ γ . This test leads to the so-called augmented Dickey-Fuller test (ADF). t ( = ˆ ) / se γ While the calculations for obtaining the asymptotic null distribution change, the basic p ideas and machinery remain the same as in the simple case. The choice of is crucial, and we will discuss some suggestions in the example. For ARMA( p , q ) models, the ADF test can be used by assuming is large enough to capture the essential correlation p structure; another alternative is the Phillips-Perron (PP) test, which differs from the ADF tests mainly in how they deal with serial correlation and heteroskedasticity in the errors. One can extend the model to include a constant, or even non-stochastic trend. For example, consider the model . w = β + + β x t + φ x 1 t − 1 0 t t 1 , the process is a random = φ If we assume β , then under the null hypothesis, = 0 1 β . Under the alternate hypothesis, the process is a causal AR(1) with walk with drift 0 mean μ 0 = β = , then the interest here is testing 1 − φ ) . If we cannot assume β ( 1 0 x and 0 , the null that β ( β , simultaneously, versus the alternative that , φ ) = ( 0 , 1 ) 1 1 φ . In this case, the null hypothesis is that the process is a random walk with 1 < | | drift, versus the alternative hypothesis that the process is trend stationary such as might be considered for the chicken price series in Example 2.1. i i i i

263 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 253 — #263 i i 5.3 GARCH Models 253 Example 5.3 Testing Unit Roots in the Glacial Varve Series tseries to test the null hypothesis that the In this example we use the R package log of the glacial varve series has a unit root, versus the alternate hypothesis that the process is stationary. We test the null hypothesis using the available DF, ADF and PP tests; note that in each case, the general regression equation incorporates a constant and a linear trend. In the ADF test, the default number of AR components 1 3 ]] , which corresponds to the suggested − included in the model, say ) k , is 1 n [[( upper bound on the rate at which the number of lags, k , should be made to grow with the sample size for the general ARMA( p , q ) setup. For the PP test, the default 1 4 k is [[ . 04 n value of ]] . library(tseries) adf.test(log(varve), k=0) # DF test Dickey-Fuller = -12.8572, Lag order = 0, p-value < 0.01 alternative hypothesis: stationary adf.test(log(varve)) # ADF test Dickey-Fuller = -3.5166, Lag order = 8, p-value = 0.04071 alternative hypothesis: stationary pp.test(log(varve)) # PP test Dickey-Fuller Z(alpha) = -304.5376, Truncation lag parameter = 6, p-value < 0.01 alternative hypothesis: stationary In each test, we reject the null hypothesis that the logged varve series has a unit root. The conclusion of these tests supports the conclusion of the previous section that the logged varve series is long memory rather than integrated. 5.3 GARCH Models Various problems such as option pricing in finance have motivated the study of the volatility , or variability, of a time series. ARMA models were used to model the conditional mean of a process when the conditional variance was constant. Using an AR(1) as an example, we assumed 2 . = , . . . E ( x ) | x w ( var , x = ) , . . . σ ) = φ x x , x , and var ( x | t t − 1 − t t − 2 2 − t 1 − t t t 1 w In many problems, however, the assumption of a constant conditional variance will be violated. Models such as the or ARCH autoregressive conditionally heteroscedastic model, first introduced by Engle (1982), were developed to model changes in volatility. These models were later extended to generalized ARCH, or GARCH models by Bollerslev (1986). In these problems, we are concerned with modeling the return or growth rate of a x is the value of an asset at time t , then the return or relative series. For example, if t is t , of the asset at time gain, r t x x − t t 1 − = r (5.34) . t x − t 1 ( . Thus, based on the discussion in x Definition (5.34) implies that x ) = r 1 + − t t 1 t Section 3.7, if the return represents a small (in magnitude) percentage change then i i i i

264 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 254 — #264 i i 5 Additional Time Domain Topics 254 ( r ∇ )≈ log x . (5.35) t t 5.2 log ( x and will be ) or ( x , − x Either value, return , will be called the )/ x ∇ − 1 1 − t t t t r stochastic volatility model ; denoted by . An alternative to the GARCH model is the t we will discuss these models in Chapter 6 because they are state-space models. Typically, for financial series, the return , does not have a constant conditional r t variance, and highly volatile periods tend to be clustered together. In other words, there is a strong dependence of sudden bursts of variability in a return on the series own past. For example, Figure 1.4 shows the daily returns of the Dow Jones Industrial Average (DJIA) from April 20, 2006 to April 20, 2016. In this case, as is typical, the r return is fairly stable, except for short-term bursts of high volatility. t The simplest ARCH model, the ARCH(1), models the return as (5.36) = σ r  t t t 2 2 + , = α (5.37) σ α r 1 0 t − 1 t  . The normal assumption is standard Gaussian white noise,  where ∼ iid N ( 0 , 1 ) t t may be relaxed; we will discuss this later. As with ARMA models, we must impose some constraints on the model parameters to obtain desirable properties. An obvious 2 because ,α constraint is that ≥ is a variance. α σ 0 1 0 t As we shall see, the ARCH(1) models return as a white noise process with non- constant conditional variance, and that conditional variance depends on the previous is Gaussian: return. First, notice that the conditional distribution of r r given t − 1 t 2 ,α r r α + r ∼ N ( 0 (5.38) ) . 1 − t 1 t 0 1 t − In addition, it is possible to write the ARCH(1) model as a non-Gaussian AR(1) 2 . First, rewrite (5.36)–(5.37) as model in the square of the returns r t 2 2 2 r  = σ t t t 2 2 , + α r α = σ 1 0 t 1 − t and subtract the two equations to obtain 2 2 2 2 2 . − σ  = + α −( r σ α ) r 1 0 t t t t t − 1 Now, write this equation as 2 2 α r + α , (5.39) = v + r t 1 0 t 1 − t 2 2 2 2 1  − (  where v − 1 ) . Because = σ  is the square of a N ( 0 , 1 ) random variable, t t t t t 2 is a shifted (to have mean-zero), χ random variable. 1 { = To explore the properties of ARCH, we define r . Then, using R r } , . . . , 1 − s s s r (5.38), we immediately see that has a zero mean: t 5 2 . r + = ( x − x Recall from Footnote 1.2 that if r )/ x ) ≈ r is a small percentage, then log ( 1 . t t t t 1 t 1 − t − It is easier to program directly. Although it is a log x r , so this is often used instead of calculating ∇ t t being logged. log-return ; but the returns are not misnomer, ∇ log x is often called the t i i i i

265 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 255 — #265 i i 255 5.3 GARCH Models = = ( r (5.40) ) . R 0 EE ) ) = EE ( r E r ( r t − 1 1 − t t t t ) E r . | Because martingale difference is said to be a r ( = 0 , the process R 1 t − t t Because r is a martingale difference, it is also an uncorrelated sequence. For t h > 0 , example, with cov ( r ) R | , r r ) = E ( r r r ( EE = ) t h t t t t t + h h + t + h − 1 + { } (5.41) r E ( r E | R ) = = 0 . + h − 1 + h t t t The last line of (5.41) follows because r R belongs to the information set for h 1 + t − t 0 , and, E h r | R > ) = 0 , as determined in (5.40). ( h − 1 h t t + + An argument similar to (5.40) and (5.41) will establish the fact that the error process in (5.39) is also a martingale difference and, consequently, an uncorrelated v t sequence. If the variance of v is finite and constant with respect to time, and 0 ≤ t 2 . α 1 , then based on Property 3.1, (5.39) specifies a causal AR(1) process for r < 1 t 2 2 ) r E ) and var ( r Therefore, . This, implies ( must be constant with respect to time t t t that α 0 2 E ( ) = var ( r r ) = (5.42) t t 1 − α 1 and, after some manipulations, 2 2 3 α α − 1 0 1 4 = E ( r (5.43) , ) t 2 2 − 1 ( ) α − 1 3 α 1 1 2 < 1 . Note that 3 provided α 1 2 4 2 2 = E r r ) , )−[ E ( r var ( )] ( t t t √ < α < 1 / 0 which exists only if 3 ≈ . 58 . In addition, these results imply that the 1 kurtosis, r is κ , of t 2 4 1 α − ( ) E r t 1 , = κ (5.44) = 3 2 2 2 E ( 1 − 3 α r [ )] t 1 which is never smaller than 3, the kurtosis of the normal distribution. Thus, the marginal distribution of the returns, r , is leptokurtic, or has “fat tails.” Summarizing, t α itself is white noise and its unconditional distribution is r , the process if 0 ≤ 1 < 1 t symmetrically distributed around zero; this distribution is leptokurtic. If, in addition, 2 2 α , follows a causal AR(1) model with ACF given < 3 1 , the square of the process, r t 1 2 h , it can be shown that 1 < α , but 1 ≥ is α 3 r . If ≥ 0 , for all h > 0 ) = by h ( α ρ 2 1 1 y t 1 strictly stationary with infinite variance (see Douc, et al., 2014). of the ARCH(1) model is typically and α α Estimation of the parameters 1 0 r , ..., accomplished by conditional MLE. The conditional likelihood of the data r 2 n given r , is given by 1 n ÷ α r ,α ) (5.45) , r r ) = L ( ( f t 1 − ,α t 0 1 α 1 1 0 = t 2 i i i i

266 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 256 — #266 i i 5 Additional Time Domain Topics 256 0.2 ACF −0.1 4 3 1 5 2 LAG 0.2 PACF −0.1 4 1 5 2 3 LAG Fig. 5.5. ACF and PACF of the squares of the residuals from the AR(1) fit on U.S. GNP. where the density r f ( r is the normal density specified in (5.38). Hence, ) 1 ,α t − t α 1 0 α is given by ,α )∝− ln the criterion function to be minimized, ( α ,α l ( r ) L 0 1 1 1 0 ( ) n n 2 ’ ’ r 1 1 t 2 l = ( α ) ,α α + α ( r ln . (5.46) + ) 0 1 1 0 1 − t 2 2 2 α + α r 0 1 1 t − t 2 2 = t = Estimation is accomplished by numerical methods, as described in Section 3.5. In ) 1 ( l ( α ,α this case, analytic expressions for the gradient vector, ) , and Hessian ma- 1 0 ( 2 ) l , as described in Example 3.30, can be obtained by straight-forward ) trix, ( α ,α 1 0 ( 1 ) α l ,α ( gradient vector, 1 × ) , is given by calculations. For example, the 2 1 0 ) ( ( ) n 2 2 ’ − α r α + r 1 0 1 ∂ l ∂α / t 0 t 1 − . × = ) ( 2 2 ∂α / l ∂ r 1 t − 1 2 2 t = 2 α + α r 0 1 t − 1 The calculation of the Hessian matrix is left as an exercise (Problem 5.8). The n is very large. A discussion of likelihood of the ARCH model tends to be flat unless this problem can be found in Shephard (1996). It is also possible to combine a regression or an ARMA model for the mean with an ARCH model for the errors. For example, a regression with ARCH(1) errors model ′ z ) as linear function of p regressors, would have the observations x = ( z z , ..., , 1 t t p t t , say, and ARCH(1) noise y t ′ = β z + y x , t t t where y satisfies (5.36)–(5.37), but, in this case, is unobserved. Similarly, for exam- t ple, an AR(1) model for data x exhibiting ARCH(1) errors would be t y . x x φ = φ + + 0 t 1 t t − 1 These types of models were explored by Weiss (1984). i i i i

267 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 257 — #267 i i 257 5.3 GARCH Models Example 5.4 Analysis of U.S. GNP In Example 3.39, we fit an MA(2) model and an AR(1) model to the U.S. GNP series and we concluded that the residuals from both fits appeared to behave like a white noise process. In Example 3.43 we concluded that the AR(1) is probably the better model in this case. It has been suggested that the U.S. GNP series has ARCH errors, and in this example, we will investigate this claim. If the GNP noise term is ARCH, the squares of the residuals from the fit should behave like a non-Gaussian AR(1) process, as pointed out in (5.39). Figure 5.5 shows the ACF and PACF of the squared residuals it appears that there may be some dependence, albeit small, left in the residuals. The figure was generated in R as follows. u = sarima(diff(log(gnp)), 1, 0, 0) acf2(resid(u$fit)^2, 20) We used the R package to fit an AR(1)-ARCH(1) model to the fGarch U.S. GNP returns with the following results. A partial output is shown; we note that garch(1,0) specifies an ARCH(1) in the code below (details later). library(fGarch) summary(garchFit(~arma(1,0)+garch(1,0), diff(log(gnp)))) Estimate Std.Error t.value p.value mu 0.005 0.001 5.867 0.000 ar1 0.367 0.075 4.878 0.000 omega 0.000 0.000 8.135 0.000 alpha1 0.194 0.096 2.035 0.042 -- Standardised Residuals Tests: Statistic p-Value Jarque-Bera Test R Chi^2 9.118 0.010 Shapiro-Wilk Test R W 0.984 0.014 Ljung-Box Test R Q(20) 23.414 0.269 Ljung-Box Test R^2 Q(20) 37.743 0.010 Note that the p-values given in the estimation paragraph are two-sided, so they should be halved when considering the ARCH parameters. In this example, we ˆ ˆ . φ = . 005 (called mu in the output) and obtain φ ) for the AR(1) = ar1 367 (called 0 1 parameter estimates; in Example 3.39 the values were . 005 and . 347 , respectively. omega (called 0 ) for the constant and The ARCH(1) parameter estimates are ˆ α = 0 . α . There are a number = . 194 , which is significant with a p-value of about ˆ 02 1 of tests that are performed on the residuals [ R ] or the squared residuals [ R^2 ]. For example, the Jarque–Bera statistic tests the residuals of the fit for normality based on the observed skewness and kurtosis, and it appears that the residuals have some non-normal skewness and kurtosis. The Shapiro–Wilk statistic tests the residuals of the fit for normality based on the empirical order statistics. The other tests, primarily based on the Q-statistic, are used on the residuals and their squares. The ARCH(1) model can be extended to the general ARCH( p ) model in an r , is retained, but (5.37) is extended to obvious way. That is, (5.36), = σ  t t t 2 2 2 . σ = r α α + + α ··· r (5.47) + 1 0 p t t − p − t 1 p ) also follows in an obvious way from the discussion of estima- Estimation for ARCH( , . . ., tion for ARCH(1) models. That is, the conditional likelihood of the data r r p + 1 n r r , . . ., , is given by given 1 p i i i i

268 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 258 — #268 i i 258 5 Additional Time Domain Topics n ÷ ) = , . . ., r r (5.48) r ) , . . ., r , r ( f ( L α m p − t − 1 1 t t α 1 + p = t α = ( α ) ,α and, under the assumption of normality, the conditional , . . .,α where p 0 1 f > (·|·) in (5.48) are, for t densities p , given by α 2 2 r . ) r , . . ., r α + ··· ∼ N ( 0 ,α + + α r r 1 1 p − t m 0 − t t − p t t 1 − Another extension of ARCH is the generalized ARCH or GARCH model de- veloped by Bollerslev (1986). For example, a GARCH 1 , 1 ) model retains (5.36), (  = σ , but extends (5.37) as follows: r t t t 2 2 2 σ + σ r (5.49) = . α + β α 0 1 1 t 1 t − 1 − t Under the condition that α , using similar manipulations as in (5.39), the + β 1 < 1 1 ( 1 , 1 ) model, (5.36) and (5.49), admits a non-Gaussian ARMA ( GARCH , 1 ) model 1 for the squared process 2 2 (5.50) = α , + ( α v + r β ) r β − v + 0 t 1 1 1 t − 1 t t − 1 where v is as defined in (5.39). Representation (5.50) follows by writing (5.36) as t 2 2 2 2 r −  − σ ) ( = σ 1 t t t t 2 2 2 2 ( β r − σ ) = β σ (  − ) , 1 1 1 − t 1 t 1 − t − 1 t 1 − subtracting the second equation from the first, and using the fact that, from (5.49), 2 2 2 q β p σ σ ) = α ( + α , on the left-hand side of the result. The GARCH r , − 1 0 1 t 1 1 t − t − model retains (5.36) and extends (5.49) to q p ’ ’ 2 2 2 σ + β r (5.51) . α α + = σ j j 0 t t − j j t − = j = 1 1 j Conditional maximum likelihood estimation of the GARCH m , r ) model param- ( eters is similar to the ARCH ( m ) case, wherein the conditional likelihood, (5.48), is 2 2 given by (5.51) and where the conditioning ) densities with σ 0 , σ the product of N ( t t 2 2 r m is on the first max ) observations, with σ ( , σ . Once the parameter 0 = = ··· = r 1 estimates are obtained, the model can be used to obtain one-step-ahead forecasts of 2 the volatility, say ˆ σ , given by + 1 t p q ’ ’ 2 2 2 ˆ β = ˆ α (5.52) . ˆ σ + ˆ α σ r ˆ + j 0 j 1 t + − − 1 j + t j t + 1 j = 1 = 1 j We explore these concepts in the following example. i i i i

269 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 259 — #269 i i 259 5.3 GARCH Models 0.10 0.05 0.00 −0.05 Nov 20 2007 Apr 01 2008 Sep 02 2008 Feb 02 2009 Jul 01 2009 Nov 02 2009 , superimposed on part σ ˆ Fig. 5.6. GARCH one-step-ahead predictions of the DJIA volatility, t of the data including the financial crisis of 2008. Example 5.5 GARCH Analysis of the DJIA Returns As previously mentioned, the daily returns of the DJIA shown in Figure 1.4 exhibit classic GARCH features. In addition, there is some low level autocorrelation in the fGarch series itself, and to include this behavior, we used the R package to fit an 1 ) model to the series using t errors: AR ( 1 ) -GARCH ( 1 , library(xts) djiar = diff(log(djia$Close))[-1] # exhibits some autocorrelation (not shown) acf2(djiar) acf2(djiar^2) # oozes autocorrelation (not shown) library(fGarch) summary(djia.g <- garchFit(~arma(1,0)+garch(1,1), data=djiar, ' ' cond.dist= )) std plot(djia.g) # to see all plot options Estimate Std.Error t.value p.value mu 8.585e-04 1.470e-04 5.842 5.16e-09 ar1 -5.531e-02 2.023e-02 -2.735 0.006239 omega 1.610e-06 4.459e-07 3.611 0.000305 alpha1 1.244e-01 1.660e-02 7.497 6.55e-14 beta1 8.700e-01 1.526e-02 57.022 < 2e-16 shape 5.979e+00 7.917e-01 7.552 4.31e-14 --- Standardised Residuals Tests: Statistic p-Value Ljung-Box Test R Q(10) 16.81507 0.0785575 Ljung-Box Test R^2 Q(10) 15.39137 0.1184312 To explore the GARCH predictions of volatility, we calculated and plotted part of the data surrounding the financial crises of 2008 along with the one-step-ahead 2 σ predictions of the corresponding volatility, as a solid line in Figure 5.6. t Another model that we mention briefly is the asymmetric power ARCH model. = σ , but the conditional variance is modeled as  The model retains (5.36), r t t t i i i i

270 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 260 — #270 i i 5 Additional Time Domain Topics 260 p q ’ ’ δ δ δ + = σ α ) (| + |− γ r r α . β (5.53) σ 0 t j j − t j j − j t − j t = 1 j 1 = j δ = 2 and γ ∈ { = 0 , for j Note that the model is GARCH when 1 , . . ., p } . The j parameters γ ( | γ parameters, which are a measure of asym- | ≤ 1 ) are the leverage j j δ > 0 is the parameter for the power term. A positive [negative] value metry, and of ’s means that past negative [positive] shocks have a deeper impact on current γ j conditional volatility than past positive [negative] shocks. This model couples the flexibility of a varying exponent with the asymmetry coefficient to take the leverage 0 ≥ α , effect into account. Further, to guarantee that σ 0 > 0 , we assume that α > 0 t j with at least one α . > 0 , and β 0 ≥ j j We contiune the analysis of the DJIA returns in the following example. Example 5.6 APARCH Analysis of the DJIA Returns was used to fit an AR-APARCH model to the DJIA returns The R package fGarch discussed in Example 5.5. As in the previous example, we include an AR(1) in the model to account for the conditional mean. In this case, we may think of the model + y as r is an AR(1), and = μ is APARCH noise with conditional μ y where t t t t t variance modeled as (5.53) with t-errors. A partial output of the analysis is given below. We do not include displays, but we show how to obtain them. The predicted volatility is, of course, different than the values shown in Figure 5.6, but appear similar when graphed. library(xts) library(fGarch) summary(djia.ap <- garchFit(~arma(1,0)+aparch(1,1), data=djiar, )) cond.dist= ' std ' # to see all plot options (none shown) plot(djia.ap) Estimate Std. Error t value p.value mu 5.234e-04 1.525e-04 3.432 0.000598 ar1 -4.818e-02 1.934e-02 -2.491 0.012727 omega 1.798e-04 3.443e-05 5.222 1.77e-07 alpha1 9.809e-02 1.030e-02 9.525 < 2e-16 gamma1 1.000e+00 1.045e-02 95.731 < 2e-16 beta1 8.945e-01 1.049e-02 85.280 < 2e-16 delta 1.070e+00 1.350e-01 7.928 2.22e-15 shape 7.286e+00 1.123e+00 6.489 8.61e-11 --- Standardised Residuals Tests: Statistic p-Value Ljung-Box Test R Q(10) 15.71403 0.108116 Ljung-Box Test R^2 Q(10) 16.87473 0.077182  In most applications, the distribution of the noise, in (5.36), is rarely normal. t The R package fGarch allows for various distributions to be fit to the data; see the help file for information. Some drawbacks of GARCH and related models are as follows. (i) The GARCH model assumes positive and negative returns have the same effect because volatility depends on squared returns; the asymmetric models help alleviate this problem. (ii) These models are often restrictive because of the tight constraints 1 2 ). (iii) The likelihood is 0 ≤ α on the model parameters (e.g., for an ARCH(1), < 3 1 i i i i

271 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 261 — #271 i i 261 5.4 Threshold Models is very large. (iv) The models tend to overpredict volatility because they n flat unless respond slowly to large isolated returns. Various extensions to the original model have been proposed to overcome some of the shortcomings we have just mentioned. For example, we have already discussed the fact that allows for asymmetric return dynamics. In the case of persistence fGarch in volatility, the integrated GARCH (IGARCH) model may be used. Recall (5.50) 1 , 1 ) model can be written as where we showed the GARCH ( 2 2 = r β α ( α − + β v ) r v + + 1 1 1 t 1 0 − t t 1 t − 2 = α and r , in which β is stationary if α + + β 1 < 1 . The IGARCH model sets 1 1 1 1 t ( 1 , 1 ) model is case the IGARCH 2 2 2 = σ + and σ σ = α + ( 1 − β . ) r r β  0 t 1 1 t t t t 1 − 1 t − There are many different extensions to the basic ARCH model that were developed to handle the various situations noticed in practice. Interested readers might find the general discussions in Engle et al. (1994) and Shephard (1996) worthwhile reading. Also, Gouriéroux (1997) gives a detailed presentation of ARCH and related models with financial applications and contains an extensive bibliography. Two excellent texts on financial time series analysis are Chan (2002) and Tsay (2002). Finally, we briefly discuss stochastic volatility models ; a detailed treatment of 2 σ these models is given in Chapter 6. The volatility component, , in GARCH and t related models are conditionally nonstochastic. For example, in the ARCH(1) model, = c , it must be the case that any time the previous return is valued at, say c , i.e., r 1 t − 2 2 = α . This assumption seems a bit unrealistic. The stochastic volatility + α c σ 1 0 t model adds a stochastic component to the volatility in the following way. In the GARCH model, a return, say r , is t 2 2 2 log σ .   ⇒ log r r log = (5.54) σ = + t t t t t t 2 log r Thus, the observations are generated by two components, the unobserved t 2 2 log σ volatility, log  , and the unobserved noise, 1 ) . While, for example, GARCH ( 1 , t t 2 2 2 σ models volatility without error, α = α + r σ + β , the basic stochastic volatility 1 0 1 t t + 1 t model assumes the logged latent variable is an autoregressive process, 2 2 (5.55) + w φ φ log σ + = σ log t 1 0 t 1 t + 2 ) w . The introduction of the noise term makes the la- ∼ ( 0 , σ w where iid N t t w tent volatility process stochastic. Together (5.54) and (5.55) comprise the stochastic , volatility model. Given n observations, the goals are to estimate the parameters φ 0 2 , and then predict future volatility. Details are provided in Section 6.11. φ and σ 1 w 5.4 Threshold Models In Section 3.4 we discussed the fact that, for a stationary time series, best linear prediction forward in time is the same as best linear prediction backward in time. i i i i

272 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 262 — #272 i i 5 Additional Time Domain Topics 262 J J 0.8 D 0.7 J M J 0.6 J F F F F J J flu 0.5 F F F F 0.4 M J F D M M M M D D F D N A D J M M A D N F M A M N A J 0.3 M D J A N A O O N O D A D M N J O A D J M J O A A O M M S N A J J A S N O J A S A S J O N N J J M J A O M M S J O M J O S A N A M J M J M S A J S S J J S J A J J S A A J 0.2 A J 1978 1974 1972 1970 1968 1976 Time Fig. 5.7. U.S. monthly pneumonia and influenza deaths per 10,000. This result followed from the fact that the variance–covariance matrix of x = 1: n n { x , is the same as the variance–covariance matrix , x )} , ..., x j } , say, Γ = { γ ( i − n 1 2 i , j = 1 of } = { x , x . In addition, if the process is Gaussian, the distributions , ..., x x :1 − 1 n n 1 n x (that is, the data plotted x x are identical. In this case, a time plot of and of n :1 n 1: 1: n x forward in time) should look similar to a time plot of (that is, the data plotted n :1 backward in time). There are, however, many series that do not fit into this category. For example, Figure 5.7 shows a plot of monthly pneumonia and influenza deaths per 10,000 in the U.S. for 11 years, 1968 to 1978. Typically, the number of deaths tends to increase faster ), especially during epidemics. Thus, if the data were plotted ↑↘ than it decreases ( backward in time, that series would tend to increase slower than it decreases. Also, if monthly pneumonia and influenza deaths followed a linear Gaussian process, we would not expect to see such large bursts of positive and negative changes that occur periodically in this series. Moreover, although the number of deaths is typically largest during the winter months, the data are not perfectly seasonal. That is, although the peak of the series often occurs in January, in other years, the peak occurs in February or in March. Hence, seasonal ARMA models would not capture this behavior. Many approaches to modeling nonlinear series exist that could be used (see (TARMA) presented Priestley, 1988); here, we focus on the class of threshold models in Tong (1983, 1990). The basic idea of these models is that of fitting local linear ARMA models, and their appeal is that we can use the intuition from fitting global linear ARMA models. For example, a k -regimes self-exciting threshold (SETARMA) model has the form Õ Õ q p 1 ) ( ( 1 ( ) 1 ( 1 ) ( ) 1 ) 1 1  + φ , x x θ w + w if φ ≤ r +  − 1 i − t t d t − t i j j 0 1 1 = i = j   Õ Õ  p q ) ) ( 2 ) ( 2 ( ) 2 ( 2 ) 2 ( 2 2  θ r φ φ x < , ≤ x + + r + w w if  1 2 d − t t i − t j i t − j j 1 1 i = 0 = = x (5.56) t . .  . .  . .   Õ Õ  q p ) k ( ) ( k ) k ( ) k ( ( k ) k k  w , x < θ r if x + w + φ + φ k d t 1 − − t − i  t − j j t i j = 1 1 = i 0 i i i i

273 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 263 — #273 i i 263 5.4 Threshold Models ( j ) 2 , for 0 iid N ( w , σ where is a specified em ) ∼ j = 1 , . . ., k , the positive integer d t j R is a partition of r delay, and −∞ < . ∞ < ··· < r < − k 1 1 These models allow for changes in the ARMA coefficients over time, and those changes are determined by comparing previous values (back-shifted by a time lag equal to d ) to fixed threshold values. Each different ARMA model is referred to as a ) q p regime . In the definition above, the values ( of the order of ARMA models can , j j differ in each regime, although in many applications, they are equal. Stationarity and invertibility are obvious concerns when fitting time series models. For the threshold time series models, such as TAR, TMA and TARMA models, however, the stationary and invertible conditions in the literature are less well-known in general and often restricted models of order one. The model can be generalized to include the possibility that the regimes depend on a collection of the past values of the process, or that the regimes depend on an exogenous variable (in which case the model is not self-exciting) such in predator-prey cases. For example, Canadian lynx have been thoroughly studied (see the R data set lynx ) and the series is typically used to demonstrate the fitting of threshold models. The lynx prey varies from small rodents to deer, with the Snowshoe Hare being its overwhelmingly favored prey. In fact, in certain areas the lynx is so closely tied to the Snowshoe that its population rises and falls with that of the hare, even though other x in food sources may be abundant. In this case, it seems reasonable to replace t − d is the size of the Snowshoe Hare population. (5.56) with say y , where y d − t t The popularity of TAR models is due to their being relatively simple to specify, estimate, and interpret as compared to many other nonlinear time series models. In addition, despite its apparent simplicity, the class of TAR models can reproduce many nonlinear phenomena. In the following example, we use these methods to fit a threshold model to monthly pneumonia and influenza deaths series previously mentioned. Example 5.7 Threshold Modeling of the Influenza Series As previously discussed, examination of Figure 5.7 leads us to believe that the monthly pneumonia and influenza deaths time series, say flu , is not linear. It is also t evident from Figure 5.7 that there is a slight negative trend in the data. We have found that the most convenient way to fit a threshold model to these data, while removing the trend, is to work with the first differences. The differenced data, say flu x = − flu 1 − t t t is exhibited in Figure 5.9 as points (+) representing the observations. The nonlinearity of the data is more pronounced in the plot of the first dif- ferences, x . Clearly x slowly rises for some months and, then, sometime in the t t . winter, has a possibility of jumping to a large number once x . If exceeds about 05 t the process does make a large jump, then a subsequent significant decrease occurs x in x x . Another telling graphic is the lag plot of shown in Figure 5.8, versus t t t − 1 x which suggests the possibility of two linear regimes based on whether or not 1 − t exceeds .05. i i i i

274 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 264 — #274 i i 264 5 Additional Time Domain Topics l 0.4 l l l l l 0.2 l l l l l t l l l l l l l l l l l l l u l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l f l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0 l d l l l l l l l l l l l l l l l l l l l l l l l l l l l l −0.2 l l −0.4 0.0 0.4 −0.2 0.2 −0.4 u f d l t − 1 with a lowess fit superimposed (line). dflu versus Fig. 5.8. Scatterplot of dflu flu = flu − t t t t − 1 1 − A vertical dashed line indicates dflu 05 = . . t − 1 As an initial analysis, we fit the following threshold model p ’ ( 1 ) ) 1 ( ) ( 1 , x x + w < . 05 ; x + = φ α t 1 − t − j t t j = 1 j (5.57) p ’ ( 2 ) ) 2 ( 2 ) ( x x , ≥ 05 w . , + x α = φ + − j − t t 1 t t j = 1 j p , assuming this would be larger than necessary. Model (5.57) is easy 6 with = x to fit using two linear regression runs, one when < . 05 and the other when − 1 t x . Details are provided in the R code at the end of this example. 05 . ≥ 1 t − An order p = 4 was finally selected and the fit was ˆ x x = 0 + . 51 12 . + x x 20 . − 1 t t − 2 − t t − 3 . 08 06 ( ( . 05 ) ) ) . ( ) 1 ( for , < . x 05 ; 11 . − + ˆ w x t − 1 − t 4 05 . ( ) t − . 40 − . 75 ˆ x x − 1 . 03 x 05 . 2 x = t t − 1 3 − t − 2 t ( ) . ( 1 . 05 ) 21 ) 17 . ( ) 2 ( , − 05 . 71 x for , . ≥ x 6 w ˆ + 4 − t 1 − t 1 . 25 ) ( t 07 . = . The threshold of .05 was exceeded 17 times. ˆ σ = . 05 and ˆ σ where 1 2 Using the final model, one-month-ahead predictions can be made, and these are shown in Figure 5.9 as a line. The model does extremely well at predicting a flu epidemic; the peak at 1976 , however, was missed by this model. When we fit a model 04 . , flu epidemics were somewhat underestimated, but with a smaller threshold of the flu epidemic in the eighth year was predicted one month early. We chose the . model with a threshold of 05 because the residual diagnostics showed no obvious departure from the model assumption (except for one outlier at ); the model 1976 with a threshold of .04 still had some correlation left in the residuals and there was more than one outlier. Finally, prediction beyond one-month-ahead for this model is complicated, but some approximate techniques exist (see Tong, 1983). The following commands can be used to perform this analysis in R. i i i i

275 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 265 — #275 i i 265 5.5 Lagged Regression and Transfer Function Modeling # Plot data with month initials as points plot(flu, type="c") Months = c("J","F","M","A","M","J","J","A","S","O","N","D") points(flu, pch=Months, cex=.8, font=2) # Start analysis dflu = diff(flu) # scatterplot with lowess fit lag1.plot(dflu, corr=FALSE) thrsh = .05 # threshold Z = ts.intersect(dflu, lag(dflu,-1), lag(dflu,-2), lag(dflu,-3), lag(dflu,-4) ) ind1 = ifelse(Z[,2] < thrsh, 1, NA) # indicator < thrsh ind2 = ifelse(Z[,2] < thrsh, NA, 1) # indicator >= thrsh X1 = Z[,1]*ind1 X2 = Z[,1]*ind2 # case 1 summary(fit1 <- lm(X1~ Z[,2:5]) ) summary(fit2 <- lm(X2~ Z[,2:5]) ) # case 2 D = cbind(rep(1, nrow(Z)), Z[,2:5]) # design matrix p1 = D %*% coef(fit1) # get predictions p2 = D %*% coef(fit2) prd = ifelse(Z[,2] < thrsh, p1, p2) plot(dflu, ylim=c(-.5,.5), type= ' p ' , pch=3) lines(prd) prde1 = sqrt(sum(resid(fit1)^2)/df.residual(fit1) ) prde2 = sqrt(sum(resid(fit2)^2)/df.residual(fit2) ) prde = ifelse(Z[,2] < thrsh, prde1, prde2) tx = time(dflu)[-(1:4)] xx = c(tx, rev(tx)) yy = c(prd-2*prde, rev(prd+2*prde)) polygon(xx, yy, border=8, col=gray(.6, alpha=.25) ) abline(h=.05, col=4, lty=6) that can be used to fit these Finally, we note that there is an R package called tsDyn models; we assume already exits. dflu library(tsDyn) t have it ' # load package - install it if you don # vignette("tsDyn") # for package details (u = setar(dflu, m=4, thDelay=0, th=.05)) # fit model and view results # let program fit threshold (=.036) (u = setar(dflu, m=4, thDelay=0)) # if you want to try other models; m=3 works well too BIC(u); AIC(u) # graphics - ?plot.setar for information plot(u) The threshold found here is .036, which includes a few more observations than using .04, but suffers from the same drawbacks previously noted. 5.5 Lagged Regression and Transfer Function Modeling In Section 4.8, we considered lagged regression in a frequency domain approach based on coherency. For example, consider the SOI and Recruitment series that were analyzed in Example 4.24; the series are displayed in Figure 1.5. In that example, the interest was in predicting the output Recruitment series, say, y , from the input SOI, t . We considered the lagged regression model say x t ∞ ’ y = η x (5.58) , + ) B ( α α x = η + t j t t j − t t 0 = j i i i i

276 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 266 — #276 i i 5 Additional Time Domain Topics 266 0.4 0.2 0.0 dflu −0.2 −0.4 1976 1978 1972 1968 1974 1970 Time First differenced U.S. monthly pneumonia and influenza deaths (+); one-month-ahead Fig. 5.9. predictions (solid line) with ± 2 prediction error bounds . The horizontal line is the threshold. Õ | α ∞ | < where . We assume the input process x in (5.58) and noise process η t j t j , . . . describe are both stationary and mutually independent. The coefficients ,α α 0 1 used in predicting y and we have used the the weights assigned to past values of x t t notation ∞ ’ j = ( α ) B α B . (5.59) j = j 0 In the Box and Jenkins (1970) formulation, we assign ARIMA models, say, , to the series p , d , q ) and ARIMA ( p , respectively. In , d η ARIMA q and ) ( x , t η t η η η , was white. The components of (5.58) in Section 4.8, we assumed the noise, t ( p backshift notation, for the case of simple ARMA q ) modeling of the input and , noise, would have the representation ) B B (5.60) φ ( w ) x ( = θ t t and z φ ( B ) (5.61) = θ , ( B ) η η t η t 2 and where w and z σ are independent white noise processes with variances t t w 2 , respectively. Box and Jenkins (1970) proposed that systematic patterns often σ z , could often be expressed as a ratio of observed in the coefficients α , . . ., j for 2 = 1 , j polynomials involving a small number of coefficients, along with a specified delay, d , so d ) B B δ ( B ) = α ( , (5.62) ω ( B ) where r 2 − ( B ) = 1 (5.63) ω −···− B − ω B B ω ω r 2 1 and s (5.64) B δ δ ( B ) = δ + + δ ··· B + 1 0 s i i i i

277 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 267 — #277 i i 267 5.5 Lagged Regression and Transfer Function Modeling 0.6 0.2 ACF −0.4 20 30 35 5 15 25 10 0 LAG 0.6 0.2 PACF −0.4 25 30 35 5 0 15 20 10 LAG Fig. 5.10. Sample ACF and PACF of detrended SOI. are the indicated operators; in this section, we find it convenient to represent the − 1 inverse of an operator, say, ω ( B ) . ) , as 1 / ω ( B B ) and estimat- ( Determining a parsimonious model involving a simple form for α ing all of the parameters in the above model are the main tasks in the transfer function methodology. Because of the large number of parameters, it is necessary to develop a sequential methodology. Suppose we focus first on finding the ARIMA model for the input and apply this operator to both sides of (5.58), obtaining the new model x t φ ( B ) ) φ ( B ) φ ( B = ̃ y w ( B ) , ̃ + η B ( α = x + ) y = α η t t t t t t θ B ) B ( θ ) B ( θ ( ) η where w are independent. and the transformed noise ̃ t t is a prewhitened The series w version of the input series, and its cross-correlation t will be just with the transformed output series ̃ y t   ∞ ’   2   E [ ̃ y h ] = ( γ E ) = w w α w α , (5.65) σ = t + h yw ̃ t t j h − j + t h w     = 0 j   because the autocovariance function of white noise will be zero except when j = h in (5.65). Hence, by computing the cross-correlation between the prewhitened input series and the transformed output series should yield a rough estimate of the behavior of α ( B ) . Example 5.8 Relating the Prewhitened SOI to the Transformed Recruitment Series We give a simple example of the suggested procedure for the SOI and the Recruit- ment series. Figure 5.10 shows the sample ACF and PACF of the detrended SOI, = p 1 will do a and it is clear, from the PACF, that an autoregressive series with i i i i

278 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 268 — #278 i i 5 Additional Time Domain Topics 268 preSOI vs filRec 0.2 0.0 CCF −0.2 −0.4 −1 2 1 0 −2 LAG Sample CCF of the prewhitened, detrended SOI and the similarly transformed Fig. 5.11. Recruitment series; negative lags indicate that SOI leads Recruitment. 2 ˆ = . 588 with ˆ reasonable job. Fitting the series gave φ , and we applied the = . 092 σ w x and computed the cross-correlation function, operator ( 1 − . 588 B ) to both y and t t which is shown in Figure 5.11. Noting the apparent shift of d = 5 months and the decrease thereafter, it seems plausible to hypothesize a model of the form 5 B δ 0 5 2 2 B ( ( 1 + ω α ) + ω B = δ B B + ···) = 1 0 1 ω B 1 − 1 to be negative. The for the transfer function. In this case, we would expect ω 1 following R code was used for this example. soi.d = resid(lm(soi~time(soi), na.action=NULL)) # detrended SOI acf2(soi.d) fit = arima(soi.d, order=c(1,0,0)) ar1 = as.numeric(coef(fit)[1]) # = 0.5875 soi.pw = resid(fit) rec.fil = filter(rec, filter=c(1, -ar1), sides=1) ccf(soi.pw, rec.fil, ylab="CCF", na.action=na.omit, panel.first=grid()) In the code above, soi.pw rec.fil is is the prewhitened detrended SOI series, and the filtered Recruitment series. ( B ) and In some cases, we may postulate the form of the separate components δ B ) , so we might write the equation ω ( d B δ ( B ) = y x + η t t t ( B ) ω as d ( B ) y = δ ( B ω B , x + ω ( B ) η ) t t t or in regression form r s ’ ’ y x = u + δ (5.66) , ω + y t t k k t k − d − t k − k 0 = k = 1 i i i i

279 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 269 — #279 i i 269 5.5 Lagged Regression and Transfer Function Modeling where η = ( B u ω (5.67) . ) t t η u Once we have (5.66), it will be easy to fit the model if we forget about and allow t t to have any ARMA behavior. We illustrate this technique in the following example. Example 5.9 Transfer Function Model for SOI and Recruitment We illustrate the procedure for fitting a lagged regression model of the form sug- x ) and the Recruitment series gested in Example 5.8 to the detrended SOI series ( t y ). The results reported here are practically the same as the the results obtained ( t from the frequency domain approach used in Example 4.24. Based on Example 5.8, we have determined that u + y x = α + ω δ y + 5 − 1 0 t t − t t 1 is a reasonable model. At this point, we simply run the regression allowing for autocorrelated errors based on the techniques discussed in Section 3.8. Based on these techniques, the fitted model is the same as the one obtained in Example 4.24, namely, u + y = w + . 8 y − 21 x + u , , and u = . 45 12 − 1 t t t t t − 5 t t − 1 2 50 is white noise with σ where w = . t w Figure 5.12 displays the ACF and PACF of the estimated noise u , showing that t an AR(1) is appropriate. In addition, the figure displays the Recruitment series and the one-step-ahead predictions based on the final model. The following R code was used for this example. soi.d = resid(lm(soi~time(soi), na.action=NULL)) fish = ts.intersect(rec, RL1=lag(rec,-1), SL5=lag(soi.d,-5)) (u = lm(fish[,1]~fish[,2:3], na.action=NULL)) acf2(resid(u)) # suggests ar1 (arx = sarima(fish[,1], 1, 0, 0, xreg=fish[,2:3])) # final model Coefficients: ar1 intercept RL1 SL5 0.4487 12.3323 0.8005 -21.0307 s.e. 0.0503 1.5746 0.0234 1.0915 sigma^2 estimated as 49.93 # 1-step-ahead predictions pred = rec + resid(arx$fit) ' ,1), lwd=c(7,1)) gray90 ts.plot(pred, rec, col=c( ' For completeness, we finish the discussion of the more complicated Box-Jenkins method for fitting transfer function models. We note, however, that the method has no recognizable overall optimality, and is not generally better or worse than the method previously discussed. The form of (5.66) suggests doing a regression on the lagged versions of both the ˆ β input and output series to obtain 1 1 , the estimate of the ( r + s + )× regression vector ′ . ) β = ( ω , . . ., δ , . . .,ω , δ , δ 0 r 1 1 s The residuals from the regression, say, i i i i

280 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 270 — #280 i i 5 Additional Time Domain Topics 270 0.5 0.5 0.3 0.3 ACF PACF 0.1 0.1 −0.1 −0.1 20 5 15 5 10 20 15 10 LAG LAG 100 80 60 40 Recruitment 20 0 1980 1970 1960 1950 Time u . Bottom: The recruitment series (line) Fig. 5.12. Top: ACF and PACF of the estimated noise t and the one-step-ahead predictions (gray swatch) based on the final transfer function model. ′ ˆ y − u β ˆ z , = t t t where ′ ( y , . . ., y = x z , . . ., x ) , − 1 t − d t t t − t − s − r d denotes the usual vector of independent variables, could be used to approximate the η , because we can compute an estimator for best ARMA model for the noise process t ˆ u and that process from (5.67), using ˆ ω ( B ) and applying the moving average operator t q , p model to the this estimated noise then completes to get ˆ η ) . Fitting an ARMA ( t η η the specification. The preceding suggests the following sequential procedure for fitting the transfer function model to data. φ , . . ., φ , x to estimate the parameters (i) Fit an ARMA model to the input series t 1 p 2 θ , σ , . . ., θ in the specification (5.60). Retain ARMA coefficients for use in q 1 w ˆ w for use in Step (iii). step (ii) and the fitted residuals t (ii) Apply the operator determined in step (i), that is, ˆ ˆ ) ) y , = φ θ ( B ( ̃ y B t t to determine the transformed output series ̃ y . t ˆ and in steps (i) and (ii) to w (iii) Use the cross-correlation function between ̃ y t t suggest a form for the components of the polynomial d B ) B ( δ = B ( α ) ω ( B ) and the estimated time delay d . ˆ ˆ ˆ ˆ (iv) Obtain δ β by fitting a linear regression of the form ( ˆ ω , . . ., , . . ., ˆ ω δ , ) δ = , 0 1 1 s r ˆ for use in step (v). u (5.66). Retain the residuals t i i i i

281 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 271 — #281 i i 271 5.6 Multivariate ARMAX Models (v) Apply the moving average transformation (5.67) to the residuals ˆ to find the u t noise series ˆ η , and fit an ARMA model to the noise, obtaining the estimated t ˆ ˆ ( ) and coefficients in φ θ ( B ) . B η η The above procedure is fairly reasonable, but as previously mentioned, is not optimal in any sense. Simultaneous least squares estimation, based on the observed x and y , can be accomplished by noting that the transfer function model can be t t written as d ( ) θ B B ) B δ ( η + x , z y = t t t ω ( B ) ) B ( φ η which can be put in the form d B ) φ (5.68) ( ω ) y , = φ z ( B ) δ ( B ) B ( x ) + ω ( B ) θ B ( B η η t t η t Õ 2 , as in earlier sections. z and it is clear that we may use least squares to minimize t t θ ( B ) η in (5.68) to have any ARMA structure. z In Example 5.9, we simply allowed u t t ( ) φ B η Finally, we mention that we may also express the transfer function in state-space form as an ARMAX model; see Section 5.6 and Section 6.6.1. 5.6 Multivariate ARMAX Models To understand multivariate time series models and their capabilities, we first present an introduction to multivariate time series regression techniques. Since all processes are vector processes, we suspend the use of boldface for vectors. A useful extension of the basic univariate regression model presented in Section 2.1 is the case in which we have more than one output series, that is, multivariate regression . Suppose, instead of a single output variable y , a collection of k output analysis t variables exist that are related to the inputs as y , . . ., y y , 1 t 2 t tk + y z + β z = β ··· + β z + w (5.69) 1 i 1 i ir 2 tr ti t ti 2 t variables are i for each of the 1 , 2 , . . ., k output variables. We assume the w = ti correlated over the variable identifier , but are still independent over time. Formally, i we assume cov { w and is zero otherwise. Then, writing (5.69) , w t } = σ = for s i j si t j ′ y , ) in matrix notation, with y , . . ., = ( y y being the vector of outputs, and 1 t 2 t tk t j = β matrix containing the regression } , i = 1 B k , { = 1 , . . ., r being a k × r , . . ., i j coefficients, leads to the simple looking form (5.70) B . w y z = + t t t k × 1 vector process w Here, the is assumed to be a collection of independent t ′ matrix containing k w vectors with common covariance matrix E { × , the w k Σ } = t w t . Under the assumption of normality, the maximum likelihood σ the covariances i j estimator for the regression matrix is i i i i

282 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 272 — #282 i i 272 5 Additional Time Domain Topics ( ) ( ) n n 1 − ’ ’ ′ ′ ˆ = B z y z z (5.71) . t t t t = t 1 t = 1 The error covariance matrix is estimated by Σ w n ’ 1 ′ ˆ ˆ ˆ ) z y ( y (5.72) − . B z − )( B Σ = t t t t w n r − t = 1 The uncertainty in the estimators can be evaluated from √ ˆ β (5.73) ) = se c ˆ σ , ( j j i j ii ˆ for 1 , . . ., r , j = 1 , . . ., k , where se denotes estimated standard error, = σ is the i j j ( ) Õ 1 − n ′ ˆ is the i -th diagonal element of z Σ j , and c z -th diagonal element of . t ii w t t 1 = Also, the information theoretic criterion changes to ) ( k 2 ) ( k + 1 ˆ ln | kr + (5.74) Σ . AIC + = | w 2 n 2 )/ 1 . and BIC replaces the second term in (5.74) by K ln n / n where K = kr + k ( k + Bedrick and Tsai (1994) have given a corrected form for AIC in the multivariate case as n ) + r ( k ˆ + ln Σ AICc = | (5.75) . | w − r − 1 − n k Many data sets involve more than one time series, and we are often interested in the possible dynamics relating all series. In this situation, we are interested in modeling ′ k × 1 vector-valued time series x = ( x ) , . . ., x and forecasting , t = 0 , ± 1 , ± 2 , . . . . t t tk 1 Unfortunately, extending univariate ARMA models to the multivariate case is not so simple. The multivariate autoregressive model, however, is a straight-forward extension of the univariate AR model. For the first-order vector autoregressive model , VAR(1), we take x , = α (5.76) Φ x w + + 1 − t t t x . The where Φ is a k × k transition matrix that expresses the dependence of on x − t 1 t vector white noise is assumed to be multivariate normal with mean-zero w process t and covariance matrix ) ( ′ . Σ = (5.77) w w E w t t ′ The vector α = ( α ,α ) , . . .,α appears as the constant in the regression setting. If k 2 1 , then μ E ( x ) ) = μ Φ α = ( I − . t Note the similarity between the VAR model and the multivariate linear regres- sion model (5.70). The regression formulas carry over, and we can, on observing ′ ′ ) . Then, ( , . . ., x x , set up the model (5.76) with y , = x 1 , B = x α,Φ ) and z ( = t t t n 1 t − 1 write the solution as (5.71) with the conditional maximum likelihood estimator for the covariance matrix given by i i i i

283 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 273 — #283 i i 273 5.6 Multivariate ARMAX Models n ’ − 1 ′ ˆ ˆ ˆ n ) 1 ( − Σ = − − (5.78) α − . Φ x ) x Φ )( x ( x ˆ α − ˆ w 1 t t − 1 t t − = 2 t α , of the vector AR model The special form assumed for the constant component, × 1 in (5.76) can be generalized to include a fixed u r . That is, we vector of inputs, t could have proposed the vector ARX model , p ’ = u + x Γ x (5.79) , w Φ + t t − t t j j j = 1 Γ is a × r parameter matrix. The X in ARX refers to the exogenous vector where p u . The introduction of exogenous variables through process we have denoted here by t by replacing u α does not present any special problems in making inferences and Γ t we will often drop the X for being superfluous. Example 5.10 Pollution, Weather, and Mortality For example, for the three-dimensional series composed of cardiovascular mortality x , introduced in Example 2.2, take , temperature x x , and particulate levels 2 3 1 t t t ′ x . We might envision dynamic , x = , x ) ( as a vector of dimension k = 3 x 2 t t t 3 1 t relations among the three series defined as the first order relation, w + x x x φ + = α , + β φ t + φ + x 1 t 1 11 1 12 1 t − 1 , 2 1 t 13 t − 1 , 3 − t , 1 which expresses the current value of mortality as a linear combination of trend and its immediate past value and the past values of temperature and particulate levels. Similarly, = α + β t + φ w x + φ x + φ x + x 1 , 1 2 − 22 2 t − 1 , 2 2 21 23 t t − 1 , 3 t 2 t and φ x = α + β t + + x w + φ x + φ x 3 t − 1 , 1 t 3 32 3 t − 1 , 2 t 3 33 31 t − 1 , 3 express the dependence of temperature and particulate levels on the other series. Of course, methods for the preliminary identification of these models exist, and we will discuss these methods shortly. The model in the form of (5.79) is x w = Γ u + + Φ x , t − 1 t t t ′ 2 and where, in obvious notation, Γ = [ α | β ] is 3 × 1 × u 2 = ( 1 , t ) . is t Throughout much of this section we will use the R package vars to fit vector AR models via least squares. For this particular example, we have (partial output shown): library(vars) x = cbind(cmort, tempr, part) ' fits constant + trend ' summary(VAR(x, p=1, type= ' both both )) # ' Estimation results for equation cmort: # other equations not shown cmort = cmort.l1 + tempr.l1 + part.l1 + const + trend Estimate Std. Error t value p.value i i i i

284 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 274 — #284 i i 5 Additional Time Domain Topics 274 cmort.l1 0.464824 0.036729 12.656 < 2e-16 tempr.l1 -0.360888 0.032188 -11.212 < 2e-16 part.l1 0.099415 0.019178 5.184 3.16e-07 const 73.227292 4.834004 15.148 < 2e-16 trend -0.014459 0.001978 -7.308 1.07e-12 -- Residual standard error: 5.583 on 502 degrees of freedom Multiple R-Squared: 0.6908, Adjusted R-squared: 0.6883 F-statistic: 280.3 on 4 and 502 DF, p-value: < 2.2e-16 Covariance matrix of residuals: Correlation matrix of residuals: cmort tempr part cmort tempr part cmort 31.172 5.975 16.65 cmort 1.0000 0.1672 0.2484 tempr 5.975 40.965 42.32 tempr 0.1672 1.0000 0.5506 part 16.654 42.323 144.26 part 0.2484 0.5506 1.0000 For this particular case, we obtain ′ ′ ˆ − 23 , 67 . 59 , 67 . 46 ) ( , 73 β = (− 0 . 014 , . 0 . 007 , − 0 . 005 ) = , ˆ α 98 16 . 5 17 . 65 . 31 . 04 )− . 36 ( . 03 ) . 10 ( . 02 . 46 ( ) © © ™ ™ ˆ ˆ . 98 40 965 42 5 . 32 . . 02 ) 24 . 04 ) . 49 ( . 04 )− − . . 13 ( ( , = Σ Φ = ≠ ≠ Æ Æ w 26 16 . 65 42 . 32 144 . 58 ( ( 04 48 . )− 08 . ( 12 . ) ) − 07 . . . ́ ́ ̈ ̈ where the standard errors, computed as in (5.73), are given in parentheses. , denoting mortality, P and For the vector ( x , M , x , with ) T x P , ) = ( M T , t t t 2 t 1 t t t t t 3 temperature, and particulate level, respectively, we obtain the prediction equation for mortality, ˆ . 36 M 10 = 73 . 23 − . 014 t + . 46 M . + T − . P − 1 1 − t t t − 1 t 2 Comparing observed and predicted mortality with this model leads to an R of about .69. ). To do this, we It is easy to extend the VAR(1) process to higher orders, VAR( p use the notation of (5.70) and write the vector of regressors as ′ ′ ′ ′ ) ( 1 z x , . . . = x , x , t − p t − 2 1 − t t B α,Φ ,Φ , . . .,Φ ( ) . Then, this regression model = and the regression matrix as 1 p 2 can be written as p ’ α x + = (5.80) w + Φ x j t − j t t 1 = j k t error sum of products matrix becomes k for × = p + 1 , . . ., n . The n ’ ′ − z SSE , = ) z B ( x − x B )( (5.81) t t t t 1 p = t + so that the conditional maximum likelihood estimator for the error covariance matrix is Σ w i i i i

285 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 275 — #285 i i 275 5.6 Multivariate ARMAX Models ˆ = Σ n − p ) , (5.82) /( SSE w − p residuals exist in (5.81). n as in the multivariate regression case, except now only For the multivariate case, we have found that the Schwarz criterion 2 ˆ | Σ BIC | + log = p ln n / n , (5.83) k w gives more reasonable classifications than either AIC or corrected version AICc. The result is consistent with those reported in simulations by Lütkepohl (1985). Of course, estimation via Yule-Walker, unconditional least squares and MLE follow directly from the univariate counterparts. Example 5.11 Pollution, Weather, and Mortality (cont) p ) model and then fit the model. We used the R package first to select a VAR( ; Hannan The selection criteria used in the package are AIC, Hannan-Quinn ( HQ ), and Final Prediction Error ( FPE ). The Hannan-Quinn SC & Quinn, 1979), BIC ( ln n replaced by 2 ln ( ln ( n )) in the penalty term. procedure is similar to BIC, but with FPE finds the model that minimizes the approximate mean squared one-step-ahead prediction error (see Akaike, 1969 for details); it is rarely used. VARselect(x, lag.max=10, type="both") $selection AIC(n) HQ(n) SC(n) FPE(n) 9 5 2 9 $criteria 1 2 3 4 5 6 7 8 9 10 AIC(n) 11.738 11.302 11.268 11.230 11.176 11.153 11.152 11.129 11.119 11.120 HQ(n) 11.788 11.381 11.377 11.370 11.346 11.352 11.381 11.388 11.408 11.439 SC(n) 11.865 11.505 11.547 11.585 11.608 11.660 11.736 11.788 11.855 11.932 = 2 model while AIC and FPE pick an order p = 9 Note that BIC picks the order p = 5 model. p model and Hannan-Quinn selects an order Fitting the model selected by BIC we obtain ′ ′ ˆ 56 . 1 , ˆ . 9 , 59 . 6 ) α , = β = (− 0 . 011 , − 0 . 005 , − 0 . 008 ) ( , 49 ( . ) 02 04 . ) 04 . ( 20 . 30 ( . 04 )− . © ™ ˆ 26 03 − . 11 ( . 05 ) ) . ( . 05 )− . 05 ( . = , Φ ≠ Æ 1 ( . 09 )− . 39 ( . 09 ) . 39 ( . 05 ) . 08 ́ ̈ ( 08 . 28 ( . 04 )− . ) 03 . 03 ) . 07 ( . © ™ ˆ ( ) ( 10 . 03 )− 05 − . 04 . . 05 ) . 36 ( . , Φ = ≠ Æ 2 ( . 09 ) . 05 05 . 09 ) . 38 ( . − ) . 33 ( ́ ̈ Σ is where the standard errors are given in parentheses. The estimate of w . 16 08 33 . 03 7 . 28 © ™ ˆ . 08 37 40 88 63 7 . . Σ . = ≠ Æ w 88 123 45 16 . 33 40 . . ́ ̈ To fit the model using the vars package use the following: i i i i

286 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 276 — #286 i i 5 Additional Time Domain Topics 276 # partial results displayed summary(fit <- VAR(x, p=2, type="both")) cmort = cmort.l1 + tempr.l1 + part.l1 + cmort.l2 + tempr.l2 + part.l2 + const + trend Estimate Std. Error t value p.value cmort.l1 0.297059 0.043734 6.792 3.15e-11 tempr.l1 -0.199510 0.044274 -4.506 8.23e-06 part.l1 0.042523 0.024034 1.769 0.07745 cmort.l2 0.276194 0.041938 6.586 1.15e-10 tempr.l2 -0.079337 0.044679 -1.776 0.07639 part.l2 0.068082 0.025286 2.692 0.00733 const 56.098652 5.916618 9.482 < 2e-16 trend -0.011042 0.001992 -5.543 4.84e-08 Covariance matrix of residuals: cmort tempr part cmort 28.034 7.076 16.33 tempr 7.076 37.627 40.88 part 16.325 40.880 123.45 Using the notation of the previous example, the prediction model for cardiovascular mortality is estimated to be ˆ . P 07 M . = 56 − . 01 t + . 3 M + T 08 − . 2 T . − M + . 04 P 28 . + − 1 t 1 t − 2 − t 1 − t − 2 t t − t 2 To examine the residuals, we can plot the cross-correlations of the residuals and examine the multivariate version of the Q-test as follows: acf(resid(fit), 52) serial.test(fit, lags.pt=12, type="PT.adjusted") Portmanteau Test (adjusted) data: Residuals of VAR object fit Chi-squared = 162.3502, df = 90, p-value = 4.602e-06 The cross-correlation matrix is shown in Figure 5.13. The figure shows the ACFs of the individual residual series along the diagonal. For example, the first diagonal ˆ M − graph is the ACF of M , and so on. The off diagonals display the CCFs between t t pairs of residual series. If the title of the off-diagonal plot is x & y , then y leads in the graphic; that is, on the upper-diagonal, the plot shows corr[x(t+Lag), y(t)] corr[x(t+Lag), whereas in the lower-diagonal, if the title is x & y , you get a plot of y(t)] (yes, it is the same thing, but the lags are negative in the lower diagonal). The graphic is labeled in a strange way, just remember the second named series is the one that leads. In Figure 5.13 we notice that most of the correlations in the residual series are negligible, however, the zero-order correlation of mortality with temperature residuals is about .22 and mortality with particulate residuals is about acf(resid(fit),52)$acf ) to see the actual values. This means that the .28 (type AR model is not capturing the concurrent effect of temperature and pollution on mortality (recall the data evolves over a week). It is possible to fit simultaneous models; see Reinsel (1997) for further details. Thus, not unexpectedly, the Q-test rejects the null hypothesis that the noise is white. The Q-test statistic is given by ] [ H ’ 1 1 − − 1 2 ˆ ˆ ˆ ˆ tr ) 0 Γ ) ( h ) (5.84) Γ h ( 0 ) ( Γ , Γ ( n Q = w w w w h − n 1 = h i i i i

287 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 277 — #287 i i 5.6 Multivariate ARMAX Models 277 cmort cmrt & part cmrt & tmpr 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 ACF 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 30 0 30 10 40 50 0 10 20 0 10 20 20 40 50 30 40 50 Lag Lag Lag tmpr & cmrt tempr tmpr & part 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 ACF 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 40 50 −20 −10 0 −50 −40 −30 0 10 0 10 20 30 40 50 20 30 Lag Lag Lag part part & tmpr part & cmrt 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 ACF 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 −10 0 30 0 −10 −20 −30 −40 −50 −50 0 10 20 −40 40 50 −30 −20 Lag Lag Lag Fig. 5.13. ACFs (diagonals) and CCFs (off-diagonals) for the residuals of the three-dimensional VAR(2) fit to the LA mortality – pollution data set. On the off-diagonals, the second-named series is the one that leads. where h − n ’ ′ 1 − ˆ , w ˆ w ˆ n = h ( ) Γ t + h w t 1 t = ˆ and is the residual process. Under the null that w is white noise, (5.84) has an w t t 2 2 ) p asymptotic χ H distribution with k − ( degrees of freedom. Finally, prediction follows in a straight forward manner from the univariate case. fanchart vars , use the predict command and the Using the R package command, which produces a nice graphic: (fit.pr = predict(fit, n.ahead = 24, ci = 0.95)) # 4 weeks ahead fanchart(fit.pr) # plot prediction + error The results are displayed in Figure 5.14; we note that the package stripped time , when plotting the fanchart and the horizontal axis is labeled 1 , 2 3 , . . . . ) models, the autocovariance structure leads to the multivariate For pure VAR( p : Yule–Walker equations version of the i i i i

288 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 278 — #288 i i 5 Additional Time Domain Topics 278 Fanchart for variable cmort 110 90 70 500 0 200 400 300 100 Fanchart for variable tempr 90 70 50 300 400 200 100 0 500 Fanchart for variable part 80 60 40 20 500 200 100 300 0 400 Fig. 5.14. Predictions from a VAR(2) fit to the LA mortality – pollution data. p ’ h ) = , Γ (5.85) , ..., 2 Φ ( Γ ( h − j ) , h = 1 j j 1 = p ’ ) = Φ Γ ( (5.86) . 0 Σ Γ (− j ) + j w 1 = j ′ Γ ( h ) = cov ( x . where ) , x h ) is a k × k matrix and Γ (− h ) = Γ ( t h + t Estimation of the autocovariance matrix is similar to the univariate case, that is, Õ n − 1 with ̄ x = n = , x E μ , as an estimate of x t t t 1 = n − h ’ ′ − 1 ˆ x ̄ x ) − )( x ( x ̄ − 0 Γ (5.87) , 1 ( − n n = ) h , h = , .., , 1 , 2 + t t h t = 1 ′ ˆ ˆ -th column of and j Γ (− h ) = -th row and Γ ( h ) ( . If ˆ γ denotes the element in the ) h i j i , ˆ ) h Γ ( , the cross-correlation functions (CCF), as discussed in (1.35), are estimated by ( h ) ˆ γ j , i ( h ) = ˆ ρ 2 0 = − 1 . h n , .., , (5.88) 1 , √ √ , i j γ γ ( 0 ( 0 ) ) ˆ ˆ j j i , i , in (5.88), we get the estimated autocorrelation function (ACF) of the j = i When individual series. Although least squares estimation was used in Example 5.10 and Example 5.11, we could have also used Yule-Walker estimation, conditional or unconditional maximum likelihood estimation. As in the univariate case, the Yule–Walker estimators, the maximum likelihood estimators, and the least squares estimators are asymptotically equivalent. To exhibit the asymptotic distribution of the autoregression parameter estimators, we write i i i i

289 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 279 — #289 i i 279 5.6 Multivariate ARMAX Models ( ) φ , ...,Φ = Φ vec , 1 p where the stacks the columns of a matrix into a vector. For example, for vec operator a bivariate AR(2) model, ( ) ′ ( ) = vec Φ , ,Φ ,Φ ,Φ φ ,Φ Φ ,Φ ,Φ ,Φ = Φ 1 2 1 2 1 1 1 2 2 2 21 11 22 22 12 21 11 12 ( ) , ...,Φ is the i j -th element of Φ Φ , ` = 1 , where . Because Φ 2 k p is a k × 1 p ` ` i j 2 vector. We now state the following property. 1 p × matrix, k is a φ Property 5.1 Large-Sample Distribution of VAR Estimators ˆ Let denote the vector of parameter estimators (obtained via Yule–Walker, least φ -dimensional AR( p squares, or maximum likelihood) for a ) model. Then, k √ ( ) − 1 ˆ φ AN ( 0 , Σ n ⊗ Γ ∼ φ − ) , (5.89) w pp p k 1 − 1 − i − j )} ( Γ { = Γ where σ is a k p × k p matrix and Σ { ⊗ Γ } Γ = is a pp i j w pp pp i = 1 j , 1 i , j = 2 2 p k k × p matrix with σ . denoting the i j -th element of Σ w i j ˆ The variance–covariance matrix of the estimator φ is approximated by replacing ˆ ˆ . The square root of the diagonal Γ Σ h by in Σ ( , and replacing Γ ( h ) by ) Γ w w pp √ 1 − ˆ ˆ n gives the individual standard errors. For the elements of Σ divided by Γ ⊗ w pp mortality data example, the estimated standard errors for the VAR(2) fit are listed in Example 5.11; although those standard errors were taken from a regression run, they could have also been calculated using Property 5.1. ± A k × 1 vector-valued time series x , . . . , for t = 0 , ± 1 , , is said to be 2 t is stationary and ( p , VARMA ) if x q t w x , = α + (5.90) x + ··· + Φ x + w + Θ w + ··· + Θ Φ t t t − 1 p t − 1 1 t − p q 1 t − q Θ is positive definite). The coefficient with Φ , 0 , (that is, , 0 , and Σ Σ > 0 q w p w matrices matrices. If x ; j = 1 , . . ., p and Θ Φ ; j = 1 , . . ., q are, of course, k × k j t j has mean μ then α = ( I − Φ . As in the univariate case, we will have to −···− Φ μ ) p 1 place a number of conditions on the multivariate ARMA model to ensure the model is unique and has desirable properties such as causality. These conditions will be discussed shortly. As in the VAR model, the special form assumed for the constant component can r . That is, we could have 1 be generalized to include a fixed u × vector of inputs, t proposed the vector ARMAX model , p q ’ ’ x u w + (5.91) x + w Θ Φ , = Γ + j − j t t t k k t t − 1 j = 1 = k parameter matrix. r p where Γ is a × While extending univariate AR (or pure MA) models to the vector case is fairly easy, extending univariate ARMA models to the multivariate case is not a simple i i i i

290 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 280 — #290 i i 5 Additional Time Domain Topics 280 matter. Our discussion will be brief, but interested readers can get more details in Lütkepohl (1993), Reinsel (1997), and Tiao and Tsay (1989). autoregressive operator is In the multivariate case, the p = I − Φ −···− B Φ Φ ( B B , (5.92) ) p 1 moving average operator is and the q ) = I + Θ ··· B + Θ + ( B B Θ (5.93) , 1 q , ) model is then written in the concise form as The zero-mean VARMA( p q B ) x ) = Θ ( B Φ w ( . (5.94) t t if the roots of | Φ ( causal )| (where |·| denotes determinant) The model is said to be z | z | > 1 ; that is, | Φ ( z )| , 0 for any value z such that | z | ≤ 1 . are outside the unit circle, In this case, we can write x Ψ ( B ) w = , t t Õ Õ ∞ ∞ j = B The model is said to be . Ψ ∞ ) = , Ψ < Ψ I , and where ( || Ψ || B j 0 j 0 = j 0 = j if the roots of | Θ ( z )| lie outside the unit circle. Then, we can write invertible , ( B ) x Π w = t t Õ Õ ∞ ∞ j ( = where Analogous to the univariate . Π ∞ B B , Π < = I , and ) Π || Π || j j 0 j 0 = 0 = j − 1 z case, we can determine the matrices , and Ψ | by solving Ψ ( z ) = Φ ( z ) 1 | ≤ Θ ( z ) , j − 1 1 by solving Π ( z ) = Θ ( z the matrices . Π Φ ( z ) , | z | ≤ ) j w ) so the general autocovariance For a causal model, we can write x B = Ψ ( t t ) 0 structure of an ARMA( p ≥ q ) model is ( h , ∞ ’ ′ ) Γ h ) = cov ( Ψ Σ Ψ . , x (5.95) ( = x t w t h + h + j j 0 = j ′ ( and h ) = Γ (− h ) . For pure MA( q ) processes, (5.95) becomes Γ h − q ’ ′ , Γ ) (5.96) h ( Θ = Σ Θ h j w + j 0 = j ) where Θ . = I . Of course, (5.96) implies Γ ( h h = 0 for > q 0 As in the univariate case, we will need conditions for model uniqueness. These conditions are similar to the condition in the univariate case that the autoregressive and moving average polynomials have no common factors. To explore the uniqueness problems that we encounter with multivariate ARMA models, consider a bivariate ′ x x ( x , given by AR(1) process, ) , = t , 2 1 , t t x , w + = φ x 1 t 1 , 2 − , t , 1 t = x , w t , 2 t , 2 i i i i

291 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 281 — #291 i i 5.6 Multivariate ARMAX Models 281 where w . Both pro- 1 | and w | φ are independent white noise processes and < 2 , t 1 , t x cesses, are causal and invertible. Moreover, the processes are jointly x and 1 t , 2 t , h 2 ( x δ , φσ = ) , x 1 − h ) = φ cov ( x ( φ γ )≡ x stationary because cov 1 , − t , 2 h + 2 , 2 t 2 , t 1 , h + t 2 w 2 1 h h does not depend on t ; note, δ δ = 1 when h = 1 , otherwise, = 0 . In matrix notation, 1 1 we can write this model as ] [ 0 φ , . + w = Φ where Φ = x (5.97) x t 1 t t − 0 0 We can write (5.97) in operator notation as ] [ φ − 1 z = . ( z ) = w Φ x B Φ ( ) where t t 1 0 In addition, model (5.97) can be written as a bivariate ARMA(1,1) model = Φ x + x w + w , (5.98) Θ t t − 1 − 1 t 1 t 1 where [ ] ] [ θ 0 θ φ − + 0 = and Θ = Φ , 1 1 0 0 0 0 and θ is arbitrary. To verify this, we write (5.98), as Φ , or ( B ) x w = Θ ) ( B 1 t 1 t − 1 ( B ) Φ Θ , w ( B ) x = t 1 t 1 where ] [ ] [ θ − −( φ + 1 ) z 1 z θ = Φ . ( z ) and Θ = ( z ) 1 1 1 0 1 0 Then, ] [ ] [ ] [ z z 1 −( φ φ θ ) z + 1 θ 1 − − 1 z ) = z ) Φ Θ ( = ) ( Φ , ( z = 1 1 0 1 1 0 1 0 Φ ( ) is the polynomial associated with the bivariate AR(1) model in (5.97). where z is arbitrary, the parameters of the ARMA(1,1) model given in (5.98) are Because θ not identifiable. No problem exists, however, in fitting the AR(1) model given in (5.97). Θ ( B ) and The problem in the previous discussion was caused by the fact that both − 1 B ( Θ are finite; such a matrix operator is called unimodular . If U ( B ) is unimod- ) ular, | U ( z )| is constant. It is also possible for two seemingly different multivariate w B , to be related ) ARMA( p , q ) models, say, Φ ( B ) x ( = Θ ( B ) w Θ and Φ = ( B ) x t ∗ ∗ t t t ) U B ) as Φ , ( B ) = U ( B ) through a unimodular operator, ( B ( and Θ ) ( B ) = U ( B ) Θ ( B Φ ∗ ∗ B B ( and in such a way that the orders of Φ ( ) ) and Θ ( B ) are the same as the orders of Φ ∗ ) Θ , respectively. For example, consider the bivariate ARMA(1,1) models given ( B ∗ by ] [ ] [ 1 B θ φ B 1 − ≡ w Θ x = x Φ ≡ w t t t t 1 0 1 0 i i i i

292 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 282 — #292 i i 5 Additional Time Domain Topics 282 and [ ] ] [ ( φ B − α ( 1 ) α + θ ) B 1 B = ) x Φ ≡ ) w ( B ( Θ x w ≡ , t ∗ ∗ t t t 0 1 1 0 φ where θ are arbitrary constants. Note, , α , and ] [ ] [ ] [ 1 ( − 1 ) B B φ − 1 α B α φ ( B = )≡ Φ ≡ U ( B ) Φ ( B ) ∗ 1 1 0 1 0 0 and ] ] [ [ [ ] α θ B + α ( 1 ) B 1 1 θ B ≡ . ) B ( = B )≡ ( Θ Θ ) U ( B ∗ 0 1 0 1 0 1 x ) = Ψ ( B In this case, both models have the same infinite MA representation w , t t where 1 − 1 − 1 − − 1 = Φ ( B ) ) B Θ ( B ) = Φ ( B ) ( ( U ( B ) . Ψ U ( B ) Θ ( B ) = Φ Θ ( B ) B ) ∗ ∗ Γ ( ) . Two This result implies the two models have the same autocovariance function h , p ) models are said to be observationally equivalent . such ARMA( q As previously mentioned, in addition to requiring causality and invertiblity, we will need some additional assumptions in the multivariate case to make sure that the identifiability of the parameters of the multivariate model is unique. To ensure the p ARMA( q ) model, we need the following additional two conditions: (i) the matrix , Θ Φ B operators and ( ( B ) have no common left factors other than unimodular ones ) B U , the common factor must [that is, if Φ ( B ) = ) ( B ) Φ ( ( B ) and Θ ( B ) = U ( B ) Θ ∗ ∗ be unimodular] and (ii) with q as small as possible and p as small as possible for k must be full rank, [ that q , the matrix . One suggestion for avoiding most Φ ] ,Θ q p of the aforementioned problems is to fit only vector AR( p ) models in multivariate situations. Although this suggestion might be reasonable for many situations, this philosophy is not in accordance with the law of parsimony because we might have to fit a large number of parameters to describe the dynamics of a process. Asymptotic inference for the general case of vector ARMA models is more com- plicated than pure AR models; details can be found in Reinsel (1997) or Lütkepohl (1993), for example. We also note that estimation for VARMA models can be re- cast into the problem of estimation for state-space models that will be discussed in Chapter 6. Example 5.12 The Spliid Algorithm for Fitting Vector ARMA A simple algorithm for fitting vector ARMA models from Spliid (1983) is worth mentioning because it repeatedly uses the multivariate regression equations. Con- p , q ) model for a time series with a nonzero mean sider a general ARMA( + = α + Φ x (5.99) ··· + Φ x + w + Θ w x + ··· + Θ w . 1 1 q 1 t t − p − t − p 1 t − q t t Φ If = E x were observed, we could , then α = ( I − Φ w −···− μ , ..., ) μ . If w t − 1 p 1 t − q t rearrange (5.99) as a multivariate regression model , w + (5.100) x z = B t t t i i i i

293 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 283 — #293 i i 5.6 Multivariate ARMAX Models 283 l l 130 l l l l l l l l l l l l l l l l l l l l l l l l 110 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 90 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 80 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Cardiovascular Mortality l l l l l l l l l l l l l l l l l l l l l l l 70 l 1970 1976 1978 1972 1974 1980 Time Predictions (line) from a VARMA(2,1) fit to the LA mortality (points) data using Fig. 5.15. Spliid’s algorithm. with ′ ′ ′ ′ ′ = ( 1 , x (5.101) w , ..., z , ..., x ) w , t p − t q − t − t − t 1 1 and = α,Φ , ...,Φ B (5.102) ,Θ [ , ...,Θ , ] q p 1 1 t = p for 1 , ..., n . Given an initial estimator B , we can reconstruct , of B + 0 w w by setting } , ..., { t − q 1 − t q , ..., (5.103) 1 w , = j = x , n , ..., − B 1 z + p = , t j 0 j − t j − t − t where, if q > p , we put w w , ..., w } 0 for t − j ≤ 0 . The new values of { = j t − 1 − t t − q B and a new estimate, say, z , is obtained. The are then put into the regressors 1 t initial value, B , can be computed by fitting a pure autoregression of order p or 0 higher, and taking 0 . The procedure is then iterated until the pa- Θ = = ··· = Θ q 1 rameter estimates stabilize. The algorithm often converges, but not to the maximum likelihood estimators. Experience suggests the estimators can be reasonably close to the maximum likelihood estimators. The algorithm can be considered as a quick and easy way to fit an initial VARMA model as a starting point to using maximum likelihood estimation, which is best done via state-space models covered in the next chapter. 2 , We used the R package marima to fit a vector ARMA ( to the mortality– 1 ) pollution data set and part of the output is displayed. We note that mortality is detrended prior to the analysis. The one-step-ahead predictions for mortality are displayed in Figure 5.15. library(marima) model = define.model(kvar=3, ar=c(1,2), ma=c(1)) arp = model$ar.pattern; map = model$ma.pattern cmort.d = resid(detr <- lm(cmort~ time(cmort), na.action=NULL)) # strip ts attributes xdata = matrix(cbind(cmort.d, tempr, part), ncol=3) fit = marima(xdata, ar.pattern=arp, ma.pattern=map, means=c(0,1,1), penalty=1) # resid analysis (not displayed) i i i i

294 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 284 — #294 i i 5 Additional Time Domain Topics 284 innov = t(resid(fit)); plot.ts(innov); acf(innov, na.action=na.pass) # fitted values for cmort pred = ts(t(fitted(fit))[,1], start=start(cmort), freq=frequency(cmort)) + detr$coef[1] + detr$coef[2]*time(cmort) plot(pred, ylab="Cardiovascular Mortality", lwd=2, col=4); points(cmort) # print estimates and corresponding t^2-statistic short.form(fit$ar.estimates, leading=FALSE) short.form(fit$ar.fvalues, leading=FALSE) short.form(fit$ma.estimates, leading=FALSE) short.form(fit$ma.fvalues, leading=FALSE) parameter estimate t^2 statistic AR1 -0.311 0.000 -0.114 51.21 0.0 7.9 0.000 -0.656 0.048 0.00 41.7 3.1 -0.109 0.000 -0.861 1.57 0.0 113.3 AR2: -0.333 0.133 -0.047 67.24 11.89 2.52 0.000 -0.200 0.055 0.00 8.10 2.90 0.179 -0.102 -0.151 4.86 1.77 6.48 MA1: 0.000 -0.187 -0.106 0.00 14.51 4.75 -0.114 -0.446 0.000 4.68 16.38 0.00 0.000 -0.278 -0.673 0.00 8.08 47.56 fit$resid.cov # estimate of noise cov matrix 27.3 6.5 13.8 6.5 36.2 38.1 13.8 38.1 109.2 Problems Section 5.1 0 , , model 5.1 The data set arf is 1000 simulated observations from an ARFIMA ( 1 ) 1 and . φ = . 75 with d = . 4 (a) Plot the data and comment. (b) Plot the ACF and PACF of the data and comment. ˆ ˆ . (c) Estimate the parameters and test for the significance of the estimates φ and d (d) Explain why, using the results of parts (a) and (b), it would seem reasonable to x represents the data, explain difference the data prior to the analysis. That is, if t x why we might choose to fit an ARMA model to ∇ . t (e) Plot the ACF and PACF of ∇ x and comment. t ∇ and comment. x (f) Fit an ARMA model to t Compute the sample ACF of the absolute values of the NYSE returns displayed in 5.2 Figure 1.4 up to lag 200, and comment on whether the ACF indicates long memory. Fit an ARFIMA model to the absolute values and comment. i i i i

295 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 285 — #295 i i 285 Problems Section 5.2 5.3 Plot the global temperature series, globtemp , and then test whether there is a unit root versus the alternative that the process is stationary using the three tests, DF, ADF, and PP, discussed in Example 5.3. Comment. Plot the GNP series, gnp , and then test for a unit root against the alternative that 5.4 the process is explosive. State your conclusion. 5.5 Verify (5.33). Section 5.3 5.6 oil ; see Problem Prob- Weekly crude oil spot prices in dollars per barrel are in lem 2.10 and Appendix R for more details. Investigate whether the growth rate of the weekly oil price exhibits GARCH behavior. If so, fit an appropriate model to the growth rate. 5.7 The stats package of R contains the daily closing prices of four major European stock indices; type help(EuStockMarkets) for details. Fit a GARCH model to the returns of one of these series and discuss your findings. (Note: The data set contains actual values, and not returns. Hence, the data must be transformed prior to the model fitting.) ) 1 ( ( 1 l × 2 The gradient vector, α 5.8 ,α ) , given for an ARCH(1) model was displayed 0 1 in (5.47). Verify (5.47) and then use the result to calculate the 2 × 2 Hessian matrix ) ( 2 2 2 / ∂ ∂α ∂α ∂α l / ∂ l 0 1 ) 2 ( 0 l α ( ,α = ) . 1 0 2 2 2 ∂ ∂α l ∂ / l / ∂α ∂α 0 1 1 Section 5.4 5.9 The sunspot data ( sunspotz ) are plotted in Chapter 4, Figure 4.22. From a time plot of the data, discuss why it is reasonable to fit a threshold model to the data, and then fit a threshold model. Section 5.5 5.10 The data in have 454 months of measured values for the climatic climhyd ), and p variables air temperature, dew point, cloud cover, wind speed, precipitation ( t inflow ( i ), at Lake Shasta; the data are displayed in Figure 7.3. We would like to look t at possible relations between the weather factors and the inflow to Lake Shasta. √ p , 0 , 0 0 0 )×( , 1 , 1 ) (a) Fit ARIMA models to (i) transformed precipitation P ( = t t 12 . i = log and (ii) transformed inflow I t t i i i i

296 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 286 — #296 i i 5 Additional Time Domain Topics 286 (b) Apply the ARIMA model fitted in part (a) for transformed precipitation to the flow series to generate the prewhitened flow residuals assuming the precipitation model. Compute the cross-correlation between the flow residuals using the pre- cipitation ARIMA model and the precipitation residuals using the precipitation model and interpret. Use the coefficients from the ARIMA model to construct the transformed flow residuals. climhyd i = log 5.11 data set, consider predicting the transformed flows I For the t t √ p using a transfer function model of = P from transformed precipitation values t t the form 12 12 1 ) I ( = α ( B )( 1 − B B ) P , + n − t t t where we assume that seasonal differencing is a reasonable thing to do. You may think of it as fitting = α ( B ) x + n , y t t t y x are the seasonally differenced transformed flows and precipitations. where and t t x can be fitted by a first-order seasonal moving average, and use the (a) Argue that t transformation obtained to prewhiten the series x . t y (b) Apply the transformation applied in (a) to the series , and compute the cross- t correlation function relating the prewhitened series to the transformed series. Argue for a transfer function of the form δ 0 ) = α ( B . − 1 B ω 1 δ (c) Write the overall model obtained in regression form to estimate and ω . Note 1 0 that you will be minimizing the sums of squared residuals for the transformed B ( . Retain the residuals for further modeling involving the n noise series ) 1 − ˆ ω 1 t ˆ n . The observed residual is u . = ( 1 noise n ω ) B − 1 t t t (d) Fit the noise residuals obtained in (c) with an ARMA model, and give the final form suggested by your analysis in the previous parts. (e) Discuss the problem of forecasting using the infinite past of y and the y t + m t present and infinite past of x . Determine the predicted value and the forecast t variance. Section 5.6 5.12 Consider the data set econ5 containing quarterly U.S. unemployment, GNP, consumption, and government and private investment from 1948-III to 1988-II. The seasonal component has been removed from the data. Concentrating on unemploy- ), and consumption ( ), fit a vector ARMA model to the data G C ment ( U ), GNP ( t t t after first logging each series, and then removing the linear trend. That is, fit a vector ′ ˆ ˆ , ARMA model to x t = ( x − β , x = )− , x U ( ) β , where, for example, x log t t t 3 t t 2 1 0 1 t 1 ˆ ˆ ) U ( on time, β log and where β are the least squares estimates for the regression of 1 0 t t . Run a complete set of diagnostics on the residuals. i i i i

297 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 287 — #297 i i Chapter 6 State Space Models A very general model that subsumes a whole class of special cases of interest in much the same way that linear regression does is the state-space model or the dynamic linear model, which was introduced in Kalman (1960) and Kalman and Bucy (1961). The model arose in the space tracking setting, where the state equation defines the motion reflect y and the data equations for the position or state of a spacecraft with location x t t information that can be observed from a tracking device such as velocity and azimuth. Although introduced as a method primarily for use in aerospace-related research, the model has been applied to modeling data from economics (Harrison and Stevens, 1976; Harvey and Pierse, 1984; Harvey and Todd, 1983; Kitagawa and Gersch 1984, Shumway and Stoffer, 1982), medicine (Jones, 1984) and the soil sciences (Shumway, 1988, §3.4.5). An excellent treatment of time series analysis based on the state space model is the text by Durbin and Koopman (2001). A modern treatment of nonlinear state space models can be found in Douc, Moulines and Stoffer (2014). In this chapter, we focus primarily on linear Gaussian state space models. We present various forms of the model, introduce the concepts of prediction, filtering and smoothing state space models and include their derivations. We explain how to perform maximum likelihood estimation using various techniques, and include methods for handling missing data. In addition, we present several special topics such as hidden Markov models (HMM), switching autoregressions, smoothing splines, ARMAX models, bootstrapping, stochastic volatility, and state space models with switching. Finally, we discuss a Bayesian approach to fitting state space models using Markov chain Monte Carlo (MCMC) techniques. The essential material is supplied in Sections 6.1, 6.2, and 6.3. After that, the other sections may be read in any order with some occasional backtracking. In general, the state space model is characterized by two principles. First, there is a hidden or latent process x called the state process. The state process is assumed to t } , are { be a Markov process; this means that the future { x } ; s > t t , and past < x s ; s s . The second condition is that the observa- x independent conditional on the present, t x . This means that the dependence among tions, y are independent given the states t t the observations is generated by states. The principles are displayed in Figure 6.1. i i i i

298 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 288 — #298 i i 6 State Space Models 288   - - - x x ··· ··· t t − 1   ? ?   y y t t − 1   Diagram of a state space model. Fig. 6.1. 6.1 Linear Gaussian Model The linear Gaussian state space model or dynamic linear model (DLM), in its ba- sic form, employs an order one, p -dimensional vector autoregression as the state equation , (6.1) = Φ x . w x + − 1 t t t w are p × 1 independent and identically distributed, zero-mean normal vectors The t Q ; we write this as w ) ∼ iid N . In the DLM, we assume ( 0 , Q with covariance matrix p t x . , such that x ) ∼ N , Σ ( the process starts with a normal vector μ 0 0 0 0 p x directly, but only a linear transformed We do not observe the state vector t version of it with noise added, say (6.2) , v + y x = A t t t t where A ; (6.2) is called the is a observation × p measurement or observation matrix q t -dimensional, which can be larger than or q , is equation . The observed data vector, y t , . ) smaller than p , the state dimension. The additive observation noise is v R ∼ iid N 0 ( q t v are uncorrelated; } In addition, we initially assume, for simplicity, x { and { w } , t t 0 this assumption is not necessary, but it helps in the explanation of first concepts. The case of correlated errors is discussed in Section 6.6. As in the ARMAX model of Section 5.6, exogenous variables, or fixed inputs, may enter into the states or into the observations. In this case, we suppose we have r × , and write the model as vector of inputs u an 1 t x = Φ x + Υ u + w (6.3) t t t − 1 t + y = A v x + u Γ (6.4) t t t t t r × ; either of these matrices may be the zero matrix. q where Υ is p × r and Γ is i i i i

299 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 289 — #299 i i 6.1 Linear Gaussian Model 289 l l l 4.0 l l l l l l l l l l l l l l l l l l l l l l 3.5 l l l l l l l l l l l l l l 3.0 l l l l l 2.5 l WBC l l l 2.0 l l l l l l 1.5 l l l l l l l l l l l l l l l l l l l l l l l l l 5.0 l Time l l l l l l l l l l l l l l PLT 4.5 l l l l l l l l l l l l l 4.0 l l l l l 35 l l l l l l l l l l l l Time l l l l l l l l l l l l l l l l l l l l l l l l l 30 l l l l l l l l l l l l HCT 25 l 0 20 40 80 60 Time Time WBC Fig. 6.2. Longitudinal series of monitored blood parameters, log (white blood count) [ ], = days). 91 log (platelet) [ PLT ], and hematocrit [ HCT ], after a bone marrow transplant ( n Example 6.1 A Biomedical Example Suppose we consider the problem of monitoring the level of several biomedical markers after a cancer patient undergoes a bone marrow transplant. The data in Figure 6.2, used by Jones (1984), are measurements made for 91 days on three vari- ables, log(white blood count) [WBC], log(platelet) [PLT], and hematocrit [HCT], ′ = ( y . Approximately 40% of the values are missing, with denoted , y ) y , y t 3 t 1 t t 2 missing values occurring primarily after the 35th day. The main objectives are to model the three variables using the state-space approach, and to estimate the miss- ing values. According to Jones, “Platelet count at about 100 days post transplant has previously been shown to be a good indicator of subsequent long term survival.” For this particular situation, we model the three variables in terms of the state equation (6.1); that is, x φ w φ x φ t t t , 1 13 − 1 12 1 11 1 © © © © ™ ™ ™ ™ x w φ x φ φ (6.5) + = . ≠ ≠ ≠ ≠ Æ Æ Æ Æ 23 t − 1 , 2 t 2 t 2 21 22 w φ x x φ φ 3 3 33 t 31 3 32 1 t − t , ́ ́ ́ ́ ̈ ̈ ̈ ̈ The observation equations would be y observation = A 3 x × + v 3 , where the t t t t matrix, A , is either the identity matrix or the zero matrix depending on whether t a blood sample was taken on that day. The covariance matrices R and Q are each matrices. A plot similar to Figure 6.2 can be produced as follows. 3 3 × , main= '' o ' , pch=19, xlab= ' day ' ' plot(blood, type= ) As we progress through the chapter, it will become apparent that, while the model seems simplistic, it is quite general. For example, if the state process is VAR(2), we may write the state equation as a 2 p -dimensional process, ( ) ( ) ) ( ( ) x Φ Φ w x t 2 t 1 t − 1 = + , (6.6) x 0 0 I x t − 1 2 t − 2 × 1 p p 2 p × 1 2 p × 1 2 × 2 p i i i i

300 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 290 — #300 i i 290 6 State Space Models 1.0 Land/Ocean Land Only 0.5 0.0 Temperature Deviations −0.5 1980 1960 2020 1940 1920 1900 1880 2000 Time Fig. 6.3. Annual global temperature deviation series, measured in degrees centigrade, 1880– 2015. The series differ by whether or not ocean data is included. and the observation equation as the q -dimensional process, ) ( ] [ x t y . + (6.7) A v = 0 t t t x t 1 − q 1 × 1 q × × q p 2 1 p 2 × The real advantages of the state space formulation, however, do not really come through in the simple example given above. The special forms that can be developed and for the transition scheme defined by the for various versions of the matrix A t Φ matrix allow fitting more parsimonious structures with fewer parameters needed to describe a multivariate time series. We will see numerous examples throughout the structural models is a good example of the model flexibility. chapter; Section 6.5 on The simple example shown below is instructive. Example 6.2 Global Warming Figure 6.3 shows two different estimators for the global temperature series from 1880 to 2015. One is globtemp , which was considered in the first chapter, and are the global mean land-ocean temperature index data. The second series, globtempl , are the surface air temperature index data using only meteorological station data. Conceptually, both series should be measuring the same underlying climatic signal, and we may consider the problem of extracting this underlying signal. The R code to generate the figure is ts.plot(globtemp, globtempl, col=c(6,4), ylab= ' Temperature Deviations ' ) We suppose both series are observing the same signal with different noises; that is, and , y y v + = x x + v = 2 1 t t t 1 t t 2 t or more compactly as ( ) ( ) ) ( y 1 v t 1 1 t = x + (6.8) , t y 1 v 2 t t 2 where i i i i

301 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 291 — #301 i i 291 6.1 Linear Gaussian Model ( ( ) ) v r r t 1 11 12 R = var = . v r r t 2 22 21 It is reasonable to suppose that the unknown common signal, x , can be modeled t as a random walk with drift of the form (6.9) = δ + x , x w + 1 t t t − , with Q = var ( w 2 ) . In terms of the model (6.3)–(6.4), this example has, p = 1 , q = t u = 1 , and Υ = δ with Φ . ≡ 1 t The introduction of the state-space approach as a tool for modeling data in the social and biological sciences requires model identification and parameter estimation because there is rarely a well-defined differential equation describing the state tran- sition. The questions of general interest for the dynamic linear model (6.3) and (6.4) , and relate to estimating the unknown parameters contained in Φ,Υ, Q , Γ, A that , R t define the particular model, and estimating or forecasting values of the underlying x . The advantages of the state-space formulation are in the ease unobserved process t with which we can treat various missing data configurations and in the incredible array of models that can be generated from (6.3) and (6.4). The analogy between the observation matrix A and the design matrix in the usual regression and analysis of t variance setting is a useful one. We can generate fixed and random effect structures that are either constant or vary over time simply by making appropriate choices for the matrix A and the transition structure Φ . t Before continuing our investigation of the general model, it is instructive to consider a simple univariate state-space model wherein an AR(1) process is observed using a noisy instrument. Example 6.3 An AR( ) Process with Observational Noise 1 Consider a univariate state-space model where the observations are noisy, v = x y + (6.10) , t t t and the signal (state) is an AR(1) process, x , = φ x w + (6.11) 1 − t t t 2 ) ( σ 2 2 w { ; } w , and { x , } v ) , and x ∼ N 0 , w , σ , ∼ ) ( iid N 0 0 iid N ∼ where , σ ( v t t 0 0 t t 2 w v 1 − φ , . . . are independent, and 1 , 2 = . t x In Chapter 3, we investigated the properties of the state, , because it is a station- t ary AR(1) process (recall Problem 3.2). For example, we know the autocovariance function of x is t 2 σ w h , . . . . ( h ) = (6.12) 2 γ 1 , 0 = φ , , h x 2 φ − 1 But here, we must investigate how the addition of observation noise affects the dynamics. Although it is not a necessary assumption, we have assumed in this i i i i

302 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 292 — #302 i i 292 6 State Space Models example that x is stationary. In this case, the observations are also stationary t because . We have y v is the sum of two independent stationary components x and t t t 2 σ w 2 σ ( 0 ) = var ( y + ) = var ( x (6.13) + v , ) = γ t t y t v 2 φ − 1 and, when h ≥ 1 , . ) (6.14) γ h ( h ) = cov ( y ( , y γ = ) ) = cov ( x v + v + , x x t h t t t − h h − t − t y Consequently, for h ≥ 1 , the ACF of the observations is ) ( 1 − 2 ( h ) γ σ y v h 2 = ( ) ρ h 1 + φ . ) = φ (6.15) ( 1 − y 2 ) 0 ( γ σ y w It should be clear from the correlation structure given by (6.15) that the observa- 2 tions, y y , are not AR(1) unless σ . In addition, the autocorrelation structure of = 0 t t v is identical to the autocorrelation structure of an ARMA(1,1) process, as presented in Example 3.14. Thus, the observations can also be written in an ARMA(1,1) form, u y = , y + θ u + φ t t t t − 1 − 1 2 2 σ u θ is Gaussian white noise with variance σ where suitably , and with and t u u chosen. We leave the specifics of this problem alone for now and defer the discussion to Section 6.6; in particular, see Example 6.11. Although an equivalence exists between stationary ARMA models and stationary state-space models (see Section 6.6), it is sometimes easier to work with one form than another. As previously mentioned, in the case of missing data, complex multivariate systems, mixed effects, and certain types of nonstationarity, it is easier to work in the framework of state-space models. 6.2 Filtering, Smoothing, and Forecasting From a practical view, a primary aim of any analysis involving the state space model, (6.3)–(6.4), would be to produce estimators for the underlying unobserved signal x , t given the data . As will be seen, state estimation is y s , to time = { y } , . . ., y s 1 s 1: t , the problem is called < an essential component of parameter estimation. When s , the problem is called s prediction . When s = t or filtering , and when forecasting > t , the problem is called smoothing . In addition to these estimates, we would also want to measure their precision. The solution to these problems is accomplished via the Kalman filter and smoother and is the focus of this section. Throughout this chapter, we will use the following definitions: s y (6.16) x ) | = E ( x t s 1: t and i i i i

303 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 293 — #303 i i 6.2 Filtering, Smoothing, and Forecasting 293 } { s ′ s s E − x (6.17) . − x = ( ) )( x x P t t t t t t , 2 1 2 1 2 1 s P When ( = t say) in (6.17), we will write t = t for convenience. 2 1 t In obtaining the filtering and smoothing equations, we will rely heavily on the Gaussian assumption. Some knowledge of the material covered in Appendix B will be helpful in understanding the details of this section (although these details may be skipped on a casual reading of the material). Even in the non-Gaussian case, the estimators we obtain are the minimum mean-squared error estimators within the in (6.16) as the projection E class of linear estimators. That is, we can think of y as the space operator in the sense of Section B.1 rather than expectation and 1: s s of linear combinations of { y } ; in this case, P , . . ., is the corresponding mean- y s 1 t squared error. Since the processes are Gaussian, (6.17) is also the conditional error covariance; that is, { } s s ′ s E ( x P = y x . − )( x ) x − t t 1: s t , t t t 1 2 2 1 1 2 s ( x − This fact can be seen, for example, by noting the covariance matrix between ) x t t , for any t and s , is zero; we could say they are orthogonal in the sense of y and 1: s s x ( − x Section B.1. This result implies that are independent (because of the ) and y t 1: s t s x ( normality), and hence, the conditional distribution of x given y − is the uncon- ) s 1: t t s ( x ditional distribution of − x ) . Derivations of the filtering and smoothing equations t t from a Bayesian perspective are given in Meinhold and Singpurwalla (1983); more traditional approaches based on the concept of projection and on multivariate normal distribution theory are given in Jazwinski (1970) and Anderson and Moore (1979). First, we present the Kalman filter, which gives the filtering and forecasting equa- t x is a linear filter of the observations tions. The name filter comes from the fact that t Õ t t matrices y × p ; that is, x q for suitably chosen = . The advantage of B y B s t 1: s s t 1 = s − 1 t t x once a the Kalman filter is that it specifies how to update the filter from to x t 1 − t new observation y is obtained, without having to reprocess the entire data set y . t t 1: Property 6.1 The Kalman Filter For the state-space model specified in (6.3) and (6.4), with initial conditions 0 0 , = μ n and x P , . . ., = Σ 1 , for t = 0 0 0 0 t − 1 t − 1 , (6.18) u = Φ + Υ x x t t t − 1 ′ t − 1 t − 1 (6.19) Φ = Φ P + , Q P t t − 1 with 1 − t t − 1 t Γ − ) (6.20) , u ( y K − A x + x = x t t t t t t t 1 − t t (6.21) , P = [ I − K ] A P t t t t where − ′ 1 1 − t ′ 1 t − ] P A = (6.22) [ A R P + K A t t t t t t is accomplished via (6.18) and n t > is called the Kalman gain. Prediction for n n . Important byproducts of the filter are the (6.19) with initial conditions x and P n n innovations (prediction errors) i i i i

304 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 294 — #304 i i 6 State Space Models 294 t − 1 − y ) = y = y A x E ( y  − Γ u , (6.23) − t 1: t − t t 1 t t t t and the corresponding variance-covariance matrices def t − 1 ′ t − 1 ) [ A R ( x + − x = (  var + v (6.24) ] = A A P ) = var Σ t t t t t t t t t > t n . We assume that Σ = , . . ., 0 (is positive definite), which is guaranteed, for 1 for t > 0 . This assumption is not necessary and may be relaxed. example, if R The derivations of (6.18) and (6.19) follow from straight forward calculations, Proof: because from (6.3) we have t − 1 1 − t ( x Υ | y + x , u = = E ( Φ x E x Φ + Υ u = + w ) | y ) 1: − 1 t t 1 − t 1 − t 1: t t t t 1 t − and thus } { 1 ′ 1 1 − t t − t − x P )( x = − x − x ( ) E t t t t t } { ] [ ] [ ′ − t − 1 1 t ( − E w + ) + w ) Φ Φ ( x x = x − x 1 t t − − 1 t t − t 1 − t 1 ′ 1 t − P Φ Φ + Q . = t − 1 To derive (6.20), we note that cov (  , which in view of the , y t ) = 0 for s < s t fact the innovation sequence is a Gaussian process, implies that the innovations are independent of the past observations. Furthermore, the conditional covariance x y and  is given between 1: t − 1 t t t − 1 ,  y y ) = cov ( x ) , y − A x cov ( x − Γ u | | − 1 t 1 t − t 1: t t 1: t t t t t − 1 t − 1 x ) cov ( | , y u − A = x x y − Γ − t t t 1: t − 1 t t t 1 1 − t − t x x x [ cov ] , A v ( x + − − = ) t t t t t t ′ t − 1 (6.25) . P = A t t given  Using these results we have that the joint conditional distribution of x and t t y is normal − 1: t 1 ( ) ] ([ [ ]) t 1 − ′ t − 1 − 1 t x P A P x t t t t t , ∼ y N . (6.26) 1 − t 1: 1 − t  0 P A Σ t t t t Thus, using (B.9) of Appendix B, we can write − t 1 t + , (6.27)  K x y | = E ( ) = E ( x x | y = ) , x t t t 1: t 1: t 1 t − t t t where 1 − ′ − t 1 ′ t 1 − − 1 ′ 1 t − ) + . R A P A ( A P = Σ A = K P t t t t t t t t t t is easily computed from (6.26) [see (B.10)] as P The evaluation of t 1 − t ′ t 1 − t − 1 − t 1 ( ) = P y | x − P , = P A A P , Σ cov t t t 1: t − 1 t t t t t t  which simplifies to (6.21). Nothing in the proof of Property 6.1 precludes the cases where some or all of the parameters vary with time, or where the observation dimension changes with time, which leads to the following corollary. i i i i

305 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 295 — #305 i i 295 6.2 Filtering, Smoothing, and Forecasting Corollary 6.1 Kalman Filter: The Time-Varying Case (6.3) , any or all of the parameters are time dependent, Φ = Φ (6.4) , Υ = and If, in t , Q = Q in the observation equation, or in the state equation or Υ Γ Γ R , R = = t t t t = q q , Property 6.1 the dimension of the observational equation is time dependent, t holds with the appropriate substitutions. Next, we explore the model, prediction, and filtering from a density point of view. To ease the notation, we will drop the inputs from the model. There are two key p (·) ingredients to the state space model. Letting denote a generic density function Θ with parameters represented by , we have the state process is Markovian: Θ p , x , . . ., x ) x p ( ( x = x x ) , (6.28) t − 2 t t − 0 t − 1 1 t Θ Θ and the observations are conditionally independent given the states: n ÷ x (6.29) ( x p , ) = ( y y ) p n t n 1: t 1: Θ Θ 1 t = g ( x ; μ, Σ ) denote a Since we are focusing on the linear Gaussian model, if we let multivariate normal density with mean and covariance matrix Σ as given in (1.33), μ then x ) = ( p x x A R ) g ( x ; ; Φ x y ( g , Q ) and p = ( . ) y , x 1 − t t Θ t t Θ t t t 1 − t t . with initial condition p ) ( x , Σ ) = g ( x μ ; 0 0 0 Θ 0 In terms of densities, the Kalman filter can be seen as a simple updating scheme, where, to determine the forecast densities, we have, π p x ( ) = y x , x ( p ) y dx t Θ 1 1: − t t − t Θ 1 − 1 t t − 1 1: p R π p = ( x ) p ( x x dx y ) t Θ 1 Θ t 1 − − t 1: t 1 1 − − t p R π t 1 − 1 − t dx ) P , Φ , ) g ( x x g ; ; x x ( Q = − 1 t − t t 1 − 1 t t − 1 1 t − p R 1 − t − 1 t , ) (6.30) P = , x ( g x ; t t t 1 − t t − 1 are given in (6.18) and (6.19). These values and P where the values of x t t are obtained upon evaluating the integral using the usual trick of completing the square; see Example 6.4. Since we were seeking an iterative procedure, we introduced x in (6.30) because we have (presumably) previously evaluated the filter density 1 − t ) p x ( y . Once we have the predictor, the filter density is obtained as t t 1 1 1: Θ − − y ) = p ( x x p | ( x ( ) ) x y , y )∝ p ( y | , p y t Θ t 1: t Θ − 1 t t Θ Θ 1 t 1: t t 1: t − 1 − t − 1 t = x ) x P , R ) g ( g ( ; x y , ; A , (6.31) t t t t t t t t t t P g x and ; x from which we deduce is are given in (6.20) and (6.21). , ( x where ) P t t t t t The following example illustrates these ideas for a simple univariate case. i i i i

306 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 296 — #306 i i 6 State Space Models 296 Example 6.4 Local Level Model y that consists of In this example, we suppose that we observe a univariate series t μ a trend component, , and a noise component, v , where t t y = μ + v (6.32) t t t 2 0 ∼ iid N ( ) , σ and v . In particular, we assume the trend is a random walk given by t v μ = μ (6.33) + w t − 1 t t 2 w ∼ iid N ( 0 , σ where } ) is independent of { v . Recall Example 6.2, where we t t w suggested this type of trend model for the global temperature series. The model is, of course, a state-space model with (6.32) being the observation equation, and (6.33) being the state equation. We will use the following notation introduced in Blight (1974). Let } { 1 2 2 ( , (6.34) ) − x μ − { exp = } μ, σ ; x 2 σ 2 then simple manipulation shows 2 2 x ; μ, σ } (6.35) = { μ ; x , σ { } and by completing the square, { } 2 2 / σ μ + / σ μ 2 1 1 2 2 2 2 2 − 1 , σ ( / σ 1 , x { μ ; ; 1 + }{ / σ , σ μ x = ) x ; } 1 2 1 1 2 2 2 2 / 1 + 1 σ / σ (6.36) 1 2 { } 2 2 × μ μ , σ ; + σ . 2 1 1 2 Thus, using (6.30), (6.35) and (6.36) we have π { } { } t 2 t − 1 1 − p , σ | P y μ ; μ )∝ μ ( μ ; μ , d μ 1 1 t − 1 t 1: t t − t − t − 1 w t t − 1 − 1 π } { { } 1 1 − − t t 2 μ ; , σ d μ P μ , = μ ; μ t t 1 1 − t − t − 1 w t − t 1 1 − { } 2 t − 1 1 t − = μ (6.37) σ + . , P ; μ t w 1 t − 1 − t From (6.37) we conclude that 1 − t 1 − t ) (6.38) , P μ N ( μ ∼ y | − t 1: t 1 t t where 2 1 − t 1 − t − 1 1 t − t σ + (6.39) and P = P = μ μ w t t t − 1 t 1 − which agrees with the first part of Property 6.1. To derive the filter density using (6.31) and (6.35) we have i i i i

307 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 297 — #307 i i 6.2 Filtering, Smoothing, and Forecasting 297 } { } { 2 t − 1 t − 1 μ ( ; y ; μ , σ p , P y )∝ μ μ t 1: t t t t t t v } { } { 2 t − 1 1 t − μ ; ; μ y (6.40) , σ μ , P . = t t t t v t An application of (6.36) gives t t , y ∼ N ( μ | (6.41) P ) μ 1: t t t t with 1 − t 1 2 − t P y μ σ t v t t t t − − 1 1 t ) μ (6.42) , μ y ( + K + = − = μ t t t t t t − 1 − t 1 2 2 P P σ σ + + v v t t where we have defined t − 1 P t K = , (6.43) t − 1 t 2 P + σ v t and ( ) − 1 1 t − 2 σ P 1 1 v t t − t 1 = P K (6.44) − . = 1 ) P = ( + t t t t 1 t − − 1 2 2 σ P P + σ v v t t The filter for this specific case, of course, agrees with Property 6.1. based on the en- Next, we consider the problem of obtaining estimators for x t n . These estimators are called , namely, x tire data sample y n , . . ., y ≤ , where t n 1 t n x 1 smoothers because a time plot of the sequence ; t = { , . . ., n } is typically smoother t t − 1 t ; . As is obvious than the forecasts } { t = 1 , . . ., n } or the filters { x x n ; t = 1 , . . ., t t from the above remarks, smoothing implies that each estimated value is a function of the present, future, and past, whereas the filtered estimator depends on the present and past. The forecast depends only on the past, as usual. Property 6.2 The Kalman Smoother n x For the state-space model specified in (6.3) and (6.4), with initial conditions n n n 1 1 obtained via Property 6.1, for t = n , , . . ., − , and P n ) ( 1 − t t n n − 1 , (6.45) x − = + J x x x − t 1 t t 1 − t − t 1 ) ( ′ t 1 − n n 1 − t J (6.46) , P − J = P P + P 1 − t t t t − 1 t − 1 1 − t where [ ] 1 − 1 − − t t ′ 1 . = P J Φ (6.47) P 1 − t t 1 − t The smoother can be derived in many ways. Here we provide a proof that was Proof: 1 ≤ t ≤ n , define given in Ansley and Kohn (1982). First, for w } , . . ., , = { y , . . ., y } and η = { v , . . ., v , w y t t 1 n 1 t t − n 1 − t + 1 1: with y being empty, and let 1:0 1 − t } , η . = E { x x − x , m y t 1: − 1 t 1 t − t 1 − t t i i i i

308 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 298 — #308 i i 298 6 State Space Models t − 1 and , { x x − x y } Then, because are mutually independent, and η , and η 1 t t − 1 t − t t 1: t are independent, using (B.9) we have t − 1 1 − t (6.48) m ) = x + J , x − ( x t 1 t t − 1 − t − 1 t where 1 − 1 − t ′ 1 − t 1 − 1 − t 1 t − ( x )[ P − J x . ] , ] = P = cov x P [ Φ t − 1 1 − t t t t t 1 − t t − 1 y Finally, because , x − x , } y y , . . ., y { = generate , and η 1 − t 1: t t 1: n 1 n t n − t n 1 1 t − x { } = E { m , y = x x } E ) y x + J ( x = − n − 1 1: n t t − 1 1: t − 1 t t 1 − t − 1 t which establishes (6.45). n The recursion for the error covariance, P , is obtained by straight-forward 1 − t calculation. Using (6.45) we obtain ( ) n t − 1 n 1 − t − x x − x = x J − x Φ , x − t t − 1 − 1 t − 1 t t − 1 − 1 t t − 1 or ) ( ) ( 1 n t n − t − 1 + J (6.49) x x = . x − x x x Φ J − + − 1 t − 1 1 1 − t t − t t 1 − t 1 − t t − 1 Multiplying each side of (6.49) by the transpose of itself and taking expectation, we have ′ ′ n n n ′ − t 1 − t t 1 − 1 ′ ′ P x ) J + x ( E J = P x E x ( Φ J + J Φ ) , (6.50) − t 1 1 − t t t t − 1 1 t − t − 1 − t − 1 t 1 t − 1 using the fact the cross-product terms are zero. But, ′ n ′ n n ′ ′ n ) = E x x x x x )− P E = Φ E ( x ( ( P − Φ , + Q ) 1 − t t t t t t t t 1 − and ′ − 1 t t t 1 − ′ 1 − x x = , E ( x x ) P E )− ( − 1 t t − 1 − t 1 − t 1 1 t − so (6.50) simplifies to (6.46).  Example 6.5 Prediction, Filtering and Smoothing for the Local Level Model observations from the local level trend = For this example, we simulated n 50 model discussed in Example 6.4. We generated a random walk + (6.51) μ = μ w t 1 t − t N with . We then supposed that we observe a ∼ iid N ( 0 , 1 ) and μ ) ∼ w ( 0 , 1 0 t univariate series y consisting of the trend component, μ , and a noise component, t t ∼ v , where iid N ) ( 0 , 1 t + (6.52) = μ . y v t t t { The sequences were generated independently. We then ran the w μ } , { v and } t t 0 Kalman filter and smoother, Property 6.1 and Property 6.2, using the actual pa- rameters. The top panel of Figure 6.4 shows the actual values of as points, and μ t i i i i

309 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 299 — #309 i i 6.2 Filtering, Smoothing, and Forecasting 299 Predict 10 l l l l 5 l l l l l l l l l l l l l l l l l l l l l t l l l l l l l l l μ l l l l l l l l l l l l l 0 l l l −5 0 10 20 30 40 50 Filter 10 l l l l 5 l l l l l l l l l l l l l l l l l l l l l t l l l l l l l l l μ l l l l l l l l l l l l l 0 l l l −5 0 20 30 40 10 50 Smooth 10 l l l l 5 l l l l l l l l l l l l l l l l l l l l l t l l l l l l l l l μ l l l l l l l l l l l l l 0 l l l −5 10 40 0 50 20 30 Time 50 (6.51) Fig. 6.4. Displays for Example 6.5. The simulated values of μ , . . ., , for t = 1 , given by t √ t − 1 t − 1 P μ are shown as points. The top shows the predictions error bounds 2 as a line with ± t t √ √ t t n n . P P ± as dashed lines. The middle is similar, showing . The bottom shows μ μ 2 2 ± t t t t t − 1 the predictions μ , superimposed on the graph as a line. In 50 = 2 , for t , . . ., 1 , t √ 1 t − 1 − t addition, we display μ as dashed lines on the plot. The middle panel 2 ± P t t √ t t t displays the filter, μ t 1 , . . ., 50 , as a line with μ , for P = 2 as dashed lines. ± t t t n . The bottom panel of Figure 6.4 shows a similar plot for the smoother μ t 10 observations as well as the corresponding state Table 6.1 shows the first values, the predictions, filters and smoothers. Note that one-step-ahead prediction is more uncertain than the corresponding filtered value, which, in turn, is more n t − 1 t P ≥ P ≥ P ). Also, uncertain than the corresponding smoother value (that is t t t in each case, the error variances stabilize quickly. Ksmooth0 , The R code for this example is as follows. In the example we use Kfilter0 for the filtering part. In the returned values from Ksmooth0 , the which calls t − 1 denote prediction, filter, and smooth, respectively (e.g., xp is x , letters p, f, s t t 6.1 n x , is is x , and so on). These scripts use a Cholesky-type decomposition xf xs t t of Q and R ; they are denoted by cQ and cR . Practically, the scripts only require Q that , respectively, or R may be reconstructed as t(cQ)%*%(cQ) or t(cR)%*%(cR) which allows more flexibility. For example, the model (6.6) - (6.7) does not pose a problem even though the state noise covariance matrix is not positive definite. 6 . 1 Given a positive definite matrix A , its Cholesky decomposition is an upper triangular matrix U with ′ strictly positive diagonal entries such that A = U . For the univariate case, it is U . In R, use chol(A) simply the positive square root of A . i i i i

310 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 300 — #310 i i 6 State Space Models 300 First 10 Observations of Example 6.5 Table 6.1. t t 1 − t t − 1 n n P μ μ P y μ t μ P t t t t t t t t 0 — .63 — — .00 1.00 − .32 .62 − − 1 − .44 .00 2.00 − .70 .67 − .65 .47 1.05 − .85 .63 − 2 − .94 − 1.28 .57 .45 .70 1.67 − − 3 − .85 1.63 − .83 .62 − .11 .45 .81 .32 − 4 2.08 .65 .83 1.62 .97 .62 1.04 .45 − .17 .97 1.62 1.49 .62 1.16 .45 5 1.81 − 6 .05 .31 1.49 1.62 .53 .62 .63 .45 7 .01 1.05 .53 1.62 .21 .62 .78 .45 8 2.20 1.63 .21 1.62 1.44 .62 1.70 .45 9 1.19 1.32 1.44 1.62 1.28 .62 2.12 .45 10 5.24 2.83 1.28 1.62 3.73 .62 3.48 .45 # generate data set.seed(1); num = 50 w = rnorm(num+1,0,1); v = rnorm(num,0,1) mu = cumsum(w) # state: mu[0], mu[1],..., mu[50] y = mu[-1] + v # obs: y[1],..., y[50] # filter and smooth (Ksmooth0 does both) ks = Ksmooth0(num, y, A=1, mu0=0, Sigma0=1, Phi=1, cQ=1, cR=1) # start figure par(mfrow=c(3,1)); Time = 1:num ' Predict ' , ylim=c(-5,10)) plot(Time, mu[-1], main= lines(ks$xp) lines(ks$xp+2*sqrt(ks$Pp), lty=2, col=4) lines(ks$xp-2*sqrt(ks$Pp), lty=2, col=4) plot(Time, mu[-1], main= ' Filter ' , ylim=c(-5,10)) lines(ks$xf) lines(ks$xf+2*sqrt(ks$Pf), lty=2, col=4) lines(ks$xf-2*sqrt(ks$Pf), lty=2, col=4) ' Smooth , ylim=c(-5,10)) plot(Time, mu[-1], main= ' lines(ks$xs) lines(ks$xs+2*sqrt(ks$Ps), lty=2, col=4) lines(ks$xs-2*sqrt(ks$Ps), lty=2, col=4) # initial value info mu[1]; ks$x0n; sqrt(ks$P0n) When we discuss maximum likelihood estimation via the EM algorithm in the n P next section, we will need a set of recursions for obtaining , as defined in (6.17). − 1 t , t We give the necessary recursions in the following property. Property 6.3 The Lag-One Covariance Smoother K For the state-space model specified in (6.3) and (6.4), with , J ), ( t = 1 , . . ., n t t n obtained from Property 6.1 and Property 6.2, and with initial condition and P n n 1 − n P (6.53) , = ( I − K P A Φ ) n n − n , n − 1 n 1 , for t = n , n − 1 , . . ., 2 ) ( t 1 − t n − 1 n ′ ′ − = P J . P (6.54) P P J Φ J + t − 1 2 t 2 − t − t , t − 1 2 − t , 1 − t − 1 t t − 1 i i i i

311 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 301 — #311 i i 301 6.2 Filtering, Smoothing, and Forecasting ≡ 0 without loss u Because we are computing covariances, we may assume Proof: t of generality. To derive the initial term (6.53), we first define s s x x − x ̃ = . t t t Then, using (6.20) and (6.45), we write ) ( ′ t t t E P ̃ x = x ̃ t t − t − 1 t , 1 { } t ′ − t − 1 1 1 − t 1 − t − − E x [ y ( )][ ̃ x x K − x ̃ )] A J A − y K ( = t 1 t − t t t t t t t t 1 t − { } − 1 ′ t t − 1 1 t − 1 − t ( A v ̃ x ̃ x . + + v = )][ ̃ x E )] − K [ x − J ̃ A ( K 1 t t t − t t t t t t t t − 1 Expanding terms and taking expectation, we arrive at t ′ ′ t − 1 t 1 − t − 1 t ′ ′ ′ 1 − ′ − P = P J , A P P K + A J K ( K + − K A A ) P R t t t t t t t t t t 1 t − t − 1 , t , − t 1 − t , 1 t t 1 − t t − 1 ′ noting ( ̃ E v x . The final simplification occurs by realizing that ) = 0 t t − 1 t 1 − t 1 − t − 1 t ′ ′ P + R ) = P K P A Φ A . These relationships hold for = , and P A ( t t t t t t 1 , − t − 1 t t n t any . = 1 , . . ., n , and (6.53) is the case t = We give the basic steps in the derivation of (6.54). The first step is to use (6.45) to write − t n 1 1 − n t Φ J + (6.55) + ̃ x x J x = ̃ x t t − 1 1 − t 1 − t 1 t − − 1 t and n 2 − t n 2 − t + J x Φ J x x + . (6.56) = ̃ x ̃ t 2 t − 2 − 1 t − 2 t − 2 − t t − 2 Next, multiply the left-hand side of (6.55) by the transpose of the left-hand side of (6.56), and equate that to the corresponding result of the right-hand sides of (6.55) and (6.56). Then, taking expectation of both sides, the left-hand side result reduces to ′ n n ′ n P J ) (6.57) + J x x ( E 1 t − t t − 2 − 1 − t , t 2 t 1 − and the right-hand side result reduces to t 2 − 2 − t 2 − t A Φ K + J P K A P − P − 1 − t t − 1 t t − 1 − 1 t 1 − t 2 − t − 1 t − 1 , t − 2 , t t − 1 , 2 (6.58) ′ − ′ 2 − t ′ 1 t ) x ( E Φ + x Φ . J J t − 1 2 t − t t − 1 − 2 In (6.57), write ′ n n ′ n ′ n ′ x ( = E ( x P x − x Q Φ )− P ( = Φ E ) x + , Φ x E ) t 1 − t t 2 1 − − t t t t − 1 − 1 t , t − 1 t , and in (6.58), write ′ ′ t − t t − 1 ′ − 2 2 t − 2 2 − t x E E ( x x x . ) = ) ( = ( x P x )− E 1 − t − 2 t − t 1 − t t − 2 2 t − 1 − t − 1 , t 2 Equating (6.57) to (6.58) using these relationships and simplifying the result leads to (6.54).  i i i i

312 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 302 — #312 i i 6 State Space Models 302 6.3 Maximum Likelihood Estimation Estimation of the parameters that specify the state space model, (6.3) and (6.4), Θ to represent the vector of unknown parameters in the is quite involved. We use initial mean and covariance Σ , and the state and , the transition matrix Φ and μ 0 0 and and the input coefficient matrices, Υ and observation covariance matrices R Q . We use maximum likelihood under the assumption that the initial state is normal, Γ ∼ N . ( μ ) x R ) , Σ w , ∼ iid N 0 ( 0 , Q ) and v ( ∼ iid N , and the errors are normal, q p t 0 0 p 0 t { } and { v We continue to assume, for simplicity, } are uncorrelated. w t t innovations  The likelihood is computed using the , . . ., , , defined by (6.23), 2 n 1 − t 1 A = x y .  − − Γ u t t t t t y , which was first given by The innovations form of the likelihood of the data 1: n Schweppe (1965), is obtained using an argument similar to the one leading to (3.117) and proceeds by noting the innovations are independent Gaussian random vectors with zero means and, as shown in (6.24), covariance matrices t − 1 ′ P Σ = A . A (6.59) R + t t t t L Θ ) , as Hence, ignoring a constant, we may write the likelihood, ( Y n n ’ ’ 1 1 ′ − 1 Θ ( Σ (6.60) , ) ( ) Θ ( ln |   ( Θ )| + Θ Σ ) Θ − ln L = ( ) t t t t Y 2 2 1 1 = t t = where we have emphasized the dependence of the innovations on the parameters Θ . Of course, (6.60) is a highly nonlinear and complicated function of the unknown parameters. The usual procedure is to fix and then develop a set of recursions for the x 0 log likelihood function and its first two derivatives (for example, Gupta and Mehra, 1974). Then, a Newton–Raphson algorithm (see 3.30) can be used successively to update the parameter values until the negative of the log likelihood is minimized. This approach is advocated, for example, by Jones (1980), who developed ARMA estimation by putting the ARMA model in state-space form. For the univariate case, (6.60) is identical, in form, to the likelihood for the ARMA model given in (3.117). The steps involved in performing a Newton–Raphson estimation procedure are as follows. 0 ) ( Θ . (i) Select initial values for the parameters, say, ) ( 0 Θ , to (ii) Run the Kalman filter, Property 6.1, using the initial parameter values, ) 0 ( n  and ; t obtain a set of innovations and error covariances, say, 1 , . . ., { } = t ( 0 ) { Σ n , . . ., t . 1 ; } = t as the (iii) Run one iteration of a Newton–Raphson procedure with − ln L ( Θ ) Y criterion function (refer to Example 3.30 for details), to obtain a new set of ( 1 ) . estimates, say Θ 1 ) ( j ) − ( j = 1 , 2 , . . . ), repeat step 2 using Θ Θ to obtain (iv) At iteration in place of j , ( j ( j ) ( j ) n . } n , . . ., ; t = 1 , . . ., 1 } and { Σ a new set of innovation values {  = ; t t t i i i i

313 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 303 — #313 i i 303 6.3 Maximum Likelihood Estimation ) ( j + 1 . Stop when the estimates Then repeat step 3 to obtain a new estimate Θ ( j + 1 ) differ or the likelihood stabilize; for example, stop when the values of Θ ) ( j ) j ( ) 1 + ( j L Θ ) differs from L ( ( Θ , by some predetermined, , or when Θ ) from Y Y but small amount. Example 6.6 Newton–Raphson for Example 6.3 n = 100 observations, y In this example, we generated , from the AR with 1:100 noise model given in Example 6.3, to perform a Newton–Raphson estimation of the 2 2 Φ = , φ . In the notation of Section 6.2, we would have , and σ φ , σ parameters v w 2 2 2 2 . and R = σ Q 1 . The actual values of the parameters are φ = . 8 , σ = σ = σ = v v w w In the simple case of an AR(1) with observational noise, initial estimation can be accomplished using the results of Example 6.3. For example, using (6.15), we set ( 0 ) ρ = ˆ φ . ( 2 )/ ˆ ρ ) ( 1 y y 2 2 so that, initially, we set , Similarly, from (6.14), γ φ ( 1 ) = γ ) ( 1 ) = φσ − 1 /( y x w ( 0 2 ) ( ) 0 2 ( 0 ) φ − 1 ( σ = ( 1 ) φ ˆ γ )/ . y w 2 , namely, Finally, using (6.13) we obtain an initial estimate of σ v ( 0 ) 2 ) 0 ( ( 2 ) 0 2 γ σ 0 ( . )] /( 1 − φ )−[ ˆ = σ y v w Newton–Raphson estimation was accomplished using the R program optim . The code used for this example is given below. In that program, we must provide . In this case, the ) Θ ( an evaluation of the function to be minimized, namely, − ln L Y function call combines steps 2 and 3, using the current values of the parameters, ( j − 1 ) Θ , to obtain first the filtered values, then the innovation values, and then ( j − 1 ) ( Θ calculating the criterion function, − , to be minimized. We can also L ) ln Y ∂Θ )/ ( , and the provide analytic forms of the gradient or score vector , − ∂ ln L Θ Y ′ 2 ∂Θ ∂Θ Hessian matrix , − ∂ , in the optimization routine, or allow the ln L )/ ( Θ Y program to calculate these values numerically. In this example, we let the program proceed numerically and we note the need to be cautious when calculating gradients numerically. It it suggested in Press et al. (1993, Ch. 10) that it is better to use numerical methods for the derivatives, at least for the Hessian, along with the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method. Details on the gradient and Hessian are provided in Problem 6.9 and Problem 6.10; see Gupta and Mehra (1974). # Generate Data set.seed(999); num = 100 x = arima.sim(n=num+1, list(ar=.8), sd=1) y = ts(x[-1] + rnorm(num,0,1)) # Initial Estimates u = ts.intersect(y, lag(y,-1), lag(y,-2)) varu = var(u); coru = cor(u) phi = coru[1,3]/coru[1,2] q = (1-phi^2)*varu[1,2]/phi r = varu[1,1] - q/(1-phi^2) i i i i

314 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 304 — #314 i i 6 State Space Models 304 (init.par = c(phi, sqrt(q), sqrt(r))) # = .91, .51, 1.03 # Function to evaluate the likelihood Linn = function(para){ phi = para[1]; sigw = para[2]; sigv = para[3] Sigma0 = (sigw^2)/(1-phi^2); Sigma0[Sigma0<0]=0 kf = Kfilter0(num, y, 1, mu0=0, Sigma0, phi, sigw, sigv) return(kf$like) } # Estimation (partial output shown) ' (est = optim(init.par, Linn, gr=NULL, method= ' , hessian=TRUE, BFGS control=list(trace=1, REPORT=1))) SE = sqrt(diag(solve(est$hessian))) cbind(estimate=c(phi=est$par[1],sigw=est$par[2],sigv=est$par[3]),SE) estimate SE phi 0.814 0.081 sigw 0.851 0.175 sigv 0.874 0.143 As seen from the output, the final estimates, along with their standard errors (in ˆ . ) parentheses), are 81 φ = . The report from ( . 08 ) , ˆ σ 14 = . 85 ( . 18 ) , ˆ σ . = . 87 ( v w yielded the following results of the estimation procedure: optim initial value 81.313627 iter 2 value 80.169051 iter 3 value 79.866131 iter 4 value 79.222846 iter 5 value 79.021504 iter 6 value 79.014723 iter 7 value 79.014453 iter 7 value 79.014452 iter 7 value 79.014452 final value 79.014452 converged Note that the algorithm converged in seven steps with the final value of the negative of the log likelihood being 79 . 014452 . The standard errors are a byproduct of the estimation procedure, and we will discuss their evaluation later in this section, after Property 6.4. Example 6.7 Newton–Raphson for the Global Temperature Deviations In Example 6.2, we considered two different global temperature series of n = 136 observations each, and they are plotted in Figure 6.3. In that example, we argued , which that both series should be measuring the same underlying climatic signal, x t we model as a random walk with drift, . w x + = δ + x t 1 t t − Recall that the observation equation was written as ) ( ) ) ( ( v y 1 t t 1 1 = x + , t y 1 v 2 t 2 t q and the model covariance matrices are given by Q = and 11 ( ) r r 12 11 . R = r r 21 22 i i i i

315 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 305 — #315 i i 6.3 Maximum Likelihood Estimation 305 1.0 0.5 0.0 Temperature Deviations −0.5 1940 1960 1980 2000 1900 1920 Time Fig. 6.5. Plot for Example 6.7. The dashed lines with points ( + and 4 ) are the two average global temperature deviations shown in Figure 6.3. The solid line is the estimated smoother n , and the corresponding two root mean square error bound is the gray swatch. Only the ˆ x t values later than 1900 are shown. Hence, there are five parameters to estimate, δ , the drift, and the variance compo- r q r , r = , r r , nents, , noting that We hold the the initial state parameters 22 12 21 11 11 12 1 = fixed in this example at μ , which is large relative to the data. = − . 35 and Σ 0 0 The final estimates were (the R matrix is reassembled in the code). estimate SE sigw 0.055 0.011 cR11 0.074 0.010 cR22 0.127 0.015 cR12 0.129 0.038 drift 0.006 0.005 √ n n ˆ ˆ ± 2 x The observations and the smoothed estimate of the signal, P , are displayed t t Kfilter1 and in Figure 6.5. The code, which uses , is as follows. Ksmooth1 # Setup y = cbind(globtemp, globtempl); num = nrow(y); input = rep(1,num) A = array(rep(1,2), dim=c(2,1,num)) mu0 = -.35; Sigma0 = 1; Phi = 1 # Function to Calculate Likelihood Linn = function(para){ _ cQ = para[1] # sigma w cR1 = para[2] # 11 element of chol(R) cR2 = para[3] # 22 element of chol(R) cR12 = para[4] # 12 element of chol(R) cR = matrix(c(cR1,0,cR12,cR2),2) # put the matrix together drift = para[5] kf = Kfilter1(num,y,A,mu0,Sigma0,Phi,drift,0,cQ,cR,input) return(kf$like) } # Estimation # initial values of parameters init.par = c(.1,.1,.1,0,.05) i i i i

316 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 306 — #316 i i 6 State Space Models 306 ' BFGS ' , hessian=TRUE, (est = optim(init.par, Linn, NULL, method= # output not shown control=list(trace=1,REPORT=1))) SE = sqrt(diag(solve(est$hessian))) # Display estimates u = cbind(estimate=est$par, SE) ); u rownames(u)=c( ' sigw ' , ' cR11 ' , ' cR22 ' , ' cR12 ' , ' drift ' # Smooth (first set parameters to their final estimates) cQ = est$par[1] cR1 = est$par[2] cR2 = est$par[3] cR12 = est$par[4] cR = matrix(c(cR1,0,cR12,cR2), 2) (R = t(cR)%*%cR) # to view the estimated R matrix drift = est$par[5] ks = Ksmooth1(num,y,A,mu0,Sigma0,Phi,drift,0,cQ,cR,input) # Plot xsm = ts(as.vector(ks$xs), start=1880) rmse = ts(sqrt(as.vector(ks$Ps)), start=1880) plot(xsm, ylim=c(-.6, 1), ylab= ' Temperature Deviations ' ) xx = c(time(xsm), rev(time(xsm))) yy = c(xsm-2*rmse, rev(xsm+2*rmse)) polygon(xx, yy, border=NA, col=gray(.6, alpha=.25)) lines(globtemp, type= ' o ' , pch=2, col=4, lty=6) , pch=3, col=3, lty=6) lines(globtempl, type= ' o ' In addition to Newton–Raphson, Shumway and Stoffer (1982) presented a con- ceptually simpler estimation procedure based on the Baum-Welch algorithm (Baum ) algorithm (Dempster expectation-maximization et al., 1970), also known as the EM ( et al., 1977). For the sake of brevity, we ignore the inputs and consider the model in the form of (6.1) and (6.2). The basic idea is that if we could observe the states, y , then we } = { x y , x , . . ., , . . ., x y } , in addition to the observations = { x n n 1 1 0 n n 0: 1: , with joint density { x , y } as the complete data would consider 0: n 1: n n n ÷ ÷ x , y ) = p x | ( x ) . ) y ( p (6.61) ( x | x ) p ( p Θ ,Σ Φ, Q 1: n t 0 t t − 1 0: n t R μ 0 0 1 1 = t = t Under the Gaussian assumption and ignoring constants, the complete data likelihood, (6.61), can be written as 1 − ′ x − Θ ) = ln | Σ | + ( ( − μ ) L Σ 2 ln − ) ( x μ 0 Y 0 X 0 , 0 0 0 n ’ 1 − ′ x ) ( Q x Φ − ) x − Φ x ( + + | Q | ln n t t t 1 − − 1 t (6.62) 1 t = n ’ ′ 1 − − y ) x ( y A . A − x ) ( R R | + | n + ln t t t t t t 1 = t Thus, in view of (6.62), if we did have the complete data, we could then use the results from multivariate normal theory to easily obtain the MLEs of Θ . Although we do not have the complete data, the EM algorithm gives us an iterative method for finding y , by successively maximizing , incomplete data the MLEs of Θ based on the 1: n i i i i

317 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 307 — #317 i i 307 6.3 Maximum Likelihood Estimation the conditional expectation of the complete data likelihood. To implement the EM , ( j = 1 , 2 algorithm, we write, at iteration ), j , . . . ( } ) { j ) 1 ) − j ( 1 − ( ,Θ Θ ) − L y 2 ln E = Θ ( Θ Q . (6.63) n X , Y 1: Calculation of (6.63) is the . Of course, given the current value of expectation step ) 1 ( j − Θ the parameters, , we can use Property 6.2 to obtain the desired conditional expectations as smoothers. This property yields ) ( } { n n ( j − 1 ) n 1 − ′ μ + tr | Σ ln = ] [ P Θ ) + ( x Σ μ − | − )( x Θ Q 0 0 0 0 0 0 0 } { − ′ ′ ′ 1 + tr Φ Q (6.64) + [ S n − S ] Φ ln − Φ S | Q + | S Φ 00 11 10 10 n { } ’ 1 ′ n − n n ′ , + n ln | ] R [( y A − A | + tr P )( y A − A + x x R ) t t t t t t t t t 1 = t where n ’ ′ n n n = S (6.65) , ( x ) x P + 11 t t t t = 1 n ’ n n n ′ ( x = + x S P , (6.66) ) 10 t − 1 t , t − 1 t 1 t = and n ’ n n ′ n x = ( S x + P (6.67) . ) 00 t − 1 1 t − t − 1 1 = t In (6.64)–(6.67), the smoothers are calculated under the current value of the param- ( − 1 ) j eters Θ ; for simplicity, we have not explicitly displayed this fact. In obtaining ′ n n ′ n , we made repeated use of fact ( x Q E | y (· | ·) x ) = x x P ; it is important + s t 1: n s t s , t n with x to note that one does not simply replace in the likelihood. x t t Minimizing (6.64) with respect to the parameters, at iteration j , constitutes the maximization step , and is analogous to the usual multivariate regression approach, which yields the updated estimates 1 ( j ) − , Φ = S (6.68) S 10 00 ( ) ( j ) − 1 ′ − 1 (6.69) , S = − S S S n Q 10 11 00 10 and n ’ ′ ′ n n n ) − 1 ( j . ] (6.70) A + A P ) A )( x − y − A y [( x = R n t t t t t t t t t = t 1 The updates for the initial mean and variance–covariance matrix are ) j ( ) j ( n n (6.71) = P Σ and x = μ 0 0 0 0 obtained from minimizing (6.64). i i i i

318 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 308 — #318 i i 6 State Space Models 308 The overall procedure can be regarded as simply alternating between the Kalman filtering and smoothing recursions and the multivariate normal maximum likelihood estimators, as given by (6.68)–(6.71). Convergence results for the EM algorithm under general conditions can be found in Wu (1983). A thorough discussion of the convergence of the EM algorithm and related methods may be found in Douc et al. (2014, Appendix D). We summarize the iterative procedure as follows. μ R , Σ (i) Initialize by choosing starting values for the parameters in ,Φ, Q , { } , say 0 0 ( 0 ) ( 0 ) − ln L ) ( Θ Θ ; see (6.60). , and compute the incomplete-data likelihood, Y j j On iteration ): , ( , . . . = 1 , 2 1 ( j − ) Θ (ii) Perform the E-Step: Using the parameters , use Properties 6.1, 6.2, and n n n and 6.3 to obtain the smoothed values , P x P , , and calculate n , . . ., 1 = t t t 1 − t , t , given in (6.65)–(6.67). S , S S 10 00 11 ,Φ, (iii) Perform the M-Step: Update the estimates in μ using (6.68)– , Σ } { Q , R 0 0 j ) ( Θ . (6.71), obtaining ( j ) ln L ) . Θ ( (iv) Compute the incomplete-data likelihood, − Y (v) Repeat Steps (ii) – (iv) to convergence. Example 6.8 EM Algorithm for Example 6.3 Using the same data generated in Example 6.6, we performed an EM algorithm 2 2 μ as well as the initial parameters σ and σ , φ estimation of the parameters 0 v w EM0 Σ and . The convergence rate of the EM algorithm compared using the script 0 with the Newton–Raphson procedure is slow. In this example, with convergence being claimed when the relative change in the log likelihood is less than .00001; convergence was attained after 59 iterations. The final estimates, along with their standard errors are listed below and the results are close those in Example 6.6. estimate SE phi 0.810 0.078 sigw 0.853 0.164 sigv 0.864 0.136 mu0 -1.981 NA Sigma0 0.022 NA Evaluation of the standard errors used a call to fdHess in the nlme R package to nlme package must be loaded prior evaluate the Hessian at the final estimates. The to the call to fdHess . library(nlme) # loads package nlme # Generate data (same as Example 6.6) set.seed(999); num = 100 x = arima.sim(n=num+1, list(ar = .8), sd=1) y = ts(x[-1] + rnorm(num,0,1)) # Initial Estimates (same as Example 6.6) u = ts.intersect(y, lag(y,-1), lag(y,-2)) varu = var(u); coru = cor(u) phi = coru[1,3]/coru[1,2] q = (1-phi^2)*varu[1,2]/phi r = varu[1,1] - q/(1-phi^2) # EM procedure - output not shown i i i i

319 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 309 — #319 i i 309 6.3 Maximum Likelihood Estimation (em = EM0(num, y, A=1, mu0=0, Sigma0=2.8, Phi=phi, cQ=sqrt(q), cR=sqrt(r), max.iter=75, tol=.00001)) # Standard Errors (this uses nlme) phi = em$Phi; cq = sqrt(em$Q); cr = sqrt(em$R) mu0 = em$mu0; Sigma0 = em$Sigma0 para = c(phi, cq, cr) Linn = function(para){ # to evaluate likelihood at estimates kf = Kfilter0(num, y, 1, mu0, Sigma0, para[1], para[2], para[3]) return(kf$like) } emhess = fdHess(para, function(para) Linn(para)) SE = sqrt(diag(solve(emhess$Hessian))) # Display Summary of Estimation estimate = c(para, em$mu0, em$Sigma0); SE = c(SE, NA, NA) u = cbind(estimate, SE) rownames(u) = c( ' phi ' , ' sigw ' , ' sigv ' , ' mu0 ' , ' Sigma0 ' ); u Steady State and Asymptotic Distribution of the MLEs ̂ The asymptotic distribution of estimators of the model parameters, say, Θ , is studied n in very general terms in Douc, Moulines, and Stoffer (2014, Chapter 13). Earlier treat- ments can be found in Caines (1988, Chapters 7 and 8), and in Hannan and Deistler (1988, Chapter 4). In these references, the consistency and asymptotic normality of the estimators are established under general conditions. An essential condition is the t , the innovations stability of the filter. Stability of the filter assures that, for large  t Σ are basically copies of each other with a stable covariance matrix that does not depend on t and that, asymptotically, the innovations contain all of the information about the unknown parameters. Although it is not necessary, for simplicity, we shall A ≡ A for all t . Details on departures from this assumption can be assume here that t found in Jazwinski (1970, Sections 7.6 and 7.8). We also drop the inputs and use the model in the form of (6.1) and (6.2). For stability of the filter, we assume the eigenvalues of Φ are less than one in absolute value; this assumption can be weakened (for example, see Harvey, 1991, Section 4.3), but we retain it for simplicity. This assumption is enough to ensure t the stability of the filter in that, as t → ∞ , the filter error covariance matrix P t converges to P , the steady-state error covariance matrix, and the gain matrix K t converges to K , the steady-state gain matrix. From these facts, it follows that the innovation covariance matrix Σ converges to Σ , the steady-state covariance matrix t of the stable innovations; details can be found in Jazwinski (1970, Sections 7.6 and 7.8) and Anderson and Moore (1979, Section 4.4). In particular, the steady-state filter error covariance matrix, , satisfies the Riccati equation: P ′ ′ − 1 ′ AP ( AP A P + R ) − P A [ ] Φ P + Q ; Φ = ′ ′ − 1 K AP A = + R ] P A the steady-state gain matrix satisfies . In Example 6.5 (see [ Table 6.1), for all practical purposes, stability was reached by the third observation. t When the process is in steady-state, we may consider x as the steady-state t + 1 t ) , . . . E ( x y , y . As can be seen from (6.18) = predictor and interpret it as x 1 t t − 1 + t 1 + t and (6.20), the steady-state predictor can be written as t t − 1 1 − t − K A ] x x = Φ [ I Φ K y + = Φ x (6.72) Φ + ,  K t t t t + 1 t i i i i

320 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 310 — #320 i i 6 State Space Models 310  is the steady-state innovation process given by where t ( E , y  = y y − y , . . . ) . 1 t t t − 2 t t − ′ In the Gaussian case,  + ∼ iid N ( 0 , Σ ) , where Σ = AP A R . In steady-state, the t observations can be written as t − 1  y = Ax (6.73) + . t t t Together, (6.72) and (6.73) make up the steady-state innovations form of the dynamic linear model. In the following property, we assume the Gaussian state space model (6.1) and A (6.2), is time invariant, i.e., A are within the unit circle ≡ , the eigenvalues of Φ t and the model has the smallest possible dimension (see Hannan and Deistler, 1988, Section 2.3 for details). We denote the true parameters by Θ , and we assume the 0 dimension of Θ is the dimension of the parameter space. Although it is not necessary 0 to assume w are Gaussian, certain additional conditions would have to apply and v t t and adjustments to the asymptotic covariance matrix would have to be made; see Douc et al. (2014, Chapter 13). Property 6.4 Asymptotic Distribution of the Estimators ̂ obtained by maximizing Under general conditions, let be the estimator of Θ Θ 0 n , ( Θ ) , as given in (6.60). Then, as n →∞ the innovations likelihood, L Y ( ) [ ] √ d 1 − ̂ Θ − Θ N , n 0 , I ( Θ ) → n 0 0 where I ( Θ ) is the asymptotic information matrix given by ] [ ′ − 1 2 = lim . n I ( E Θ − ∂ ) ln L )/ ( Θ ∂Θ ∂Θ Y →∞ n For a Newton procedure, the Hessian matrix (as described in Example 6.6) at the Θ n I ( time of convergence can be used as an estimate of to obtain estimates of the ) 0 standard errors. In the case of the EM algorithm, no derivatives are calculated, but we may include a numerical evaluation of the Hessian matrix at the time of convergence to obtain estimated standard errors. Also, extensions of the EM algorithm exist, such as the SEM algorithm (Meng and Rubin, 1991), that include a procedure for the estimation of standard errors. In the examples of this section, the estimated standard ̂ ̂ ( − ln L Θ is errors were obtained from the numerical Hessian matrix of Θ ) , where Y the vector of parameters estimates at the time of convergence. 6.4 Missing Data Modifications An attractive feature available within the state space framework is its ability to treat time series that have been observed irregularly over time. For example, Jones (1980) used the state-space representation to fit ARMA models to series with missing obser- vations, and Palma and Chan (1997) used the model for estimation and forecasting of i i i i

321 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 311 — #321 i i 6.4 Missing Data Modifications 311 ARFIMA series with missing observations. Shumway and Stoffer (1982) described the modifications necessary to fit multivariate state-space models via the EM algo- rithm when data are missing. We will discuss the procedure in detail in this section. Throughout this section, for notational simplicity, we assume the model is of the form (6.1) and (6.2). , we define the partition of the 1 observation vector t Suppose, at a given time q × ) ( 2 ) ( 1 1 component of observed values, and y y 1 into two parts, q , the q × × , the 1 t t 2 t t . Then, write the partitioned q component of unobserved values, where + = q q 1 t t 2 observation equation ) ] ( ) [ ( ( 1 ) 1 ( ) ) 1 ( v A y t t t (6.74) , + x = t ) ( 2 ) 2 ( 2 ( ) v A y t t t ( 1 ) ) 2 ( × A A partitioned observation are, respectively, the q and p and q where × p 1 t t 2 t t matrices, and ( ) [ ] ( 1 ) R R v 12 t 11 t t = cov (6.75) ( ) 2 R R v 21 22 t t t denotes the covariance matrix of the measurement errors between the observed and unobserved parts. ( 2 ) y is not observed, we may modify the observation In the missing data case where t equation in the DLM, (6.1)–(6.2), so that the model is ( ) ) ) 1 ( 1 1 ( x (6.76) and y x Φ = = A , + v w x + t − 1 t t t t t t where now, the observation equation is q . In this case, it -dimensional at time t 1 t follows directly from Corollary 6.1 that the filter equations hold with the appropriate notational substitutions. If there are no observations at time t , then set the gain matrix, t 1 − t t 1 − t , to the x q K zero matrix in Property 6.1, in which case × p x = and P P = . t t t t t Rather than deal with varying observational dimensions, it is computationally q - easier to modify the model by zeroing out certain components and retaining a dimensional observation equation throughout. In particular, Corollary 6.1 holds for the missing data case if, at update t , we substitute [ ] [ ) ( ] 1 ) ) ( 1 ( 0 R y A t 11 t t A = = = R y , , , (6.77) ( t ) t ) ( ( ) t 0 I 0 0 t 22 , and q for y , A identity R , respectively, in (6.20)–(6.22), where I is the q × t 22 t t 2 2 t t matrix. With the substitutions (6.77), the innovation values (6.23) and (6.24) will now be of the form ( [ ) ] ′ ( ( 1 ) ) ( 1 ) 1 1 − t A  A P + 0 R 11 t t t t t , Σ = =  , (6.78) ( ) t t ( ) 0 0 I 22 t so that the innovations form of the likelihood given in (6.60) is correct for this case. Hence, with the substitutions in (6.77), maximum likelihood estimation via the innovations likelihood can proceed as in the complete data case. i i i i

322 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 312 — #322 i i 312 6 State Space Models Once the missing data filtered values have been obtained, Stoffer (1982) also established the smoother values can be processed using Property 6.2 and Property 6.3 with the values obtained from the missing data-filtered values. In the missing data case, the state estimators are denoted ) ( 1 ( ) ) 1 ( ) s ( y , . . ., y = (6.79) E x , x t s t 1 with error variance–covariance matrix } ) ) ( {( ′ ( s ) s ) ( s ) ( E P x x . − x − x (6.80) = t t t t t ( n ) . P The missing data lag-one smoother covariances will be denoted by 1 t , t − The maximum likelihood estimators in the EM procedure require further modifi- cations for the case of missing data. Now, we consider ) 1 ( ( 1 ) ) 1 ( } (6.81) y , . . ., { = y y n 1 n 1: as the incomplete data, and , as defined in (6.61), as the complete data. In x } y , { n 1: n 0: this case, the complete data likelihood, (6.61), or equivalently (6.62), is the same, but to implement the E-step, at iteration , we must calculate j { } ) ( ( ) 1 j − 1 ) − 1 j ) ( ( ( Θ ) E 2 ln L = Θ − Θ Q ,Θ y , Y X n 1: } { ) 1 ( − 1 ′ E = tr Σ ln | Σ ( x y − μ − )( x | + μ ) ∗ 0 0 0 0 0 0 1: n n { } ’ ] [ 1 ( ) − 1 ′ (6.82) E + Q tr | Q + ( n x ln − Φ x | ) x )( x Φ − y ∗ t − 1 − t t 1 t 1: n = t 1 n { } ’ [ ] 1 ) ( ′ − 1 n ln | R E + | + tr y − A R x ( )( y ) − A x , y ∗ t t t t t t 1: n t = 1 ) j ( 1 − Θ and tr denotes trace. E where denotes the conditional expectation under ∗ The first two terms in (6.82) will be like the first two terms of (6.64) with the n ) ) ( ( n n n n P , replaced by their missing data counterparts, x , x P smoothers , and , P t t t t t , t − 1 ( n ) 2 1 ( ) ) ( and P y y ( E . In the third term of (6.82), we must additionally evaluate ) ∗ t t , 1 t − n 1: ′ ) ( 2 ) 2 ( ( 1 ) E y and y ( y ) . In Stoffer (1982), it is shown that ∗ t t 1: n { } ) 1 ( ′ E ( y y )( y x − A − x A ) ∗ t t t t t t 1: n ) ( ) ( ′ ( ) ) ( 1 ) 1 ( ) ( 1 ) 1 ( ( n ) n y − x y A − A x t t t t t t = ) 1 ) 1 1 1 ( ( ( n ( ) ) ) ) ( n ( 1 − 1 − ( y R R x R A − ) R x − A ) y ( ∗ t t 21 ∗ 21 t t t t t t t 11 ∗ 11 ∗ t ( ) ) ( ′ (6.83) ( 1 ) ) 1 ( A A ( n ) t t + P ) 1 ) ( 1 ( t 1 − 1 − A R R R R A 21 t 21 ∗ t ∗ t t t 11 t ∗ 11 ∗ ) ( 0 0 + . 1 − R R R 0 − R ∗ t t t 22 21 ∗ ∗ 12 ∗ 11 t i i i i

323 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 313 — #323 i i 6.4 Missing Data Modifications 313 ) ( j − 1 = 1 , 2 , are the current values specified by Θ R , for i , In (6.83), the values of . k ∗ ikt n ) ( n ) ( and P x are the values obtained by running the smoother under the In addition, t t ( j − 1 ) . Θ current parameter estimates specified by In the case in which observed and unobserved components have uncorrelated errors, that is, R is the zero matrix, (6.83) can be simplified to t 12 ∗ { } ( 1 ) ′ E A x ) x y )( y y − − ( A t t ∗ t t t t n 1: ( ) ( ) ) ( 0 0 ′ n n ) ( ( ) ( n ) ′ − A x y + A P − A = A x y (6.84) , + t ) t ) ( ) t ) ( t t ( ( ( ) t t t ) t ( 0 R 22 t ∗ y A are defined in (6.77). where and ( ) t ) t ( In this simplified case, the missing data M-step looks like the M-step given in (6.65)-(6.71). That is, with n ’ ′ ) n ( ( n ( ) ) n = S x ( x + P (6.85) , ) 11 ( ) t t t t = 1 n ’ ′ ) n ( n ) ( ) n ( , (6.86) ) P S + = x ( x ( 10 ) t t t , − 1 − 1 t t = 1 and n ’ ′ n ( n ( ) n ) ) ( = S (6.87) ( , ) P + x x 00 ( ) − 1 1 − t t − 1 t t 1 = − ( j 1 ) where the smoothers are calculated under the present value of the parameters Θ using the missing data modifications, at iteration j , the maximization step is 1 ( j ) − = S Φ S (6.88) , ) 10 ( ) ( 00 ( ) 1 ) − ( − j ′ 1 S , = Q − S S n S (6.89) ) 10 ( 11 ( ) ( ( 10 ) ) 00 and { n ) ( ) ( ’ ′ ( ) n n ) ( 1 − ) j ( A − y y x D x − A R = n t ( t ) ) t ( t ) ) ( ( t t t t 1 = )} ( 0 0 n ) ( ′ ′ (6.90) , P D + + A A − ( j 1 ) t ( ) t t ) t ( 0 R t 22 where D is a permutation matrix that reorders the variables at time t in their original t 3 are defined in (6.77). For example, suppose = q order and y and at time A and ) t ) ( ( t , is missing. Then, y t t 2     1 0 0 A y 1 t t 1     © ™     0 0 1 A y , , and D = y , A = = ≠ Æ 3 t t 3 t ( t ( t ) )     ′     0 1 0 0 0 ́ ̈     i i i i

324 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 314 — #324 i i 6 State Space Models 314 ′ gets is the i th row of A R and 0 is a A 1 where p vector of zeros. In (6.90), only × t t ti 11 R updated, and at iteration j is simply set to its value from the previous iteration, t 22 0 = . Of course, if we cannot assume j − 1 , (6.90) must be changed accordingly R 12 t using (6.83), but (6.88) and (6.89) remain the same. As before, the parameter estimates for the initial state are updated as ( j ) j ( ) ) ( n ) n ( . μ and = x Σ (6.91) P = 0 0 0 0 Example 6.9 Longitudinal Biomedical Data We consider the biomedical data in Example 6.1, which have portions of the three- dimensional vector missing after the 40th day. The maximum likelihood procedure yielded the estimators (code at the end of the example): $Phi [,1] [,2] [,3] [1,] 0.984 -0.041 0.009 [2,] 0.061 0.921 0.007 [3,] -1.495 2.289 0.794 $Q [,1] [,2] [,3] [1,] 0.014 -0.002 0.012 [2,] -0.002 0.003 0.018 [3,] 0.012 0.018 3.494 $R [,1] [,2] [,3] [1,] 0.007 0.000 0.000 [2,] 0.000 0.017 0.000 [3,] 0.000 0.000 1.147 for the transition, state error covariance and observation error covariance matrices, respectively. The coupling between the first and second series is relatively weak, whereas the third series HCT is strongly related to the first two; that is, x 794 ˆ . = − 1 . 495 x + 2 . 289 x + . x t t 3 t − t − 1 , 2 1 , 3 1 , − 1 Hence, the HCT is negatively correlated with white blood count (WBC) and pos- itively correlated with platelet count (PLT). Byproducts of the procedure are es- timated trajectories for all three longitudinal series and their respective prediction intervals. In particular, Figure 6.6 shows the data as points, the estimated smoothed √ ) n ( ( n ) ˆ ± 2 as solid lines, and error bounds, x ˆ values as a gray swatch. P t t In the following R code we use the script EM1 . In this case the observation matrices A are either the identity or zero matrix because all the series are either t observed or not observed. y = cbind(WBC, PLT, HCT); num = nrow(y) # make array of obs matrices A = array(0, dim=c(3,3,num)) for(k in 1:num) { if (y[k,1] > 0) A[,,k]= diag(1,3) } # Initial values mu0 = matrix(0, 3, 1); Sigma0 = diag(c(.1, .1, 1), 3) Phi = diag(1, 3); cQ = diag(c(.1, .1, 1), 3); cR = diag(c(.1, .1, 1), 3) # EM procedure - some output previously shown i i i i

325 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 315 — #325 i i 315 6.5 Structural Models: Signal Extraction and Forecasting 4.5 l l l 4.0 l l l l l l l l l l l l l l l l l l l l l l 3.5 l l l l l l l l l l l l l l 3.0 l l l l l WBC 2.5 l l l l 2.0 l l l l l l 1.5 5.5 l l l l l l l l l l l l l l l Time l l l l l l l l l l 5.0 l l l l l l l l l l l l l l PLT l l l 4.5 l l l l l l l l l l l 4.0 l 40 l Time l l l 35 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 30 l l l l l l l HCT l l l l l 25 l 20 0 20 40 60 80 Time Time Fig. 6.6. Smoothed values for various components in the blood parameter tracking problem. The actual data are shown as points, the smoothed values are shown as solid lines, and standard 2 ± error bounds are shown as a gray swatch; tick marks indicate days with no observation. (em = EM1(num, y, A, mu0, Sigma0, Phi, cQ, cR, 100, .001)) # Graph smoother ks = Ksmooth1(num, y, A, em$mu0, em$Sigma0, em$Phi, 0, 0, chol(em$Q), chol(em$R), 0) y1s = ks$xs[1,,]; y2s = ks$xs[2,,]; y3s = ks$xs[3,,] p1 = 2*sqrt(ks$Ps[1,1,]); p2 = 2*sqrt(ks$Ps[2,2,]); p3 = 2*sqrt(ks$Ps[3,3,]) par(mfrow=c(3,1)) plot(WBC, type= ' p ' , pch=19, ylim=c(1,5), xlab= ' day ' ) lines(y1s); lines(y1s+p1, lty=2, col=4); lines(y1s-p1, lty=2, col=4) p ' ) plot(PLT, type= ' ' day , ylim=c(3,6), pch=19, xlab= ' lines(y2s); lines(y2s+p2, lty=2, col=4); lines(y2s-p2, lty=2, col=4) , pch=19, ylim=c(20,40), xlab= ' plot(HCT, type= ' p ' ) ' day lines(y3s); lines(y3s+p3, lty=2, col=4); lines(y3s-p3, lty=2, col=4) 6.5 Structural Models: Signal Extraction and Forecasting Structural models are component models in which each component may be thought of as explaining a specific type of behavior. The models are often some version of the classical time series decomposition of data into trend, seasonal, and irregular components. Consequently, each component has a direct interpretation as to the nature of the variation in the data. Furthermore, the model fits into the state space framework quite easily. To illustrate these ideas , we consider an example that shows how to fit a sum of trend, seasonal, and irregular components to the quarterly earnings data that we have considered before. i i i i

326 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 316 — #326 i i 6 State Space Models 316 Example 6.10 Johnson & Johnson Quarterly Earnings Here, we focus on the quarterly earnings series from the U.S. company Johnson & Johnson as displayed in Figure 1.1. The series is highly nonstationary, and there is both a trend signal that is gradually increasing over time and a seasonal component that cycles every four quarters or once per year. The seasonal component is getting th root larger over time as well. Transforming into logarithms or even taking the n does not seem to make the series trend stationary, however, such a transformation does help with stabilizing the variance over time; this is explored in Problem 6.13. Suppose, for now, we consider the series to be the sum of a trend component, a seasonal component, and a white noise. That is, let the observed series be expressed as , = T (6.92) + S y + v t t t t T where is trend and S is the seasonal component. Suppose we allow the trend to t t increase exponentially; that is, (6.93) = φ T , T + w 1 t 1 t t − φ > 1 characterizes the increase. Let the seasonal component where the coefficient be modeled as + (6.94) , S = + S S + S w 1 t 2 − t t − 3 t − t 2 which corresponds to assuming the component is expected to sum to zero over a complete period or four quarters. To express this model in state-space form, let ′ ) x = ( be the state vector so the observation equation (6.2) can be , S , S , S T t t − 1 − t t t 2 written as T t © ™ ( ) S ≠ Æ t 1 1 0 0 y v = , + ≠ Æ t t S ≠ Æ t − 1 S 2 − t ́ ̈ with the state equation written as T φ 0 0 0 T w t − 1 t 1 t © © © © ™ ™ ™ ™ S − 1 0 1 − 1 − S w ≠ ≠ ≠ ≠ Æ Æ Æ Æ t 2 t 1 − t = + , ≠ ≠ ≠ ≠ Æ Æ Æ Æ S 0 1 0 0 S 0 ≠ ≠ ≠ ≠ Æ Æ Æ Æ t 1 − t 2 − S 0 0 1 0 S 0 2 t − t − 3 ́ ́ ́ ́ ̈ ̈ ̈ ̈ R r = and where 11 0 0 0 q 11 © ™ q 0 0 0 ≠ Æ 22 = Q . ≠ Æ 0 0 0 0 ≠ Æ 0 0 0 0 ́ ̈ The model reduces to state-space form, (6.1) and (6.2), with p = 4 and q = 1 . The parameters to be estimated are r , the noise variance in the measurement 11 , the model variances corresponding to the trend and seasonal equations, q q and 11 22 φ components and , the transition parameter that models the growth rate. Growth i i i i

327 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 317 — #327 i i 317 6.5 Structural Models: Signal Extraction and Forecasting Trend Component 15 10 Trend 5 1965 1975 1980 1960 1970 Time Data & Trend+Season 15 10 5 J&J QE/Share 0 1960 1965 1970 1975 1980 Time n n T , and seasonal component, S Fig. 6.7. , of the Johnson and Estimated trend component, t t Johnson quarterly earnings series. Gray areas are three root MSE bounds. is about 3 % per year, and we began with φ = 1 . 03 . The initial mean was fixed ′ 0 , = ( . 7 , μ , 0 , 0 ) at with uncertainty modeled by the diagonal covariance matrix 0 , . . ., 1 . Initial state covariance values were taken as 4 with Σ = i = . 04 , for ii 0 q . = . 01 , q 25 = . 01 . The measurement error covariance was started at r . = 11 11 22 After about 20 iterations of a Newton–Raphson, the transition parameter esti- ˆ φ mate was = 1 . 035 , corresponding to exponential growth with inflation at about √ 3.5% per year. The measurement uncertainty was small at . compared , 0005 ˆ r = 11 √ √ . = q . Figure 6.7 shows ˆ q ˆ = . 1397 and 2209 with the model uncertainties 11 22 the smoothed trend estimate and the exponentially increasing seasonal components. We may also consider forecasting the Johnson & Johnson series, and the result of a 12-quarter forecast is shown in Figure 6.8 as basically an extension of the latter part of the observed data. scripts as follows. Ksmooth0 and Kfilter0 This example uses the num = length(jj) A = cbind(1,1,0,0) # Function to Calculate Likelihood Linn =function(para){ Phi = diag(0,4); Phi[1,1] = para[1] Phi[2,]=c(0,-1,-1,-1); Phi[3,]=c(0,1,0,0); Phi[4,]=c(0,0,1,0) cQ1 = para[2]; cQ2 = para[3] # sqrt q11 and q22 cQ = diag(0,4); cQ[1,1]=cQ1; cQ[2,2]=cQ2 # sqrt r11 cR = para[4] kf = Kfilter0(num, jj, A, mu0, Sigma0, Phi, cQ, cR) i i i i

328 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 318 — #328 i i 6 State Space Models 318 return(kf$like) } # Initial Parameters mu0 = c(.7,0,0,0); Sigma0 = diag(.04,4) init.par = c(1.03,.1,.1,.5) # Phi[1,1], the 2 cQs and cR # Estimation and Results BFGS est = optim(init.par, Linn,NULL, method= , hessian=TRUE, ' ' control=list(trace=1,REPORT=1)) SE = sqrt(diag(solve(est$hessian))) u = cbind(estimate=est$par, SE) ' Phi11 ' , ' sigw1 rownames(u)=c( , ' sigw2 ' , ' sigv ' ); u ' # Smooth Phi = diag(0,4); Phi[1,1] = est$par[1] Phi[2,]=c(0,-1,-1,-1); Phi[3,]=c(0,1,0,0); Phi[4,]=c(0,0,1,0) cQ1 = est$par[2]; cQ2 = est$par[3] cQ = diag(1,4); cQ[1,1]=cQ1; cQ[2,2]=cQ2 cR = est$par[4] ks = Ksmooth0(num,jj,A,mu0,Sigma0,Phi,cQ,cR) # Plots Tsm = ts(ks$xs[1,,], start=1960, freq=4) Ssm = ts(ks$xs[2,,], start=1960, freq=4) p1 = 3*sqrt(ks$Ps[1,1,]); p2 = 3*sqrt(ks$Ps[2,2,]) par(mfrow=c(2,1)) ) ' Trend ' plot(Tsm, main= ' Trend Component ' , ylab= xx = c(time(jj), rev(time(jj))) yy = c(Tsm-p1, rev(Tsm+p1)) polygon(xx, yy, border=NA, col=gray(.5, alpha = .3)) plot(jj, main= ' Data & Trend+Season ' , ylab= ' J&J QE/Share ' , ylim=c(-.5,17)) xx = c(time(jj), rev(time(jj)) ) yy = c((Tsm+Ssm)-(p1+p2), rev((Tsm+Ssm)+(p1+p2)) ) polygon(xx, yy, border=NA, col=gray(.5, alpha = .3)) # Forecast n.ahead = 12; y = ts(append(jj, rep(0,n.ahead)), start=1960, freq=4) rmspe = rep(0,n.ahead); x00 = ks$xf[,,num]; P00 = ks$Pf[,,num] Q = t(cQ)%*%cQ; R = t(cR)%*%(cR) for (m in 1:n.ahead){ xp = Phi%*%x00; Pp = Phi%*%P00%*%t(Phi)+Q sig = A%*%Pp%*%t(A)+R; K = Pp%*%t(A)%*%(1/sig) x00 = xp; P00 = Pp-K%*%A%*%Pp y[num+m] = A%*%xp; rmspe[m] = sqrt(sig) } J&J QE/Share ' o ' , main= '' , ylab= ' plot(y, type= ' , ylim=c(5,30), xlim=c(1975,1984)) upp = ts(y[(num+1):(num+n.ahead)]+2*rmspe, start=1981, freq=4) low = ts(y[(num+1):(num+n.ahead)]-2*rmspe, start=1981, freq=4) xx = c(time(low), rev(time(upp))) yy = c(low, rev(upp)) polygon(xx, yy, border=8, col=gray(.5, alpha = .3)) abline(v=1981, lty=3) Q Note that the Cholesky decomposition of does not exist here, however, the diagonal cQ . form allows us to use standard deviations for the first two diagonal elements of This technicality can be avoided using a form of the model that we present in the next section. i i i i

329 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 319 — #329 i i 319 6.6 State-Space Models with Correlated Errors 30 25 l l l l l l 20 l l l l l l l l l 15 l l J&J QE/Share l l l l l l l l l 10 l l l l l l l l l l l l l l 5 1978 1976 1982 1980 1984 Time Fig. 6.8. A 12-quarter forecast for the Johnson & Johnson quarterly earnings series. The forecasts are shown as a continuation of the data (points connected by a solid line). The gray area represents two root MSPE bounds. 6.6 State-Space Models with Correlated Errors Sometimes it is advantageous to write the state-space model in a slightly different way, as is done by numerous authors; for example, Anderson and Moore (1979) and Hannan and Deistler (1988). Here, we write the state-space model as x (6.95) n , . . ., = Φ x 1 + Υ u , 0 = + Θ w t t + t t 1 + t 1 y n = A , . . ., x (6.96) + Γ u 1 + v = t t t t t t m × p and where, in the state equation, x is ∼ N Θ ( μ , , Σ r ) , Φ is p × p , and Υ is p × 0 p 0 0 × is iid N iid ( 0 , Q ) . In the observation equation, A ∼ w q ∼ p and Γ is q × r , and v t t m t are still white noise series (both independent v and N w ( 0 , R ) . In this model, while q t t of ), we also allow the state noise and observation noise to be correlated at time t ; x 0 that is, t , (6.97) cov ( w δ , v S ) = t s s t × m q δ is an matrix. The major difference is Kronecker’s delta; note that S where s between this form of the model and the one specified by (6.3)–(6.4) is that this model = 0 in order to ease the notation related to the t starts the state noise process at w concurrent covariance between and v . Also, the inclusion of the matrix Θ allows t t us to avoid using a singular state noise process as was done in Example 6.10. t 1 − −  u = y Γ − A , and the innovation variance x To obtain the innovations, t t t t t ′ t − 1 R = A A P , in this case, we need the one-step-ahead state predictions. Of + Σ t t t t course, the filtered estimates will also be of interest, and they will be needed for smoothing. Property 6.2 (the smoother) as displayed in Section 6.2 still holds. The t − t 1 x when the from the past predictor x following property generates the predictor t t + 1 noise terms are correlated and exhibits the filter update. i i i i

330 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 320 — #330 i i 6 State Space Models 320 Property 6.5 The Kalman Filter with Correlated Noise 0 x For the state-space model specified in (6.95) and (6.96), with initial conditions 1 0 and , for t P = , . . ., n , 1 1 t t − 1 = Φ x x + (6.98)  + Υ u K + t 1 t t t + t 1 t ′ 1 ′ − t ′ P = (6.99) P Φ K + Θ Q Θ Φ − K Σ t t t t + 1 t 1 − t where = A x  − y − Γ u and the gain matrix is given by t t t t t 1 1 t ′ − − t − 1 ′ A [ Φ + Θ S ][ A P P . K (6.100) ] A = R + t t t t t t The filter values are given by [ ] 1 − t 1 − 1 − ′ t t t − 1 ′ + R P A + P x = x A A ,  (6.101) t t t t t t t t ] [ − 1 ′ 1 − t t t − 1 t 1 − t − 1 ′ A A P + R P − P P (6.102) = . A A P t t t t t t t t + 1 t The derivation of Property 6.5 is similar to the derivation of the Kalman filter in Property 6.1 (Problem 6.17); we note that the gain matrix K differs in the two t properties. The filter values, (6.101)–(6.102), are symbolically identical to (6.18) and (6.19). To initialize the filter, we note that 0 0 ′ ′ P x E ( x Θ ) = Φμ + + Υ u Θ , and = . = var ( x Φ ) = ΦΣ Q 0 1 1 0 1 1 1 In the next two subsections, we show how to use the model (6.95)-(6.96) for fitting ARMAX models and for fitting (multivariate) regression models with autocorrelated errors. To put it succinctly, for ARMAX models, the inputs enter in the state equation and for regression with autocorrelated errors, the inputs enter in the observation equation. It is, of course, possible to combine the two models and we give an example of this at the end of the section. 6.6.1 ARMAX Models Consider a k -dimensional ARMAX model given by p q ’ ’ u + + y (6.103) . v Φ y + = Υ Θ v − j t t j k − t t k t 1 = j 1 k = k × The observations y k are a k -dimensional vector process, the Φ s and Θ s are t 1 × matrices, Υ is k × r , u white noise process; in fact, is the r × 1 input, and v k is a t t (6.103) and (5.91) are identical models, but here, we have written the observations as . We now have the following property. y t i i i i

331 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 321 — #331 i i 6.6 State-Space Models with Correlated Errors 321 Property 6.6 A State-Space Form of ARMAX p ≥ q , let For   + Θ Φ   1 1 Φ 0 ··· 0 I   1     .     . Υ   I ··· Φ 0 0 .     2       0   Φ + Θ     . . . . q q .   . . . . . (6.104) = H = G F =     . . . . . .   . Φ     q + 1 .         . ··· Φ 0 0 I     p − 1 . 0       .     ··· Φ 0 0 0   p   Φ p   where F is k p × k p , G is k p × k , and H is k p × r . Then, the state-space model given by x (6.105) , = F x v + Hu G + 1 t t t t 1 + + Ax y + v = , (6.106) t t t ] [ , , I , 0 0 ··· I = is k × pk and where is the k × k identity matrix, implies the A ARMAX model (6.103). If p < and , set Φ q = p = ··· = Φ , in which case = 0 q q 1 + p still apply. Note that the state process is (6.106) -dimensional, whereas k p (6.105)– the observations are -dimensional. k We do not prove Property 6.6 directly, but the following example should suggest how to establish the general result. Example 6.11 Univariate ARMAX 1 ) in State-Space Form ( , 1 , 1 ) model ( Consider the univariate ARMAX 1 = y α + φ y , v + + θ v t − 1 1 − t t t t α = = Υ u ) to ease the notation. For a simple example, if Υ where ( β , β 0 t 1 t ′ t = ( 1 , and ) u , the model for y = would be ARMA(1,1) with linear trend, y t t t Using Property 6.6, we can write the model as + β . t + φ y v + β + θ v t − 1 − t t 1 0 1 x (6.107) , v = φ x ) + α φ + θ + ( 1 t t 1 t + t + and y = x + v . (6.108) t t t In this case, (6.107) is the state equation with w v ≡ and (6.108) is the observation t t ( , v ) = var ( v ) equation. Consequently, cov R , and cov ( w w , v ) = 0 when s , t , = s t t t t 1 1 ) , so Property 6.5 would apply. To verify (6.107) and (6.108) specify an ARMAX( model, we have from (6.108) x + v y = t t t + = from (6.107) v + + α v + ( θ x φ ) φ t t − 1 1 − t t = rearrange terms v + φ ( x + v θ + v + ) α 1 t 1 − t − 1 t t t − v from (6.108) , . = α + + φ y v θ + t 1 t t − 1 − t i i i i

332 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 322 — #332 i i 6 State Space Models 322 Together, Property 6.5 and Property 6.6 can be used to accomplish maximum likelihood estimation as described in Section 6.3 for ARMAX models. The ARMAX model is only a special case of the model (6.95)–(6.96), which is quite rich, as will be discovered in the next subsection. 6.6.2 Multivariate Regression with Autocorrelated Errors In regression with autocorrelated errors, we are interested in fitting the regression model = y u Γ + ε (6.109) t t t ′ × 1 to a y is vector , with r regressors u ε = ( u where k , . . ., u ) vector process, tr 1 t t t t ARMA ( p , q ) and Γ is a k × r matrix of regression parameters. We note that the regressors do not have to vary with time (e.g., u includes a constant in the ≡ 1 1 t k = 1 was treated in Section 3.8. regression) and that the case To put the model in state-space form, we simply notice that ε is a = y u − Γ t t t k -dimensional ARMA( p , q ) process. Thus, if we set H = 0 in (6.105), and include Γ u in (6.106), we obtain t G (6.110) , v x + F x = 1 + t t t + = u y + Ax (6.111) Γ v , t t t t A , F , and G are defined in Property 6.6. The fact that where the model matrices (6.110)–(6.111) is multivariate regression with autocorrelated errors follows directly from Property 6.6 by noticing that together, v + Ax = F x = + G v x and ε t t 1 t + t t t p ). q Γ imply ε is vector ARMA( = y u − , t t t As in the case of ARMAX models, regression with autocorrelated errors is a special case of the state-space model, and the results of Property 6.5 can be used to obtain the innovations form of the likelihood for parameter estimation. Example 6.12 Mortality, Temperature and Pollution This example combines both techniques of Section 6.6.1 and Section 6.6.2. We will fit an ARMAX model to the detrended mortality series . The detrending part cmort of the example constitutes the regression with autocorrelated errors. denote the weekly cardiovascular mortality series, Here, we let M T as the t t P as the corresponding particulate tempr corresponding temperature series , and t series. A preliminary analysis suggests the following considerations (no output is shown): : • An AR(2) model fits well to detrended M t fit1 = sarima(cmort, 2,0,0, xreg=time(cmort)) • The CCF between the mortality residuals, the temperature series and the par- ticulates series, shows a strong correlation with temperature lagged one week ( ) and the particulate level about one T P ), concurrent particulate level ( t − 1 t ). month prior ( P t − 4 acf(cbind(dmort <- resid(fit1$fit), tempr, part)) lag2.plot(tempr, dmort, 8) lag2.plot(part, dmort, 8) i i i i

333 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 323 — #333 i i 6.6 State-Space Models with Correlated Errors 323 From these results, we decided to fit the ARMAX model ̃ ̃ ̃ = φ v β M + P β + φ + M M P β + + (6.112) T 3 1 − 1 t − 2 t t 2 1 t − t − 4 t 1 t 2 2 ̃ ∼ to the detrended mortality series, . To M , σ = M 0 −( α + β ( t ) , where v iid N ) t 4 t t v write the model in state-space form using Property 6.6, let n , . . ., 1 = Φ x , + Υ u 0 x = + Θ v t t 1 t t 1 + t + 1 n y , . . ., = = α + Ax t + Γ u v + t t t t with [ [ ] ] ] [ 0 0 β φ φ β β 1 1 1 2 1 3 = = Υ = Φ Θ 0 0 0 0 0 φ 0 φ 2 2 ′ [ 1 0 ] , Γ = [ 0 0 0 β α ] , . Note that the = ( T , P , P A , t , 1 ) = , y = M u 4 t t t t − t 4 1 t − state process is bivariate and the observation process is univariate. ̄ − t . In this case, t Some additional data analysis notes are: (1) Time is centered as are highly correlated, M α . (2) P and P should be close to the average value of t t − 4 t so orthogonalizing these two inputs would be advantageous (although we did not P using simple linear regression. P from do it here), perhaps by partialling out t t − 4 2 T and T is included. (3) , as in Chapter 2, are not needed in the model when T − t t 1 t (4) Initial values of the parameters are taken from a preliminary investigation that we discuss now. A quick and dirty method for fitting the model is to first detrend cmort and then fit (6.112) using in the second phase, lm on the detrended series. Rather than use lm we use sarima because it also provides a thorough analysis of the residuals. The code for this run is quite simple; the residual analysis (not displayed) supports the model. trend = time(cmort) - mean(time(cmort)) # center time dcmort = resid(fit2 <- lm(cmort~trend, na.action=NULL)); fit2 (Intercept) trend 88.699 -1.625 u = ts.intersect(dM=dcmort, dM1=lag(dcmort,-1), dM2=lag(dcmort,-2), T1=lag(tempr,-1), P=part, P4=lag(part,-4)) # lm(dM ~ ., data=u, na.action=NULL) # and then anaylze residuals ... or sarima(u[,1], 0,0,0, xreg=u[,2:6]) # get residual analysis as a byproduct Coefficients: intercept dM1 dM2 T1 P P4 5.9884 0.3164 0.2989 -0.1826 0.1107 0.0495 s.e. 2.6401 0.0370 0.0395 0.0309 0.0177 0.0195 sigma^2 estimated as 25.42 We can now use Newton–Raphson and the Kalman filter to fit all the parameters simultaneously because the quick method has given us reasonable starting values. The results are close to the quick and dirty method: estimate SE ˆ # φ phi1 0.315 0.037 1 ˆ # phi2 0.318 0.041 φ 2 sigv 5.061 0.161 # ˆ σ v ˆ β # T1 -0.119 0.031 1 i i i i

334 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 324 — #334 i i 324 6 State Space Models ˆ # β P 0.119 0.018 2 ˆ P4 0.067 0.019 # β 3 ˆ trend -1.340 0.220 # β 4 constant 88.752 7.015 # ˆ α The R code for the complete analysis is as follows: trend = time(cmort) - mean(time(cmort)) # center time const = time(cmort)/time(cmort) # appropriate time series of 1s ded = ts.intersect(M=cmort, T1=lag(tempr,-1), P=part, P4=lag(part,-4), trend, const) y = ded[,1] input = ded[,2:6] num = length(y) A = array(c(1,0), dim = c(1,2,num)) # Function to Calculate Likelihood Linn=function(para){ phi1=para[1]; phi2=para[2]; cR=para[3]; b1=para[4] b2=para[5]; b3=para[6]; b4=para[7]; alf=para[8] mu0 = matrix(c(0,0), 2, 1) Sigma0 = diag(100, 2) Phi = matrix(c(phi1, phi2, 1, 0), 2) Theta = matrix(c(phi1, phi2), 2) Ups = matrix(c(b1, 0, b2, 0, b3, 0, 0, 0, 0, 0), 2, 5) Gam = matrix(c(0, 0, 0, b4, alf), 1, 5); cQ = cR; S = cR^2 kf = Kfilter2(num, y, A, mu0, Sigma0, Phi, Ups, Gam, Theta, cQ, cR, S, input) return(kf$like) } # Estimation init.par = c(phi1=.3, phi2=.3, cR=5, b1=-.2, b2=.1, b3=.05, b4=-1.6, alf=mean(cmort)) # initial parameters L = c( 0, 0, 1, -1, 0, 0, -2, 70) # lower bound on parameters # upper bound - used in optim U = c(.5, .5, 10, 0, .5, .5, 0, 90) ' , lower=L, upper=U, est = optim(init.par, Linn, NULL, method= ' L-BFGS-B hessian=TRUE, control=list(trace=1, REPORT=1, factr=10^8)) SE = sqrt(diag(solve(est$hessian))) round(cbind(estimate=est$par, SE), 3) # results The residual analysis involves running the Kalman filter with the final estimated values and then investigating the resulting innovations. We do not display the results, but the analysis supports the model. # Residual Analysis (not shown) phi1 = est$par[1]; phi2 = est$par[2] cR = est$par[3]; b1 = est$par[4] b2 = est$par[5]; b3 = est$par[6] b4 = est$par[7]; alf = est$par[8] mu0 = matrix(c(0,0), 2, 1); Sigma0 = diag(100, 2) Phi = matrix(c(phi1, phi2, 1, 0), 2) Theta = matrix(c(phi1, phi2), 2) Ups = matrix(c(b1, 0, b2, 0, b3, 0, 0, 0, 0, 0), 2, 5) Gam = matrix(c(0, 0, 0, b4, alf), 1, 5) cQ = cR S = cR^2 kf = Kfilter2(num, y, A, mu0, Sigma0, Phi, Ups, Gam, Theta, cQ, cR, S, input) res = ts(as.vector(kf$innov), start=start(cmort), freq=frequency(cmort)) # gives a full residual analysis sarima(res, 0,0,0, no.constant=TRUE) i i i i

335 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 325 — #335 i i 6.7 Bootstrapping State Space Models 325 Finally, a similar and simpler analysis can be fit using a complete ARMAX model. In this case the model would be v + (6.113) M t = α + φ β M + P β + φ + M P β + + β T 4 1 1 2 − 2 t t 2 1 3 − t − 4 t 1 − t t t 2 0 . This model is different from (6.112) in that the mortality ∼ iid N ( v , σ where ) t v process is not detrended, but trend appears as an exogenous variable. In this case, we may use to easily perform the regression and get the residual analysis as sarima a byproduct. trend = time(cmort) - mean(time(cmort)) u = ts.intersect(M=cmort, M1=lag(cmort,-1), M2=lag(cmort,-2), T1=lag(tempr,-1), P=part, P4=lag(part,-4), trend) # could use lm, but it ' s more work sarima(u[,1], 0,0,0, xreg=u[,2:7]) Coefficients: intercept M1 M2 T1 P P4 trend 40.3838 0.315 0.2971 -0.1845 0.1113 0.0513 -0.5214 s.e. 4.5982 0.037 0.0394 0.0309 0.0177 0.0195 0.0956 sigma^2 estimated as 25.32 We note that the residuals look fine, and the model fit is similar to the fit of (6.112). 6.7 Bootstrapping State Space Models Although in Section 6.3 we discussed the fact that under general conditions (which we assume to hold in this section) the MLEs of the parameters of a DLM are consistent and asymptotically normal, time series data are often of short or moderate length. Several researchers have found evidence that samples must be fairly large before asymptotic results are applicable (Dent and Min, 1978; Ansley and Newbold, 1980). Moreover, as we discussed in Example 3.36, problems occur if the parameters are near the boundary of the parameter space. In this section, we discuss an algorithm for bootstrapping state space models; this algorithm and its justification, including the non-Gaussian case, along with numerous examples, can be found in Stoffer and Wall (1991) and in Stoffer and Wall (2004). In view of Section 6.6, anything we do or say here about DLMs applies equally to ARMAX models. Using the DLM given by (6.95)–(6.97) and Property 6.5, we write the innovations form of the filter as − 1 t − u (6.114) Γ ,  − A = x y t t t t t ′ 1 t − (6.115) , R + A A Σ P = t t t t 1 − ′ 1 − t (6.116) , Σ ] + Θ S A = Φ K [ P t t t t t 1 − t +  , K + (6.117) Υ u x x = Φ t + 1 t t t + 1 t 1 t − t ′ ′ ′ P = P Φ K Φ K − Σ Θ + Θ Q . (6.118) t t t t 1 + t This form of the filter is just a rearrangement of the filter given in Property 6.5. In addition, we can rewrite the model to obtain its innovations form, i i i i

336 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 326 — #336 i i 6 State Space Models 326 t − 1 t Φ x x (6.119) ,  + Υ u = K + 1 t t + t t 1 t + t − 1 . x u y (6.120)  + Γ = A + t t t t t This form of the model is a rewriting of (6.114) and (6.117), and it accommodates the bootstrapping algorithm.  are uncorrelated, ini- As discussed in Example 6.5, although the innovations t tially, can be vastly different for different time points t . Thus, in a resampling Σ t  until procedure, we can either ignore the first few values of Σ stabilizes or we can t t standardized innovations work with the / 1 − 2 Σ (6.121) e , =  t t t so we are guaranteed these innovations have, at least, the same first two moments. In 2 / 1 / 2 1 2 / 1 defined by Σ Σ denotes the unique square root matrix of . Σ Σ Σ (6.121), = t t t t t In what follows, we base the bootstrap procedure on the standardized innovations, but we stress the fact that, even in this case, ignoring startup values might be necessary, as noted by Stoffer & Wall (1991). The model coefficients and the correlation structure of the model are uniquely parameterized by a k × 1 parameter vector Θ , ; that is, Φ = Φ ( Θ ) ) , Υ = Υ ( Θ 0 0 0 Q = Q ( Θ . Recall the innovations form ) , A ) = A Θ ( Θ ( ) , Γ = Γ ( Θ R ) , and R = 0 t 0 t 0 0 of the Gaussian likelihood (ignoring a constant) is n ’ [ ] 1 − ′  ( Θ ) = Θ ) Θ ( ) − ln | Σ Θ ( 2 ln )| +  ( ( Θ ) L Σ t t t t Y t 1 = n ’ ′ ] [ = ln | Σ ) ( (6.122) )| + . Θ ( Θ ) Θ e ( e t t t 1 = t We stress the fact that it is not necessary for the model to be Gaussian to consider (6.122) as the criterion function to be used for parameter estimation. ˆ ˆ L Θ denote the MLE of , that is, Let Θ = argmax , obtained by the Θ ) ( Θ Y 0 Θ ˆ ˆ be the innovation values ( methods discussed in Section 6.3. Let Θ ) and Σ ) (  Θ t t ˆ obtained by running the filter, (6.114)–(6.118), under Θ . Once this has been done, 6.2 the nonparametric bootstrap procedure is accomplished by the following steps. (i) Construct the standardized innovations 2 / 1 − ˆ ˆ ˆ ) = Σ ( . ( e Θ Θ  ) ( ) Θ t t t ˆ ˆ Θ (ii) Sample, with replacement, n times from the set { e to obtain ( )} Θ ) , . . ., e ( n 1 ∗ ∗ ˆ ˆ Θ )} ( , a bootstrap sample of standardized innovations. ( e Θ ) , . . ., e { n 1 ∗ ∗ )× (iii) Construct a bootstrap data set { y ( q , . . ., y 1 + } as follows. Define the p n 1 ′ t ′ ′ ξ . Stacking (6.119) and (6.120) results in a vector first-order = ( x ) vector y , t t 1 t + equation for ξ given by t 6 . 2 Nonparametric refers to the fact that we use the empirical distribution of the innovations rather than assuming they have a parametric form. i i i i

337 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 327 — #337 i i 6.7 Bootstrapping State Space Models 327 Inflation 15 Interest 10 Rate (%) 5 0 1955 1980 1975 1970 1965 1960 Time Quarterly interest rate for Treasury bills (dashed line) and quarterly inflation rate Fig. 6.9. (solid line) in the Consumer Price Index. + F = + ξ ξ H e , (6.123) Gu t t t 1 t t − t where ] [ [ ] [ ] 2 1 / Σ K Υ Φ 0 t t . , H = = = F , G t t 1 / 2 0 A Γ t Σ t ∗ ˆ in place of ) ( e Θ Thus, to construct the bootstrap data set, solve (6.123) using t and the initial conditions of the Kalman filter u . The exogenous variables e t t ˆ . Θ remain fixed at their given values, and the parameter vector is held fixed at ∗ ∗ (iv) Using the bootstrap data set y , construct a likelihood, L , and obtain the ( Θ ) Y n 1: ∗ ˆ Θ , say, Θ . MLE of , of times, obtaining a bootstrapped (v) Repeat steps 2 through 4, a large number, B ∗ ˆ { set of parameter estimates Θ ; b = 1 , . . ., B } . The finite sample distribution b ∗ ˆ ˆ ˆ B − . Θ , b = 1 , . . ., Θ Θ of Θ − may be approximated by the distribution of 0 b In the next example, we discuss the case of a linear regression model, but where the regression coefficients are stochastic and allowed to vary with time. The state space model provides a convenient setting for the analysis of such models. Example 6.13 Stochastic Regression Figure 6.9 shows the quarterly inflation rate (solid line), y , in the Consumer Price t Index and the quarterly interest rate recorded for Treasury bills (dashed line), z , from the first quarter of 1953 through the second quarter of 1980, n = 110 t observations. These data are taken from Newbold and Bos (1985). In this example, we consider one analysis that was discussed in Newbold and Bos (1985, pp. 61–73), that focused on the first 50 observations and where quarterly inflation was modeled as being stochastically related to quarterly interest rate, + , z v y β = α + t t t t i i i i

338 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 328 — #338 i i 6 State Space Models 328 Table 6.2. Comparison of Standard Errors Asymptotic Bootstrap Parameter MLE Standard Error Standard Error φ .223 .463 .865 686 .487 .557 α − . .821 .226 b .788 σ .115 .107 .216 w 1.135 .147 σ .340 v where β α is a stochastic regression coefficient, and v is a fixed constant, is white t t 2 noise with variance . The stochastic regression term, which comprises the state σ v variable, is specified by a first-order autoregression, − b ) ( φ ( β β − b ) + w , = 1 t t t − 2 . The noise processes, b σ where is a constant, and is white noise with variance w t w and w , are assumed to be uncorrelated. v t t Using the notation of the state-space model (6.95) and (6.96), we have in the 2 = = β , and in the , Φ state equation, φ , u x ≡ 1 , Υ = ( 1 − φ ) b , Q = σ t t t w 2 . The parameter vector 0 = S A , and = z observation equation, , Γ = α , R = σ t t v ′ , σ Θ φ,α, b , σ . The results of the Newton–Raphson estimation procedure ( is ) = v w are listed in Table 6.2. Also shown in the Table 6.2 are the corresponding standard errors obtained from = 500 runs of the bootstrap. These standard errors are B simply the standard deviations of the bootstrapped estimates, that is, the square root Õ B ∗ 2 ˆ ˆ i , represents the MLE of the , where ) ( Θ Θ 1 − of Θ − ) th parameter, /( B i i = b 1 ib Θ , , for i = 1 , . . ., 5 i The asymptotic standard errors listed in Table 6.2 are typically much smaller than those obtained from the bootstrap. For most of the cases, the bootstrapped standard errors are at least 50% larger than the corresponding asymptotic value. Also, asymptotic theory prescribes the use of normal theory when dealing with the parameter estimates. The bootstrap, however, allows us to investigate the small sample distribution of the estimators and, hence, provides more insight into the data analysis. For example, Figure 6.10 shows the bootstrap distribution of the estimator of φ in the upper left-hand corner. This distribution is highly skewed with values concentrated around .8, but with a long tail to the left. Some quantiles are 09 . − (5%), .11 (10%), .34 (25%), .73 (50%), .86 (75%), .96 (90%), .98 (95%), and they can be used to obtain confidence intervals. For example, a 90% confidence interval for φ would be approximated by ( − . 09 , .96). This interval is ridiculously wide and as a plausible value of includes 0 ; we will interpret this after we discuss the results φ of the estimation of σ . w Figure 6.10 shows the bootstrap distribution of ˆ σ in the lower right-hand w corner. The distribution is concentrated at two locations, one at approximately (which is the median of the distribution of values away from 0) and = 25 . ˆ σ w i i i i

339 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 329 — #339 i i 6.7 Bootstrapping State Space Models 329 1 1.0 0.8 0.6 y φ 0.4 0.2 0.0 0.4 0.5 0.6 0.0 0.1 0.2 0.3 σ 1 w x ˆ Joint and marginal bootstrap distributions, B = 500 , of Fig. 6.10. φ and ˆ σ . Only the values w ∗ ˆ φ ≥ 0 are shown. corresponding to the other at ˆ correspond to deterministic 0 = 0 . The cases in which ˆ σ ≈ σ w w , so the t for large b state dynamics. When σ ≈ = 0 and | φ | < 1 , then β t w σ ≈ 0 suggest a fixed state, or constant ˆ approximately 25% of the cases in which w σ coefficient model. The cases in which is away from zero would suggest a truly ˆ w stochastic regression parameter. To investigate this matter further, the off-diagonals ˆ ) φ, ˆ σ of Figure 6.10 show the joint bootstrapped estimates, ( , for positive values w ∗ ˆ ˆ , of 0 φ = . The joint distribution suggests ˆ σ φ > 0 corresponds to . When φ ≈ 0 w is small relative to β σ = b the state dynamics are given by w . If, in addition, + t t w b . Considering these results, the ≈ β b , the system is nearly deterministic; that is, t bootstrap analysis leads us to conclude the dynamics of the data are best described in terms of a fixed regression effect. The following R code was used for this example. We note that the first few lines of the code set the relative tolerance for determining convergence of the numerical optimization and the number of bootstrap replications. Using the current settings may result in a long run time of the algorithm and we suggest the tolerance and the number of bootstrap replicates be decreased on slower machines or for demon- stration purposes. For example, setting tol=.001 and nboot=200 yields reasonable results. In this example, we fixed the first three values of the data for the resampling scheme. # used for displaying progress library(plyr) # determines convergence of optimizer tol = sqrt(.Machine$double.eps) i i i i

340 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 330 — #340 i i 6 State Space Models 330 nboot = 500 # number of bootstrap replicates y = window(qinfl, c(1953,1), c(1965,2)) # inflation # interest z = window(qintr, c(1953,1), c(1965,2)) num = length(y) A = array(z, dim=c(1,1,num)) input = matrix(1,num,1) # Function to Calculate Likelihood Linn = function(para, y.data){ # pass data also phi = para[1]; alpha = para[2] b = para[3]; Ups = (1-phi)*b cQ = para[4]; cR = para[5] kf = Kfilter2(num,y.data,A,mu0,Sigma0,phi,Ups,alpha,1,cQ,cR,0,input) return(kf$like) } # Parameter Estimation mu0 = 1; Sigma0 = .01 init.par = c(phi=.84, alpha=-.77, b=.85, cQ=.12, cR=1.1) # initial values est = optim(init.par, Linn, NULL, y.data=y, method="BFGS", hessian=TRUE, control=list(trace=1, REPORT=1, reltol=tol)) SE = sqrt(diag(solve(est$hessian))) phi = est$par[1]; alpha = est$par[2] b = est$par[3]; Ups = (1-phi)*b cQ = est$par[4]; cR = est$par[5] round(cbind(estimate=est$par, SE), 3) estimate SE phi 0.865 0.223 alpha -0.686 0.487 b 0.788 0.226 cQ 0.115 0.107 cR 1.135 0.147 # BEGIN BOOTSTRAP # Run the filter at the estimates kf = Kfilter2(num,y,A,mu0,Sigma0,phi,Ups,alpha,1,cQ,cR,0,input) # Pull out necessary values from the filter and initialize xp = kf$xp innov = kf$innov sig = kf$sig K = kf$K e = innov/sqrt(sig) e.star = e # initialize values y.star = y xp.star = xp k = 4:50 # hold first 3 observations fixed # to store estimates para.star = matrix(0, nboot, 5) init.par = c(.84, -.77, .85, .12, 1.1) pr <- progress_text() # displays progress pr$init(nboot) for (i in 1:nboot){ pr$step() e.star[k] = sample(e[k], replace=TRUE) for (j in k){ xp.star[j] = phi*xp.star[j-1] + Ups+K[j]*sqrt(sig[j])*e.star[j] } y.star[k] = z[k]*xp.star[k] + alpha + sqrt(sig[k])*e.star[k] est.star = optim(init.par, Linn, NULL, y.data=y.star, method="BFGS", control=list(reltol=tol)) para.star[i,] = cbind(est.star$par[1], est.star$par[2], est.star$par[3], abs(est.star$par[4]), abs(est.star$par[5])) } i i i i

341 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 331 — #341 i i 6.8 Smoothing Splines and the Kalman Smoother 331 # Some summary statistics rmse = rep(NA,5) # SEs from the bootstrap for(i in 1:5){rmse[i]=sqrt(sum((para.star[,i]-est$par[i])^2)/nboot) cat(i, rmse[i],"\n") } # Plot phi and sigw phi = para.star[,1] sigw = abs(para.star[,4]) # any phi < 0 not plotted phi = ifelse(phi<0, NA, phi) library(psych) # load psych package for scatter.hist scatter.hist(sigw, phi, ylab=expression(phi), xlab=expression(sigma[~w]), smooth=FALSE, correl=FALSE, density=FALSE, ellipse=FALSE, , pch=19, col=gray(.1,alpha=.33), '' title= panel.first=grid(lty=2), cex.lab=1.2) 6.8 Smoothing Splines and the Kalman Smoother There is a connection between smoothing splines, e.g., Eubank (1993), Green (1993), or Wahba (1990) and state space models. The basic idea of smoothing splines (recall are generated by y Example 2.14) in discrete time is we suppose that data μ y +  = t t t t is a smooth function of t , . . ., n , where μ = 1 t , and  is white noise. In cubic for t t t , μ smoothing with knots at the time points is estimated by minimizing t n n ( ) ’ ’ 2 2 2 [ ] μ λ − + y μ ∇ (6.124) t t t 1 = = t t 1 μ with respect to , where λ > 0 is a smoothing parameter. The parameter λ controls the t degree of smoothness, with larger values yielding smoother estimates. For example, if λ = 0 , then the minimizer is the data itself ˆ μ ; consequently, the estimate = y t t will not be smooth. If λ = ∞ , then the only way to minimize (6.124) is to choose the 2 ∇ = μ , second term to be zero, i.e., 0 , in which case it is of the form μ t = α + β t t 6.3 is seen 0 λ > and we are in the setting of linear regression. Hence, the choice of as a trade-off between fitting a line that goes through all the data points and linear regression. Now, consider the model given by 2 = w and y ∇ = μ + μ , (6.125) v t t t t t 2 w and v are independent white noise processes with var ( w ) = σ where and t t t w 2 ( v . Rewrite (6.125) as ) = var σ t v ( [ ) ] ( ) [ ( ] ) [ ] μ 2 − 1 μ 1 μ t t − 1 t 1 0 = + = y w and v (6.126) + , t t t μ 1 0 μ 0 μ − t 1 t 2 − − t 1 ′ . It is clear then that (6.125) specifies a state so that the state vector is ) = ( μ , μ x t t t − 1 space model. 6 . 3 2 μ t That the unique general solution to ∇ β follows from difference α = 0 is of the form μ = + t t equation theory; e.g., see Mickens (1990). i i i i

342 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 332 — #342 i i 332 6 State Space Models Note that the model is similar to the local level model discussed in Example 6.5. In μ η = μ . w + + η η , where particular, the state process could be written as = t t − 1 1 − t t t t An example of such a trajectory can be seen in Figure 6.11; note that the generated data in Figure 6.11 look like the global temperature data in Figure 1.2. Next, we examine the problem of estimating the states, x , when the model t 2 2 parameters, θ = { σ x , are specified. For ease, we assume is fixed. Then using , σ } 0 v w the notation surrounding equations (6.61)–(6.62), the goal is to find the MLE of y ) x | x = { x ( , . . ., x log p } given y ; i.e., to maximize } = { y y , . . ., n 1 n 1: n θ 1 n 1: n 1: n 1: with respect to the states. Because of the Gaussianity, the maximum (or mode) of n , the conditional means. These the distribution is when the states are estimated by x t values are, of course, the smoothers obtained via Property 6.2. x log p , so maximizing the ( x ) y | y ( log p ) = log p ) − ( But y , 1: n 1: n θ n θ 1: n 1: n 1: θ ( complete data likelihood, log p , is an equivalent with respect to x x ) , y 1: n n 1: 1: n θ problem. Writing (6.62) in the notation of (6.125), we have, n n ( ) ’ ’ 2 − 2 2 2 − 2 ( 2 log p σ y − ∇ , μ x )∝ y μ , ( (6.127) σ ) − + n t θ 1: n 1: t t w v 1 t = t = 1 2 2 , we σ / where we have kept only the terms involving the states, μ λ . If we set = σ t w v can write n n ( ) ’ ’ 2 2 2 ( , ∇ y μ )∝ λ 2 log p − x + (6.128) , ) − y ( μ 1: t θ n 1: n t t t = 1 = 1 t with respect to the states is equivalent to mini- ( x so that maximizing , y ) log p 1: 1: n n θ mizing (6.128), which is the original problem stated in (6.124). 2 2 via maximum and σ In the general state space setting, we would estimate σ v w likelihood as described in Section 6.3, and then obtain the smoothed state values by 2 2 . In this case, the running Property 6.2 with the estimated variances, say ˆ σ and σ ˆ v w 2 2 ˆ ˆ σ ˆ estimated value of the smoothing parameter would be given by / λ σ = . w v Example 6.14 Smoothing Splines In this example, we generated the signal, or state process, μ y and observations t t . The state is displayed from the model (6.125) with n = 50 , σ = = . 1 and σ 1 v w in Figure 6.11 as a thick solid line, and the observations are displayed as points. and σ using Newton-Raphson techniques and obtained σ We then estimated w v = σ . . 08 and ˆ σ ˆ = . We then used Property 6.2 to generate the estimated 94 v w n ˆ μ smoothers, say, , and those values are displayed in Figure 6.11 as a thick dashed t line along with a corresponding 95% (pointwise) confidence band as thin dashed lines. Finally, we used the R function smooth.spline to fit a smoothing spline to the ). The fitted spline is data based on the method of generalized cross-validation ( gcv n displayed in Figure 6.11 as a thin solid line, which is close to ˆ μ . t The R code to reproduce Figure 6.11 is given below. set.seed(123) num = 50 w = rnorm(num,0,.1) x = cumsum(cumsum(w)) i i i i

343 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 333 — #343 i i 333 6.8 Smoothing Splines and the Kalman Smoother 8 l Observations l l State l l 6 l l l l l l l l l l l l l 4 l l l l l l l l l l l l l l l l l l 2 l l l l l l l l l l l 0 Smoother l l l GCV Spline l 40 50 10 0 30 20 Time y and observations from the Fig. 6.11. Display for Example 6.14: Simulated state process, μ t t model σ with n = 50 , σ and = . 1 and (6.125) μ = 1 . Estimated smoother (dashed lines): ˆ v w t | n corresponding 95% confidence band. gcv smoothing spline (thin solid line). y = x + rnorm(num,0,1) plot.ts(x, ylab="", lwd=2, ylim=c(-1,8)) ' , col=8) ' lines(y, type= o ## State Space ## Phi = matrix(c(2,1,-1,0),2); A = matrix(c(1,0),1) mu0 = matrix(0,2); Sigma0 = diag(1,2) Linn = function(para){ sigw = para[1]; sigv = para[2] cQ = diag(c(sigw,0)) kf = Kfilter0(num, y, A, mu0, Sigma0, Phi, cQ, sigv) return(kf$like) } ## Estimation ## init.par = c(.1, 1) (est = optim(init.par, Linn, NULL, method="BFGS", hessian=TRUE, control=list(trace=1,REPORT=1))) SE = sqrt(diag(solve(est$hessian))) # Summary of estimation estimate = est$par; u = cbind(estimate, SE) rownames(u) = c("sigw","sigv"); u # Smooth sigw = est$par[1] cQ = diag(c(sigw,0)) sigv = est$par[2] ks = Ksmooth0(num, y, A, mu0, Sigma0, Phi, cQ, sigv) xsmoo = ts(ks$xs[1,1,]); psmoo = ts(ks$Ps[1,1,]) upp = xsmoo+2*sqrt(psmoo); low = xsmoo-2*sqrt(psmoo) lines(xsmoo, col=4, lty=2, lwd=3) lines(upp, col=4, lty=2); lines(low, col=4, lty=2) lines(smooth.spline(y), lty=1, col=2) legend("topleft", c("Observations","State"), pch=c(1,-1), lty=1, lwd=c(1,2), col=c(8,1)) legend("bottomright", c("Smoother", "GCV Spline"), lty=c(2,1), lwd=c(3,1), col=c(4,2)) i i i i

344 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 334 — #344 i i 334 6 State Space Models 6.9 Hidden Markov Models and Switching Autoregression In the introduction to this chapter, we mentioned that the state space model is char- } = acterized by two principles. First, there is a hidden state process, { x , . . . ; t , 0 , 1 t , . . . , are that is assumed to be Markovian. Second, the observations, { y 2 } t = 1 , ; t independent given the states. The principles were displayed in Figure 6.1 and written in terms of densities in (6.28) and (6.29). We have been focusing primarily on linear Gaussian state space models, but there is an entire area that has developed around the case where the states x are a discrete- t valued Markov chain, and that will be the focus in this section. The basic idea is that the value of the state at time t specifies the distribution of the observation at time t . These models were developed in Goldfeld and Quandt (1973) and Lindgren (1978). Changes can also be modeled in the classical regression setting by allowing the value of the state to determine the design matrix, as in Quandt (1972). An early application to speech recognition was considered by Juang and Rabiner (1985). An application of the idea of switching to the tracking of multiple targets was considered in Bar-Shalom (1978), who obtained approximations to Kalman filtering in terms of weighted averages of the innovations. As another example, some authors (for example, Hamilton, 1989, or McCulloch and Tsay, 1993) have explored the possibility that the dynamics of a country’s economy might be different during expansion than during contraction. In the Markov chain approach, we declare the dynamics of the system at time t are generated by one of m possible regimes evolving according to a Markov chain over time. The case in which the particular regime is unknown to the observer comes under the heading of hidden Markov models (HMM), and the techniques related to analyzing these models are summarized in Rabiner and Juang (1986). Although the model satisfies the conditions for being a state space model, HMMs were developed in parallel. If the state process is discrete-valued, one typically uses the term “hidden Markov model” and if the state process is continuous-valued, one uses the term “state space model” or one of its variants. Texts that cover the theory and methods in whole or in part are Cappé, Moulines, & Rydén (2009) and Douc, Moulines, & Stoffer (2014). A recent introductory text that uses R is Zucchini & MacDonald (2009). x , are a Markov chain taking values in a finite state Here, we assume the states, t space , . . ., m } , with stationary distribution 1 { (6.129) = Pr , x π = j ) ( t j and stationary transition probabilities π = Pr ( x , = j | x = i ) (6.130) t i j t + 1 i for = 0 , 1 , 2 , . . ., and t , j = 1 , . . ., m . Since the second component of the model is that the observations are conditionally independent, we need to specify the distributions, and we denote them by ) (6.131) j . p = ( y x ) = p ( y | t t t j i i i i

345 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 335 — #345 i i 335 Hidden Markov Models 35 25 EQcount 15 5 1980 1900 1960 1940 1920 2000 Time 0.6 0.6 0.4 0.4 0.2 0.2 ACF PACF −0.2 −0.2 10 15 20 10 5 5 20 15 LAG LAG Fig. 6.12. Top: Series of annual counts of major earthquakes (magnitude 7 and above) in the world between 1900–2006. Bottom: Sample ACF and PACF of the counts. Example 6.15 Poisson HMM – Number of Major Earthquakes Consider the time series of annual counts of major earthquakes displayed in Fig- ure 6.12 that were discussed in Zucchini & MacDonald (2009). A natural model for unbounded count data is a Poisson distribution, in which case the mean and vari- ̄ x 19 . 4 ance are equal. However, the sample mean and variance of the data are = 2 51 s . 6 , so this model is clearly inappropriate. It would be possible to take = and into account the overdispersion by using other distributions for counts such as the negative binomial distribution or a mixture of Poisson distributions. This approach, however, ignores the sample ACF and PACF displayed Figure 6.12, which indi- cate the observations are serially correlated, and further suggest an AR(1)-type correlation structure. A simple and convenient way to capture both the marginal distribution and the serial dependence is to consider a Poisson-HMM model. Let y denote the number t of major earthquakes in year , and consider the state, or latent variable, x t to be t a stationary two-state Markov chain taking values in 1 , 2 } . Using the notation in { π . The stationary π − (6.129) and (6.130), we have 1 = = 1 − π π and 11 12 21 22 6.4 distribution of this Markov chain is given by π π 21 12 π = = π . and , 1 2 π π π π + + 12 12 21 21 > λ as the parameter of a Poisson distribution, 0 For j ∈ { 1 , 2 } , denote j y λ − j λ e j , , . . . . 1 , 0 = y ( p = ) y j ! y Õ 6 . 4 π . π The stationary distribution must satisfy π = j i i j i i i i i

346 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 336 — #346 i i 6 State Space Models 336 Since the states are stationary, the marginal distribution of is stationary and a y t mixture of Poissons, p ( ) ) = π p ( y ) + π p y ( y 2 1 2 Θ t t 1 t with = { λ ,λ } . The mean of the stationary distribution is Θ 1 2 E ( y (6.132) ) = π λ λ π + 1 2 1 2 t 6.5 and the variance is 2 ) ( ) = E ( y var ) + π (6.133) π , ( λ ) − λ y y ≥ E ( t 1 2 2 1 t t implying that the two-state Poisson HMM is overdispersed. Similar calculations (see Problem 6.21) show that the autocovariance function of is given by y t 2 2 ’ ’ h h 2 γ . (6.134) ) ( h π − π π − ( ) = 1 − π ( ) λ π λ ) = π λ π − ( λ 2 y 1 j i j 1 i 12 2 21 i j 1 j 1 = i = Thus, a two-state Poisson-HMM has an exponentially decaying autocorrelation function, and this is consistent with the sample ACF seen in Figure 6.12. It is worthwhile to note that if we increase the number of states, more complex depen- dence structures may be obtained. As in the linear Gaussian case, we need filters and smoothers of the state in their own right, and additionally for estimation and prediction. We then write ) . π (6.135) ( t | s ) = Pr ( x = j | y j t s 1: Straight forward calculations (see Problem 6.22) give the filter equations as: Property 6.7 HMM Filter 1 , . . ., n , = t For m ’ ( | 1 − t ) π 1 (6.136) , − π t ) 1 − t | t ( π = i j i j 1 i = ) ) t y p ( π ( j j t Õ , (6.137) = t | t ( π ) j m ( π t ) p ) ( y i i t i = 1 = with initial condition 1 | 0 ) ( π . π j j Θ denote the parameters of interest. Given data y Let , the likelihood is given n 1: by n ÷ ( . y | p y ) L = ) ( Θ 1 Θ t − 1: t Y = 1 t But, by the conditional independence, 6 . 5 | . )] V Recall var ( U ) = E [ var ( U | V )] + var [ E ( U i i i i

347 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 337 — #347 i i 337 Hidden Markov Models m ’ p | y ( y = ) ) p ( x = j | y ) Pr ( y | x = j , y t j − 1 t 1: t Θ 1: t − 1 − 1 Θ t 1: t 1 = j m ’ − π ( t | t . 1 ) p ( y = ) j j t = j 1 Consequently, n m ’ ’ © ™ = ) (6.138) . Θ ( ln L ( ) ln p ) 1 π − ( y | t t ≠ Æ Y j t j 1 j 1 t = = ́ ̈ Maximum likelihood can then proceed as in the linear Gaussian case discussed in Section 6.3. In addition, the Baum-Welch (or EM) algorithm discussed in Section 6.3 applies here as well. First, the general complete data likelihood still has the form of (6.61), that is, n n ’ ’ x , y | y ) = ln p ( ( x ln p ) + x ln p . ( + ln p ) ( x ) | x t t 1 Θ − Θ 0: t n 1: 0 Θ Θ n t 1 t t = 1 = I if ( t ) = 1 if It is more useful to define 1 = j and 0 otherwise, and I = ( t ) x i j t j = ] t and ( x 1 = ) , x π ) = ( i , j ) and 0 otherwise, for i , j = 1 , . . ., m . Recall Pr [ I ( j t − t j 1 ] Pr t ) = 1 ( = π I π . Then the complete data likelihood can be written as (we drop [ i i j i j from some of the notation for convenience) Θ m m n m ’ ’ ’ ’ ) ( = y , ) t ( I ( 0 ) ln π + π ln x ( ) t ln p I j n i j 1: n Θ j i j 0: 1 1 j 1 = i = t = 1 = j m n ’ ’ , ) + y ( ln p ) I t (6.139) ( j j t = j 1 1 = t ′ ′ . ] and, as before, we need to maximize Q ( Θ | Θ p ) = E [ ln In ,Θ ( x y ) | , y 1: n 0: 1: n Θ n this case, it should be clear that in addition to the filter, (6.137), we will need π ( t | n ) = E ( I (6.140) ( t ) | y ) y ) = Pr ( x | = j t j 1: n 1: j n for the first and third terms, and ( (6.141) . ) π | ( t | n ) = E y I j ( t ) | y = x ) = Pr ( x , = i i j t + 1 1: i j t 1: n n for the second term. In the evaluation of the second term, as will be seen, we must also evaluate j = (6.142) . φ x ) t ) = p ( y | ( + n t j 1: t i i i i

348 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 338 — #348 i i 6 State Space Models 338 Property 6.8 HMM Smoother = n 1 , . . ., 0 , For t − ) t ) ( t | t π φ ( j j Õ (6.143) , ( ) = π | t n j m t | t ) ( ( t ) π φ j j = 1 j ( t | n ) = π (6.144) ( π t n ) π , p ) ( y t ( φ ) φ )/ ( t + 1 | j i + t j i j i i j 1 m ’ p 1 + t (6.145) , π ) ( ( y φ ) t ) φ = ( j + 1 j i j t i 1 = j φ where ( n ) = 1 for j = 1 , . . ., m . j Proof: We leave the proof of (6.143) to the reader; see Problem 6.22. To verify (6.145), note that m ’ , = φ t ) ) i p ( y = x | j ( x = t + t n 1: + t 1 i 1 = j m ’ ) j = Pr ( x x | y = j | x ( = i ) p ( = p ) j | x = y + t 1 + t t 1 t + 2: n + t t + 1 1 = 1 j m ’ = ) π p 1 y . ) φ ( t + ( t j j + 1 i j 1 = j To verify (6.144), we have y n ) y π , ( t | | )∝ Pr ( x y = i , x , j = 1 t t + 1 i j t + 2: n 1: t + t ( y = = i | Pr ) i ) Pr ( x x = x = j | 1 t + t 1: t t p ( y ) ) j | x = x | = j × p ( y + t n 1 + t 2: 1 t 1 + t + π ( ( t | t ) π . p ) = y 1 + t ) φ ( j i t j i j 1 + Finally, to find the constant of proportionality, say , if we sum over j on both sides C t Õ Õ m m we get, . This means ( φ = ) π 1 ( t | n ) = π + ) t | n ) and t ( t ( φ π ) p y ( i j i j t j i i j 1 + j 1 = j = 1  , and (6.144) follows. ) that π t ( t | n ) = C ( π φ ( t | t ) i i i t For the Baum-Welch (or EM) algorithm, given the current value of the parameters, ′ Θ say , run the filter Property 6.7 and smoother Property 6.8, and then, as is evident from (6.139), update the first two estimates as Õ n ′ ) ( t n | π t = 1 i j ′ Õ Õ . (6.146) π ˆ = π π = ( 0 | n ) and ˆ i j j m n j ′ t ) n | π ( 1 = t 1 = k ik ′ and the hat Of course, the prime indicates that values have been obtain under Θ denotes the update. Although not the MLE, it has been suggested by Lindgren (1978) that a natural estimate of the stationary distribution of the chain would be i i i i

349 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 339 — #349 i i Hidden Markov Models 339 2 40 2 2 2 2 2 2 2 2 2 30 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 1 1 2 1 2 2 2 2 2 2 2 2 1 1 1 2 20 1 1 1 1 1 2 1 1 2 1 1 1 EQcount 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10 1 1 1 1 1 1 1920 1940 1960 1980 2000 1900 Time 1.0 0.08 0.8 0.6 (t | n) 2 Density ^ 0.04 π 0.4 0.2 0.0 0.00 1940 1960 5 10 15 2000 25 30 35 40 1980 1920 1900 20 EQcount Time Top: Earthquake count data and estimated states. Bottom left: Smoothing probabili- Fig. 6.13. ties. Bottom right: Histogram of the data with the two estimated Poisson densities superimposed (solid lines). n ’ − 1 ′ ˆ n = ˆ π | ( t π n ) , j j = t 1 rather than the value given in (6.146). Finally, the third term in (6.139) will require p ( knowing the distribution of y , and this will depend on the particular model. We ) t j will discuss the Poisson distribution in Example 6.15 and the normal distribution in Example 6.17 Example 6.16 Poisson HMM – Number of Major Earthquakes (cont) To run the EM algorithm in this case, we still need to maximize the conditional expectation of the third term of (6.139). The conditional expectation of the third ′ Θ is term at the current parameter value n m ’ ’ ′ t , ) ( t | π − 1 ) ln p ( y t j j 1 t j = 1 = where y ( y log )∝ p log λ − λ . j t j j t Consequently, maximization with respect to yields λ j Õ n ′ ) n | π y t ( t 1 = t j ˆ Õ 1 . = j , m , . . ., λ = j n ′ | π ( t ) n t 1 = j i i i i

350 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 340 — #350 i i 6 State Space Models 340 We fit the model to the time series of earthquake counts using the R package depmixS4 . The package, which uses the EM algorithm, does not provide stan- dard errors, so we obtained them by a parametric bootstrap procedure; see Remil- lard (2011) for justification. The MLEs of the intensities, along with their stan- ˆ ˆ . The MLE of the transition matrix λ ) , dard errors, were λ ( ) = ( 15 . 4 , 26 . 0 2 1 ( . 7 ) 1 . 1 ) ( ˆ ] [ ˆ π , . , ˆ π 12 , ˆ π , . , was π . Figure 6.13 displays the ] = [ . 93 , . 07 88 22 21 12 11 09 04 ( . ) ) ( . ( . 09 ) 04 ) ( . counts, the estimated state (displayed as points) and the smoothing distribution for the earthquakes data, modeled as a two-state Poisson HMM model with parameters fitted using the MLEs. Finally, a histogram of the data is displayed along with the two estimated Poisson densities superimposed as solid lines. The R code for this example is as follows. library(depmixS4) model <- depmix(EQcount ~1, nstates=2, data=data.frame(EQcount), family=poisson()) set.seed(90210) summary(fm <- fit(model)) # estimation results ##-- Get Parameters --## u = as.vector(getpars(fm)) # ensure state 1 has smaller lambda if (u[7] <= u[8]) { para.mle = c(u[3:6], exp(u[7]), exp(u[8])) } else { para.mle = c(u[6:3], exp(u[8]), exp(u[7])) } mtrans = matrix(para.mle[1:4], byrow=TRUE, nrow=2) lams = para.mle[5:6] pi1 = mtrans[2,1]/(2 - mtrans[1,1] - mtrans[2,2]); pi2 = 1-pi1 ##-- Graphics --## layout(matrix(c(1,2,1,3), 2)) par(mar = c(3,3,1,1), mgp = c(1.6,.6,0)) # data and states , type= , col=gray(.7)) ' plot(EQcount, main="", ylab= ' EQcount ' ' h text(EQcount, col=6*posterior(fm)[,1]-2, labels=posterior(fm)[,1], cex=.9) # prob of state 2 plot(ts(posterior(fm)[,3], start=1900), ylab = ' expression(hat(pi)[~2]* (t|n) ' )); abline(h=.5, lty=2) # histogram hist(EQcount, breaks=30, prob=TRUE, main="") xvals = seq(1,45) u1 = pi1*dpois(xvals, lams[1]) u2 = pi2*dpois(xvals, lams[2]) lines(xvals, u1, col=4); lines(xvals, u2, col=2) ##-- Bootstap --## # function to generate data pois.HMM.generate_sample = function(n,m,lambda,Mtrans,StatDist=NULL){ # n = data length, m = number of states, Mtrans = transition matrix, StatDist = stationary distn if(is.null(StatDist)) StatDist = solve(t(diag(m)-Mtrans +1),rep(1,m)) mvect = 1:m state = numeric(n) state[1] = sample(mvect ,1, prob=StatDist) for (i in 2:n) state[i] = sample(mvect ,1,prob=Mtrans[state[i-1] ,]) y = rpois(n,lambda=lambda[state ]) list(y= y, state= state) } # start it up set.seed(10101101) i i i i

351 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 341 — #351 i i 341 Hidden Markov Models nboot = 100 nobs = length(EQcount) para.star = matrix(NA, nrow=nboot, ncol = 6) for (j in 1:nboot){ x.star = pois.HMM.generate_sample(n=nobs, m=2, lambda=lams, Mtrans=mtrans)$y model <- depmix(x.star ~1, nstates=2, data=data.frame(x.star), family=poisson()) u = as.vector(getpars(fit(model, verbose=0))) # make sure state 1 is the one with the smaller intensity parameter if (u[7] <= u[8]) { para.star[j,] = c(u[3:6], exp(u[7]), exp(u[8])) } else { para.star[j,] = c(u[6:3], exp(u[8]), exp(u[7])) } } # bootstrapped std errors SE = sqrt(apply(para.star,2,var) + (apply(para.star,2,mean)-para.mle)^2)[c(1,4:6)] ' , ' seM21/M22 ' , ' seLam1 ' , ' seLam2 ' ); SE names(SE)=c( ' seM11/M12 Next, we present an example using a mixture of normal distributions. Example 6.17 Normal HMM – S&P500 Weekly Returns Estimation in the Gaussian case is similar to the Poisson case given in Example 6.16, 2 ( y for ) is the normal density; i.e., ( y ) | x except that now, = j ) ∼ N ( μ , σ p j j t t t j = j , . . ., m . Then, dealing with the third term in (6.139) in this case yields 1 Õ Õ n n ′ 2 ′ π y ) n | t ( t | n ) y ( π t t = t 1 t = 1 j j 2 2 Õ Õ = ˆ σ , ˆ μ − . μ ˆ = j n n j j ′ ′ n ) π | t π n ( ( t | ) = t 1 1 = t j j depmixS4 to the In this example, we fit a normal HMM using the R package weekly S&P 500 returns displayed in Figure 6.14. We chose a three-state model and we leave it to the reader to investigate a two-state model (see Problem 6.24). Standard errors (shown in parentheses below) were obtained via a parametric bootstrap based on a simulation script provided with the package. 3 denote the matrix of transition probabilities, the fitted 3 If we let P = { π × } i j transition matrix was   . . 945 055 . 000 ( . 074 . ( ) ( . 074 ) ) 000     ̂ 261 . . 739 000 . P = , ( ( . 000 ) 275 . ( . 275 ) )     942 . 027 032 . . . ) ( ( . 147 ) 057 ( . 122 )   ˆ = = μ and the three fitted normals were N ( ˆ μ ( N . 004 , ) 014 . , ˆ σ = 1 1 2 . 968 173 . ( ) ( ) ) 044 . The data, . − . 034 = σ ˆ , , ˆ σ 003 = . 009 . − = μ ) , and N ( ˆ 2 3 3 ( ) ( . 317 ) ) 909 . . 777 ( ( . 910 ) along with the predicted state (based on the smoothing distribution), are plotted in Figure 6.14. Note that regime 2 appears to represent a somewhat large-in-magnitude negative return, and may be a lone dip, or the start or end of a highly volatile period. States 1 and 3 represent clusters of regular or high volatility, respectively. Note that there is a large amount of uncertainty in the fitted normals, and in the transition matrix involving transitions from state 2 to states 1 or 3. The R code for this example is: library(depmixS4) y = ts(sp500w, start=2003, freq=52) # make data depmix friendly i i i i

352 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 342 — #352 i i 342 6 State Space Models 3 3 3 0.10 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 0.05 3 3 3 3 1 1 1 3 1 3 1 1 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 1 3 1 1 1 3 1 3 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 1 1 3 3 1 1 1 1 1 3 1 1 1 3 3 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 3 1 3 1 1 1 1 1 1 1 3 0.00 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 3 3 1 3 3 1 1 1 3 1 1 1 1 1 1 1 3 1 1 1 1 1 3 1 3 1 1 3 3 3 1 1 2 3 1 1 3 2 1 3 1 3 3 3 1 3 2 2 2 2 3 3 2 2 2 2 2 3 2 3 2 2 3 2 3 2 3 2 2 3 2 2 3 3 3 2 3 3 3 3 3 2 3 3 3 −0.05 3 3 3 3 3 S&P500 Weekly Returns 3 3 3 3 −0.10 Close 2012 2010 2006 2008 2004 Time 0.5 20 3 15 0.3 10 ACF Density 0.1 5 0 −0.1 0.4 −0.10 0.00 0.05 0.10 0.2 0.5 0.1 0.3 −0.20 LAG S&P500 Weekly Returns Fig. 6.14. Top: S&P 500 weekly returns with estimated regimes labeled as a number, 1, 2, or 20% during the financial crisis has been truncated to improve the 3 The minimum value of − graphics. Bottom left: Sample ACF of the squared ruturns. Bottom right: Histogram of the data with the three estimated normal densities superimposed. mod3 <- depmix(y~1, nstates=3, data=data.frame(y)) set.seed(2) summary(fm3 <- fit(mod3)) ##-- Graphics --## layout(matrix(c(1,2, 1,3), 2), heights=c(1,.75)) par(mar=c(2.5,2.5,.5,.5), mgp=c(1.6,.6,0)) plot(y, main="", ylab= ' S&P500 Weekly Returns ' , col=gray(.7), ylim=c(-.11,.11)) culer = 4-posterior(fm3)[,1]; culer[culer==3]=4 # switch labels 1 and 3 text(y, col=culer, labels=4-posterior(fm3)[,1]) ##-- MLEs --## para.mle = as.vector(getpars(fm3)[-(1:3)]) permu = matrix(c(0,0,1,0,1,0,1,0,0), 3,3) # for the label switch (mtrans.mle = permu%*%round(t(matrix(para.mle[1:9],3,3)),3)%*%permu) (norms.mle = round(matrix(para.mle[10:15],2,3),3)%*%permu) acf(y^2, xlim=c(.02,.5), ylim=c(-.09,.5), panel.first=grid(lty=2) ) ) hist(y, 25, prob=TRUE, main= '' culer=c(1,2,4); pi.hat = colSums(posterior(fm3)[-1,2:4])/length(y) for (i in 1:3) { mu=norms.mle[1,i]; sig = norms.mle[2,i] x = seq(-.15,.12, by=.001) lines(x, pi.hat[4-i]*dnorm(x, mean=mu, sd=sig), col=culer[i]) } ##-- Bootstrap --## set.seed(666); n.obs = length(y); n.boot = 100 para.star = matrix(NA, nrow=n.boot, ncol = 15) respst <- para.mle[10:15]; trst <- para.mle[1:9] for ( nb in 1:n.boot ){ i i i i

353 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 343 — #353 i i Hidden Markov Models 343 mod <- simulate(mod3) y.star = as.vector([email protected][[1]][[1]]@y) dfy = data.frame(y.star) mod.star <- depmix(y.star~1, data=dfy, respst=respst, trst=trst, nst=3) fm.star = fit(mod.star, emcontrol=em.control(tol = 1e-5), verbose=FALSE) para.star[nb,] = as.vector(getpars(fm.star)[-(1:3)]) } # bootstrap stnd errors SE = sqrt(apply(para.star,2,var) + (apply(para.star,2,mean)-para.mle)^2) (SE.mtrans.mle = permu%*%round(t(matrix(SE[1:9],3,3)),3)%*%permu) (SE.norms.mle = round(matrix(SE[10:15], 2,3),3)%*%permu) It is worth mentioning that switching regressions also fits into this framework. In μ in the model in Example 6.17 to depend on independent this case, we would change j inputs, say z , so that , . . ., z tr t 1 r ’ ( ( j ) j ) + β μ . = z β ti j i 0 1 i = This type of model is easily handled using the depmixS4 R package. By conditioning on the first few observations, it is also possible to include simple switching linear autoregression into this framework. In this case, we model the observations as being an AR( p ), with parameters depending on the state; that is, p ’ x ( ) ( x ) x ( ) t t t (6.147) v + σ , y φ = + φ y t − i t t i 0 1 = i and v . The model is similar to the threshold model discussed in Sec- ∼ iid N ( 0 , 1 ) t tion 5.4, however, the process is not self-exciting or influenced by an observed ex- ogenous process. In (6.147), we are saying that the parameters are random, and the regimes are changing due to a latent Markov process. In a similar fashion to (6.131), we write the conditional distribution of the observations as p , ( y ) ) = p ( y y (6.148) x , = j | t t − 1: t − p t j t ), g p is the normal density ( and we note that for t > p , ) y ( j t p ( ) ’ ( ) j ( j ) ( j ) 2 g + y ) = ( . p ; φ φ (6.149) y , σ y t t j t − i i 0 = 1 i As in (6.138), the conditional likelihood is given by m n ’ ’ © ™ ) p ) 1 − t | t ( y ln ( π . ) L Θ | ln = y ( ≠ Æ t j j 1: Y p j = 1 t = p + 1 ́ ̈ where Property 6.7 still applies, but with the updated evaluation of p ( y ) given j t in (6.149). In addition, the EM algorithm may be used analogously by assessing the smoothers. The smoothers in this case are symbolically the same as given in Property 6.8 with the appropriate definition changes, p ( y ) as given in (6.148) and j t . t p > with φ for ( t ) = p ( y y , j = | x ) t 1: + t t + 1 − n : t j p i i i i

354 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 344 — #354 i i 344 6 State Space Models 2 0.4 2 2 2 2 0.2 2 2 2 2 t 2 2 2 2 1 1 1 1 u 1 1 1 1 l 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 f 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.0 2 1 1 1 1 1 1 1 ∇ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 2 2 2 2 2 2 −0.2 2 2 −0.4 1968 1974 1976 1978 1970 1972 1.0 0.5 (t | n) 2 ^ π 0.0 1974 1976 1978 1968 1970 1972 Fig. 6.15. The differenced flu mortality data along with the estimated states (displayed as points). The smoothed state 2 probabilities are displayed in the bottom of the figure as a straight line. The filtered state 2 probabilities are displayed as vertical lines. Example 6.18 Switching AR – Influenza Mortality In Example 5.7, we discussed the monthly pneumonia and influenza mortality series shown in Figure 5.7. We pointed out the non-reversibility of the series, which rules out the possibility that the data are generated by a linear Gaussian process. In addition, note that the series is irregular, and while mortality is highest during the winter, the peak does not occur in the same month each year. Moreover, some seasons have very large peaks, indicating flu epidemics, whereas other seasons are mild. In addition, it can be seen from Figure 5.7 that there is a slight negative trend in the data set, indicating that flu prevention is getting better over the eleven year period. As in Example 5.7, we focus on the differenced data, which removes the trend. ∇ represents the data displayed in , where flu flu In this case, we denote y = t t t y , we might also consider Figure 5.7. Since we already fit a threshold model to t a switching autoregessive model where there are two hidden regimes, one for epidemic periods and one for more mild periods. In this case, the model is given by { Õ p ) ( 1 ( 1 ) ) 1 ( φ + φ x , 1 = for y , v σ + j t t t − j 0 j = 1 = y (6.150) Õ t p ) 2 ( ) 2 ( 2 ) ( φ + φ v , + σ 2 = y , for x − t j t t j 0 j = 1 ) x , and is a latent, two-state Markov chain. where v 1 ∼ iid N ( 0 , t t We used the R package MSwM to fit the model specified in (6.150), with p = 2 . The results were { 1 v = x for , 024 , . 006 . + y 097 + . 293 . + y t − 1 t − 2 t t . 031 ) 003 ) . ( . ( ) ( 039 y ˆ = t , = x for . 199 , v 112 . 2 . 313 + y 604 . y 1 − − 1 t t t − − t 2 ) 276 . ( ( ) 063 . 281 ( ) . i i i i

355 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 345 — #355 i i 6.10 Dynamic Linear Models with Switching 345 with estimated transition matrix [ ] 93 . 07 . ˆ = . P 70 . . 30 Figure 6.15 displays the data y along with the estimated states (displayed = ∇ flu t t as points labeled 1 or 2). The smoothed state 2 probabilities are displayed in the bottom of the figure as a straight line. The filtered state 2 probabilities are displayed in the same graph as vertical lines. The code for this example is as follows. library(MSwM) set.seed(90210) dflu = diff(flu) model = lm(dflu~ 1) # 2 regimes, AR(2)s mod = msmFit(model, k=2, p=2, sw=rep(TRUE,4)) summary(mod) plotProb(mod, which=3) 6.10 Dynamic Linear Models with Switching In this section, we extend the hidden Markov model discussed in Section 6.9 to more general problems. As previously indicated, the problem of modeling changes in regimes for time series has been of interest in many different fields, and we have explored these ideas in Section 5.4 as well as in Section 6.9. Generalizations of the state space model to include the possibility of changes oc- curring over time have been approached by allowing changes in the error covariances (Harrison and Stevens, 1976, Gordon and Smith, 1988, 1990) or by assigning mixture v (Peña and Guttman, 1988). Approximations distributions to the observation errors t to filtering were derived in all of the aforementioned articles. An application to mon- itoring renal transplants was described in Smith and West (1983) and in Gordon and Smith (1990). Gerlach et al. (2000) considered an extension of the switching AR model to allow for level shifts and outliers in both the observations and innovations. An application of the idea of switching to the tracking of multiple targets has been considered in Bar-Shalom (1978), who obtained approximations to Kalman filtering in terms of weighted averages of the innovations. For a thorough coverage of these and related techniques, see Cappé, Moulines, & Rydén (2009) and Douc, Moulines, & Stoffer (2014). In this section, we will concentrate on the method presented in Shumway and Stoffer (1991). One way of modeling change in an evolving time series is by as- suming the dynamics of some underlying model changes discontinuously at certain undetermined points in time. Our starting point is the DLM given by (6.1) and (6.2), namely, = x (6.151) Φ x , w + 1 − t t t to describe the p × 1 state dynamics, and x v + (6.152) y A = t t t t i i i i

356 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 346 — #356 i i 6 State Space Models 346 to describe the 1 observation dynamics. Recall w q and v × are Gaussian white noise t t var ( w ) . Q , var ( v sequences with ) = R , and cov ( w , v ) = 0 for all s and t = t t t s Example 6.19 Tracking Multiple Targets The approach of Shumway and Stoffer (1991) was motivated primarily by the problem of tracking a large number of moving targets using a vector y of sensors. t In this problem, we do not know at any given point in time which target any given sensor has detected. Hence, it is the structure of the measurement matrix A in t . v (6.152) that is changing, and not the dynamics of the signal x or or the noises, w t t t ′ As an example, consider a × 1 vector of satellite measurements y , = ( y ) y , y 3 t 2 1 t 3 t t that are observations on some combination of a 3 × 1 vector of targets or signals, ′ ( x ) = . For the measurement matrix x x , , x t 2 t 3 t 1 t   0 1 0     1 0 0 = A t     0 0 1   x ; the second sensor, for example, the first sensor, y , observes the second target, 2 t 1 t y y x , observes the third target, , observes the first target, ; and the third sensor, 1 t 3 t t 2 , A . All possible detection configurations will define a set of possible values for x 3 t t , . . ., { , as a collection of plausible measurement matrices. , say, } M M M m 2 1 Example 6.20 Modeling Economic Change As another example of the switching model presented in this section, consider the case in which the dynamics of the linear model changes suddenly over the history of a given realization. For example, Lam (1990) has given the following generalization of Hamilton’s (1989) model for detecting positive and negative growth periods in the economy. Suppose the data are generated by y = z (6.153) , n + t t t where z is a random walk with a drift that switches is an autoregressive series and n t t between two values α α and . That is, + α 0 0 1 n (6.154) n = , + α + α S − t 0 t t 1 1 = 0 or 1, depending on whether the system is in state 1 or state 2. For the S with t purpose of illustration, suppose z φ (6.155) z + φ = z + w − t t t − 2 1 1 t 2 2 var ( w . Lam (1990) wrote (6.153) in a differenced ) = σ is an AR(2) series with t w form (6.156) , S ∇ y α = z + − z α + 1 − t 0 t t 1 t which we may take as the observation equation (6.152) with state vector i i i i

357 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 347 — #357 i i 347 6.10 Dynamic Linear Models with Switching ′ ,α , z = z ,α ) ( (6.157) x t 1 − t t 0 1 and , = [ 1 , − 1 , 1 , 0 ] and M (6.158) = [ 1 , − 1 , 1 M 1 ] 2 1 determining the two possible economic conditions. The state equation, (6.151), is of the form   0 0 w z φ φ z t 1 − 1 t 2 t   © © © ™ ™ ™   0 z 1 0 0 0 z ≠ ≠ ≠ Æ Æ Æ 2 t − − 1 t   + (6.159) . = ≠ ≠ ≠ Æ Æ Æ   0 0 1 0 α 0 α ≠ ≠ ≠ Æ Æ Æ 0 0     0 0 0 1 α 0 α 1 1 ́ ́ ́ ̈ ̈ ̈   The observation equation, (6.156), can be written as ∇ y (6.160) = A , x v + t t t t where we have included the possibility of observational noise, and where Pr ( A = t given in (6.158). M and M M , with ) = 1 − Pr ( A ) = M 2 1 t 2 1 To incorporate a reasonable switching structure for the measurement matrix into the DLM that is compatible with both practical situations previously described, we m possible configurations are states in a nonstationary, independent assume that the process defined by the time-varying probabilities (6.161) ( t ) = Pr ( A , = M π ) j j t = j 1 , . . ., m for t = 1 , 2 , . . ., n . Important information about the current state and of the measurement process is given by the filtered probabilities of being in state j , defined as the conditional probabilities ) (6.162) ( t | t ) = Pr ( A , = M π | y 1: j t j t ′ ′ = which also vary as a function of time. Recall that y { y , . . ., y } . The filtered s : s s s probabilities (6.162) give the time-varying estimates of the probability of being in . given the data to time t j state It will be important for us to obtain estimators of the configuration probabilities, − t 1 t and ) x t ( | π , the predicted and filtered state estimators, x t , and the corresponding j t t t t − 1 error covariance matrices . Of course, the predictor and filter estimators and P P t t will depend on the parameters, Θ , of the DLM. In many situations, the parameters will be unknown and we will have to estimate them. Our focus will be on maxi- mum likelihood estimation, but other authors have taken a Bayesian approach that assigns priors to the parameters, and then seeks posterior distributions of the model parameters; see, for example, Gordon and Smith (1990), Peña and Guttman (1988), or McCulloch and Tsay (1993). and the We now establish the recursions for the filters associated with the state x t . As discussed in Section 6.3, the filters are also an essential A switching process, t t − 1 ) part of the maximum likelihood procedure. The predictors, x , = E ( x y | t t − 1 1: t t and filters, x , and their associated error variance–covariance matrices, ) = E ( x y | t 1: t t t t − 1 P P , are given by and t t i i i i

358 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 348 — #358 i i 6 State Space Models 348 − 1 t 1 − t x Φ , (6.163) x = t − 1 t − 1 t ′ 1 − t + P P (6.164) , Φ = Φ Q t 1 − t m ’ − 1 t t = + x x (6.165) , π  ( t | t ) K t j j t j t t 1 j = m ’ 1 − t t P = π (6.166) ( t | , )( I − K P M ) t j t j j t t 1 = j 1 − ′ 1 − t (6.167) P M Σ = K , t j t t j j where the innovation values in (6.165) and (6.167) are t − 1 M x (6.168) y  − , = t j t j t t − 1 ′ M R P Σ (6.169) , + M = j t j t j . for j = 1 , . . ., m Equations (6.163)–(6.167) exhibit the filter values as weighted linear combi- nations of the m innovation values, (6.168)-(6.169), corresponding to each of the possible measurement matrices. The equations are similar to the approximations in- troduced by Bar-Shalom and Tse (1975), by Gordon and Smith (1990), and Peña and Guttman (1988). I A A = M To verify (6.165), let the indicator ) = 1 when ( , and zero = M j t j t otherwise. Then, using (6.20), t ] E ( x | y ) = E [ E ( x | y , A ) x = y t t 1: t t t t 1: 1: t m { } ’ = y E ) M = E ( x A | y ( I , A ) = M j t 1: t t t t j 1: j = 1 m } { ’ 1 − t t − 1 = A ( I )] y ) M M − y x ( K + = x [ E t t 1: j t j t j t t j 1 = m ’ 1 − t − 1 t )] , + − x ( y M K t )[ x t ( π | = t t j j j t t j 1 = where is given by (6.167). Equation (6.166) is derived in a similar fashion; K t j the other relationships, (6.163), (6.164), and (6.167), follow from straightforward applications of the Kalman filter results given in Property 6.1. − ) 1 π Next, we derive the filters denote the conditional density t ( t | t ) . Let p | ( t j j = y given the past y . Then, m , . . ., 1 , and A of = M j , for j t 1 − t 1: t − t t ) | 1 ( p ) t ( π j j Õ (6.170) , = ( ) π t | t j m ( ) − t | π t ( t ) p 1 k k 1 = k has been specified before m , . . ., 1 where we assume the distribution π = ( t ) , for j j observing y (details follow as in Example 6.21 below). If the investigator has 1: t i i i i

359 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 349 — #359 i i 6.10 Dynamic Linear Models with Switching 349 t , the choice of uniform priors, no reason to prefer one state over another at time − 1 ( ) = m , will suffice. Smoothness can be introduced by letting π , for t = 1 , . . ., m j j m ’ ( t ) = − π , π ) π 1 ( t − 1 | t (6.171) i i j j = i 1 Õ m process was are chosen so If the where the non-negative weights A π . π 1 = i j t i j 1 i = π , then (6.171) would be the update for the Markov with transition probabilities i j filter probability, as shown in the next example. Example 6.21 Hidden Markov Chain Model If { A } is a hidden Markov chain with stationary transition probabilities π = t i j , . . ., Pr ( A , we have = M m | A = 1 j = M , ) , for i i − t j t 1 ) y | y , M = A p ( t j t t 1: − 1 ( = | ) t t π j p ( y | y ) 1: t − 1 t p Pr A = M | y ) ) ( ( y | A = M , y 1: t t − t 1 j j t 1 1: t − = ) ( y p | y t 1: t − 1 t ) π 1 ( − | t − 1 ) p t ( t | j j Õ (6.172) = . m 1 | − t ) π t ( t | t − 1 ) p ( k k k 1 = In the Markov case, the conditional probabilities π ( t | t − 1 ) = Pr ( A ) = M y | j t j − 1 t 1: A π ( t in (6.172) replace the unconditional probabilities, = Pr ( , in (6.170). ) = M ) j t j . We To evaluate (6.172), we must be able to calculate π ) ( t | t − 1 ) and p 1 ( t | t − j j , will discuss the calculation of p ) ( t | t − 1 ) after this example. To derive π 1 ( t | t − j j note, ) ( t | t − 1 π = Pr ( A M = ) y t j j t 1 − 1: m ’ M ( A = y , A ) = M Pr = t − 1 1 j i − t 1: t 1 = i m ’ M ( = M Pr ) A = A ) Pr A y = M ( = t i j 1 1: − t − 1 t − i 1 t = i 1 m ’ − 1 (6.173) ) . t π | π 1 ( t − = i i j 1 = i Expression (6.171) comes from equation (6.173), where, as previously noted, we . replace π t ( ) | t − 1 ) by π ( t j j The difficulty in extending the approach here to the Markov case is the depen- , which makes it necessary to enumerate over all possible histories y dence among the t to derive the filtering equations. This problem will be evident when we derive the i i i i

360 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 350 — #360 i i 6 State Space Models 350 conditional density ( t | t − 1 ) . Equation (6.171) has p π ( t ) as a function of the j j past observations, y , which is inconsistent with our model assumption. Never- 1: 1 t − theless, this seems to be a reasonable compromise that allows the data to modify the π ( t ) , without having to develop a highly computer-intensive technique. probabilities j p ( As previously suggested, the computation of t | t − 1 ) , without some ap- j 1 proximations, is highly computer-intensive. To evaluate p ) ( t | t − , consider the j event { A , = M } M , . . . , A (6.174) = 1 − j j 1 t 1 t − 1 , . . ., for j , which specifies a specific set of measurement = 1 1 m , and i = 1 , . . ., t − i 1 − t M = matrices through the past; we will write this event as . Because m A ( − ) ) t ( 1 ` − 1 t . A possible outcomes exist for , . . ., A m , . . ., 1 , the index ` runs through ` = 1 1 − t Using this notation, we may write 1 ( t | t − p ) j t − 1 m ’ A , M = A , M = y y ( p } y ) = A Pr { = M t 1 t 1: t − 1 1: t − j ) ` ( ( t − 1 ) − t ( ) ( 1 ) ` ` 1 = 1 − t m ’ ) ( def , . . ., , 1 = j ) ` ( (6.175) g , y ; μ ( ` ) , Σ m = ) ` ( α t t j t j 1 ` = (· ; μ, Σ represents the normal density with mean vector μ and variance– g ) where Σ . Thus, p covariance matrix ( t | t − 1 ) is a mixture of normals with non-negative j Õ α ( ` ) = Pr { A , and with each 1 = ) ` = M ( α weights | y such that } − 1 1: t t 1 ( ) ) − ` ( ` normal distribution having mean vector t − 1 ` ) = M ] x M μ (6.176) M ( ` ) = ( = E [ x A | y , t t j 1: t j − j 1 1 − t ( ` ) ( ) t and covariance matrix ′ 1 t − Σ ( ( ` ` = M ) P M ) (6.177) . + R j t j t j This result follows because the conditional distribution of y in (6.175) is identical t to the fixed measurement matrix case presented in Section 6.2. The values in (6.176) and (6.177), and hence the densities, p , can be obtained ( t | t − 1 ) , for j = 1 , . . ., m j A directly from the Kalman filter, Property 6.1, with the measurement matrices ( t − 1 ) fixed at M . ( ` ) is given explicitly in (6.175), its evaluation is highly ) 1 − Although p t ( t | j n m = 2 states and computer intensive. For example, with = 20 observations, we have 2 20 20 = to filter over 2 + 2 ). There are a few + ··· + 2 576 possible sample paths ( 2 , 048 1 , remedies to this problem. An algorithm that makes it possible to efficiently compute the most likely sequence of states given the data is known as the Viterbi algorithm , which is based on the well-known dynamic programming principle. Details may , t be found in Douc et al. (2014, §9.2). Another remedy is to trim (remove), at each highly improbable sample paths; that is, remove events in (6.174) with extremely small ) − 1 probability of occurring, and then evaluate p as if the trimmed sample paths ( t | t j could not have occurred. Another rather simple alternative, as suggested by Gordon i i i i

361 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 351 — #361 i i 351 6.10 Dynamic Linear Models with Switching ( t | t − 1 ) and Smith (1990) and Shumway and Stoffer (1991), is to approximate p j using the closest (in the sense of Kulback–Leibler distance) normal distribution. In this case, the approximation leads to choosing normal distribution with the same p t | mean and variance associated with − 1 ) ; that is, we approximate p ( ( t | t − 1 ) t j j 1 − t x M and variance Σ by a normal with mean given in (6.169). j t j t To develop a procedure for maximum likelihood estimation, the joint density of the data is n ÷ ( f ) ( y , . . ., y f ) y y | = t 1: t − 1 1 n = 1 t n m ÷ ’ = ) , y , M Pr ( A = = M A | y | y ( p ) 1 1 t 1: j t t − j t − 1: t j 1 = t 1 = and hence, the likelihood can be written as n m ’ ’ © ™ (6.178) . ( Θ ) ln = ) 1 L ln − t | π t ( t ) p ( ≠ Æ j j Y j = t 1 1 = ́ ̈ ( t ) For the hidden Markov model, π π ( t | t − 1 ) . In (6.178), would be replaced by j j . That is, henceforth, we will p t | t − 1 ) ( we will use the normal approximation to j 1 t − 1 − t − 1 ) as the normal, N ( M consider x p is given in ( t | Σ t ) , density, where x , t j j j t t Σ is given in (6.169). We may consider maximizing (6.178) directly as (6.163) and t j Θ in { a function of the parameters using a Newton method, or we may ,Φ, Q , R } μ 0 consider applying the EM algorithm to the complete data likelihood. , and A To apply the EM algorithm as in Section 6.3, we call , the y x , n n 0: 1: n 1: complete data, with likelihood given by 1 − ′ ) μ − x ( ( Θ ) = ln | Σ + | − ( x 2 ln − μ Σ ) L , 0 0 X Y , 0 A 0 0 0 n ’ 1 − ′ ( x x Q ) Φ − ) x Φ ( x − + | Q | ln n + t 1 t − t − 1 t t 1 = n m ’ ’ (6.179) | | ln n ( R + ) t I ( A = M ) ln π − 2 j j t 1 1 = t j = n m ’ ’ 1 − ′ − I ( A y = M x )( y ( − A ) x . ) A R + t t t t t t t j j = 1 1 t = As discussed in Section 6.3, we require the minimization of the conditional expecta- tion ( } ) { ( ) 1 − k k 1 ) − ( Θ Q Θ = E ) ( Θ − 2 ln L (6.180) y , ,Θ n Y , A , X : with respect to Θ at each iteration, k = 1 , 2 , . . . . The calculation and maximization of (6.180) is similar to the case of (6.63). In particular, with (6.181) , π y ( t | n ) = E [ I ( A ] = M ) j t j 1: n i i i i

362 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 352 — #362 i i 6 State Space Models 352 we obtain on iteration k , ( k ) π ( t ) = π (6.182) ( t | n ) , j j k ) ( n , (6.183) μ x = 0 0 ( k ) 1 − , = S Φ S (6.184) 10 00 ( ) − 1 ( ) k ′ − 1 (6.185) S , − = n S Q S S 10 11 00 10 and n m ] [ ’ ’ ) 1 k ( − n n ′ n ′ n = R P π (6.186) ( t | n ) + ( y . − M ) x M M )( y x − M j t j j t j t t t j 1 j 1 = t = k S where S are given in (6.65)–(6.67). As before, at iteration , S , the filters , 10 00 11 ( − 1 ) k Θ and the smoothers are calculated using the current values of the parameters, , and is held fixed. Filtering is accomplished by using (6.163)-(6.167). Smoothing Σ 0 is derived in a similar manner to the derivation of the filter, and one is led to the smoother given in Property 6.2 and Property 6.3, with one exception, the initial smoother covariance, (6.53), is now m ’ n 1 − n P = ) M P Φ π K ( n | n )( I − (6.187) . j t j j n , n − 1 n − 1 1 j = π Unfortunately, the computation of t | n ) is excessively complicated, and requires ( j integrating over mixtures of normal distributions. Shumway and Stoffer (1991) sug- ) gest approximating the smoother π , and find the approxi- ( t | n ) by the filter π t ( t | j j mation works well. Example 6.22 Analysis of the Influenza Data We use the results of this section to analyze the U.S. monthly pneumonia and influenza mortality data plotted in Figure 5.7. Letting y denote the observations t at month , we model y in terms of a structural component model coupled with a t t hidden Markov process that determines whether a flu epidemic exists. , The model consists of three structural components. The first component, x t 1 is an AR(2) process chosen to represent the periodic (seasonal) component of the data, , x = w x + α x (6.188) + α t − 1 , 1 t 1 2 1 t − 2 , 1 1 t 2 ) x . The second component, is white noise, with var ( w , is an w where = σ t 1 t t 2 1 1 AR(1) process with a nonzero constant term, which is chosen to represent the sharp rise in the data during an epidemic, (6.189) + x = β , β x + w 1 2 t t − 1 , 2 0 2 t 2 w . The third component, where w x σ is white noise, with var ( , is a fixed = ) 2 t 2 t 3 t 2 trend component given by, i i i i

363 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 353 — #363 i i 353 6.10 Dynamic Linear Models with Switching x = x (6.190) , w + t , 1 t 3 − t 3 3 var ( w , which corresponds to a 0 ) = where . The case in which var ( w 0 > ) 3 3 t t stochastic trend (random walk), was tried here, but the estimation became unstable, and lead to us fitting a fixed, rather than stochastic, trend. Thus, in the final model, ∇ x = ; recall in Example 6.18 the data were also the trend component satisfies 0 3 t differenced once before fitting the model. Throughout the years, periods of normal influenza mortality (state 1) are mod- eled as x (6.191) y v = , + x + 1 t t t t 3 2 where the measurement error, v = , is white noise with var ( v σ ) . When an t t v epidemic occurs (state 2), mortality is modeled as v y = . (6.192) + x + x + x 2 1 t t 3 t t t The model specified in (6.188)–(6.192) can be written in the general state-space form. The state equation is   w x 0 0 0 α α x 1 , − 1 t t 1 2 1 t 1   © © © © ™ ™ ™ ™   0 0 1 0 0 0 x x ≠ ≠ ≠ ≠ Æ Æ Æ Æ − t 1 , 2 , t 1 − 1   (6.193) . + + = ≠ ≠ ≠ ≠ Æ Æ Æ Æ   w β 0 x 0 0 β x ≠ ≠ ≠ ≠ Æ Æ Æ Æ t 2 0 2 t 1 − 1 , 2 t     0 0 x 0 0 0 1 x 1 , 3 t − 3 t ́ ́ ́ ́ ̈ ̈ ̈ ̈   Of course, (6.193) can be written in the standard state-equation form as (6.194) = Φ x , w + + x u Υ t 1 − t t t ′ ′ × 4 where x is a = ( x Q , and , x 1 ≡ u , matrix , x ) 0 , x , , β ) , , Υ = ( 0 4 0 3 0 2 t 1 , 1 t − t 1 t t t 2 2 σ with as the (3,3)-element, and the remaining elements σ as the (1,1)-element, 1 2 set equal to zero. The observation equation is x + = , (6.195) A y v t t t t 2 is 1 × 4 , and v A is white noise with var ( v ) = R = σ . We assume all where t t t v v w are uncorrelated. , w components of variance , and 2 t t t 1 As discussed in (6.191) and (6.192), A can take one of two possible forms t no epidemic ] 1 A , , = M 0 = [ 1 , 0 , 1 t , 0 M , = [ 1 , A = 1 , 1 ] epidemic 2 t corresponding to the two possible states of (1) no flu epidemic and (2) flu epidemic, = 1 . In this example, we will assume ) such that Pr ( A A = M ( ) = M − Pr 1 t 2 t is a hidden Markov chain, and hence we use the updating equations given in A t = . 75 Example 6.21, (6.172) and (6.173), with transition probabilities π = π 22 11 π . ). (and, thus, π = = 25 21 12 Parameter estimation was accomplished using a quasi-Newton–Raphson pro- cedure to maximize the approximate log likelihood given in (6.178), with initial i i i i

364 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 354 — #364 i i 6 State Space Models 354 Estimation Results for Influenza Data Table 6.3. Initial Model Final Model Parameter Estimates Estimates 1.422 (.100) 1.406 (.079) α 1 . α . 634 (.089) − (.069) 622 − 2 β .276 (.056) .210 (.025) 0 β — − . 312 (.218) 1 σ .023 (.003) .023 (.005) 1 σ .108 (.017) .112 (.017) 2 .002 (.009) — σ v Estimated standard errors in parentheses values of π . Table 6.3 shows the results of the estimation ( 1 | 0 ) = π 5 ( 1 | 0 ) = . 2 1 ˆ . σ procedure. On the initial fit, two estimates are not significant, namely, ˆ β and 1 v 2 σ , there is no measurement error, and the variability in data is explained When = 0 v 2 2 . The case and σ solely by the variance components of the state system, namely, σ 2 1 in which β = 0 corresponds to a simple level shift during a flu epidemic. In the 1 2 ˆ β final model, with β removed, the estimated level shift ( and σ ) corresponds to 0 1 v an increase in mortality by about .2 per 1000 during a flu epidemic. The estimates for the final model are also listed in Table 6.3. Figure 6.16(a) shows a plot of the data, y , for the ten-year period of 1969– t . , or 2 if 5 1978 as well as an indicator that takes the value of 1 if ˆ π ) ≥ ( t | t − 1 1 t . The estimated prediction probabilities do a reasonable job of 5 ˆ π > . ( t | ) − 1 2 predicting a flu epidemic, although the peak in 1972 is missed. Figure 6.16(b) shows the estimated filtered values (that is, filtering is done t t , and , x x using the parameter estimates) of the three components of the model, t 2 1 t t t represents the seasonal x . Except for initial instability (which is not shown), ˆ x 1 t 3 t t t ˆ x represents the spikes during a flu epidemic, and x ˆ (cyclic) aspect of the data, t 3 2 t represents the slow decline in flu mortality over the ten-year period of 1969–1978. 1 − t y , is obtained as One-month-ahead prediction, say, ˆ t 1 − t 1 − t , − ) t ( π ˆ > ) 1 t | 1 if ˆ π − ( t | t ˆ M = x y ˆ 1 2 1 t t t − 1 t − 1 ˆ . ) t ( π 1 ) ≤ 1 − t | − if ˆ π t ( t | ˆ x M = ˆ y 1 2 2 t t t − 1 Of course, ˆ x is the estimated state prediction, obtained via the filter presented t in (6.163)–(6.167) (with the addition of the constant term in the model) using the estimated parameters. The results are shown in Figure 6.16(c). The precision of the forecasts can be measured by the innovation variances, Σ when no epidemic t 1 Σ when an epidemic is predicted. These values become stable is predicted, and 2 t quickly, and when no epidemic is predicted, the estimated standard prediction error large); when a flu epidemic is approximately .02 (this is the square root of Σ for t t 1 is predicted, the estimated standard prediction error is approximately .11. The results of this analysis are impressive given the small number of param- eters and the degree of approximation that was made to obtain a computationally i i i i

365 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 355 — #365 i i 355 6.10 Dynamic Linear Models with Switching 2 0.8 (a) 1 2 1 2 0.6 2 1 2 2 2 2 1 2 2 2 0.4 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.2 1 1 1 1 1976 1978 1968 1970 1974 1972 Time 0.4 (b) 0.3 0.2 0.1 0.0 −0.1 1976 1978 1968 1970 1972 1974 Time (c) l 0.8 l l l l l 0.6 l l l l l l l l l l l l 0.4 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.2 1974 1976 1978 1968 1972 1970 , (line–points) and a prediction indicator (1 or 2) that an Fig. 6.16. (a) Influenza data, y t t given the data up to month t − 1 (dashed line). (b) The three epidemic occurs in month t t x filtered structural components of influenza mortality: ˆ ( cyclic trace ), ˆ x ), spiked trace ( 1 t t 2 t and ˆ x ( negative linear trace ). (c) One-month-ahead predictions shown as upper and lower 3 t √ 1 − t t − 1 ˆ P (gray swatch), of the number of pneumonia and influenza deaths, and y 2 ± y ˆ limits t t t (points). simple method for fitting a complex model. Further evidence of the strength of this technique can be found in the example given in Shumway and Stoffer (1991). The R code for the final model estimation is as follows. y = as.matrix(flu); num = length(y); nstate = 4; M1 = as.matrix(cbind(1,0,0,1)) # obs matrix normal M2 = as.matrix(cbind(1,0,1,1)) # obs matrix flu epi prob = matrix(0,num,1); yp = y # to store pi2(t|t-1) & y(t|t-1) xfilter = array(0, dim=c(nstate,1,num)) # to store x(t|t) # Function to Calculate Likelihood Linn = function(para){ alpha1 = para[1]; alpha2 = para[2]; beta0 = para[3] sQ1 = para[4]; sQ2 = para[5]; like=0 xf = matrix(0, nstate, 1) # x filter xp = matrix(0, nstate, 1) # x pred Pf = diag(.1, nstate) # filter cov Pp = diag(.1, nstate) # pred cov pi11 <- .75 -> pi22; pi12 <- .25 -> pi21; pif1 <- .5 -> pif2 phi = matrix(0,nstate,nstate) i i i i

366 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 356 — #366 i i 6 State Space Models 356 phi[1,1] = alpha1; phi[1,2] = alpha2; phi[2,1]=1; phi[4,4]=1 Ups = as.matrix(rbind(0,0,beta0,0)) Q = matrix(0,nstate,nstate) Q[1,1] = sQ1^2; Q[3,3] = sQ2^2; R=0 # R=0 in final model # begin filtering # for(i in 1:num){ xp = phi%*%xf + Ups; Pp = phi%*%Pf%*%t(phi) + Q sig1 = as.numeric(M1%*%Pp%*%t(M1) + R) sig2 = as.numeric(M2%*%Pp%*%t(M2) + R) k1 = Pp%*%t(M1)/sig1; k2 = Pp%*%t(M2)/sig2 e1 = y[i]-M1%*%xp; e2 = y[i]-M2%*%xp pip1 = pif1*pi11 + pif2*pi21; pip2 = pif1*pi12 + pif2*pi22 den1 = (1/sqrt(sig1))*exp(-.5*e1^2/sig1) den2 = (1/sqrt(sig2))*exp(-.5*e2^2/sig2) denm = pip1*den1 + pip2*den2 pif1 = pip1*den1/denm; pif2 = pip2*den2/denm pif1 = as.numeric(pif1); pif2 = as.numeric(pif2) e1 = as.numeric(e1); e2=as.numeric(e2) xf = xp + pif1*k1*e1 + pif2*k2*e2 eye = diag(1, nstate) Pf = pif1*(eye-k1%*%M1)%*%Pp + pif2*(eye-k2%*%M2)%*%Pp like = like - log(pip1*den1 + pip2*den2) prob[i]<<-pip2; xfilter[,,i]<<-xf; innov.sig<<-c(sig1,sig2) yp[i]<<-ifelse(pip1 > pip2, M1%*%xp, M2%*%xp) } return(like) } # Estimation alpha1 = 1.4; alpha2 = -.5; beta0 = .3; sQ1 = .1; sQ2 = .1 init.par = c(alpha1, alpha2, beta0, sQ1, sQ2) , hessian=TRUE, ' ' (est = optim(init.par, Linn, NULL, method= BFGS control=list(trace=1,REPORT=1))) SE = sqrt(diag(solve(est$hessian))) u = cbind(estimate=est$par, SE) sQ2 ); u ' rownames(u)=c( ' alpha1 ' , ' alpha2 ' , ' beta0 ' , ' sQ1 ' , ' estimate SE alpha1 1.40570967 0.078587727 alpha2 -0.62198715 0.068733109 beta0 0.21049042 0.024625302 sQ1 0.02310306 0.001635291 sQ2 0.11217287 0.016684663 # Graphics predepi = ifelse(prob<.5,0,1); k = 6:length(y) Time = time(flu)[k] regime = predepi[k]+1 par(mfrow=c(3,1), mar=c(2,3,1,1)+.1) plot(Time, y[k], type="n", ylab="") grid(lty=2); lines(Time, y[k], col=gray(.7)) text(Time, y[k], col=regime, labels=regime, cex=1.1) text(1979,.95,"(a)") plot(Time, xfilter[1,,k], type="n", ylim=c(-.1,.4), ylab="") grid(lty=2); lines(Time, xfilter[1,,k]) lines(Time, xfilter[3,,k]); lines(Time, xfilter[4,,k]) text(1979,.35,"(b)") plot(Time, y[k], type="n", ylim=c(.1,.9),ylab="") grid(lty=2); points(Time, y[k], pch=19) prde1 = 2*sqrt(innov.sig[1]); prde2 = 2*sqrt(innov.sig[2]) prde = ifelse(predepi[k]<.5, prde1,prde2) i i i i

367 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 357 — #367 i i 6.11 Stochastic Volatility 357 xx = c(Time, rev(Time)) yy = c(yp[k]-prde, rev(yp[k]+prde)) polygon(xx, yy, border=8, col=gray(.6, alpha=.3)) text(1979,.85,"(c)") 6.11 Stochastic Volatility Stochastic volatility (SV) models are an alternative to GARCH-type models that were presented in Chapter 5. Throughout this section, we let denote the returns of some r t financial asset. Most models for return data used in practice are of a multiplicative form that we have seen in Section 5.3, σ = r (6.196) ε , t t t  , is a non-negative stochastic is an iid sequence and the σ , where volatility process t t process such that ε has is independent of σ ε for all s ≤ t . It is often assumed that t s t zero mean and unit variance. In SV models, the volatility is a nonlinear transform of a hidden linear autore- 2 gressive process where the hidden volatility process, x = log σ , follows a first order t t autoregression, (6.197a) x , = φ x w + 1 − t t t = (6.197b) r , ε ) β exp ( x 2 / t t t 2 w  ∼ iid N ( is iid noise having finite moments. The error processes , σ where and ) 0 t t w is normally w . As w 1 and ε < are assumed to be mutually independent and | φ | t t t ε exist, so that all distributed, x is also normally distributed. All moments of t t 2 2 [the φ − )) 1 /( moments of ∼ r ( 0 , σ N in (6.197) exist as well. Assuming that x t 0 w 6.6 r of stationary distribution] the kurtosis is given by t 2 (6.198) ) , κ σ ( r ( ) = κ exp ( ε ) t 4 4 t x 2 2 2 ( > κ , ) r ( ) κ /( 1 − φ ε ) is the (stationary) variance of x . Thus σ = where σ t t t 4 4 w x is leptokurtic. The autocorrelation r , the distribution of so that if ε ) ∼ iid N ( 0 , 1 t t 2 m m for any integer { r is given by (see Problem 6.29) function of } ; , . . . = 1 , 2 t t h 2 2 )− φ 1 m σ ( exp x 2 m 2 m ) = , r r ( corr (6.199) . t + t h 2 2 σ ( m ε ) exp ( )− 1 κ t m 4 x The decay rate of the autocorrelation function is faster than exponential at small time lags and then stabilizes to φ for large lags. Sometimes it is easier to work with the linear form of the model where we define 2 2 and ε y = = log r log v , t t t t 6 . m / 2 m 2 6 m and a random variable U , κ is called ( U ) : = E κ U | . Typically, ]/( E [| U | [| ]) For an integer m 3 kurtosis is called . κ and skewness 4 i i i i

368 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 358 — #368 i i 358 6 State Space Models in which case we may write = α + x (6.200) y v . + t t t A constant is usually needed in either the state equation or the observation equation (but not typically both), so we write the state equation as φ = φ (6.201) + x , x w + 1 t t 1 0 t − 2 σ w where . The constant φ is white Gaussian noise with variance is sometimes 0 t w referred to as the leverage effect . Together, (6.200) and (6.201) make up the stochastic volatility model due to Taylor (1982). 2 ε If had a log-normal distribution, (6.200)–(6.201) would form a Gaussian state- t space model, and we could then use standard DLM results to fit the model to data. Unfortunately, that assumption does not seem to work well. Instead, one often keeps 2 log ∼ iid N ( 0 , 1 ) , in which case, v is = the ARCH normality assumption on ε ε t t t distributed as the log of a chi-squared random variable with one degree of freedom. This density is given by { } v 1 1 √ ( ) ∞ exp (6.202) − ) < = e f − v ( . −∞ < v v 2 π 2 5772 . 0 is Euler’s constant, and The mean of the distribution is −( γ + log 2 ) , where γ ≈ 2 the variance of the distribution is 2 . It is a highly skewed density (see Figure 6.18) π / but it is not flexible because there are no free parameters to be estimated. Various approaches to the fitting of stochastic volatility models have been exam- ined; these methods include a wide range of assumptions on the observational noise process. A good summary of the proposed techniques, both Bayesian (via MCMC) and non-Bayesian approaches (such as quasi-maximum likelihood estimation and the EM algorithm), can be found in Jacquier et al. (1994), and Shephard (1996). Simulation methods for classical inference applied to stochastic volatility models are discussed in Danielson (1994) and Sandmann and Koopman (1998). Kim, Shephard and Chib (1998) proposed modeling the log of a chi-squared random variable by a mixture of seven normals to approximate the first four moments of the observational error distribution; the mixture is fixed and no additional model parameters are added by using this technique. The basic model assumption that ε t is Gaussian is unrealistic for most applications. In an effort to keep matters simple but more general (in that we allow the observational error dynamics to depend on parameters that will be fitted), our method of fitting stochastic volatility models is to retain the Gaussian state equation (6.201), but to write the observation equation, as y (6.203) = α + x , + η t t t where η is white noise, whose distribution is a mixture of two normals, one centered t at zero. In particular, we write z η = I (6.204) + ( 1 − I ) z , t 0 t t t t 1 { ), where I 1 is an iid Bernoulli process, Pr { I = = 0 } = π π , Pr + I π = 1 } = π ( 1 t 0 0 1 t t 2 2 . ) z , σ μ ( ) , and iid N ∼ z ∼ iid N ( 0 , σ 1 1 t t 0 1 0 i i i i

369 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 359 — #369 i i 359 6.11 Stochastic Volatility The advantage to this model is that it is easy to fit because it uses normality. In fact, the model equations (6.201) and (6.203)-(6.204) are similar to those presented in Peña and Guttman (1988), who used the idea to obtain a robust Kalman filter, and, as previously mentioned, in Kim et al. (1998). The material presented in Section 6.10 applies here, and in particular, the filtering equations for this model are 1 ’ t t − 1  + φ K x x (6.205) , π + = φ t j t j 1 t j 0 t 1 + t 0 j = 1 ’ − 1 2 t 2 2 t P (6.206) , = φ σ + K − Σ P π t j t j w t t j 1 1 + t 0 = j t − 1 1 − t − α − x  (6.207) , − ,  = y = y μ − α − x t 1 t t t 1 0 t t t − 1 2 2 t − 1 + σ P σ , Σ P + = Σ (6.208) , = t 1 t 0 t t 1 0 / / t 1 1 − t − K (6.209) . = Σ φ Σ , K P P = φ 1 1 0 t t t 1 t 1 0 t t π | 1 = Pr To complete the filtering, we must be able to assess the probabilities I = ( t 1 t y ) 1 denote the conditional , for t = 1 , . . ., n ; of course, π − t = 1 − π | t . Let p ( ) j 1 t 0 t t 1: 1 , 0 . Then, density of y = given the past y j for j = , and I t − 1: t 1 t ( | 1 p − π t t ) 1 1 , (6.210) = π 1 t π ( t | t − 1 ) + π p p ) ( t | t − 1 1 1 0 0 where we assume the distribution π . If the , for j = 0 , 1 has been specified a priori j investigator has no reason to prefer one state over another the choice of uniform priors, π = 1 / 2 , will suffice. Unfortunately, it is computationally difficult to obtain the exact 1 values of p , the ( t | t − 1 ) ; although we can give an explicit expression of p ) ( t | t − 1 j j actual computation of the conditional density is prohibitive. A viable approximation, t − 1 ( t | t − 1 ) to be the normal density, N for x however, is to choose p , ) + μ , Σ ( j t j j t μ ; see Section 6.10 for details. 0 = j = 0 , 1 and 0 The innovations filter given in (6.205)–(6.210) can be derived from the Kalman filter by a simple conditioning argument; e.g., to derive (6.205), write 1 ’ ) ( ( ) I ( Pr ) I j y | = , j E = x | y E = x y | t t 1: t + 1 1: t t 1: 1 t t + = 0 j 1 ) ( ’ − t 1 + K  φ x φ π + = t j t j t j 1 0 t = j 0 1 ’ 1 t − π .  K x + + φ = φ t j t j t j 1 0 t = j 0 ′ 2 2 2 , σ , is accomplished via MLE ) , μ , σ φ Estimation of the parameters, , σ ( Θ , φ = 1 1 0 w 1 0 based on the likelihood given by i i i i

370 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 360 — #370 i i 6 State Space Models 360 0.05 −0.05 −0.15 1988.0 1988.5 1987.5 1987.0 r , the daily returns of the NYSE Approximately four hundred observations of Fig. 6.17. t surrounding the crash of October 19, 1987. Also displayed is the corresponding one-step- 1 − t 2 x ahead predicted log volatility, where x ˆ = log σ , scaled by .1 to fit on the plot. t t t n 1 ’ ’ © ™ (6.211) , Θ ) ln ) 1 − = L t ( ln π t f ( ≠ Æ j Y j = 0 1 = j t ́ ̈ 2 t − 1 where the density p , ( t | ) − 1 ) is approximated by the normal density, N ( x t σ , μ + j j t j previously mentioned. We may consider maximizing (6.211) directly as a function of the parameters Θ using a Newton method, or we may consider applying the EM algorithm to the complete data likelihood. Example 6.23 Analysis of the New York Stock Exchange Returns , for about 400 of the 2000 trading days of the r Figure 6.17 shows the returns, t 5 was fit to the data NYSE. Model (6.201) and (6.203)–(6.204), with π , fixed at . 1 using a quasi-Newton–Raphson method to maximize (6.211). The results are given 2 in Table 6.4. Figure 6.18 compares the density of the log of a χ with the fitted 1 normal mixture; we note the data indicate a substantial amount of probability in the 2 χ upper tail that the log- distribution misses. 1 Finally, Figure 6.17 also displays the one-step-ahead predicted log volatility, 1 t − 2 , surrounding the crash of October 19, 1987. The analysis ˆ x σ where x = log t t t is included in the model is as φ indicates that φ is not needed. The R code when 0 0 follows. y = log(nyse^2) num = length(y) # Initial Parameters phi0 = 0; phi1 =.95; sQ =.2; alpha = mean(y) sR0 = 1; mu1 = -3; sR1 =2 init.par = c(phi0, phi1, sQ, alpha, sR0, mu1, sR1) # Innovations Likelihood Linn = function(para){ phi0 = para[1]; phi1 = para[2]; sQ = para[3]; alpha = para[4] sR0 = para[5]; mu1 = para[6]; sR1 = para[7] sv = SVfilter(num, y, phi0, phi1, sQ, alpha, sR0, mu1, sR1) i i i i

371 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 361 — #371 i i 361 6.11 Stochastic Volatility Table 6.4. Estimation Results for the NYSE Fit Estimated Parameter Estimate Standard Error † 006 . 016 φ . − 0 .007 φ .988 1 σ .091 .027 w 9 . α 1.269 − 613 .065 σ 1.220 0 − .205 . 292 μ 2 1 σ 2.683 .105 1 † not significant return(sv$like) } # Estimation (est = optim(init.par, Linn, NULL, method= BFGS ' , hessian=TRUE, ' control=list(trace=1,REPORT=1))) SE = sqrt(diag(solve(est$hessian))) u = cbind(estimates=est$par, SE) ' phi0 ' , ' phi1 ' , ' sQ ' , ' alpha ' , ' sigv0 rownames(u)=c( , ' mu1 ' , ' sigv1 ' ); u ' # Graphics (need filters at the estimated parameters) phi0 = est$par[1]; phi1 = est$par[2]; sQ = est$par[3]; alpha = est$par[4] sR0 = est$par[5]; mu1 = est$par[6]; sR1 = est$par[7] sv = SVfilter(num,y,phi0,phi1,sQ,alpha,sR0,mu1,sR1) # densities plot (f is chi-sq, fm is fitted mixture) x = seq(-15,6,by=.01) f = exp(-.5*(exp(x)-x))/(sqrt(2*pi)) f0 = exp(-.5*(x^2)/sR0^2)/(sR0*sqrt(2*pi)) f1 = exp(-.5*(x-mu1)^2/sR1^2)/(sR1*sqrt(2*pi)) fm = (f0+f1)/2 plot(x, f, type= ); lines(x, fm, lty=2, lwd=2) ' l ' dev.new(); Time=701:1100 '' ' l ' , col=4, lwd=2, ylab= '' , xlab= plot (Time, nyse[Time], type= , ylim=c(-.18,.12)) lines(Time, sv$xp[Time]/10, lwd=2, col=6) It is possible to use the bootstrap procedure described in Section 6.7 for the stochastic volatility model, with some minor changes. The following procedure was described in Stoffer and Wall (2004). We develop a vector first-order equation, as was y done in (6.123). First, using (6.207), and noting that + , we may write = π y π y t 0 t 1 t t t t − 1 . ( π + μ + (6.212)  ) + π  x + α = y t 0 t 1 t 1 t 1 0 t t Consider the standardized innovations / − 1 2 Σ (6.213) , e 1 =  , j = 0 , t j t j t j vector and define the 2 × 1 [ ] e 0 t . = e t e t 1 i i i i

372 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 362 — #372 i i 362 6 State Space Models 0.20 density 0.10 0.00 0 −5 −10 −15 5 x 2 Density of the log of a χ Fig. 6.18. (solid line) and the fitted normal as given by (6.202) 1 mixture (dashed line) from Example 6.23. vector 1 × Also, define the 2 ] [ t x t + 1 ξ = . t y t given by Combining (6.205) and (6.212) results in a vector first-order equation for ξ t , (6.214) e ξ H = F ξ + G + 1 − t t t t t where ] [ [ ] [ ] 1 2 1 2 / / K π Σ K Σ π φ 0 φ t t 1 t 0 1 0 t 0 1 t t 1 0 . , = F G = H , = t t 2 / 1 1 / 2 1 0 α π μ + 1 1 t π Σ Σ π 1 t t 0 t 1 t 0 Hence, the steps in bootstrapping for this case are the same as steps (i) through (v) described in Section 6.7, but with (6.123) replaced by the following first-order equation: ∗ ∗ ∗ ˆ ˆ ˆ , (6.215) ( Θ ) ξ F ; ˆ e ) + G ( = Θ ; ˆ π ) + H ( Θ π ξ 1 1 t t t t t t 1 − t 2 2 2 ˆ ˆ ˆ is the MLE of } is estimated via ˆ , and Θ π ˆ ˆ , where , Θ α, ˆ μ = , ˆ σ φ σ , ˆ σ , { φ t 1 0 1 1 w 0 1 (6.210), replacing p by their respective estimated normal ( t | t − 1 ) and p ) ( t | t − 1 0 1 densities ( ˆ π ). π = 1 − ˆ 0 1 t t Example 6.24 Analysis of the U.S. GNP Growth Rate In Example 5.4, we fit an ARCH model to the U.S. GNP growth rate. In this example, we will fit a stochastic volatility model to the residuals from the AR(1) fit on the growth rate (see Example 3.39). Figure 6.19 shows the log of the squared residuals, say y , from the fit on the U.S. GNP series. The stochastic volatility model (6.200)– t (6.204) was then fit to y . Table 6.5 shows the MLEs of the model parameters t along with their asymptotic SEs assuming the model is correct. Also displayed in bootstrapped samples. There is little agreement B 500 Table 6.5 are the SEs of = between most of the asymptotic values and the bootstrapped values. The interest here, however, is not so much in the SEs, but in the actual sampling distribution i i i i

373 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 363 — #373 i i 6.11 Stochastic Volatility 363 8 −10 6 4 −15 2 −20 0 0.8 1970 1960 1950 0.4 0.6 1990 1.0 1.2 2000 1980 Time Fig. 6.19. Results for Example 6.24: Log of the squared residuals from an AR(1) fit on GNP ˆ . φ growth rate. Bootstrap histogram and asymptotic distribution of 1 of the estimates. For example, Figure 6.19 compares the bootstrap histogram and ˆ . In this case, the bootstrap distribution exhibits asymptotic normal distribution of φ 1 positive kurtosis and skewness which is missed by the assumption of asymptotic normality. at 0 for this analysis The R code for this example is as follows. We held φ 0 because it was not significantly different from 0 in an initial analysis. # number of bootstrap replicates n.boot = 500 # convergence tolerance tol = sqrt(.Machine$double.eps) gnpgr = diff(log(gnp)) fit = arima(gnpgr, order=c(1,0,0)) y = as.matrix(log(resid(fit)^2)) num = length(y) '' ) plot.ts(y, ylab= # Initial Parameters phi1 = .9; sQ = .5; alpha = mean(y); sR0 = 1; mu1 = -3; sR1 = 2.5 init.par = c(phi1, sQ, alpha, sR0, mu1, sR1) # Innovations Likelihood Linn = function(para, y.data){ phi1 = para[1]; sQ = para[2]; alpha = para[3] sR0 = para[4]; mu1 = para[5]; sR1 = para[6] sv = SVfilter(num, y.data, 0, phi1, sQ, alpha, sR0, mu1, sR1) return(sv$like) } # Estimation (est = optim(init.par, Linn, NULL, y.data=y, method= ' BFGS ' , hessian=TRUE, control=list(trace=1,REPORT=1))) SE = sqrt(diag(solve(est$hessian))) u = rbind(estimates=est$par, SE) ' ); round(u, 3) phi1 colnames(u)=c( ' sig1 ' , ' sQ ' , ' alpha ' , ' sig0 ' , ' mu1 ' , ' phi1 sQ alpha sig0 mu1 sig1 estimates 0.884 0.381 -9.654 0.835 -2.350 2.453 SE 0.109 0.221 0.343 0.204 0.495 0.293 # Bootstrap # to store parameter estimates para.star = matrix(0, n.boot, 6) for (jb in 1:n.boot){ i i i i

374 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 364 — #374 i i 6 State Space Models 364 Estimates and Standard Errors for GNP Example Table 6.5. Asymptotic Bootstrap † MLE SE SE Parameter 0.884 0.109 0.057 φ 1 σ 0.381 0.221 0.324 w α − 9 . 654 0.343 1.529 σ 0.835 0.204 0.527 0 μ 0.410 − 2 . 350 0.495 1 σ 2.453 0.293 0.375 1 † Based on 500 bootstrapped samples. ) ' \n ' cat( ' iteration: ' , jb, phi1 = est$par[1]; sQ = est$par[2]; alpha = est$par[3] sR0 = est$par[4]; mu1 = est$par[5]; sR1 = est$par[6] Q = sQ^2; R0 = sR0^2; R1 = sR1^2 sv = SVfilter(num, y, 0, phi1, sQ, alpha, sR0, mu1, sR1) sig0 = sv$Pp+R0; sig1 = sv$Pp+R1; K0 = sv$Pp/sig0; K1 = sv$Pp/sig1 inn0 = y-sv$xp-alpha; inn1 = y-sv$xp-mu1-alpha den1 = (1/sqrt(sig1))*exp(-.5*inn1^2/sig1) den0 = (1/sqrt(sig0))*exp(-.5*inn0^2/sig0) fpi1 = den1/(den0+den1) # start resampling at t=4 e0 = inn0/sqrt(sig0); e1 = inn1/sqrt(sig1) indx = sample(4:num, replace=TRUE) sinn = cbind(c(e0[1:3], e0[indx]), c(e1[1:3], e1[indx])) eF = matrix(c(phi1, 1, 0, 0), 2, 2) # initialize xi = cbind(sv$xp,y) for (i in 4:num){ # generate boot sample G = matrix(c(0, alpha+fpi1[i]*mu1), 2, 1) h21 = (1-fpi1[i])*sqrt(sig0[i]); h11 = h21*K0[i] h22 = fpi1[i]*sqrt(sig1[i]); h12 = h22*K1[i] H = matrix(c(h11,h21,h12,h22),2,2) xi[i,] = t(eF%*%as.matrix(xi[i-1,],2) + G + H%*%as.matrix(sinn[i,],2))} # Estimates from boot data y.star = xi[,2] phi1=.9; sQ=.5; alpha=mean(y.star); sR0=1; mu1=-3; sR1=2.5 # same as for data init.par = c(phi1, sQ, alpha, sR0, mu1, sR1) est.star = optim(init.par, Linn, NULL, y.data=y.star, method= ' ' , BFGS control=list(reltol=tol)) para.star[jb,] = cbind(est.star$par[1], abs(est.star$par[2]), est.star$par[3], abs(est.star$par[4]), est.star$par[5], abs(est.star$par[6])) } # Some summary statistics and graphics rmse = rep(NA,6) # SEs from the bootstrap for(i in 1:6){ rmse[i] = sqrt(sum((para.star[,i]-est$par[i])^2)/n.boot) cat(i, rmse[i], \n ' ) } ' dev.new(); phi = para.star[,1] hist(phi, 15, prob=TRUE, main= '' , xlim=c(.4,1.2), xlab= '' ) xx = seq(.4, 1.2, by=.01) , lwd=2) ' ' dashed lines(xx, dnorm(xx, mean=u[1,1], sd=u[2,1]), lty= i i i i

375 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 365 — #375 i i 365 6.12 Bayesian Analysis 6.12 Bayesian Analysis of State Space Models We now consider some Bayesian approaches to fitting linear Gaussian state space models via Markov chain Monte Carlo (MCMC) methods. We assume that the model is given by (6.1)–(6.2); inputs are allowed in the model, but we do not display them for the sake of brevity. In this case, Frühwirth-Schnatter (1994) and Carter and Kohn (1994) established the MCMC procedure that we will discuss here. A comprehensive text that we highly recommend for this case is Petris et al. (2009) and the corresponding R package . For nonlinear and non-Gaussian models, the dlm reader is referred to Douc, Moulines, & Stoffer (2014). As in previous sections, we y { have = n y observations denoted by , . . ., y , whereas the states are denoted } n 1 1: n = as x { x being the initial state. , x x , . . ., x , with } n n 0 0 0: 1 MCMC methods refer to Monte Carlo integration methods that use a Markovian updating scheme to sample from intractable posterior distributions. The most common MCMC method is the Gibbs sampler, which is essentially a modification of the Metropolis algorithm (Metropolis et al., 1953) developed by Hastings (1970) in the statistical setting and by Geman and Geman (1984) in the context of image restoration. Later, Tanner and Wong (1987) used the ideas in their substitution sampling approach, and Gelfand and Smith (1990) developed the Gibbs sampler for a wide class of parametric models. The basic strategy is to use conditional distributions to set up a Markov chain to obtain samples from a joint distribution. The following simple case demonstrates this idea. Example 6.25 Gibbs Sampling for the Bivariate Normal Suppose we wish to obtain samples from a bivariate normal distribution, )] ( ) ) ( [( 0 1 ρ X ∼ , , N 1 0 Y ρ | < 1 , but we can only generate samples from a univariate normal. | ρ where • The univariate conditionals are [see (B.9)–(B.10)] 2 2 Y = y )∼ N ( ρ y , 1 − ρ ) ( and ( Y | X X x )∼ N ( ρ x , 1 − ρ | ) , = and we can simulate from these distributions. ( 0 ) ( 0 ) Construct a Markov chain: Pick X , and then iterate the process X x = • = 0 ) ) ) k k ( ) 1 ( 0 ( ( ) ( 1 x X Y 7→ 7→···7→ X where 7→ 7→ Y 7→ 7→··· , Y 0 k ) 2 ( k ) ( , X Y ( = x ) )∼ N ( ρ x ρ | 1 − k k ) ( k 2 ( k − 1 ) | Y ( X . ρ ) = y − 1 , )∼ N ( ρ y k 1 − k − 1 k ( ) ( k ) , ( The joint distribution of • X ) is (see Problem 3.2 ) Y [( )] ( ) ) ( ( ) 4 k k 2 k 4 k ρ ρ ( 1 x − X ρ − 1 ) ρ 0 . N ∼ , k 2 k 1 + 4 4 2 + k ( k ) x ( 1 − ρ ρ ρ ) 1 − ρ Y 0 i i i i

376 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 366 — #376 i i 366 6 State Space Models ( k ) ( k ) , , Y ( X • x Thus, for any starting value, ( X , Y ) as k → ∞ ; the speed ) → d 0 . Then one would run the chain and throw away the initial depends on ρ n 0 sampled values (burnin) and retain the rest. For state space models, the main objective is to obtain the posterior density of p Θ | y the parameters if the states are meaningful. For example, ) or p ( x ( ) | y 1: n n n 1: 0: the states do not have any meaning for an ARMA model, but they are important for a stochastic volatility model. It is generally easier to get samples from the full p ( Θ, x p or | y ) y ) and then marginalize (“average”) to obtain posterior ( Θ | n 1: n 0: n 1: . As previously mentioned, the most popular method is to run a full Gibbs | y x ) p ( n n 0: 1: sampler, alternating between sampling model parameters and latent state sequences from their respective full conditional distributions. Procedure 6.1 Gibbs Sampler for State Space Models ′ ∼ p (i) Draw Θ | x Θ ( , y ) n n 1: 0: ′ ′ ) ∼ ( x y , | Θ p (ii) Draw x n 1: n 0: n 0: Procedure 6.1-(i) is generally much easier because it conditions on the complete data x y , which we saw in Section 6.3 can simplify the problem. Procedure 6.1- , { } n 1: n 0: (ii) amounts to sampling from the joint smoothing distribution of the latent state sequence and is generally difficult. For linear Gaussian models, however, both parts of Procedure 6.1 are relatively easy to perform. To accomplish Procedure 6.1-(i), note that n ÷ ) ,Θ p ( Θ | x | y , y ( p )∝ π ( Θ ) p ( x ) | Θ ) x (6.216) ,Θ x | p ( x n t 1 t 0 n 1: − t 0: t 1 t = is the prior on the parameters. The prior often depends on “hyperparame- where π ( Θ ) ters” that add another level to the hierarchy. For simplicity, these hyperparameters are assumed to be known. The parameters are typically conditionally independent with distributions from standard parametric families (at least as long as the prior distri- bution is conjugate relative to the Bayesian model specification). For non-conjugate models, one option is to replace Procedure 6.1-(i) with a Metropolis-Hastings step, p which is feasible since the complete data density ) ( Θ, x can be evaluated , y 0: n 1: n pointwise. For example, in the univariate model + x + = φ x x = y w and v t t 1 t − t t t 2 2 ) , we can use the normal and 0 ( iid N ∼ v independent of ) , σ , σ ( iid N ∼ w where 0 t t v w inverse gamma (IG) distributions for priors. In this case, the priors on the variance 2 2 inde- components are chosen from a conjugate family, that is, IG ( a ) / σ , b 2 / ∼ 0 0 w 2 ) / 2 pendent of σ / d ∼ IG ( c , , where IG denotes the inverse (reciprocal) gamma 2 0 0 v 2 ) , then , σ distribution. Then, for example, if the prior on φ is Gaussian, φ ∼ N ( μ φ φ ) , where B , φ | σ Bb , x ( N , y ∼ n 1: 0: w n i i i i

377 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 367 — #377 i i 6.12 Bayesian Analysis 367 n n ’ ’ μ 1 1 1 φ − 1 2 B = . x x , b = + x + 1 − t t t − 1 2 2 2 2 σ σ σ σ w w φ φ t = 1 t = 1 and n ) ( ’ { } 2 2 1 1 a n ) , ( b + + x y | φ, ∼ IG φ x ; − ] σ , x [ 0 0 − n 0: t t 1 1: n w 2 2 1 = t n ) ( ’ } { 2 2 1 1 c . + − y [ ( c ] + n ) , x ∼ | σ x y IG , 0 t 0 t n 0: 1: n v 2 2 1 = t x , For Procedure 6.1-(ii), the goal is to sample the entire set of state vectors, 0: n ( x from the posterior density is a fixed set of parameters | Θ, y p Θ ) , where n 1: n 0: p ( obtained from the previous step. We will write the posterior as | y ) to save x 1: n n Θ 0: space. Because of the Markov structure, we can write, ( ( x | y ) = p ( x | y ) p p x | x , y )··· p ( x | x ) . (6.217) n n n Θ 1: 1: n − 1 n 0: Θ Θ Θ 0 1: n 1 n − 1 In view of (6.217), it is possible to sample the entire set of state vectors, x , by 0: n sequentially simulating the individual states backward. This process yields a simula- tion method that Frühwirth-Schnatter (1994) called the forward-filtering, backward- sampling (FFBS) algorithm. From (6.217), we see that we must obtain the densities ) x | . p x ( x ( | x p ) y , y | x )∝ p ( Θ t 1: 1 1: t + t Θ t Θ t + 1 t t t Θ t Θ ( x Φ ) , Q . And | x P N ( ) and x ∼ , x | y N ∼ In particular, we know that x t t 1 t + t 1: t p p t t because the processes are Gaussian, we need only obtain the conditional means and m In particular, = E . ( x ) | y x , , x y | x ) and V variances, say, = var ( Θ t t + t 1: t t 1: t + 1 t Θ t 1 t t ′ t t , ( x m − x and x (6.218) J ) = V = P + J − J P t t t t t + 1 t t t t + 1 1 + t is defined in (6.47). We note that J has already m for t = n − 1 , n − 2 , . . ., 0 , where t t been derived in (6.48). To derive and V using standard normal theory, use a m t t strategy similar to the derivation of the filter in Property 6.1. That is, ]) ] ( [ ) ([ ′ t t t Φ P P x x t t t t ; , N ∼ y t 1: t t t x P P Φ x + t 1 t t + 1 t 1 + now use (B.9), (B.10), and the definition of J in (6.47). Also, recall the proof of t t t . Property 6.3 wherein we noted the off-diagonal P = Φ P t , t t + 1 n n Θ n , x ( x , the algorithm is to first sample Θ from a N P Hence, given ) , where x n n n p n n and P x from are obtained from the Kalman filter, Property 6.1, and then sample t n Θ is the x a N , where the conditioning value of 0 , ( m , . . ., , V 2 ) , for t = n − 1 , n − t t + 1 t p value previously sampled. i i i i

378 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 368 — #378 i i 6 State Space Models 368 l l x(t) 10 l l l l l l y(t) l l l l 8 l l l l l l l l l l l l l l l l 200 l l l l l l l l 6 l l l l l l l l l l 4 l l l l l l l l l l l Density l l l l l l l l l l l l l 2 l l 100 l l l l l l l l l l l l l l l l l l l l l l l l 0 50 l l l l l 0 −2 0.0 1.2 0 20 40 60 0.8 0.4 80 100 2 σ Time w 1.2 350 250 0.8 2 w σ Density 150 0.4 50 0 0.0 1.0 2.0 0.5 1.5 0.5 1.0 1.5 2.0 2 2 σ σ v v y Display for Example 6.26: Left: Generated states, Fig. 6.20. . Contours of the x and data t t likelihood (solid line) of the data and sampled posterior values as points. Right : Marginal sampled posteriors and posterior means (vertical lines) of each variance component. The true 2 2 σ = . . 5 and σ values are 1 = v w Example 6.26 Local Level Model In this example, we consider the local level model previously discussed in Exam- ple 6.4. Here, we consider the model x + y = w + v and x = x t t t t t 1 t − 2 2 . This is the uni- ) 5 . = ∼ , σ 0 ( = 1 ) independent of w iid N ∼ where v , σ 0 iid N ( t t w v variate model we just discussed, but where φ = 1 . In this case, we used IG priors for each of the variance components. b ) were set to .02. We gen- For the prior distributions, all parameters ( a , d , c , 0 0 0 0 erated 1010 samples, using the first 10 as burn-in. Figure 6.20 displays the simulated data and states, the contours of the likelihood of the data, the sampled posterior values as points, and the marginal sampled posteriors of each variance component n along with the posterior means. Figure 6.21 compares the actual smoother x with t the posterior mean of the sampled smoothed values. In addition, a pointwise 95% credible interval is displayed as a filled area. The following code was used in this example. ##-- Notation --## ~ # y(t) = x(t) + v(t); v(t) iid N(0,V) # x(t) = x(t-1) + w(t); w(t) ~ iid N(0,W) # priors: x(0) ~ N(m0,C0); V ~ IG(a,b); W ~ IG(c,d) N(a,R) ~ N(mm,CC); x(t|t+1) # FFBS: x(t|t) ~ N(m,C); x(t|n) ~ ##-- i i i i

379 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 369 — #379 i i 6.12 Bayesian Analysis 369 l l true smoother 10 l l data l posterior mean l l l l l l l 8 l 95% of draws l l l l l l l l l l l l l l l l l l l l l l l 6 l l l l l l l l l l 4 l l l l l l l l l l l l l l l l l l l l l l l l 2 l l l l l l l l l l l l l l l l l l l l l l l l l l 0 l l l l l −2 80 100 20 0 60 40 Time n , and the posterior mean y , the data Display for Example 6.26: True smoother, x Fig. 6.21. t t of the sampled smoother values; the filled in area shows 2.5% to 97.5%-tiles of the draws. ffbs = function(y,V,W,m0,C0){ n = length(y); a = rep(0,n); R = rep(0,n) m = rep(0,n); C = rep(0,n); B = rep(0,n-1) H = rep(0,n-1); mm = rep(0,n); CC = rep(0,n) x = rep(0,n); llike = 0.0 for (t in 1:n){ if(t==1){a[1] = m0; R[1] = C0 + W }else{ a[t] = m[t-1]; R[t] = C[t-1] + W } f = a[t] Q = R[t] + V A = R[t]/Q m[t] = a[t]+A*(y[t]-f) C[t] = R[t]-Q*A**2 B[t-1] = C[t-1]/R[t] H[t-1] = C[t-1]-R[t]*B[t-1]**2 llike = llike + dnorm(y[t],f,sqrt(Q),log=TRUE) } mm[n] = m[n]; CC[n] = C[n] x[n] = rnorm(1,m[n],sqrt(C[n])) for (t in (n-1):1){ mm[t] = m[t] + C[t]/R[t+1]*(mm[t+1]-a[t+1]) CC[t] = C[t] - (C[t]^2)/(R[t+1]^2)*(R[t+1]-CC[t+1]) x[t] = rnorm(1,m[t]+B[t]*(x[t+1]-a[t+1]),sqrt(H[t])) } return(list(x=x,m=m,C=C,mm=mm,CC=CC,llike=llike)) } # Simulate states and data set.seed(1); W = 0.5; V = 1.0 n = 100; m0 = 0.0; C0 = 10.0; x0 = 0 w = rnorm(n,0,sqrt(W)) v = rnorm(n,0,sqrt(V)) x = y = rep(0,n) x[1] = x0 + w[1] y[1] = x[1] + v[1] for (t in 2:n){ x[t] = x[t-1] + w[t] i i i i

380 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 370 — #380 i i 370 6 State Space Models y[t] = x[t] + v[t] } # actual smoother (for plotting) ks = Ksmooth0(num=n, y, A=1, m0, C0, Phi=1, cQ=sqrt(W), cR=sqrt(V)) xsmooth = as.vector(ks$xs) # run = ffbs(y,V,W,m0,C0) m = run$m; C = run$C; mm = run$mm CC = run$CC; L1 = m-2*C; U1 = m+2*C L2 = mm-2*CC; U2 = mm+2*CC N = 50 Vs = seq(0.1,2,length=N) Ws = seq(0.1,2,length=N) likes = matrix(0,N,N) for (i in 1:N){ for (j in 1:N){ V = Vs[i] W = Ws[j] run = ffbs(y,V,W,m0,C0) likes[i,j] = run$llike } } # Hyperparameters a = 0.01; b = 0.01; c = 0.01; d = 0.01 # MCMC step set.seed(90210) burn = 10; M = 1000 niter = burn + M V1 = V; W1 = W draws = NULL all_draws = NULL for (iter in 1:niter){ run = ffbs(y,V1,W1,m0,C0) x = run$x V1 = 1/rgamma(1,a+n/2,b+sum((y-x)^2)/2) W1 = 1/rgamma(1,c+(n-1)/2,d+sum(diff(x)^2)/2) draws = rbind(draws,c(V1,W1,x)) } all_draws = draws[,1:2] q025 = function(x){quantile(x,0.025)} q975 = function(x){quantile(x,0.975)} draws = draws[(burn+1):(niter),] xs = draws[,3:(n+2)] lx = apply(xs,2,q025) mx = apply(xs,2,mean) ux = apply(xs,2,q975) ## plot of the data par(mfrow=c(2,2), mgp=c(1.6,.6,0), mar=c(3,3.2,1,1)) ts.plot(ts(x), ts(y), ylab= '' , col=c(1,8), lwd=2) points(y) legend(0, 11, legend=c("x(t)","y(t)"), lty=1, col=c(1,8), lwd=2, bty="n", pch=c(-1,1)) contour(Vs, Ws, exp(likes), xlab=expression(sigma[v]^2), ylab=expression(sigma[w]^2), drawlabels=FALSE, ylim=c(0,1.2)) points(draws[,1:2], pch=16, col=rgb(.9,0,0,0.3), cex=.7) hist(draws[,1], ylab="Density",main="", xlab=expression(sigma[v]^2)) abline(v=mean(draws[,1]), col=3, lwd=3) hist(draws[,2],main="", ylab="Density", xlab=expression(sigma[w]^2)) abline(v=mean(draws[,2]), col=3, lwd=3) ## plot states i i i i

381 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 371 — #381 i i 6.12 Bayesian Analysis 371 par(mgp=c(1.6,.6,0), mar=c(2,1,.5,0)+.5) , type= n ' , ylim=c(min(y),max(y))) '' ' plot(ts(mx), ylab= grid(lty=2); points(y) lines(xsmooth, lwd=4, col=rgb(1,0,1,alpha=.4)) lines(mx, col= 4) xx=c(1:100, 100:1) yy=c(lx, rev(ux)) polygon(xx, yy, border=NA, col= gray(.6,alpha=.2)) lines(y, col=gray(.4)) topleft , , c( ' true smoother ' , ' data ' legend( ' posterior mean ' , ' 95% of ' ' ' ), lty=1, lwd=c(3,1,1,10), pch=c(-1,1,-1,-1), col=c(6, draws gray(.4) ,4, gray(.6, alpha=.5)), bg= ' ' ) white Next, we consider a more complicated model. Example 6.27 Structural Model Consider the Johnson & Johnson quarterly earnings per share series that was dis- cussed in Example 6.10. Recall that the model is ( ) 1 1 0 0 = y x + v , t t t T T 0 0 0 φ w 1 − t t 1 t © © © © ™ ™ ™ ™ S S 1 − 0 − 1 1 w − ≠ ≠ ≠ ≠ Æ Æ Æ Æ 1 t − t 2 t = x + = ≠ ≠ ≠ ≠ Æ Æ Æ Æ t S S 0 1 0 0 0 ≠ ≠ ≠ ≠ Æ Æ Æ Æ t 1 t 2 − − 0 0 1 0 S S 0 t 3 − − 2 t ́ ́ ́ ́ ̈ ̈ ̈ ̈ 2 R = σ and where v 2 0 0 0 σ w , 11 © ™ 2 σ 0 0 0 ≠ Æ w , 22 . = Q ≠ Æ ≠ Æ 0 0 0 0 0 0 0 0 ́ ̈ The parameters to be estimated are the transition parameter associated with the 2 growth rate, 1 , the observation noise variance, σ , and the state noise vari- φ > v 2 2 σ ances associated with the trend and the seasonal components, and σ , 11 , w w , 22 respectively. ) In this case, sampling from ( x follows directly from (6.217)– p | Θ, y 1: n 0: n (6.218). Next, we discuss how to sample from p ( ) | x . For the transition , y Θ n 1: n 0: < β was 1 + β , where 0 =  1 ; recall that in Example 6.10, φ φ parameter, write estimated to be 1 . 035 , which indicated a growth rate, β , of 3 . 5% . Note that the trend component may be rewritten as . ∇ T = T − T = β T w + t − 1 1 t t t − 1 t is the slope in the linear Consequently, conditional on the states, the parameter β , . . ., ∇ T is the on T w , and n , for t = 1 regression (through the origin) of 1 t t 1 t − 2 , i.e., ) β, σ error. As is typical, we put a Normal–Inverse Gamma (IG) prior on ( w , 11 2 2 2 2 | σ / b s n ∼ N ( ) , , σ , with known hyperpa- 2 / n B ( ) and σ β IG ∼ 2 0 0 0 0 w 11 w , 11 , , w 0 11 2 . , b , B rameters n , s 0 0 0 0 i i i i

382 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 372 — #382 i i 372 6 State Space Models φ σ σ σ 11 w 22 w v 1.4 0.08 1.045 1.2 0.25 0.06 1.0 trace trace trace trace 1.035 0.8 0.04 0.15 0.6 1.025 0.02 1000 0 1000 0 600 600 0 200 200 200 600 0 200 600 1000 1000 iterations iterations iterations iterations 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 ACF ACF ACF ACF −0.4 −0.4 −0.4 −0.4 20 15 10 5 10 15 20 25 5 15 20 25 25 5 20 15 10 5 10 25 Lag Lag Lag Lag 250 250 250 250 150 150 150 150 Frequency Frequency Frequency Frequency 50 50 50 50 0 0 0 0 0.04 0.02 0.10 1.035 1.045 0.30 1.025 0.20 0.4 0.8 1.2 0.08 0.06 Fig. 6.22. Parameter estimation results for Example 6.27. The top row displays the traces of 1000 draws after burn-in. The middle row displays the ACF of the traces. The sampled posteriors are displayed in the last row (the mean is marked by a solid vertical line). 2 2 . We also used IG priors for the other two variance components, and σ σ v w , 22 2 2 2 σ , then the posterior is / ∼ IG ( In this case, if the prior / 2 , n ) s n 0 0 v 0 2 2 ) σ s / | x , 2 , ∼ IG ( n / 2 , n y 1: 0: n n v v v v Õ n 2 2 2 T − Y ( . Similarly, if the prior − + ) s n S = where n s , and n + n n = 0 t t t 0 v v v 1 = t 0 2 2 2 σ , then the posterior is / ) s ∼ IG ( n n / 2 , 0 0 22 , 0 w 2 2 , ) 2 / σ s n | x , 2 , y / n ∼ IG ( n 1: 0: w w n w w , 22 Õ 3 − n 2 2 2 = ( n − 3 ) , and + s = n s n n where n S + S − − S S ) − . ( 0 w w 0 3 − 1 t − t − 2 t t w = 1 t 0 Figure 6.22 displays the results of the posterior estimates of the parameters. The top row of the figure displays the traces of 1000 draws, after a burn-in of 100, with a step size of 10 (i.e., every 10th sampled value is retained). The middle row of the figure displays the ACF of the traces, and the sampled posteriors are displayed in the last row of the figure. The results of this analysis are comparable to the results indicates a 3.7% φ obtained in Example 6.10; the posterior mean and median for growth rate in the Johnson & Johnson quarterly earnings over this time period. T S ) along with Figure 6.23 displays the smoothers of trend ( T + ) and season ( t t t 99% credible intervals. Again, these results are comparable to the results obtained in Example 6.10. The R code for this example is as follows: i i i i

383 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 373 — #383 i i 373 6.12 Bayesian Analysis 15 10 Trend 5 0 1960 1965 1970 1975 1980 Time 15 10 5 Trend + Season 0 1960 1970 1975 1980 1965 Time S T ) along Fig. 6.23. Example 6.27 smoother estimates of trend ( T ) and trend plus season ( + t t t with corresponding 99% credible intervals. # used to view progress (install it if you don ' t have it) library(plyr) y = jj ### setup - model and initial parameters set.seed(90210) n = length(y) F = c(1,1,0,0) # this is A G = diag(0,4) # G is Phi G[1,1] = 1.03 G[2,] = c(0,-1,-1,-1); G[3,]=c(0,1,0,0); G[4,]=c(0,0,1,0) a1 = rbind(.7,0,0,0) # this is mu0 R1 = diag(.04,4) # this is Sigma0 V = .1 W11 = .1 W22 = .1 ##-- FFBS --## ffbs = function(y,F,G,V,W11,W22,a1,R1){ n = length(y) Ws = diag(c(W11,W22,1,1)) # this is Q with 1s as a device only iW = diag(1/diag(Ws),4) # this is m _ a = matrix(0,n,4) t R = array(0,c(n,4,4)) # this is V _ t m = matrix(0,n,4) C = array(0,c(n,4,4)) a[1,] = a1[,1] R[1,,] = R1 f = t(F)%*%a[1,] Q = t(F)%*%R[1,,]%*%F + V A = R[1,,]%*%F/Q[1,1] m[1,] = a[1,]+A%*%(y[1]-f) C[1,,] = R[1,,]-A%*%t(A)*Q[1,1] i i i i

384 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 374 — #384 i i 6 State Space Models 374 for (t in 2:n){ a[t,] = G%*%m[t-1,] R[t,,] = G%*%C[t-1,,]%*%t(G) + Ws f = t(F)%*%a[t,] Q = t(F)%*%R[t,,]%*%F + V A = R[t,,]%*%F/Q[1,1] m[t,] = a[t,] + A%*%(y[t]-f) C[t,,] = R[t,,] - A%*%t(A)*Q[1,1] } xb = matrix(0,n,4) xb[n,] = m[n,] + t(chol(C[n,,]))%*%rnorm(4) for (t in (n-1):1){ iC = solve(C[t,,]) CCC = solve(t(G)%*%iW%*%G + iC) mmm = CCC%*%(t(G)%*%iW%*%xb[t+1,] + iC%*%m[t,]) xb[t,] = mmm + t(chol(CCC))%*%rnorm(4) } return(xb) } ##-- Prior hyperparameters --## # b0 = 0 # mean for beta = phi -1 # B0 = Inf # var for beta (non-informative => use OLS for sampling beta) /2) _ n0 = 10 # use same for all- the prior is 1/Gamma(n0/2, n0*s20 # for V s20v = .001 s20w =.05 # for Ws ##-- MCMC scheme --## set.seed(90210) burnin = 100 step = 10 M = 1000 niter = burnin+step*M pars = matrix(0,niter,4) xbs = array(0,c(niter,n,4)) pr <- progress_text() # displays progress pr$init(niter) for (iter in 1:niter){ xb = ffbs(y,F,G,V,W11,W22,a1,R1) u = xb[,1] yu = diff(u); xu = u[-n] # for phihat and se(phihat) regu = lm(yu~0+xu) # est of beta = phi-1 phies = as.vector(coef(summary(regu)))[1:2] + c(1,0) # phi estimate and SE dft = df.residual(regu) G[1,1] = phies[1] + rt(1,dft)*phies[2] # use a t V = 1/rgamma(1, (n0+n)/2, (n0*s20v/2) + sum((y-xb[,1]-xb[,2])^2)/2) W11 = 1/rgamma(1, (n0+n-1)/2, (n0*s20w/2) + sum((xb[-1,1]-phies[1]*xb[-n,1])^2)/2) W22 = 1/rgamma(1, (n0+ n-3)/2, (n0*s20w/2) + sum((xb[4:n,2] + xb[3:(n-1),2]+ xb[2:(n-2),2] +xb[1:(n-3),2])^2)/2) xbs[iter,,] = xb pars[iter,] = c(G[1,1], sqrt(V), sqrt(W11), sqrt(W22)) pr$step() } # Plot results ind = seq(burnin+1,niter,by=step) names= c(expression(phi), expression(sigma[v]), expression(sigma[w~11]), expression(sigma[w~22])) dev.new(height=5) par(mfcol=c(3,4), mar=c(2,2,.25,0)+.75, mgp=c(1.6,.6,0), oma=c(0,0,1,0)) for (i in 1:4){ plot.ts(pars[ind,i],xlab="iterations", ylab="trace", main="") i i i i

385 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 375 — #385 i i 375 Problems mtext(names[i], side=3, line=.5, cex=1) acf(pars[ind,i],main="", lag.max=25, xlim=c(1,25), ylim=c(-.4,.4)) hist(pars[ind,i],main="",xlab="") abline(v=mean(pars[ind,i]), lwd=2, col=3) } par(mfrow=c(2,1), mar=c(2,2,0,0)+.7, mgp=c(1.6,.6,0)) mxb = cbind(apply(xbs[ind,,1],2,mean), apply(xbs[,,2],2,mean)) lxb = cbind(apply(xbs[ind,,1],2,quantile,0.005), apply(xbs[ind,,2],2,quantile,0.005)) uxb = cbind(apply(xbs[ind,,1],2,quantile,0.995), apply(xbs[ind,,2],2,quantile,0.995)) mxb = ts(cbind(mxb,rowSums(mxb)), start = tsp(jj)[1], freq=4) lxb = ts(cbind(lxb,rowSums(lxb)), start = tsp(jj)[1], freq=4) uxb = ts(cbind(uxb,rowSums(uxb)), start = tsp(jj)[1], freq=4) names=c( Trend ' , ' Season ' , ' Trend + Season ' ) ' L = min(lxb[,1])-.01; U = max(uxb[,1]) +.01 ' ) plot(mxb[,1], ylab=names[1], ylim=c(L,U), type= ' n grid(lty=2); lines(mxb[,1]) xx=c(time(jj), rev(time(jj))) yy=c(lxb[,1], rev(uxb[,1])) polygon(xx, yy, border=NA, col=gray(.4, alpha = .2)) L = min(lxb[,3])-.01; U = max(uxb[,3]) +.01 plot(mxb[,3], ylab=names[3], ylim=c(L,U), type= ' n ' ) grid(lty=2); lines(mxb[,3]) xx=c(time(jj), rev(time(jj))) yy=c(lxb[,3], rev(uxb[,3])) polygon(xx, yy, border=NA, col=gray(.4, alpha = .2)) Problems Section 6.1 6.1 Consider a system process given by n , . . ., x = = − . 9 x t w + 1 2 − t t t 2 2 N ) N ( 0 , σ where ) , x is Gaussian white noise with variance w ∼ ∼ ( 0 , σ , and x 0 1 t − 1 0 2 σ . The system process is observed with noise, say, w + = x y , v t t t 2 v } is Gaussian white noise with variance σ where w . Further, suppose x { , x , 1 − 0 t t v and { v are independent. } t (a) Write the system and observation equations in the form of a state space model. 2 2 and σ (b) Find the values of σ that make the observations, y , stationary. t 0 1 2 σ = observations with σ = 1 , 100 (c) Generate = 1 and using the values of σ n v w 0 2 σ y and compare the two processes. found in (b). Do a time plot of x and and of t t 1 Also, compare the sample ACF and PACF of x and of y . t t . σ = 10 (d) Repeat (c), but with v i i i i

386 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 376 — #386 i i 6 State Space Models 376 t − 1 6.2 x = E ( x Consider the state-space model presented in Example 6.3. Let | t t t − 1 2 t − 1 x y y . The innovation sequence or residuals are = E ( x ) − ) and let P , . . ., t 1 − 1 t t t − t 1 t − 1 − t 1 , where y − y = = y E ( y  | y x in terms of ) , . . ., y , ) . Find cov (  s t 1 − t t t t 1 t t t t 1 − s and (ii) s , t P for (i) = t . and t Section 6.2 Simulate n = 100 observations from the following state-space model: 6.3 and x 8 x + w . y = x + v = t t t t t − t 1 1 x N ( 0 , 2 . 78 ) , w where ∼ iid N ( 0 , ∼ ) , and v are all mutually indepen- ∼ iid N ( 0 , 1 ) t t 0 t − 1 y y dent. Compute and plot the data, , the one-step-ahead predictors, along with t t 1 / 2 t − 1 2 − y E ( y using Example 6.5 as a guide. ) the root mean square prediction errors, t t ′ ′ ′ ( x Suppose the vector , y z ) = , where 6.4 ( p × 1 ) and y ( q × 1 ) are jointly distributed x and μ with mean vectors and with covariance matrix μ x y ( ) Σ Σ xx xy ( ) cov = z . Σ Σ yy yx . y sp { 1 , y } , say, ̂ x = b + B = M on x Consider projecting (a) Show the orthogonality conditions can be written as − b − y E ) = 0 , ( x B ′ − b − B y ) y E ] = 0 , [( x leading to the solutions − 1 Σ and B = − μ Σ = B μ . b y x xy yy (b) Prove the mean square error matrix is 1 − ′ Σ . b Σ B y ) x − ] = Σ − − Σ M SE x [( E = yx xy xx yy (c) How can these results be used to justify the claim that, in the absence of normality, Property 6.1 yields the best linear estimate of the state x given the data Y , namely, t t t t , and its corresponding MSE, namely, P x ? t t 6.5 Projection Theorem Derivation of Property 6.2 . Throughout this problem, we use the notation of Property 6.2 and of the Projection Theorem given in Appendix B, k 2 , for } y = y − k { sp sp { y = , . . ., y V , and } is = L . If where H L + k k + 1 + 1 1 k 1 + k 1 1 + k k , 1 , . . ., n − 1 , where y 0 . V ⊕ is the projection of y L = L on L , then, k k k + 1 + k 1 1 + k + 1 k 0 . R 0 and 0 > > P We assume 0 i i i i

387 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 377 — #387 i i 377 Problems 1 + k (a) Show the projection of on L , that is, x x , is given by 1 + k k k k k 1 k + , ) + H y − ( y = x x k 1 1 + + k + 1 k k k H where can be determined by the orthogonality property k 1 + } ) ) ( {( ′ k k = 0 . y − ) y − y ( y H − x E 1 + k 1 k k + 1 k + 1 k + 1 k + Show [ ] − 1 k ′ ′ ′ k R Φ H A = A P P + A . k 1 + k + 1 k k + 1 + 1 1 + k k 1 − k ′ k , Φ P = ] [ J P and show (b) Define k 1 k k + + 1 k k k + 1 k + J ) ( x = x x . x − k 1 + k k + 1 k k (c) Repeating the process, show k 2 + k + 1 k k k + 1 x x = x J + ( − x y ( y H − ) + ) , k k k + 2 2 + k k k 1 + + k 1 k + 2 . Simplify and show H solving for + 2 k k 2 + k + 2 k k = x J x ( x + x . − ) k k k + 1 + 1 k k (d) Using induction, conclude k n n k + J ) = x x − x , x ( k k + k k k 1 + 1 which yields the smoother with = t − 1 . k Section 6.3 6.6 Consider the univariate state-space model given by state conditions x , = w 0 0 x x are = x v and w + w , where and observations y , . . . = , 2 + v 1 , t = − t t t 1 t t t t t 2 2 var ( w . ) = σ v independent, Gaussian, white noise processes with and var ( σ = ) t t w v (a) Show that y follows an IMA(1,1) model, that is, ∇ y follows an MA(1) model. t t (b) Fit the model specified in part (a) to the logarithm of the glacial varve series and compare the results to those presented in Example 3.33. 6.7 Consider the model , y = x + v t t t 2 σ v is Gaussian white noise with variance where are independent Gaussian , x t t v 2 , and independent of x with v x random variables with mean zero and ) = r var σ ( t t t t x r are known constants. Show that applying the EM algorithm to the problem , . . ., r n 1 2 2 leads to updates (represented by hats) and σ σ of estimating v x n n 2 2 ’ ’ σ μ + 1 1 t t 2 2 2 2 μ − y , ) [( ] + σ σ and ˆ = = σ ˆ t t t v x n r n t = t 1 1 t = where, based on the current estimates (represented by tildes), 2 2 2 ̃ ̃ σ σ r r ̃ σ t t x v x 2 = μ = . σ and y t t t 2 2 2 2 σ σ ̃ r + ̃ r ̃ ̃ σ σ + t t v x x v i i i i

388 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 378 — #388 i i 378 6 State Space Models To explore the stability of the filter, consider a univariate state-space model. 6.8 That is, for 1 , 2 , . . . , the observations are y t = x = + v and the state equation is t t t , has zero mean = φ x x . The initial state, 1 + w < , where σ = x σ | = 1 and | φ v w 1 − t 0 t t and variance one. t 2 − 1 t − P P in Property 6.1 in terms of . (a) Exhibit the recursion for t t − 1 1 t − (b) Use the result of (a) to verify P that is the P approaches a limit ( t → ∞ ) t 2 2 P − φ positive solution of P − 1 = 0 . < (c) With K = lim . 1 K | as given in Property 6.1, show | 1 − K t t →∞ n , ) , . . . (d) Show, in steady-state, the one-step-ahead predictor, E ( y y y y , = n 1 n 1 + n − 1 + n of a future observation satisfies ∞ ’ 1 − j j n ( . y ) φ y K = 1 − K − j n + 1 1 + n j = 0 6.9 In Section 6.3, we discussed that it is possible to obtain a recursion for the gradient − ∂ ln L is a ( vector, )/ ∂Θ . Assume the model is given by (6.1) and (6.2) and A Θ Y t known design matrix that does not depend on Θ , in which case Property 6.1 applies. For the gradient vector, show { n ’ ∂Σ 1 ∂ t t 1 1 − ′ − 1 ′ − Σ Σ  −  = Σ ( ln  ∂ L Θ ∂Θ )/ t Y i t t t t t 2 ∂Θ ∂Θ i i = 1 t ( )} ∂Σ 1 t − 1 Σ + , tr t 2 ∂Θ i Θ where the dependence of the innovation values on is understood. In addition, Θ , show the following recursions, for = ∂ g ( g )/ ∂Θ ∂ with the general definition i i = 2 , . . ., n apply: t 1 − t  ∂ = − A (i) ∂ x , i t t i t 2 − t − 1 t 2 t − ∂ x x = ∂ (ii) Φ x ∂  + Φ ∂ ∂ , K +  K + i i i i t − 1 1 t − 1 − t t − 1 i t 1 t − − t 1 t − 1 ′ = A ∂ ∂ (iii) P ∂ , Σ + A R t i t i i t t ] [ − 1 t t − 1 1 ′ ′ − Φ ∂ Φ P + = P Σ ∂ ∂ , A A K − K ∂ Σ (iv) t t i i i t i t t t t t 2 − t 1 t − ′ 2 − t 2 − t ′ ′ Φ P P = ∂ (v) ∂ P Φ ∂ + Φ P Φ + Φ , + ∂ ∂ Q Φ i i i i i t − t 1 t 1 − 1 − t ′ ′ ′ K − ∂ K Σ K K − Σ ∂ ∂ Σ K − K , t i t − 1 − 1 t i t t − 1 t i 1 − t 1 − t t − 1 t − 1 2 t − ′ ′ = Φ P using the fact that P Σ Φ K + Q − K . − t t 1 t t − 1 1 − t 6.10 Continuing with the previous problem, consider the evaluation of the Hessian matrix and the numerical evaluation of the asymptotic variance–covariance matrix of the parameter estimates. The information matrix satisfies {( } ) ) ( } { ′ 2 ) ( L ln ∂ Θ ) Θ ∂ ln ( L L ) Θ ( ∂ ln Y Y Y E = ; − E ′ ∂Θ ∂Θ ∂Θ ∂Θ ) -th element of the , j see Anderson (1984, Section 4.4), for example. Show the ( i { } 2 = I ∂Θ ( Θ ) ∂ E ∂Θ − )/ information matrix, say, ln L Θ ( , is Y i j i j i i i i

389 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 379 — #389 i i Problems 379 n { ’ ( ) 1 1 − − ′ 1 1 − Θ = I ) ( Σ Σ Σ E ∂ ∂ ∂  Σ +  Σ tr ∂ t j t t i i j i j t t t t 2 t = 1 ) ( } ( ) 1 1 − 1 − Σ Σ Σ Σ tr tr ∂ ∂ + . i j t t t t 4 Consequently, an approximate Hessian matrix can be obtained from the sample by E , in the above result and using only the recursions needed dropping the expectation, to calculate the gradient vector. Section 6.4 As an example of the way the state-space model handles the missing data prob- 6.11 lem, suppose the first-order autoregressive process x = φ x w + 1 − t t t x A , where has an observation missing at t = m , leading to the observations y = t t t with variance = 0 A = = 1 for all t , except t = m wherein A x . Assume 0 0 t t 2 2 2 σ w . Show the Kalman smoother estimators is 1 − φ σ ) , where the variance of /( t w w in this case are   = t φ 0 y , 1    φ n m = , t ) y + y ( m m − 1 + 1 x = 2 t 1 + φ     , y m t , 0 , ,  with mean square covariances determined by  2  = 0 t , σ  w  n 2 2 = P σ = , t ) m φ 1 + /( t w    , 0 t , 0 . m  6.12 The data set ar1miss is n = 100 observations generated from an AR(1) process, , where φ of the data have been deleted at 10% x 1 = φ x = σ and + w 9 , with . = 1 w t t t − random (replaced with NA ). Use the results of Problem 6.11 to estimate the parameters , using the EM algorithm, and then estimate the missing values. σ of the model, φ and w Section 6.5 Johnson & Johnson quarterly earnings per 6.13 Redo Example 6.10 on the logged share. 6.14 Fit a structural model to quarterly unemployment as follows. Use the data in unemp , which are monthly. The series can be made quarterly by aggregating and aver- y , so that is the quarterly aging: y = aggregate(unemp, nfrequency=4, FUN=mean) average unemployment. Use Example 6.10 as a guide. i i i i

390 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 380 — #390 i i 6 State Space Models 380 Section 6.6 6.15 (a) Fit an AR(2) to the recruitment series, R in rec , and consider a lag-plot of t S , in soi , at various lags, S the residuals from the fit versus the SOI series, t − t h S h 0 , 1 for . Use the lag-plot to argue that = is reasonable to include as an , . . . − 5 t exogenous variable. (b) Fit an ARX(2) to R as an exogenous variable and comment on the using S t − t 5 results; include an examination of the innovations. 6.16 Use Property 6.6 to complete the following exercises. y , in state-space form. Verify your = φ y (a) Write a univariate AR(1) model, v + 1 − t t t answer is indeed an AR(1). (b) Repeat (a) for an MA(1) model, y . = v v + θ t − 1 t t (c) Write an IMA(1,1) model, y v y , in state-space form. + v + θ = − − t t t 1 1 t Verify Property 6.5. 6.17 Verify Property 6.6. 6.18 Section 6.7 6.19 Repeat the bootstrap analysis of Example 6.13 on the entire three-month Trea- sury bills and rate of inflation data set of 110 observations. Do the conclusions of Example 6.13—that the dynamics of the data are best described in terms of a fixed, rather than stochastic, regression—still hold? Section 6.8 Let y 6.20 represent the global temperature series ( globtemp ) shown in Figure 1.2. t (a) Fit a smoothing spline using gcv (the default) to y and plot the result superim- t posed on the data. Repeat the fit using spar=.7 ; the gcv method yields spar=.5 approximately. (Example 2.14 on page 70 may help. Also in R, see the help file ?smooth.spline .) 2 (b) Write the model y = , in state-space form. Fit this state- + v with ∇ x x = w t t t t t n space model to , and exhibit a time plot the estimated smoother, ̂ x y and the t t √ n n ̂ superimposed on the data. P ̂ corresponding error limits, x 2 ± t t (c) Superimpose all the fits from parts (a) and (b) [include the error bounds] on the data and briefly compare and contrast the results. Section 6.9 Verify (6.132), (6.133), and (6.134). 6.21 6.22 Prove Property 6.7 and verify (6.143). i i i i

391 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 381 — #391 i i 381 Problems polio from the gamlss.data package. The 6.23 Fit a Poisson-HMM to the dataset data are reported polio cases in the U.S. for the years 1970 to 1983. To get started, install the package and then type library(gamlss.data) # load package plot(polio, type= ' s ' ) # view the data Fit a two-state HMM model to the weekly S&P 500 returns that were analyzed 6.24 in Example 6.17 and compare the results. Section 6.10 Fit the switching model described in Example 6.20 to the growth rate of GNP. 6.25 gnp and, in the notation of the example, y is the is log-GNP and ∇ y The data are in t t growth rate. Use the code in Example 6.22 as a guide. 6.26 Argue that a switching model is reasonable in explaining the behavior of the number of sunspots (see Figure 4.22) and then fit a switching model to the sunspot data. Section 6.11 6.27 Fit a stochastic volatility model to the returns of one (or more) of the four financial time series available in the R datasets package as EuStockMarkets . Fit a stochastic volatility model to the residuals of the GNP ( gnp ) returns analyzed 6.28 in Example 3.39. 6.29 We consider the stochastic volatility model (6.197). (a) Show that for any integer m , 2 2 m 2 m 2 m 2 m β ] E [ r , r ) ] exp ( = [ σ E 2 / x t t 2 2 2 φ where σ . − = ) σ /( 1 x (b) Show (6.198). h 2 . ) φ + 1 ( ) + X (c) Show that for any positive integer h , var = 2 σ ( X t h + t X (d) Show that ) ( ) ( 2 2 2 h 2 2 2 4 m 2 m m 2 m . ) m exp σ ))− 1 ( + ( φ σ m ( exp ] = r β , ) E [ v cov ( r x x t t + h t (e) Establish (6.199). Section 6.12 6.30 Verify the distributional statements made in Example 6.25. 6.31 Repeat Example 6.27 on the log of the Johnson & Johnson data. ) using a Bayesian approach via 6.32 Fit an AR(1) to the returns of the US GNP ( gnp MCMC. i i i i

392 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 382 — #392 i i i i i i

393 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 383 — #393 i i Chapter 7 Statistical Methods in the Frequency Domain In previous chapters, we saw many applied time series problems that involved relating series to each other or to evaluating the effects of treatments or design parameters that arise when time-varying phenomena are subjected to periodic stimuli. In many cases, the nature of the physical or biological phenomena under study are best described by their Fourier components rather than by the difference equations involved in ARIMA or state-space models. The fundamental tools we use in studying periodic phenomena are the discrete Fourier transforms (DFTs) of the processes and their statistical properties. Hence, in Section 7.2, we review the properties of the DFT of a multivariate time series and discuss various approximations to the likelihood function based on the large-sample properties and the properties of the complex multivariate normal distribution. This enables extension of the classical techniques such as ANOVA and principal component analysis to the multivariate time series case, which is the focus of this chapter. 7.1 Introduction An extremely important class of problems in classical statistics develops when we are interested in relating a collection of input series to some output series. For example, in Chapter 2, we have previously considered relating temperature and various pollutant levels to daily mortality, but have not investigated the frequencies that appear to be driving the relation and have not looked at the possibility of leading or lagging effects. In Chapter 4, we isolated a definite lag structure that could be used to relate sea surface temperature to the number of new recruits. In Problem 5.10, the possible driving processes that could be used to explain inflow to Lake Shasta were hypothesized in terms of the possible inputs precipitation, cloud cover, temperature, and other variables. Identifying the combination of input factors that produce the best prediction for inflow is an example of multiple regression in the frequency domain, with the models treated theoretically by considering the regression, conditional on the random input processes. i i i i

394 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 384 — #394 i i 384 7 Statistical Methods in the Frequency Domain Awake Sedated 0.6 0.4 0.2 0.2 0.0 Brush Brush −0.2 −0.4 −0.6 0.4 0.4 0.2 0.2 0.0 0.0 Heat Heat −0.4 −0.4 0.6 0.4 0.2 0.2 0.0 Shock Shock −0.2 −0.4 −0.6 60 80 120 40 20 0 100 0 20 40 60 80 100 120 Time Time Fig. 7.1. Mean response of subjects to various combinations of periodic stimulae measured at the cortex (primary somatosensory, contralateral). In the first column, the subjects are awake, in the second column the subjects are under mild anesthesia. In the first row, the stimulus is a brush on the hand, the second row involves the application of heat, and the third row involves a low level shock. A situation somewhat different from that above would be one in which the input series are regarded as fixed and known. In this case, we have a model analogous to , in which the analysis now can be performed analysis of variance that occurring in on a frequency by frequency basis. This analysis works especially well when the inputs are dummy variables, depending on some configuration of treatment and other design effects and when effects are largely dependent on periodic stimuli. As an example, we will look at a designed experiment measuring the fMRI brain responses of a number of awake and mildly anesthetized subjects to several levels of periodic brushing, heat, and shock effects. Some limited data from this experiment have been discussed previously in Example 1.6. Figure 7.1 shows mean responses to various levels of periodic heat, brushing, and shock stimuli for subjects awake and subjects under mild anesthesia. The stimuli were periodic in nature, applied alternately for 32 seconds (16 points) and then stopped for 32 seconds. The periodic input signal comes through under all three design conditions when the subjects are awake, but is somewhat attenuated under anesthesia. The mean shock level response hardly shows on the input signal; shock levels were designed to simulate surgical incision without inflicting tissue damage. The means in Figure 7.1 are from a single location. Actually, for each individual, some nine series were recorded at various locations in the brain. It is natural to consider testing the effects of brushing, heat, and shock under the two i i i i

395 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 385 — #395 i i 7.1 Introduction 385 levels of consciousness, using a time series generalization of analysis of variance. The R code used to generate Figure 7.1 is: x = matrix(0, 128, 6) for (i in 1:6) { x[,i] = rowMeans(fmri[[i]]) } colnames(x) = c("Brush", "Heat", "Shock", "Brush", "Heat", "Shock") plot.ts(x, main="") mtext("Awake", side=3, line=1.2, adj=.05, cex=1.2) mtext("Sedated", side=3, line=1.2, adj=.85, cex=1.2) A generalization to random coefficient regression is also considered, paralleling the univariate approach to signal extraction and detection presented in Section 4.9. This method enables a treatment of multivariate ridge-type regressions and inversion problems . Also, the usual random effects analysis of variance in the frequency domain becomes a special case of the random coefficient model. The extension of frequency domain methodology to more classical approaches to multivariate discrimination and clustering is of interest in the frequency dependent case. Many time series differ in their means and in their autocovariance functions, making the use of both the mean function and the spectral density matrices relevant. As an example of such data, consider the bivariate series consisting of the P and S components derived from several earthquakes and explosions, such as those shown in Figure 7.2, where the P and S components, representing different arrivals have been separated from the first and second halves, respectively, of waveforms like those shown originally in Figure 1.7. Two earthquakes and two explosions from a set of eight earthquakes and ex- plosions are shown in Figure 7.2 and some essential differences exist that might be used to characterize the two classes of events. Also, the frequency content of the two components of the earthquakes appears to be lower than those of the explosions, and relative amplitudes of the two classes appear to differ. For example, the ratio of the S to P amplitudes in the earthquake group is much higher for this restricted subset. Spectral differences were also noticed in Chapter 4, where the explosion processes had a stronger high-frequency component relative to the low-frequency contributions. Examples like these are typical of applications in which the essential differences between multivariate time series can be expressed by the behavior of either the frequency-dependent mean value functions or the spectral matrix. In dis- criminant analysis , these types of differences are exploited to develop combinations of linear and quadratic classification criteria. Such functions can then be used to clas- sify events of unknown origin, such as the Novaya Zemlya event shown in Figure 7.2, which tends to bear a visual resemblance to the explosion group. The R code used to produce Figure 7.2 is: attach(eqexp) # so you can use the names of the series P = 1:1024; S = P+1024 x = cbind(EQ5[P], EQ6[P], EX5[P], EX6[P], NZ[P], EQ5[S], EQ6[S], EX5[S], EX6[S], NZ[S]) x.name = c("EQ5","EQ6","EX5","EX6","NZ") colnames(x) = c(x.name, x.name) plot.ts(x, main="") mtext("P waves", side=3, line=1.2, adj=.05, cex=1.2) mtext("S waves", side=3, line=1.2, adj=.85, cex=1.2) i i i i

396 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 386 — #396 i i 386 7 Statistical Methods in the Frequency Domain P waves S waves 4 1.0 2 0 EQ5 EQ5 −0.5 −4 −2.0 4 2 0 0.0 EQ6 EQ6 −4 −1.0 3 3 1 1 EX5 EX5 −1 −1 −3 4 −3 6 2 2 0 EX6 EX6 −2 −4 −6 3 2 1 0 NZ NZ −1 −3 −4 0 200 400 600 800 1000 1000 800 0 200 400 600 Time Time Fig. 7.2. Various bivariate earthquakes (EQ) and explosions (EX) recorded at 40 pts/sec compared with an event NZ (Novaya Zemlya) of unknown origin. Compressional waves, also known as primary or P waves, travel fastest in the Earth’s crust and are first to arrive. Shear waves propagate more slowly through the Earth and arrive second, hence they are called secondary or S waves. Finally, for multivariate processes, the structure of the spectral matrix is also of great interest. We might reduce the dimension of the underlying process to a smaller set of input processes that explain most of the variability in the cross-spectral matrix as a function of frequency. Principal component analysis can be used to decompose the spectral matrix into a smaller subset of component factors that explain decreasing amounts of power. For example, the hydrological data might be explained in terms of a component process that weights heavily on precipitation and inflow and one that weights heavily on temperature and cloud cover. Perhaps these two components could explain most of the power in the spectral matrix at a given frequency. The ideas behind principal component analysis can also be generalized to include an optimal scaling methodology for categorical data called the spectral envelope (see Stoffer et al., 1993). 7.2 Spectral Matrices and Likelihood Functions We have previously argued for an approximation to the log likelihood based on the joint distribution of the DFTs in (4.85), where we used approximation as an aid in estimating parameters for certain parameterized spectra. In this chapter, we make 1 × p vector process heavy use of the fact that the sine and cosine transforms of the i i i i

397 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 387 — #397 i i 387 7.2 Spectral Matrices and Likelihood Functions 7.1 ′ x x = , x , say, with DFT μ , . . ., x = ) ( with mean E x t 2 t t 1 t t p t n ’ π 2 − − 1 / 2 t ω i k X (7.1) e n = ) ) ω ( ( = x ω ( ω i X )− X k s c k k t t = 1 and mean n ’ − / − 1 π i ω 2 t 2 k ω ( μ i M e = n ω M ( ω (7.2) = M ) ( ) )− k c s t k k t = 1 will be approximately uncorrelated, where we evaluate at the usual Fourier frequencies < = k / n , 0 < | ω { | ω 1 / 2 } . By Theorem C.6, the approximate 2 p × 2 p covariance k k ′ ′ ′ is ) = ( X , ( ω ) matrix of the cosine and sine transforms, say, X , ( ) ( ω ω ) X k k c k s ) ( Q ) ω ( ) − C ( ω k k 1 (7.3) , ( = ω ) Σ k 2 ) ) ( ω Q C ω ( k k and the real and imaginary parts are jointly normal. This result implies, by the results , can be ) ω ( stated in Appendix C, the density function of the vector DFT, say, X k approximated as { } ( ) ( ) ∗ 1 − − 1 ( − p ω ( ω ω )− M ( ω ) ( )| )≈ | f exp X M , ( ω ω ) ) X ( ω ( )− f k k k k k k k where the spectral matrix is the usual f ( ω (7.4) ) = C ( ω . )− iQ ( ω ) k k k Certain computations that we do in the section on discriminant analysis will involve approximating the joint likelihood by the product of densities like the one given above over subsets of the frequency band 0 < ω . < 1 / 2 k To use the likelihood function for estimating the spectral matrix, for example, we appeal to the limiting result implied by Theorem C.7 and again choose L frequencies ) m , . . ., in the neighborhood of some target frequency ω , say, X ( ω 1 ± k / n k , for = k L 1 . Then, let X = + and m 2 denote the indexed values, and note the DFTs of the ` mean adjusted vector process are approximately jointly normal with mean zero and complex covariance matrix ω ) . Then, write the log likelihood over the L f = ( f sub-frequencies as m ’ 1 ∗ − (7.5) ( X − − M X ) . f ( ω ( ) M ) ))≈− L ln L f ( ω | )|− ( f ( ω ln k ` ` ` ` k k X = m − ` The use of spectral approximations to the likelihood has been fairly standard, begin- ning with the work of Whittle (1961) and continuing in Brillinger (1981) and Hannan 7 . 1 In previous chapters, the DFT of a process x . In this chapter, we will consider was denoted by d ) ( ω x t k the Fourier transforms of many different processes and so, to avoid the overuse of subscripts and to ease . This notation is standard in x ) the notation, we use a capital letter, e.g., X ( ω , to denote the DFT of t k the digital signal processing (DSP) literature. i i i i

398 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 388 — #398 i i 388 7 Statistical Methods in the Frequency Domain 1.4 30 1.2 25 20 1.0 15 Temp 0.8 WndSpd 10 0.6 5 0.4 800 15 600 10 400 5 Precip DewPt 200 0 0 0.8 1000 0.6 0.4 Inflow 600 CldCvr 0.2 200 0.0 400 200 0 100 300 0 100 200 300 400 Time Time climhyd Monthly values of weather and inflow at Lake Shasta ( ). Fig. 7.3. M is known, we obtain (1970). Assuming the mean adjusted series are available, i.e., ` , namely, the maximum likelihood estimator for f m ’ ∗ − 1 ˆ L M ) = ; ) f ( ω (7.6) ( X − − M X )( ` ` ` k ` ` = − m see Problem 7.2. 7.3 Regression for Jointly Stationary Series In Section 4.7, we considered a model of the form ∞ ’ (7.7) y β v x = + , t t − r , 1 1 t r −∞ = r where x y is a single observed input series and is the observed output series, and t t 1 we are interested in estimating the filter coefficients β relating the adjacent lagged r 1 values of x to the output series y . In the case of the SOI and Recruitment series, 1 t t we identified the El Niño driving series as x , the Recruitment y , the input and 1 t t series, as the output. In general, more than a single plausible input series may exist. For example, the Lake Shasta inflow hydrological data ( climhyd ) shown in Figure 7.3 suggests there may be at least five possible series driving the inflow; see Example 7.1 input vector of driving series, 1 × q for more details. Hence, we may envision a i i i i

399 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 389 — #399 i i 389 7.3 Regression for Jointly Stationary Series ′ = = ( x β vector of regression functions , x 1 × , . . ., x say, ) , x and a set of q tq t 1 t r t 2 ′ β ( , β , . . ., β ) , which are related as r qr 2 r 1 , q ∞ ∞ ’ ’ ’ ′ β x (7.8) , x + v β = v + = y r j r − , t − t jr t t t r −∞ r r = = −∞ j = 1 which shows that the output is a sum of linearly filtered versions of the input processes and a stationary noise process v . Each filtered , assumed to be uncorrelated with x t t j gives the contribution of lagged values of the j -th input component in the sum over β are fixed and series to the output series. We assume the regression functions jr unknown. The model given by (7.8) is useful under several different scenarios, correspond- ing to a number of different assumptions that can be made about the components. Assuming the input and output processes are jointly stationary with zero means leads to the conventional regression analysis given in this section. The analysis depends on theory that assumes we observe the output process y conditional on fixed values t of the input vector ; this is the same as the assumptions made in conventional re- x t gression analysis. Assumptions considered later involve letting the coefficient vector β be a random unknown signal vector that can be estimated by Bayesian arguments, t using the conditional expectation given the data. The answers to this approach, given in Section 7.5, allow signal extraction and deconvolution problems to be handled. Assuming the inputs are fixed allows various experimental designs and analysis of variance to be done for both fixed and random effects models. Estimation of the frequency-dependent random effects variance components in the analysis of variance model is also considered in Section 7.5. For the approach in this section, assume the inputs and outputs have zero means ′ ′ of inputs ( and are jointly stationary with the ( q + 1 )× 1 vector process and x x ) , y t t t y assumed to have a spectral matrix of the form outputs t ) ( ( ) ω f f ( ω ) xx xy ) ( (7.9) = ω , f f ) ( ( ω ) ω f yy yx 1 vector of cross-spectra q × where f is the ( ω ) = ( f )) ω ( ω ) , f ( f ( ω ) , . . ., yx yx yx yx q 1 2 q q f spectral matrix of the inputs. ( ω relating the is the q × inputs to the output and ) xx Generally, we observe the inputs and search for the vector of regression functions β t relating the inputs to the outputs. We assume all autocovariance functions satisfy the absolute summability conditions of the form ∞ ’ (7.10) . ∞ < )| h ( | h || γ jk = h −∞ is the autocovariance corresponding to the cross- ) h ( ( j , k = 1 , . . ., q + 1 ) , where γ jk in (7.9). We also need to assume a linear process of the form (C.35) ) ω spectrum f ( jk as a condition for using Theorem C.7 on the joint distribution of the discrete Fourier transforms in the neighborhood of some fixed frequency. i i i i

400 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 390 — #400 i i 7 Statistical Methods in the Frequency Domain 390 Estimation of the Regression Function In order to estimate the regression function β , the Projection Theorem (Appendix r B) applied to minimizing ∞ ] [ ’ ′ 2 ( y MSE (7.11) = β − E x ) − r t t r −∞ = r leads to the orthogonality conditions ∞ [ ] ’ ′ ′ ′ E y ( − β x (7.12) ) x 0 = t r t − s − r t r = −∞ ′ s = 0 , ± 1 , ± 2 , . . . , where 0 1 denotes the for all × q zero vector. Taking the expectations inside and substituting for the definitions of the autocovariance functions appearing and leads to the normal equations ∞ ’ ′ ′ β (7.13) ) Γ s ( s − r ) = γ , ( xx yx r −∞ = r s = 0 , ± 1 , autocovariance matrix of the 2 , . . . , where Γ for ( s ) denotes the q × q ± xx = vector series s and γ ( s ) at lag ( γ x ( s ) , . . ., γ q × 1 vector containing is a )) ( s yx yx t yx q 1 y . Again, a frequency domain approximate x and the lagged covariances between t t solution is easier in this case because the computations can be done frequency by frequency using cross-spectra that can be estimated from sample data using the DFT. In order to develop the frequency domain solution, substitute the representation into the normal equations, using the same approach as used in the simple case derived in Section 4.7. This approach yields π ∞ 1 / 2 ’ ′ 2 π i ω ( s − r ) ′ . ( ) γ β s = e f ω ( ω d ) xx yx r 2 / 1 − −∞ r = ′ Now, because ( s ) is the Fourier transform of the cross-spectral vector γ f ( ω ) = yx yx ∗ ( ω ) , we might write the system of equations in the frequency domain, using the f xy uniqueness of the Fourier transform, as ′ ∗ ) (7.14) , ( ω B ( ω ) = f ( ω ) f xx xy f B ( ω ) is the q × q spectral matrix of the inputs and where ( ω ) is the q × 1 vector xx − 1 ( f , assuming ) ) ω ω ( β Fourier transform of . Multiplying (7.14) on the right by f xx t xx ω , leads to the frequency domain estimator is nonsingular at ∗ ′ − 1 ω ) = f (7.15) . B ω ) f ( ( ) ( ω xx xy Note, (7.15) implies the regression function would take the form π / 2 1 ω t i π 2 β d = (7.16) ω. e B ( ω ) t 2 / 1 − i i i i

401 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 391 — #401 i i 7.3 Regression for Jointly Stationary Series 391 As before, it is conventional to introduce the DFT as the approximate estimator for the integral (7.16) and write M − 1 ’ 1 t M ω − i π 2 k B ω β ) e ( , = M (7.17) k t k 0 = , = k / M where M << n . The approximation was shown in Problem 4.35 to hold ω k M = 0 for | t | ≥ exactly as long as / 2 and to have a mean-squared-error bounded β t by a function of the zero-lag autocovariance and the absolute sum of the neglected coefficients. The mean-squared-error (7.11) can be written using the orthogonality principle, giving π / 2 1 ω, (7.18) = f MSE ( ω ) d · x y 2 / 1 − where ∗ − 1 ( ω ) = f ω f ω )− f ( ( ( ω ) f (7.19) ) f ( ω ) yy x · xy y xx xy denotes the residual or error spectrum. The resemblance of (7.19) to the usual equa- tions in regression analysis is striking. It is useful to pursue the multiple regression squared multiple coherence can be defined as analogy further by noting a 1 − ∗ ( ) ω ( f ( f ) ω ) f ω xy xy xx 2 . (7.20) ( ) = ω ρ x y · ) f ( ω yy This expression leads to the mean squared error in the form π 1 / 2 2 ω, d MSE )] ω ( ρ f (7.21) ( ω )[ 1 − = yy x y · 2 / 1 − 2 ρ and we have an interpretation of as the proportion of power accounted for by ( ω ) y x · 2 ω for all 0 = , we have ) ω ( . If the lagged regression on x ρ at frequency ω t x y · π 1 / 2 2 , ] ω y f MSE ( = ) d ω = E [ yy t 2 / 1 − ) ω ( f which is the mean squared error when no predictive power exists. As long as xx , and we will have is positive definite at all frequencies, MSE ≥ 0 2 (7.22) ω ) ≤ 1 ( ≤ ρ 0 y · x for all ω . If the multiple coherence is unity for all frequencies, the mean squared error in (7.21) is zero and the output series is perfectly predicted by a linearly filtered combination of the inputs. Problem 7.3 shows the ordinary squared coherence between and the linearly filtered combinations of the inputs appearing in (7.11) y the series t is exactly (7.20). i i i i

402 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 392 — #402 i i 392 7 Statistical Methods in the Frequency Domain Estimation Using Sampled Data Clearly, the matrices of spectra and cross-spectra will not ordinarily be known, so the regression computations need to be based on sampled data. We assume, , x , . . ., and output y x series are available at the time therefore, the inputs x t t 2 1 t tq = t , 2 , . . ., n , as in Chapter 4. In order to develop reasonable estimates for the points 1 spectral quantities, some replication must be assumed. Often, only one replication of each of the inputs and the output will exist, so it is necessary to assume a band f ( ω ) and exists over which the spectra and cross-spectra are approximately equal to xx y ( ω ) , respectively. Then, let Y ( ω and + ` / n ) and X ( ω f + ` / n ) be the DFTs of k t xy k over the band, say, at frequencies of the form x t ± ` / n , ` = 1 , . . ., m ω , k = 2 m + L as before. Then, simply substitute the sample spectral matrix where 1 m ’ ∗ 1 − ˆ ) ) ω ( = (7.23) ) n / X ( ω ` + ` / n L X f ( ω + k k xx m − = ` and the vector of sample cross-spectra m ’ − 1 ˆ ` + ω ( n / Y (7.24) ) + L f X ( ω ( = ` / n ) ) ω k k xy ` m − = ˆ ω ( for the respective terms in (7.15) to get the regression estimator ) . For the regres- B sion estimator (7.17), we may use 1 − M ’ 1 i 1 − 2 ∗ ω t π M k ˆ ˆ ˆ ( ω e ) ) f (7.25) f ω ( = β k k xy xx t M = k 0 − , as the estimated regression function. ) 1 for t = 0 , ± 1 , ± 2 , . . ., ±( M / 2 Tests of Hypotheses The estimated squared multiple coherence, corresponding to the theoretical co- herence (7.20), becomes ∗ 1 − ˆ ˆ ˆ ω ( f ) ( f ) f ( ω ) ω xy xy xx 2 = ) ω ( ρ ˆ . (7.26) · x y ˆ ( ) f ω yy We may obtain a distributional result for the multiple coherence function analogous to that obtained in the univariate case by writing the multiple regression model in the frequency domain, as was done in Section 4.5. We obtain the statistic 2 ω ( ˆ ρ ) ) q − L ( y · x (7.27) , = F − ) L ( q , q 2 2 2 q ω [ 1 − ˆ ρ )] ( · x y i i i i

403 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 393 — #403 i i 393 7.3 Regression for Jointly Stationary Series Table 7.1. ANOPOW for the Partitioned Regression Model Degrees of Freedom Source Power q 2 x ) ω , . . ., x ( SSR (7.34) q + q t , + q , 2 1 t 2 1 1 ( ω ) Error 2 ( L − q SSE − q ) (7.35) 2 1 ˆ L Total f ) q ( ω ) 2 ( L − 1 · y 1 F -distribution with 2 q and 2 ( L − q ) degrees of freedom under the null which has an 2 , or equivalently, that , in the model 0 = ( ω ) = 0 ) B ( ω hypothesis that ρ x · y ′ , ) Y ( ω n + ` / n ) = B ) ( ω ) X ( ω / + ` / n (7.28) + V ( ω ` + k k k V ( ω + ` / . Problem 7.4 sketches ) is f where the spectral density of the error ( ω ) n x y · k a derivation of this result. A second kind of hypothesis of interest is one that might be used to test whether a full model with q inputs is significantly better than some submodel with q < q 1 components. In the time domain, this hypothesis implies, for a partition of the vector ′ ′ ′ of inputs into q ) and q x components ( q , and the + q , = q ) , say, x x = ( t 1 2 1 2 t 1 2 t ′ ′ ′ similarly partitioned vector of regression functions ( β β ) , we would be , β = t t 2 t 1 0 = in the partitioned regression model interested in testing whether β 2 t ∞ ∞ ’ ’ ′ ′ β = y β x + v (7.29) . x + t 1 t − r , 2 , t t − r r 1 2 r r = −∞ −∞ = r Rewriting the regression model (7.29) in the frequency domain in a form that is similar to (7.28) establishes that, under the partitions of the spectral matrix into its 2 q ) submatrices, say, × q , ( i , j = 1 j i ) ( ˆ ˆ ω ) f f ( ω ) ( 12 11 ˆ (7.30) , f = ) ( ω xx ˆ ˆ ω ) f f ( ( ω ) 21 22 = ) subvectors, × 1 ( i q 1 , 2 and the cross-spectral vector into its i ( ) ˆ f ( ω ) 1 y ˆ ( ω (7.31) = f , ) xy ˆ f ( ω ) 2 y by comparing the estimated ω at frequency we may test the hypothesis β = 0 t 2 residual power − 1 ∗ ˆ ˆ ˆ ˆ ˆ ( ) ω f ω ( ) (7.32) ) f ( ω ( = f )− ω f f ω ) ( xy · yy y x xx xy under the full model with that under the reduced model, given by ∗ − 1 ˆ ˆ ˆ ˆ ˆ f = ( f ω ω )− (7.33) f . ( f ( ω ) ) f ( ω ) ( ω ) 1 1 y yy · y y 1 11 The power due to regression can be written as ˆ ˆ , ω SSR ( ω ) = L [ )] f ( f (7.34) ( ω )− 1 · y · x y i i i i

404 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 394 — #404 i i 7 Statistical Methods in the Frequency Domain 394 Inflow with Inflow with Inflow with CldCvr DewPt Temp (b) (c) (a) 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 Sq Coherence Sq Coherence Sq Coherence 0.2 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.3 0.2 0.0 0.3 0.1 0.0 0.1 0.2 0.4 0.4 0.5 0.5 0.3 0.5 0.0 0.1 Frequency Frequency Frequency Inflow with Inflow with Inflow with WndSpd Precip Temp and Precip (e) (d) (f) 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 Sq Coherence Sq Coherence Sq Coherence 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.5 0.4 0.1 0.0 0.3 0.3 0.1 0.2 0.3 0.4 0.5 0.4 0.5 0.2 0.1 0.0 0.2 Frequency Frequency Frequency Squared coherency between Lake Shasta inflow and (a) temperature; (b) dew point; Fig. 7.4. (c) cloud cover; (d) wind speed; (e) precipitation. The multiple coherency between inflow and temperature – precipitation jointly is displayed in (f). In each case, the .001 threshold is exhibited as a horizontal line. with the usual error power given by ˆ ) ( ω ( = L SSE f (7.35) ω ) . y · x The test of no regression proceeds using the F -statistic ) q − L ( ω ) ( SSR = F (7.36) . q ) q 2 , 2 ( L − 2 q SSE ( ω ) 2 q − The distribution of this F ) 2 q L numerator degrees of freedom and 2 ( -statistic with 2 denominator degrees of freedom follows from an argument paralleling that given in Chapter 4 for the case of a single input. The test results can be summarized in an Analysis of Power (ANOPOW) table that parallels the usual analysis of variance (ANOVA) table. Table 7.1 shows the components of power for testing β = 0 at a 2 t ω particular frequency . The ratio of the two components divided by their respective q -statistic (7.36) used for testing whether the degrees of freedom just yields the F 2 add significantly to the predictive power of the regression on the q series. 1 Example 7.1 Predicting Lake Shasta Inflow We illustrate some of the preceding ideas by considering the problem of predicting the transformed (logged) inflow series shown in Figure 7.3 from some combination of the inputs. First, look for the best single input predictor using the squared coher- ence function (7.26). The results, exhibited in Figure 7.4(a)-(e), show transformed i i i i

405 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 395 — #405 i i 7.3 Regression for Jointly Stationary Series 395 Partial F Statistic 20 15 F 10 5 0 0.5 0.0 0.3 0.2 0.1 0.4 Frequency Impulse Response Functions 0.02 Temp −0.02 −20 0 20 40 −40 0.05 0.02 Precip −0.01 −40 −20 0 20 40 Index Fig. 7.5. Partial F -statistics [top] for testing whether temperature adds to the ability to predict Lake Shasta inflow when precipitation is included in the model. The dashed line indicates the .001 FDR level and the solid line represents the corresponding quantile of the null F distribution. Multiple impulse response functions for the regression relations of temperature [middle] and precipitation [bottom]. (square root) precipitation produces the most consistently high squared coherence ) , with the seasonal period contributing most values at all frequencies ( L = 25 significantly. Other inputs, with the exception of wind speed, also appear to be plausible contributors. Figure 7.4(a)-(e) shows a .001 threshold corresponding to F -statistic, separately, for each possible predictor of inflow. the Next, we focus on the analysis with two predictor series, temperature and transformed precipitation. The additional contribution of temperature to the model seems somewhat marginal because the multiple coherence (7.26), shown in the top panel of Figure 7.4(f) seems only slightly better than the univariate coherence with precipitation shown in Figure 7.4(e). It is, however, instructive to produce the multiple regression functions, using (7.25) to see if a simple model for inflow exists that would involve some regression combination of inputs temperature and precipitation that would be useful for predicting inflow to Shasta Lake. The top of Figure 7.5 shows the partial F -statistic, (7.36), for testing if temperature is predictive of inflow when precipitation is in the model. In addition, threshold values corresponding to a false discovery rate (FDR) of .001 (see Benjamini & Hochberg, 1995) and the corresponding null quantile are displayed in that figure. F Although the contribution of temperature is marginal, it is instructive to produce the multiple regression functions, using (7.25), to see if a simple model for inflow exists that would involve some regression combination of inputs temperature and i i i i

406 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 396 — #406 i i 7 Statistical Methods in the Frequency Domain 396 precipitation that would be useful for predicting inflow to Lake Shasta. With this T for transformed precipitation and in mind, denoting the possible inputs P for t t transformed temperature, the regression function has been plotted in the lower two M = 100 for each of the two inputs. In that panels of Figure 7.5 using a value of figure, the time index runs over both positive and negative values and are centered at time = 0 . Hence, the relation with temperature seems to be instantaneous and t positive and an exponentially decaying relation to precipitation exists that has been noticed previously in the analysis in Problem 4.37. The plots suggest a transfer function model of the general form fitted to the Recruitment and SOI series in I , using the model Example 5.8. We might propose fitting the inflow output, say, t δ 0 η + T + P α , + α I = 2 t t t t 0 ( ω − ) 1 B 1 which is the transfer function model, without the temperature component, consid- ered in that section. The R code for this example is as follows. plot.ts(climhyd) # Figure 7.3 Y = climhyd # Y holds the transformed series # log inflow Y[,6] = log(Y[,6]) # sqrt precipitation Y[,5] = sqrt(Y[,5]) L = 25; M = 100; alpha = .001; fdr = .001 # number of inputs (Temp and Precip) nq = 2 # Spectral Matrix Yspec = mvspec(Y, spans=L, kernel="daniell", detrend=TRUE, demean=FALSE, taper=.1) n = Yspec$n.used # effective sample size Fr = Yspec$freq # fundamental freqs n.freq = length(Fr) # number of frequencies Yspec$bandwidth*sqrt(12) # = 0.050 - the bandwidth # Coherencies Fq = qf(1-alpha, 2, L-2) cn = Fq/(L-1+Fq) plt.name = c("(a)","(b)","(c)","(d)","(e)","(f)") dev.new(); par(mfrow=c(2,3), cex.lab=1.2) # The coherencies are listed as 1,2,...,15=choose(6,2) for (i in 11:15){ plot(Fr, Yspec$coh[,i], type="l", ylab="Sq Coherence", xlab="Frequency", ylim=c(0,1), main=c("Inflow with", names(climhyd[i-10]))) abline(h = cn); text(.45,.98, plt.name[i-10], cex=1.2) } # Multiple Coherency coh.15 = stoch.reg(Y, cols.full = c(1,5), cols.red = NULL, alpha, L, M, plot.which = "coh") text(.45 ,.98, plt.name[6], cex=1.2) title(main = c("Inflow with", "Temp and Precip")) # Partial F (called eF; avoid use of F alone) numer.df = 2*nq; denom.df = Yspec$df-2*nq dev.new() par(mfrow=c(3,1), mar=c(3,3,2,1)+.5, mgp = c(1.5,0.4,0), cex.lab=1.2) out.15 = stoch.reg(Y, cols.full = c(1,5), cols.red = 5, alpha, L, M, plot.which = "F.stat") eF = out.15$eF pvals = pf(eF, numer.df, denom.df, lower.tail = FALSE) pID = FDR(pvals, fdr); abline(h=c(eF[pID]), lty=2) title(main = "Partial F Statistic") i i i i

407 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 397 — #407 i i 397 7.4 Regression with Deterministic Inputs # Regression Coefficients S = seq(from = -M/2+1, to = M/2 - 1, length = M-1) plot(S, coh.15$Betahat[,1], type = "h", xlab = "", ylab = names(climhyd[1]), ylim = c(-.025, .055), lwd=2) abline(h=0); title(main = "Impulse Response Functions") plot(S, coh.15$Betahat[,2], type = "h", xlab = "Index", ylab = names(climhyd[5]), ylim = c(-.015, .055), lwd=2) abline(h=0) 7.4 Regression with Deterministic Inputs The previous section considered the case in which the input and output series were jointly stationary, but there are many circumstances in which we might want to assume that the input functions are fixed and have a known functional form. This happens in the analysis of data from designed experiments. For example, we may want to take a collection of earthquakes and explosions such as are shown in Figure 7.2 and test whether the mean functions are the same for either the P or S components or, perhaps, for them jointly. In certain other signal detection problems using arrays, the inputs are used as dummy variables to express lags corresponding to the arrival times of the signal at various elements, under a model corresponding to that of a plane wave from a fixed source propagating across the array. In Figure 7.1, we plotted the mean responses of the cortex as a function of various underlying design configurations corresponding to various stimuli applied to awake and mildly anesthetized subjects. It is necessary to introduce a replicated version of the underlying model to handle even the univariate situation, and we replace (7.8) by ∞ ’ ′ y = + (7.37) β v z jt j , t − r jt r r −∞ = j = 1 , 2 , . . ., N series, where we assume the vector of known deterministic inputs, for ′ z ( z , satisfies = , . . ., z ) jtq 1 jt jt ∞ ’ ∞ < | z || t | jtk −∞ = t for j = 1 , . . ., N replicates of an underlying process involving k = 1 , . . ., q regression functions. The model can also be treated under the assumption that the deterministic function satisfy Grenander’s conditions, as in Hannan (1970), but we do not need those conditions here and simply follow the approach in Shumway (1983, 1988). It will sometimes be convenient in what follows to represent the model in matrix notation, writing (7.37) as ∞ ’ , (7.38) y β = v + z t − r t r t = r −∞ ′ N matrices of independent inputs and q where z × = ( z and are the , . . ., z y ) N t 1 t t t ′ v is , . . ., v v are the N × ) output and error vectors. The error vector v ( = 1 t t 1 t N t i i i i

408 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 398 — #408 i i 7 Statistical Methods in the Frequency Domain 398 Infrasonic Signals and Beam 2 0 −2 sensor1 −6 2 Time 0 sensor2 −4 2 0 Time sensor3 −4 2 Time 0 beam −4 1000 1500 0 500 2000 Time Time Three series for a nuclear explosion detonated 25 km south of Christmas Island and Fig. 7.6. the delayed average or beam. The time scale is 10 points per second. assumed to be a multivariate, zero-mean, stationary, normal process with spectral that is proportional to the ( ω ) I matrix identity matrix. That is, we assume N × N f v N the error series are independently and identically distributed with spectral densities v jt . ω ) f ( v Example 7.2 An Infrasonic Signal from a Nuclear Explosion Often, we will observe a common signal, say, β on an array of sensors, with the t , . . ., 1 y . For example, Figure 7.6 response at the j -th sensor denoted by N = , j jt shows an infrasonic or low-frequency acoustic signal from a nuclear explosion, N 3 acoustic sensors. These signals as observed on a small triangular array of = appear at slightly different times. Because of the way signals propagate, a plane wave signal of this kind, from a given source, traveling at a given velocity, will arrive at elements in the array at predictable time delays. In the case of the infrasonic signal in Figure 7.6, the delays were approximated by computing the cross-correlation between elements and simply reading off the time delay corresponding to the maximum. For a detailed discussion of the statistical analysis of array signals, see Shumway et al. (1999). A simple additive signal-plus-noise model of the form (7.39) v = β + y τ t jt jt − j N , . . ., j are the time delays that determine the start can be assumed, where τ 2 , , = 1 j point of the signal at each element of the array. The model (7.39) is written in the δ form (7.37) by letting z and is zero otherwise. = δ 0 = t when , where 1 = t t jt τ − j In this case, we are interested in both the problem of detecting the presence of the . In this case, a plausible estimator of the signal and in estimating its waveform β t beam waveform would be the unbiased , say, i i i i

409 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 399 — #409 i i 399 7.4 Regression with Deterministic Inputs Õ N y τ + t , j j 1 = j ˆ β = , (7.40) t N and τ = 17 , τ from = 0 , 22 τ − = where time delays in this case were measured as 3 1 2 the cross-correlation function. The bottom panel of Figure 7.6 shows the computed beam in this case, and the noise in the individual channels has been reduced and the essential characteristics of the common signal are retained in the average. The R code for this example is attach(beamd) tau = rep(0,3) u = ccf(sensor1, sensor2, plot=FALSE) tau[1] = u$lag[which.max(u$acf)] # 17 u = ccf(sensor3, sensor2, plot=FALSE) tau[3] = u$lag[which.max(u$acf)] # -22 Y = ts.union(lag(sensor1,tau[1]), lag(sensor2, tau[2]), lag(sensor3, tau[3])) Y = ts.union(Y, rowMeans(Y)) colnames(Y) = c( ' sensor1 ' , ' sensor2 ' , ' sensor3 ' , ' beam ' ) plot.ts(Y) The above discussion and example serve to motivate a more detailed look at the estimation and detection problems in the case in which the input series z are jt fixed and known. We consider the modifications needed for this case in the following sections. Estimation of the Regression Relation Because the regression model (7.37) involves fixed functions, we may parallel the usual approach using the Gauss–Markov theorem to search for linear-filtered estimators of the form N ∞ ’ ’ ˆ β y = h (7.41) , jr r − t , t j −∞ = r = 1 j ′ h = ( h . . ., h where ) is a vector of filter coefficients, determined so the jtq jt 1 jt estimators are unbiased and have minimum variance. The equivalent matrix form is ∞ ’ ˆ h = (7.42) , y β r t − r t −∞ = r × q N matrix of filter functions. The matrix form where h is a = ( h ) h , . . ., t t 1 N t resembles the usual classical regression case and is more convenient for extending the Gauss–Markov Theorem to lagged regression. The unbiased condition is considered h in Problem 7.6. It can be shown (see Shumway and Dean, 1968) that can be taken js as the Fourier transform of 1 − Z (7.43) ( , ) ω H ) ω ( = S ω ( ) j j z where ∞ ’ t 2 ω − π i ) = (7.44) Z ω e z ( jt j −∞ = t i i i i

410 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 400 — #410 i i 7 Statistical Methods in the Frequency Domain 400 z is the infinite Fourier transform of . The matrix jt N ’ ′ ) (7.45) ( ω Z ) ω Z ( ω S = ( ) j z j = j 1 can be written in the form ∗ ( ω S = Z ) ( ω ) Z ( ω ) , (7.46) z ′ N × q matrix Z ( ω ) is defined by Z ( ω ) = ( Z . In matrix ( ω ) , . . ., Z where the ( ω )) N 1 notation, the Fourier transform of the optimal filter becomes − 1 ∗ ω S H ω (7.47) ) ( ) Z = ( ω ) , ( z where H ( ω ) = ( H matrix of frequency response func- ( ω ) , . . ., H N ( ω )) is the q × N 1 tions. The optimal filter then becomes the Fourier transform π 2 / 1 2 t ω i π = (7.48) H ( ω ) e h ω. d t 1 − / 2 If the transform is not tractable to compute, an approximation analogous to (7.25) may be used. Example 7.3 Estimation of the Infrasonic Signal in Example 7.2 We consider the problem of producing a best linearly filtered unbiased estimator and (7.44) becomes 1 for the infrasonic signal in Example 7.2. In this case, q = ∞ ’ π i ωτ i π 2 − t ω 2 − j δ ( ω ) = Z e = e τ j t − j = −∞ t N . Hence, we have and S ( ω ) = z 1 ωτ i 2 π j ( = H ω ) e . j N 1 . Substituting in (7.41), we obtain the best = Using (7.48), we obtain h δ ( t + τ ) j jt N linear unbiased estimator as the beam, computed as in (7.40). Tests of Hypotheses We consider first testing the hypothesis that the complete vector β is zero, i.e., t that the vector signal is absent. We develop a test at each frequency ω by taking ω single adjacent frequencies of the form = k / n , as in the initial section. We may k approximate the DFT of the observed vector in the model (7.37) using a representation of the form ′ (7.49) ) ω ( V ( ω + ) = B Y ( ω ) ) Z ω ( j k k j j k k N , . . ., , where the error terms will be uncorrelated with common variance 1 for j = ) can ) ω f ( ω ( Z , the spectral density of the error term. The independent variables j k k i i i i

411 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 401 — #411 i i 7.4 Regression with Deterministic Inputs 401 either be the infinite Fourier transform, or they can be approximated by the DFT. Hence, we can obtain the matrix version of a complex regression model, written in the form ) ) = Z ( ω Y ( B ω ω (7.50) ) + V ( ω , ) ( k k k k × q matrix Z ( ω ) ) has been defined previously below (7.46) and Y ( ω where the N k k ω V ) are N × 1 ( V ( ω and ) having mean zero, with vectors with the error vector k k ( ω I ) covariance matrix f . The usual regression arguments show that the maximum N k likelihood estimator for the regression coefficient will be − 1 ˆ ω S ( ) = ( ω ) s ( ω B ) , (7.51) k zy k k z is given by (7.46) and ( ω S ) where z k N ’ ∗ ω ( Y ) (7.52) ω ( Z . ) ( ( = Y ) ω s ) ω ω ( Z = ) j j k k k k zy k 1 = j Also, the maximum likelihood estimator for the error spectral matrix is proportional to N ’ 2 ′ 2 ˆ ) s = ( ω ω ) Y )| ( ω ω )− ( B ( | Z k k j k k j z · y j 1 = ∗ ∗ ∗ ∗ 1 − Y ( ω ) ) Y ( ω ω )− Y Z ( ω ( ) Z ( ω ) )[ Z ω ( ω Z ) ( ( ω = )] Y k k k k k k k k ∗ 1 − 2 ω (7.53) )− s s , ( ω ) ) S = ( ω ( ω ( ) s k zy k k k zy z y where N ’ 2 2 ω . ) s = (7.54) ( )| ω | Y ( j k k y = j 1 B ( ω = ) Under the null hypothesis that the regression coefficient 0 , the estimator for k 2 the error power is just s ( ω ) . If smoothing is needed, we may replace the (7.53) and k y , for (7.54) by smoothed components over the frequencies , . . ., + ` / n ω ` = − m m k . In that case, we obtain the regression and error spectral ω , close to 1 and L = 2 m + components as m ’ − 1 ∗ ( s (7.55) ) n / ` ) ` + ω + ( ω n / ( ) n / ` S ω + SSR s = ) ω ( k zy k k z zy − m ` = and m ’ 2 ( s ) = ω SSE ( ` . n / + (7.56) ω ) k z · y = ` − m The F -statistic for testing no regression relation is q − N ) ω ( SSR (7.57) . F = − q ) L N , Lq ( 2 2 q ) ω SSE ( i i i i

412 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 402 — #412 i i 7 Statistical Methods in the Frequency Domain 402 Analysis of Power (ANOPOW) for Testing No Contribution Table 7.2. from the Independent Series at Frequency ω in the Fixed Input Case Power Degrees of Freedom Source SSR ( ω ) (7.55) 2 Lq Regression ) Error SSE ( ω ) (7.56) 2 L ( N − q SST ( ω ) 2 L N Total The analysis of power pertaining to this situation appears in Table 7.2. In the fixed regression case, the partitioned hypothesis that is the analog of in (7.27) with β = 0 ) x , x replaced by z , z into . Here, we partition S ( ω t 1 1 2 t t t 2 z 2 t = 2 q ) submatrices, say, ( i , j × 1 , q i j ) ( ω ( ) S S ) ( ω 11 k 12 k = (7.58) , ω ( S ) z k ( ω S ) S ( ω ) 21 22 k k 2 , subvectors and the cross-spectral vector into its q 1 × 1 , for i = , i ( ) s ( ω ) k 1 y ( ) = s ω (7.59) . k zy ) s ( ω k y 2 at frequency β = 0 Here, we test the hypothesis ω by comparing the residual power t 2 (7.53) under the full model with the residual power under the reduced model, given by 2 2 ∗ − 1 s ( s ( ω ω = )− s ) ( ω S ) ω . ) ω ( s ) ( (7.60) k k k y 1 k k y 1 · y 1 y 11 Again, it is desirable to add over adjacent frequencies with roughly comparable spectra so the regression and error power components can be taken as m ] [ ’ 2 2 = n / SSR ) ( ` ω + ω ( ) s ω ` / n )− + ( s (7.61) k k y · z y · 1 − ` = m and m ’ 2 ) n / ω (7.62) ` + ( . s = ) ω ( SSE k y · z = m ` − The information can again be summarized as in Table 7.3, where the ratio of mean -statistic power regression and error components leads to the F ) q − N ( ) ω ( SSR . (7.63) F = ) N ( L q , 2 − Lq 2 2 ω ) ( SSE q 2 We illustrate the analysis of power procedure using the infrasonic signal detection procedure of Example 7.2. i i i i

413 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 403 — #413 i i 403 7.4 Regression with Deterministic Inputs Table 7.3. Analysis of Power (ANOPOW) for Testing No Contribution q Inputs in the Fixed Input Case from the Last 2 Source Power Degrees of Freedom ( Regression ) (7.61) 2 Lq SSR ω 2 SSE ( ω ) (7.62) 2 L ( Error − q ) N Total ( ω ) 2 L ( N − q SST ) 1 Example 7.4 Detecting the Infrasonic Signal Using ANOPOW We consider the problem of detecting the common signal for the three infrasonic series observing the common signal, as shown in Figure 7.4. The presence of the signal is obvious in the waveforms shown, so the test here mainly confirms the statistical significance and isolates the frequencies containing the strongest signal n = 2048 points, sampled at 10 points per components. Each series contained ωτ i π 2 − j S = e Z second. We use the model in (7.39) so ( ω and ) as in ( ω ) = N z j ) ω Example 7.3, with s given as ( zy k N ’ ωτ i π 2 j s ( ω ) = e , ) Y ( ω zy k k j = 1 j using (7.45) and (7.52). The above expression can be interpreted as being propor- tional to the weighted mean or beam , computed in frequency, and we introduce the notation N ’ 1 i π 2 ωτ j ) e Y (7.64) ω ( ) ω ( B = k j k w N j = 1 for that term. Substituting for the power components in Table 7.3 yields 2 − 1 ∗ ( ( | N = B )| ) ( ω ω ) s ω ( S ) ω s zy k k w k k z zy and N N ’ ’ 2 2 2 2 ω − B )| ( ω N ( Y | | )| ( ( ω | )− B Y = ω )| ( ω ) = s k k w j w k j k k z · y 1 = j = 1 j for the regression signal and error components, respectively. Because only three elements in the array and a reasonable number of points in time exist, it seems advisable to employ some smoothing over frequency to obtain additional degrees and 2 of freedom. In this case, L = 9 , yielding 2 ( 9 ) = 18 degrees 36 ( 9 )( 3 − 1 ) = -statistic (7.57). The top of freedom for the numerator and denominator of the F of Figure 7.7 shows the analysis of power components due to error and the total power. The power is maximum at about .002 cycles per point or about .02 cycles per -statistic is compared with the .001 FDR and the corresponding null second. The F significance in the bottom panel and has the strongest detection at about .02 cycles i i i i

414 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 404 — #414 i i 7 Statistical Methods in the Frequency Domain 404 Sum of Squares 4 2 0 log Power −2 0.04 0.10 0.02 0.06 0.08 0.00 F Statistic 8 6 4 2 F−statistic 0 0.06 0.00 0.10 0.02 0.04 0.08 Frequency Fig. 7.7. Analysis of power for infrasound array on a log scale (top panel) with SST ( ω ) shown ( -statistics (bottom panel) showing detections F as a dashed line. The as a solid line and SSE ) ω F with the dashed line based on an FDR level of .001 and the solid line corresponding null quantile. per second. Little power of consequence appears to exist elsewhere, however, there is some marginally significant signal power near the .5 cycles per second frequency band. The R code for this example is as follows. attach(beamd) L = 9; fdr = .001; N = 3 Y = cbind(beamd, beam=rowMeans(beamd) ) n = nextn(nrow(Y)) Y.fft = mvfft(as.ts(Y))/sqrt(n) Df = Y.fft[,1:3] # fft of the data # beam fft Bf = Y.fft[,4] ssr = N*Re(Bf*Conj(Bf)) # raw signal spectrum # raw error spectrum sse = Re(rowSums(Df*Conj(Df))) - ssr # Smooth SSE = filter(sse, sides=2, filter=rep(1/L,L), circular=TRUE) SSR = filter(ssr, sides=2, filter=rep(1/L,L), circular=TRUE) SST = SSE + SSR par(mfrow=c(2,1), mar=c(4,4,2,1)+.1) Fr = 0:(n-1)/n # the fundamental frequencies nFr = 1:200 # number of freqs to plot plot(Fr[nFr], SST[nFr], type="l", ylab="log Power", xlab="", main="Sum of Squares", log="y") lines(Fr[nFr], SSE[nFr], type="l", lty=2) eF = (N-1)*SSR/SSE; df1 = 2*L; df2 = 2*L*(N-1) # p values for FDR pvals = pf(eF, df1, df2, lower=FALSE) pID = FDR(pvals, fdr); Fq = qf(1-fdr, df1, df2) plot(Fr[nFr], eF[nFr], type="l", ylab="F-statistic", xlab="Frequency", main="F Statistic") abline(h=c(Fq, eF[pID]), lty=1:2) i i i i

415 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 405 — #415 i i 405 7.5 Random Coefficient Regression Although there are examples of detecting multiple regression functions of the general type considered above (see, for example, Shumway, 1983), we do not consider additional examples of partitioning in the fixed input case here. The reason is that several examples exist in the section on designed experiments that illustrate the partitioned approach. 7.5 Random Coefficient Regression The lagged regression models considered so far have assumed the input process is β either stochastic or fixed and the components of the vector of regression function t are fixed and unknown parameters to be estimated. There are many cases in time series analysis in which it is more natural to regard the regression vector as an unknown stochastic signal. For example, we have studied the state-space model in Chapter 6, where the state equation can be considered as involving a random parameter vector that is essentially a multivariate autoregressive process. In Section 4.8, we considered as a signal extraction problem. β estimating the univariate regression function t In this section, we consider a random coefficient regression model of (7.38) in the equivalent form ∞ ’ = y (7.65) , z v + β r − t t r t = r −∞ ′ ′ = ( y where y , . . ., y ) ) are is the N × 1 response vector and z z = ( z , . . ., t t t N t N t 1 t 1 N q matrices containing the fixed input processes. Here, the components of the × q × 1 regression vector β the are zero-mean, uncorrelated, stationary series with t common spectral matrix have zero-means and spectral ( ω ) f v and the error series I q t β q × N identity matrix. Then, defining the matrix f N ( ω ) I × , where I N is the N N v ′ ( ω ) = ( Z ω ( ω ) , Z , as in (7.44), ( matrix ) , . . ., Z z ( ω )) Z of Fourier transforms of N t 2 1 it is easy to show the spectral matrix of the response vector y is given by t ∗ Z (7.66) . f ω ( ω ) = f ( ( ω ) I ( ω ) Z ) ( ω ) + f v N β y The regression model with a stochastic stationary signal component is a general version of the simple additive noise model v y , = β + t t t considered by Wiener (1949) and Kolmogorov (1941), who derived the minimum β mean squared error estimators for , as in Section 4.8. The more general multivariate t version (7.65) represents the series as a convolution of the signal vector β and a known t set of vector input series contained in the matrix . Restricting the the covariance z t matrices of signal and noise to diagonal form is consistent with what is done in statistics using random effects models, which we consider here in a later section. The in the is often called deconvolution problem of estimating the regression function β t engineering and geophysical literature. i i i i

416 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 406 — #416 i i 7 Statistical Methods in the Frequency Domain 406 Estimation of the Regression Relation β can be estimated by a general filter of the form (7.42), The regression function t where we write that estimator in matrix form ∞ ’ ˆ β (7.67) , h y = r t t − r −∞ r = h , . . ., = ( h and apply the orthogonality principle, as in Section 4.8. A , where h ) N t t 1 t generalization of the argument in that section (see Problem 7.7) leads to the estimator − 1 ∗ ) = [ S (7.68) ( ω ) + θ ( ω ) I ) ] H ( Z ω ( ω q z for the Fourier transform of the minimum mean-squared error filter, where the pa- rameter ) ω ( f v (7.69) = ) ω ( θ ( ω ) f β is the inverse of the signal-to-noise ratio. It is clear from the frequency domain version of the linear model (7.50), the comparable version of the estimator (7.51) can be written as − 1 ˆ ω ) = [ S (7.70) ( ω ) + θ ( ω ) I . ] B ( s ) ( ω zy q z This version exhibits the estimator in the stochastic regressor case as the usual ω ( , that is proportional to the inverse of the ) estimator, with a ridge correction , θ signal-to-noise ratio. The mean-squared covariance of the estimator is shown to be ∗ 1 − ˆ ˆ ( B )( E B − B ) [( ] = f (7.71) ( ω )[ S , ( ω ) + θ ) ω I B ] − q v z which again exhibits the close connection between this case and the variance of the 1 − ω ω ) S f estimator (7.51), which can be shown to be . ( ( ) v z Example 7.5 Estimating the Random Infrasonic Signal In Example 7.4, we have already determined the components needed in (7.68) and (7.69) to obtain the estimators for the random signal. The Fourier transform of the optimum filter at series j has the form 2 π ωτ i j e H ) = ( ω (7.72) j θ ( ω ) + N f from (7.71). The net effect ( N )/[ with the mean-squared error given by + θ ( ω )] ω v of applying the filters will be the same as filtering the beam with the frequency response function ω N f ( ) N β , (7.73) = ( H ω ) = 0 ω ( θ + N ) ω f ) ( ω ) + N f ( β v where the last form is more convenient in cases in which portions of the signal spectrum are essentially zero. i i i i

417 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 407 — #417 i i 407 7.6 Analysis of Designed Experiments h have frequency response functions that depend on the The optimal filters t f ) ( ω ) and noise spectrum f , so we will need estimators for these ( ω signal spectrum v β parameters to apply the optimal filters. Sometimes, there will be values, suggested 1 θ ( ω ) as a function of frequency. from experience, for the signal-to-noise ratio / The analogy between the model here and the usual variance components model in statistics, however, suggests we try an approach along those lines as in the next section. Detection and Parameter Estimation The analogy to the usual variance components situation suggests looking at the regression and error components of Table 7.2 under the stochastic signal assumptions. We consider the components of (7.55) and (7.56) at a single frequency ω . In order k to estimate the spectral components f , we reconsider the linear model ( ω ) and f ) ( ω v β B ( ω ) is a random process with spectral matrix (7.50) under the assumption that k ) ( ω . Then, the spectral matrix of the observed process is (7.66), evaluated at f I q β k ω . frequency k Consider first the component of the regression power, defined as ∗ − 1 ) ) = s SSR ω ( ω ( ) S ( ω s ( ω ) k k zy k k z zy ∗ − 1 ∗ = Y ω ( ω Y ) Z ( ω ) ) S ( . ω ( ω ( ) Z ) k k k k k z A computation shows ) SSR ( ω , )] = f [ ( ω ω ) tr { S E ( ω q f )} + ( k z k k β k v where tr denotes the trace of a matix. If we can find a set of frequencies of the form + + ` / n , where the spectra and the Fourier transforms S ω ( ω are ) ` / n ) ≈ S ω ( z k k z relatively constant, the expectation of the averaged values in (7.55) yields )] ) ω (7.74) E [ SSR ( ω . = L f ( ( ω ) tr [ S Lq f ( ω )] + z v β A similar computation establishes − SSE ( ω )] = L ( E [ q ) f (7.75) ( ω ) . N v ω ( f We may obtain an approximately unbiased estimator for the spectra ) f and ( ω ) v β by replacing the expected power components by their values and solving (7.74) and (7.75). 7.6 Analysis of Designed Experiments An important special case (see Brillinger, 1973, 1980) of the regression model (7.49) occurs when the regression (7.38) is of the form , v + (7.76) β z = y t t t i i i i

418 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 408 — #418 i i 7 Statistical Methods in the Frequency Domain 408 ′ = z where , z -th , . . ., z ( ) z is a matrix that determines what is observed by the j 2 1 N series; i.e., ′ v β (7.77) . + z = y jt t jt j z In this case, the the matrix of independent variables is constant and we will have the frequency domain model. ( ω (7.78) ) = Z B ( Y ) ) + V ( ω ω k k k corresponding to (7.50), where the matrix Z ( ω . ) was a function of frequency ω k k The matrix is purely real, in this case, but the equations (7.51)–(7.57) can be applied with ( ω Z ) replaced by the constant matrix Z . k Equality of Means A typical general problem that we encounter in analyzing real data is a simple equality of means test in which there might be a collection of time series y , i = i jt 1 , . . ., I , j = 1 , . . ., N . To , belonging to I possible groups, with N i series in group i i test equality of means, we may write the regression model in the form y , = μ v (7.79) α + + it t i jt i jt -th group at time i denotes the effect of the α where μ denotes the overall mean and t it Õ = . In this case, the full model can be written in α t t 0 for all and we require that it i the general regression notation as ′ β = z y v + t i jt i jt i j where ′ ) , . . .,α β ,α = ( μ ,α − t t 2 t I t 1 , t 1 denotes the regression vector, subject to the constraint. The reduced model becomes v + (7.80) y μ = i jt t i jt under the assumption that the group means are equal. In the full model, there are I design vectors z ; the first component is always one for possible values for the I × 1 i j , . . ., the mean, and the rest have a one in the -th position for i = 1 i I − 1 and zeros 1 elsewhere. The vectors for the last group take the value − I for i = 2 , 3 , . . ., . − 1 is a single column of ones. The rest of the analysis Under the reduced model, each z i j follows the approach summarized in (7.51)–(7.57). In this particular case, the power components in Table 7.3 (before smoothing) simplify to N I i ’ ’ 2 ( (7.81) )− ω Y )| ω ( Y | SSR ) = ( ω k k i k · · · = j 1 i = 1 and i i i i

419 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 409 — #419 i i 7.6 Analysis of Designed Experiments 409 N I i ’ ’ 2 )| (7.82) ω ( , ( Y )− ω | Y ) ω ( SSE = k k k i i j · i = 1 j = 1 which are analogous to the usual sums of squares in analysis of variance. Note that a ) stands for a mean, taken over the appropriate subscript, so the regression power dot ( · ( ω component ) is basically the power in the residuals of the group means from SSR k SSE ( ω ) reflects the departures of the overall mean and the error power component k the group means from the original data values. Smoothing each component over L Õ − N − frequencies leads to the usual F -statistic (7.63) with 2 L ( I ) 1 ) and 2 L ( I i i ω degrees of freedom at each frequency of interest. Example 7.6 Means Test for the fMRI Data Figure 7.1 showed the mean responses of subjects to various levels of periodic stimulation while awake and while under anesthesia, as collected in a pain percep- tion experiment of Antognini et al. (1997). Three types of periodic stimuli were presented to awake and anesthetized subjects, namely, brushing, heat, and shock. The periodicity was introduced by applying the stimuli, brushing, heat, and shocks in on-off sequences lasting 32 seconds each and the sampling rate was one point every two seconds. The blood oxygenation level (BOLD) signal intensity (Ogawa et al., 1990) was measured at nine locations in the brain. Areas of activation were determined using a technique first described by Bandettini et al. (1993). The specific locations of the brain where the signal was measured were Cortex 1: Primary So- matosensory, Contralateral, Cortex 2: Primary Somatosensory, Ipsilateral, Cortex 3: Secondary Somatosensory, Contralateral, Cortex 4: Secondary Somatosensory, Ipsilateral, Caudate, Thalamus 1: Contralateral, Thalamus 2: Ipsilateral, Cerebel- lum 1: Contralateral and Cerebellum 2: Ipsilateral. Figure 7.1 shows the mean response of subjects at Cortex 1 for each of the six treatment combinations, 1: Awake-Brush (5 subjects), 2: Awake-Heat (4 subjects), 3: Awake-Shock (5 sub- jects), 4: Low-Brush (3 subjects), 5: Low-Heat (5 subjects), and 6: Low-Shock( 4 subjects). The objective of this first analysis is to test equality of these six group means, paying special attention to the 64-second period band (1/64 cycles per sec- ond) expected from the periodic driving stimuli. Because a test of equality is needed = to control for the overall error 001 . at each of the nine brain locations, we took α 3 F -statistics, computed from (7.63), with L = rate. Figure 7.8 shows , and we see substantial signals for the four cortex locations and for the second cerebellum trace, but the effects are nonsignificant in the caudate and thalamus regions. Hence, we will retain the four cortex locations and the second cerebellum location for further analysis. The R code for this example is as follows. n = 128 # length of series n.freq = 1 + n/2 # number of frequencies Fr = (0:(n.freq-1))/n # the frequencies N = c(5,4,5,3,5,4) # number of series for each cell n.subject = sum(N) # number of subjects (26) n.trt = 6 # number of treatments # for smoothing L = 3 num.df = 2*L*(n.trt-1) # df for F test i i i i

420 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 410 — #420 i i 410 7 Statistical Methods in the Frequency Domain Cortex 3 Cortex 2 Cortex 1 6 6 6 4 4 4 F Statistic F Statistic F Statistic 2 2 2 0 0 0 0.2 0.3 0.4 0.5 0.4 0.2 0.5 0.3 0.1 0.0 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.0 Frequency Frequency Frequency Thalamus 1 Caudate Cortex 4 6 6 6 4 4 4 F Statistic F Statistic F Statistic 2 2 2 0 0 0 0.1 0.5 0.1 0.1 0.0 0.0 0.2 0.3 0.3 0.0 0.2 0.2 0.3 0.4 0.5 0.4 0.5 0.4 Frequency Frequency Frequency Thalamus 2 Cerebellum 2 Cerebellum 1 6 6 6 4 4 4 F Statistic F Statistic F Statistic 2 2 2 0 0 0 0.5 0.3 0.5 0.2 0.1 0.4 0.0 0.2 0.5 0.1 0.2 0.3 0.4 0.4 0.3 0.0 0.1 0.0 Frequency Frequency Frequency Fig. 7.8. Frequency-dependent equality of means tests for fMRI data at 9 brain locations. L = 3 26 . and critical value . F 2 = ( 30 , 120 ) 001 . den.df = 2*L*(n.subject-n.trt) # Design Matrix (Z): Z1 = outer(rep(1,N[1]), c(1,1,0,0,0,0)) Z2 = outer(rep(1,N[2]), c(1,0,1,0,0,0)) Z3 = outer(rep(1,N[3]), c(1,0,0,1,0,0)) Z4 = outer(rep(1,N[4]), c(1,0,0,0,1,0)) Z5 = outer(rep(1,N[5]), c(1,0,0,0,0,1)) Z6 = outer(rep(1,N[6]), c(1,-1,-1,-1,-1,-1)) Z = rbind(Z1, Z2, Z3, Z4, Z5, Z6) ZZ = t(Z)%*%Z SSEF <- rep(NA, n) -> SSER HatF = Z%*%solve(ZZ, t(Z)) HatR = Z[,1]%*%t(Z[,1])/ZZ[1,1] par(mfrow=c(3,3), mar=c(3.5,4,0,0), oma=c(0,0,2,2), mgp = c(1.6,.6,0)) loc.name = c("Cortex 1","Cortex 2","Cortex 3","Cortex 4","Caudate","Thalamus 1","Thalamus 2","Cerebellum 1","Cerebellum 2") for(Loc in 1:9) { i = n.trt*(Loc-1) Y = cbind(fmri[[i+1]], fmri[[i+2]], fmri[[i+3]], fmri[[i+4]], fmri[[i+5]], fmri[[i+6]]) Y = mvfft(spec.taper(Y, p=.5))/sqrt(n) # Y is now 26 x 128 FFTs Y = t(Y) # Calculation of Error Spectra for (k in 1:n) { SSY = Re(Conj(t(Y[,k]))%*%Y[,k]) SSReg = Re(Conj(t(Y[,k]))%*%HatF%*%Y[,k]) SSEF[k] = SSY - SSReg SSReg = Re(Conj(t(Y[,k]))%*%HatR%*%Y[,k]) i i i i

421 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 411 — #421 i i 411 7.6 Analysis of Designed Experiments SSER[k] = SSY - SSReg } # Smooth sSSEF = filter(SSEF, rep(1/L, L), circular = TRUE) sSSER = filter(SSER, rep(1/L, L), circular = TRUE) eF = (den.df/num.df)*(sSSER-sSSEF)/sSSEF plot(Fr, eF[1:n.freq], type="l", xlab="Frequency", ylab="F Statistic", ylim=c(0,7)) abline(h=qf(.999, num.df, den.df),lty=2) text(.25, 6.5, loc.name[Loc], cex=1.2) } An Analysis of Variance Model The arrangement of treatments for the fMRI data in Figure 7.1 suggests more infor- mation might be available than was obtained from the simple equality of means test. Separate effects caused by state of consciousness as well as the separate treatments brush, heat, and shock might exist. The reduced signal present in the low shock mean suggests a possible interaction between the treatments and level of consciousness. The arrangement in the classical two-way table suggests looking at the analog of the two factor analysis of variance as a function of frequency. In this case, we would obtain a different version of the regression model (7.79) of the form y (7.83) = μ v + α + + β γ + jt it t i jkt i jkt i jt -th level of k -th individual undergoing the i -th level of some factor A and the j for the n , . . . 1 . The number of individuals some other factor B, i = 1 , . . . I , j = 1 . . ., J , k = i j in each cell can be different, as for the fMRI data in the next example. In the above μ , a row model, we assume the response can be modeled as the sum of a mean, t (type of stimulus), (level of consciousness), and an , a column effect effect β α jt it interaction , γ , with the usual restrictions i jt ’ ’ ’ ’ γ = 0 β = = α = γ i jt i jt jt it i j j i required for a full rank design matrix Z in the overall regression model (7.78). If the number of observations in each cell were the same, the usual simple analogous version of the power components (7.81) and (7.82) would exist for testing various hypotheses. In the case of (7.83), we are interested in testing hypotheses obtained by 0 ), a B dropping one set of terms at a time out of (7.83), so an A factor (testing α = it ) will appear as components in the 0 factor ( β = = γ ), and an interaction term ( 0 i jt jt analysis of power. Because of the unequal numbers of observations in each cell, we often put the model in the form of the regression model (7.76)–(7.78). Example 7.7 Analysis of Power Tests for the fMRI Series For the fMRI data given as the means in Figure 7.1, a model of the form (7.83) is plausible and will yield more detailed information than the simple equality of means test described earlier. The results of that test, shown in Figure 7.8, were that the means were different for the four cortex locations and for the second cerebellum i i i i

422 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 412 — #422 i i 412 7 Statistical Methods in the Frequency Domain Consciousness Interaction Stimulus 12 12 12 8 8 8 4 4 4 F Statistic F Statistic F Statistic Cortex 1 0 0 0 0.20 0.00 0.05 0.10 0.15 0.20 0.25 0.25 0.00 0.00 0.05 0.10 0.15 0.20 0.25 0.05 0.10 0.15 Frequency Frequency Frequency 12 12 12 8 8 8 4 4 4 F Statistic F Statistic F Statistic Cortex 2 0 0 0 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 Frequency Frequency Frequency 12 12 12 8 8 8 4 4 4 F Statistic F Statistic F Statistic Cortex 3 0 0 0 0.00 0.05 0.10 0.15 0.20 0.25 0.15 0.20 0.25 0.00 0.10 0.00 0.05 0.05 0.10 0.15 0.20 0.25 Frequency Frequency Frequency 12 12 12 8 8 8 4 4 4 F Statistic F Statistic F Statistic Cortex 4 0 0 0 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.00 0.05 0.10 0.25 0.15 0.20 0.05 0.10 0.15 0.20 Frequency Frequency Frequency 12 12 12 8 8 8 4 4 4 F Statistic F Statistic F Statistic 0 0 0 Cerebellum 2 0.15 0.20 0.25 0.05 0.25 0.20 0.15 0.00 0.05 0.00 0.00 0.10 0.15 0.20 0.25 0.10 0.10 0.05 Frequency Frequency Frequency Fig. 7.9. Analysis of power for fMRI data at five locations, L = 3 and critical values 3 02 . for consciousness and interaction. F = ) ( 6 , 120 ) = 4 . 04 for stimulus and F 120 , ( 12 001 . . 001 location. We may examine these differences further by testing whether the mean differences are because of the nature of the stimulus or the consciousness level, or perhaps due to an interaction between the two factors. Unequal numbers of observations exist in the cells that contributed the means in Figure 7.1. For the regression vector, ′ , ,α , β ) ,α , γ , γ ( μ 1 1 11 t 2 21 t t t t t the rows of the design matrix are as specified in Table 7.4. Note the restrictions given above for the parameters. The results of testing the three hypotheses are shown in Figure 7.9 for the four cortex locations and the cerebellum, the components that showed some significant differences in the means in Figure 7.8. Again, the regression power components were 3 = L smoothed over frequencies. Appealing to the ANOPOW results summarized in Table 7.3 for each of the subhypotheses, = 1 when the stimulus effect is q 2 dropped, and q 2 = when either the consciousness effect or the interaction terms 2 Õ 2 26 are dropped. Hence, = Lq n = 6 , 12 for the two cases, with N = total i j 2 i j observations. Here, the state of consciousness (Awake, Sedated) has the major effect at the signal frequency. The level of stimulus was less significant at the signal frequency. A significant interaction occurred, however, at the ipsilateral component of the primary somatosensory cortex location. i i i i

423 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 413 — #423 i i 7.6 Analysis of Designed Experiments 413 The R code for this example is similar to Example 7.6. n = 128 n.freq = 1 + n/2 Fr = (0:(n.freq-1))/n nFr = 1:(n.freq/2) N = c(5,4,5,3,5,4) n.subject = sum(N) n.para = 6 # number of parameters L = 3 # for smoothing # stimulus (3 levels: Brush,Heat,Shock) df.stm = 2*L*(3-1) df.con = 2*L*(2-1) # conscious (2 levels: Awake,Sedated) df.int = 2*L*(3-1)*(2-1) # interaction den.df = 2*L*(n.subject-n.para) # df for full model # Design Matrix: mu a1 a2 b g1 g2 Z1 = outer(rep(1,N[1]), c(1, 1, 0, 1, 1, 0)) Z2 = outer(rep(1,N[2]), c(1, 0, 1, 1, 0, 1)) Z3 = outer(rep(1,N[3]), c(1, -1, -1, 1, -1, -1)) Z4 = outer(rep(1,N[4]), c(1, 1, 0, -1, -1, 0)) Z5 = outer(rep(1,N[5]), c(1, 0, 1, -1, 0, -1)) Z6 = outer(rep(1,N[6]), c(1, -1, -1, -1, 1, 1)) Z = rbind(Z1, Z2, Z3, Z4, Z5, Z6) ZZ = t(Z)%*%Z rep(NA, n)-> SSEF-> SSE.stm-> SSE.con-> SSE.int HatF = Z%*%solve(ZZ,t(Z)) Hat.stm = Z[,-(2:3)]%*%solve(ZZ[-(2:3),-(2:3)], t(Z[,-(2:3)])) Hat.con = Z[,-4]%*%solve(ZZ[-4,-4], t(Z[,-4])) Hat.int = Z[,-(5:6)]%*%solve(ZZ[-(5:6),-(5:6)], t(Z[,-(5:6)])) par(mfrow=c(5,3), mar=c(3.5,4,0,0), oma=c(0,0,2,2), mgp = c(1.6,.6,0)) loc.name = c("Cortex 1","Cortex 2","Cortex 3","Cortex 4","Caudate", " Thalamus 1","Thalamus 2","Cerebellum 1","Cerebellum 2") for(Loc in c(1:4,9)) { # only Loc 1 to 4 and 9 used i = 6*(Loc-1) Y = cbind(fmri[[i+1]], fmri[[i+2]], fmri[[i+3]], fmri[[i+4]], fmri[[i+5]], fmri[[i+6]]) Y = mvfft(spec.taper(Y, p=.5))/sqrt(n); Y = t(Y) for (k in 1:n) { SSY = Re(Conj(t(Y[,k]))%*%Y[,k]) SSReg = Re(Conj(t(Y[,k]))%*%HatF%*%Y[,k]) SSEF[k] = SSY - SSReg SSReg = Re(Conj(t(Y[,k]))%*%Hat.stm%*%Y[,k]) SSE.stm[k] = SSY-SSReg SSReg = Re(Conj(t(Y[,k]))%*%Hat.con%*%Y[,k]) SSE.con[k] = SSY-SSReg SSReg = Re(Conj(t(Y[,k]))%*%Hat.int%*%Y[,k]) SSE.int[k] = SSY-SSReg } # Smooth sSSEF = filter(SSEF, rep(1/L, L), circular = TRUE) sSSE.stm = filter(SSE.stm, rep(1/L, L), circular = TRUE) sSSE.con = filter(SSE.con, rep(1/L, L), circular = TRUE) sSSE.int = filter(SSE.int, rep(1/L, L), circular = TRUE) eF.stm = (den.df/df.stm)*(sSSE.stm-sSSEF)/sSSEF eF.con = (den.df/df.con)*(sSSE.con-sSSEF)/sSSEF eF.int = (den.df/df.int)*(sSSE.int-sSSEF)/sSSEF plot(Fr[nFr],eF.stm[nFr], type="l", xlab="Frequency", ylab="F Statistic", ylim=c(0,12)) abline(h=qf(.999, df.stm, den.df),lty=2) i i i i

424 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 414 — #424 i i 7 Statistical Methods in the Frequency Domain 414 Table 7.4. Rows of the Design Matrix for Example 7.7 Low Anesthesia Awake Brush 1 1 0 1 1 0 (5) 1 1 0 − 1 0 (3) 1 − 1 0 − 1 (5) Heat 1 0 1 1 0 1 (4) 1 0 1 − − 1 − 1 1 Shock 1 1 − 1 (5) 1 − 1 − 1 − 1 1 1 (4) − Number of Observations per Cell in Parentheses if(Loc==1) mtext("Stimulus", side=3, line=.3, cex=1) mtext(loc.name[Loc], side=2, line=3, cex=.9) plot(Fr[nFr], eF.con[nFr], type="l", xlab="Frequency", ylab="F Statistic", ylim=c(0,12)) abline(h=qf(.999, df.con, den.df),lty=2) if(Loc==1) mtext("Consciousness", side=3, line=.3, cex=1) plot(Fr[nFr], eF.int[nFr], type="l", xlab="Frequency", ylab="F Statistic", ylim=c(0,12)) abline(h=qf(.999, df.int, den.df),lty=2) if(Loc==1) mtext("Interaction", side=3, line= .3, cex=1) } Simultaneous Inference In the previous examples involving the fMRI data, it would be helpful to focus on the components that contributed most to the rejection of the equal means hypothesis. One way to accomplish this is to develop a test for the significance of an arbitrary of the form linear compound ∗ = A ( ( ω Ψ ω B ( ω ) , (7.84) ) ) k k k ′ ( ω where the components of the vector ) = ( A ( ω ) , A ( ω ) , . . ., A ( ω )) are A q 1 k k k 2 k chosen in such a way as to isolate particular linear functions of parameters in the in the regression model (7.78). This argument suggests ) ω ( regression vector B k developing a test of the hypothesis ) = 0 for ( values of the linear ω Ψ all possible k coefficients in the compound (7.84) as is done in the conventional analysis of variance approach (see, for example, Scheffé, 1959). Recalling the material involving the regression models of the form (7.50), the linear compound (7.84) can be estimated by ∗ ˆ ˆ Ψ = A ( ( ω ω ) ) B ( ω (7.85) ) , k k k ˆ B ( ω where ) is the estimated vector of regression coefficients given by (7.51) and k 2 independent of the error spectrum s ( ω ) in (7.53). It is possible to show the k z y · maximum of the ratio 2 ˆ − q N ω Ψ ( | )| )− Ψ ( ω k k (7.86) , ) = ( F A 2 q ( s ( ω ) ) Q A k y z · where ∗ − 1 ( Q ( A ) = A (7.87) ( ω A ) S ) ω ) ( ω k k k z i i i i

425 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 415 — #425 i i 7.6 Analysis of Designed Experiments 415 is bounded by a statistic that has an F -distribution with 2 q and 2 ( N − q ) degrees of freedom. Testing the hypothesis that the compound has a particular value, usually Ψ ( ω , then proceeds naturally, by comparing the statistic (7.86) evaluated at ) = 0 k the hypothesized value with the α level point on an F distribution. We can q 2 ) q , 2 ( N − choose an infinite number of compounds of the form (7.84) and the test will still be valid at level α . As before, arguing the error spectrum is relatively constant over a band enables us to smooth the numerator and denominator of (7.86) separately over L frequencies so distribution involving the smooth components is F . , Lq L 2 ( N − q ) 2 Example 7.8 Simultaneous Inference for the fMRI Series As an example, consider the previous tests for significance of the fMRI factors, in which we have indicated the primary effects are among the stimuli but have not investigated which of the stimuli, heat, brushing, or shock, had the most effect. To × 6 contrast vector of analyze this further, consider the means model (7.79) and a 1 the form 6 ’ ∗ ∗ ˆ ˆ ) ( , ω (7.88) = ( Y ) = B A ) ω ) A ω Ψ ( ω ( k k k i k · i 1 = i where the means are easily shown to be the regression coefficients in this particular case. In this case, the means are ordered by columns; the first three means are the the three levels of stimuli for the awake state, and the last three means are the levels for the anesthetized state. In this special case, the denominator terms are 6 2 ’ ( )| ω A | k i , (7.89) = Q N i 1 = i SSE ( ω ) available in (7.82). In order to evaluate the effect of a particu- with k lar stimulus, like brushing over the two levels of consciousness, we may take zero otherwise. 0 = A ) ( ω ω ) = A ( ( ω A ) = 1 for the two brush levels and k k k 1 4 From Figure 7.10, we see that, at the first and third cortex locations, brush and heat are both significant, whereas the fourth cortex shows only brush and the second cerebellum shows only heat. Shock appears to be transmitted relatively weakly, when averaged over the awake and mildly anesthetized states. The R code for this example is as follows. n = 128; n.freq = 1 + n/2 Fr = (0:(n.freq-1))/n; nFr = 1:(n.freq/2) N = c(5,4,5,3,5,4); n.subject = sum(N); L = 3 # Design Matrix Z1 = outer(rep(1,N[1]), c(1,0,0,0,0,0)) Z2 = outer(rep(1,N[2]), c(0,1,0,0,0,0)) Z3 = outer(rep(1,N[3]), c(0,0,1,0,0,0)) Z4 = outer(rep(1,N[4]), c(0,0,0,1,0,0)) Z5 = outer(rep(1,N[5]), c(0,0,0,0,1,0)) Z6 = outer(rep(1,N[6]), c(0,0,0,0,0,1)) Z = rbind(Z1, Z2, Z3, Z4, Z5, Z6); ZZ = t(Z)%*%Z # Contrasts: 6 by 3 A = rbind(diag(1,3), diag(1,3)) nq = nrow(A); num.df = 2*L*nq; den.df = 2*L*(n.subject-nq) i i i i

426 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 416 — #426 i i 416 7 Statistical Methods in the Frequency Domain HatF = Z%*%solve(ZZ, t(Z)) # full model rep(NA, n)-> SSEF -> SSER; eF = matrix(0,n,3) par(mfrow=c(5,3), mar=c(3.5,4,0,0), oma=c(0,0,2,2), mgp = c(1.6,.6,0)) loc.name = c("Cortex 1", "Cortex 2", "Cortex 3", "Cortex 4", "Caudate", " Thalamus 1", "Thalamus 2", "Cerebellum 1", "Cerebellum 2") cond.name = c("Brush", "Heat", "Shock") for(Loc in c(1:4,9)) { i = 6*(Loc-1) Y = cbind(fmri[[i+1]], fmri[[i+2]], fmri[[i+3]], fmri[[i+4]], fmri[[i+5]], fmri[[i+6]]) Y = mvfft(spec.taper(Y, p=.5))/sqrt(n); Y = t(Y) for (cond in 1:3){ Q = t(A[,cond])%*%solve(ZZ, A[,cond]) HR = A[,cond]%*%solve(ZZ, t(Z)) for (k in 1:n){ SSY = Re(Conj(t(Y[,k]))%*%Y[,k]) SSReg = Re(Conj(t(Y[,k]))%*%HatF%*%Y[,k]) SSEF[k] = (SSY-SSReg)*Q SSReg = HR%*%Y[,k] SSER[k] = Re(SSReg*Conj(SSReg)) } # Smooth sSSEF = filter(SSEF, rep(1/L, L), circular = TRUE) sSSER = filter(SSER, rep(1/L, L), circular = TRUE) eF[,cond] = (den.df/num.df)*(sSSER/sSSEF) } plot(Fr[nFr], eF[nFr,1], type="l", xlab="Frequency", ylab="F Statistic", ylim=c(0,5)) abline(h=qf(.999, num.df, den.df),lty=2) if(Loc==1) mtext("Brush", side=3, line=.3, cex=1) mtext(loc.name[Loc], side=2, line=3, cex=.9) plot(Fr[nFr], eF[nFr,2], type="l", xlab="Frequency", ylab="F Statistic", ylim=c(0,5)) abline(h=qf(.999, num.df, den.df),lty=2) if(Loc==1) mtext("Heat", side=3, line=.3, cex=1) plot(Fr[nFr], eF[nFr,3], type="l", xlab="Frequency", ylab="F Statistic", ylim=c(0,5)) abline(h = qf(.999, num.df, den.df) ,lty=2) if(Loc==1) mtext("Shock", side=3, line=.3, cex=1) } Multivariate Tests Although it is possible to develop multivariate regression along lines analogous to the usual real valued case, we will only look at tests involving equality of group means and spectral matrices, because these tests appear to be used most often in applications. ′ ) For these results, consider the p -variate time series y = ( y to have , . . ., y i jt 1 i jt p i jt = arisen from observations on , . . ., N j individuals in group i , all having mean 1 i μ and stationary autocovariance matrix Γ ( h ) . Denote the DFTs of the group mean i it ˆ ( ω vectors as ) and the p × p spectral matrices as Y f ( ω ) for the i = 1 , 2 , . . ., I k k i i · groups. Assume the same general properties as for the vector series considered in Section 7.3. In the multivariate case, we obtain the analogous versions of (7.81) and (7.82) as within cross-power and matrices between cross-power the i i i i

427 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 417 — #427 i i 417 7.6 Analysis of Designed Experiments Heat Shock Brush 4 4 4 2 2 2 F Statistic F Statistic F Statistic Cortex 1 0 0 0 0.20 0.05 0.25 0.00 0.15 0.00 0.10 0.05 0.10 0.15 0.20 0.25 0.05 0.10 0.15 0.00 0.20 0.25 Frequency Frequency Frequency 4 4 4 2 2 2 F Statistic F Statistic F Statistic Cortex 2 0 0 0 0.20 0.15 0.10 0.05 0.00 0.00 0.05 0.10 0.15 0.25 0.25 0.25 0.20 0.15 0.20 0.10 0.00 0.05 Frequency Frequency Frequency 4 4 4 2 2 2 F Statistic F Statistic F Statistic Cortex 3 0 0 0 0.25 0.15 0.10 0.25 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.20 Frequency Frequency Frequency 4 4 4 2 2 2 F Statistic F Statistic F Statistic Cortex 4 0 0 0 0.10 0.20 0.25 0.25 0.10 0.05 0.00 0.20 0.00 0.05 0.15 0.15 0.20 0.25 0.00 0.05 0.10 0.15 Frequency Frequency Frequency 4 4 4 2 2 2 F Statistic F Statistic F Statistic 0 0 0 Cerebellum 2 0.10 0.15 0.20 0.25 0.00 0.15 0.20 0.25 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 Frequency Frequency Frequency Fig. 7.10. Power in simultaneous linear compounds at five locations, enhancing brush, heat, . 2 , . and shock effects, L = 3 16 F = ) ( 36 , 120 001 . N I i ’ ’ ( ) ) ( ∗ Y SPR ( = ) ω ( Y )− ( ) ω ( ( ω Y )− Y ω ) ω (7.90) i k i k k k k · · · · · · = 1 = i 1 j and N I i ’ ’ ) ) ( ( ∗ ω . ( (7.91) Y ) ω ( Y ) ω ( )− Y )− Y ω ( ) = SPE ( ω k k i j i k i j i k k · · i 1 j = 1 = The equality of means test is rejected using the fact that the likelihood ratio test yields a monotone function of ω ( SPE )| | k ) = ω Λ ( . (7.92) k )| ω | SPE ( ω ( ) + SPR k k Khatri (1965) and Hannan (1970) give the approximate distribution of the statistic ( ) ’ 2 2 = − χ − N − 1 p − I (7.93) ) ω log Λ ( i k ) 1 − p I ( 2 1 p degrees of freedom when the group means are equal. as chi-squared with 2 ( I − ) 2 , as has been shown by Giri T The case of I = 2 groups reduces to Hotelling’s (1965), where i i i i

428 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 418 — #428 i i 418 7 Statistical Methods in the Frequency Domain [ ] [ ] N N ∗ 2 1 1 − 2 ˆ ω ( ) Y Y (7.94) ( ω ) )− Y ω ( ( ω , ) ω ( Y f )− T = 2 2 k k · 1 1 · k · k k · v N N ( ) + 1 2 where ω ( ) SPE k ˆ Õ f ) = (7.95) ( ω k v N I − i i I 2 . The test statistic, in this case, is the pooled error spectrum given in (7.91),with = is 2 − p ) N + ( N 1 2 2 T , (7.96) = F ) 2 ( N + 2 p , − p − 1 N 2 1 − + N ( N p − 1 ) 2 1 which was shown by Giri (1965) to have the indicated limiting F -distribution with ) degrees of freedom when the means are the same. The 2 p and 2 ( N 1 + N − − p 2 1 classical t -test for inequality of two univariate means will be just (7.95) and (7.96) . with p = 1 Testing equality of the spectral matrices is also of interest, not only for discrim- ination and pattern recognition, as considered in the next section, but also as a test indicating whether the equality of means test, which assumes equal spectral matrices, is valid. The test evolves from the likelihood ration criterion, which compares the single group spectral matrices N i ’ ( ) ( ) 1 ∗ ˆ ω ) = Y ( Y f )− ( ω ω Y ) Y )− ω ω ( ( ) ( (7.97) i j i i k k i j k i k k · · 1 − N i 1 = j with the pooled spectral matrix (7.95). A modification of the likelihood ratio test, Õ M N M − which incorporates the degrees of freedom and rather than = = M 1 i i i the sample sizes into the likelihood ratio statistic, uses Œ M M p i ˆ M )| ( f M | ω k i i ′ (7.98) . ω L ( ) = k Œ M p I i M ˆ | M ( f )| ω M k v i 1 = i ′ ( and calculated 95% ) Krishnaiah et al. (1976) have given the moments of L ω k 4 p = 3 , critical points for using a Pearson Type I approximation. For reasonably large samples involving smoothed spectral estimators, the approximation involving the first term of the usual chi-squared series will suffice and Shumway (1982) has given ′ 2 ) (7.99) ( = − 2 r log L ω , χ k 2 ( I − 1 ) p where ) ( ’ p ) − 1 ( p + 1 )( 1 − 1 − , (7.100) M − M − r 1 = i ) 1 − I ( p 6 i 2 1 degrees of freedom when with an approximate chi-squared distribution with ( I − p ) the spectral matrices are equal. Introduction of smoothing over L frequencies leads M L M in the equations above. to replacing L M by and M and j j Of course, it is often of great interest to use the above result for testing equality of two univariate spectra, and it is obvious from the material in Chapter 4, i i i i

429 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 419 — #429 i i 419 7.6 Analysis of Designed Experiments ˆ ) ω ( f 1 (7.101) = F 2 , L M 2 L M 2 1 ˆ ( ω f ) 2 will have the requisite -distribution with 2 L M F and 2 L M degrees of freedom when 2 1 L frequencies. spectra are smoothed over Example 7.9 Equality of Means and Spectral Matrices An interesting problem arises when attempting to develop a methodology for dis- criminating between waveforms originating from explosions and those that came from the more commonly occurring earthquakes. Figure 7.2 shows a small subset of a larger population of bivariate series consisting of two phases from each of eight earthquakes and eight explosions. If the large–sample approximations to normality hold for the DFTs of these series, it is of interest to known whether the differences between the two classes are better represented by the mean functions or by the spectral matrices. The tests described above can be applied to look at these two questions. The upper left panel of Figure 7.11 shows the test statistic (7.96) with 7 , , 36 the straight line denoting the critical level for α = . 001 , i.e., F = ) ( 4 . 26 001 . for equal means using L = 1 , and the test statistic remains well below its critical value at all frequencies, implying that the means of the two classes of series are not significantly different. Checking Figure 7.2 shows little reason exists to suspect that either the earthquakes or explosions have a nonzero mean signal. Checking the equality of the spectra and the spectral matrices, however, leads to a different conclusion. Some smoothing ( L = 21 ) is useful here, and univariate tests on both the P and S components using (7.101) and = lead to strong rejections of N N = 8 1 2 the equal spectra hypotheses. The rejection seems stronger for the S component and we might tentatively identify that component as being dominant. Testing equality 2 of the spectral matrices using (7.99) and 4 shows a similar strong . 18 ( χ ) = 47 001 . rejection of the equality of spectral matrices. We use these results to suggest optimal discriminant functions based on spectral differences in the next section. The R code for this example is as follows. We make use of the recycling feature of R and the fact that the data are bivariate to produce simple code specific to this problem in order to avoid having to use multiple arrays. P = 1:1024; S = P+1024; N = 8; n = 1024; p.dim = 2; m = 10; L = 2*m+1 eq.P = as.ts(eqexp[P,1:8]); eq.S = as.ts(eqexp[S,1:8]) eq.m = cbind(rowMeans(eq.P), rowMeans(eq.S)) ex.P = as.ts(eqexp[P,9:16]); ex.S = as.ts(eqexp[S,9:16]) ex.m = cbind(rowMeans(ex.P), rowMeans(ex.S)) m.diff = mvfft(eq.m - ex.m)/sqrt(n) eq.Pf = mvfft(eq.P-eq.m[,1])/sqrt(n) eq.Sf = mvfft(eq.S-eq.m[,2])/sqrt(n) ex.Pf = mvfft(ex.P-ex.m[,1])/sqrt(n) ex.Sf = mvfft(ex.S-ex.m[,2])/sqrt(n) fv11 = rowSums(eq.Pf*Conj(eq.Pf))+rowSums(ex.Pf*Conj(ex.Pf))/(2*(N-1)) fv12 = rowSums(eq.Pf*Conj(eq.Sf))+rowSums(ex.Pf*Conj(ex.Sf))/(2*(N-1)) fv22 = rowSums(eq.Sf*Conj(eq.Sf))+rowSums(ex.Sf*Conj(ex.Sf))/(2*(N-1)) fv21 = Conj(fv12) # Equal Means T2 = rep(NA, 512) for (k in 1:512){ i i i i

430 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 420 — #430 i i 420 7 Statistical Methods in the Frequency Domain Equal P−Spectra Equal Means 8 2.0 6 1.5 4 1.0 F Statistic F Statistic 2 0.5 0 20 5 0 0 5 10 15 20 10 15 Frequency (Hz) Frequency (Hz) Equal Spectral Matrices Equal S−Spectra 5 4 5000 3 4600 2 F Statistic Chi−Sq Statistic 1 4200 10 15 20 15 0 0 5 10 5 20 Frequency (Hz) Frequency (Hz) Fig. 7.11. Tests for equality of means, spectra, and spectral matrices for the earthquake and 1024 points at 40 points per second. 2 explosion data p = = , L = 21 , n fvk = matrix(c(fv11[k], fv21[k], fv12[k], fv22[k]), 2, 2) dk = as.matrix(m.diff[k,]) T2[k] = Re((N/2)*Conj(t(dk))%*%solve(fvk,dk)) } eF = T2*(2*p.dim*(N-1))/(2*N-p.dim-1) par(mfrow=c(2,2), mar=c(3,3,2,1), mgp = c(1.6,.6,0), cex.main=1.1) freq = 40*(0:511)/n # Hz plot(freq, eF, type="l", xlab="Frequency (Hz)", ylab="F Statistic", main="Equal Means") abline(h = qf(.999, 2*p.dim, 2*(2*N-p.dim-1))) # Equal P kd = kernel("daniell",m); u = Re(rowSums(eq.Pf*Conj(eq.Pf))/(N-1)) feq.P = kernapply(u, kd, circular=TRUE) u = Re(rowSums(ex.Pf*Conj(ex.Pf))/(N-1)) fex.P = kernapply(u, kd, circular=TRUE) plot(freq, feq.P[1:512]/fex.P[1:512], type="l", xlab="Frequency (Hz)", ylab="F Statistic", main="Equal P-Spectra") abline(h=qf(.999, 2*L*(N-1), 2*L*(N-1))) # Equal S u = Re(rowSums(eq.Sf*Conj(eq.Sf))/(N-1)) feq.S = kernapply(u, kd, circular=TRUE) u = Re(rowSums(ex.Sf*Conj(ex.Sf))/(N-1)) fex.S = kernapply(u, kd, circular=TRUE) plot(freq, feq.S[1:512]/fex.S[1:512], type="l", xlab="Frequency (Hz)", ylab="F Statistic", main="Equal S-Spectra") abline(h=qf(.999, 2*L*(N-1), 2*L*(N-1))) # Equal Spectra u = rowSums(eq.Pf*Conj(eq.Sf))/(N-1) i i i i

431 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 421 — #431 i i 7.7 Discriminant and Cluster Analysis 421 feq.PS = kernapply(u, kd, circular=TRUE) u = rowSums(ex.Pf*Conj(ex.Sf)/(N-1)) fex.PS = kernapply(u, kd, circular=TRUE) fv11 = kernapply(fv11, kd, circular=TRUE) fv22 = kernapply(fv22, kd, circular=TRUE) fv12 = kernapply(fv12, kd, circular=TRUE) Mi = L*(N-1); M = 2*Mi TS = rep(NA,512) for (k in 1:512){ det.feq.k = Re(feq.P[k]*feq.S[k] - feq.PS[k]*Conj(feq.PS[k])) det.fex.k = Re(fex.P[k]*fex.S[k] - fex.PS[k]*Conj(fex.PS[k])) det.fv.k = Re(fv11[k]*fv22[k] - fv12[k]*Conj(fv12[k])) log.n1 = log(M)*(M*p.dim); log.d1 = log(Mi)*(2*Mi*p.dim) log.n2 = log(Mi)*2 +log(det.feq.k)*Mi + log(det.fex.k)*Mi log.d2 = (log(M)+log(det.fv.k))*M r = 1 - ((p.dim+1)*(p.dim-1)/6*p.dim*(2-1))*(2/Mi - 1/M) TS[k] = -2*r*(log.n1+log.n2-log.d1-log.d2) } plot(freq, TS, type="l", xlab="Frequency (Hz)", ylab="Chi-Sq Statistic", main="Equal Spectral Matrices") abline(h = qchisq(.9999, p.dim^2)) 7.7 Discriminant and Cluster Analysis The extension of classical pattern-recognition techniques to experimental time series is a problem of great practical interest. A series of observations indexed in time often produces a pattern that may form a basis for discriminating between different classes of events. As an example, consider Figure 7.2, which shows regional (100-2000 km) recordings of several typical Scandinavian earthquakes and mining explosions measured by stations in Scandinavia. A listing of the events is given in Kakizawa et al. (1998). The problem of discriminating between mining explosions and earthquakes is a reasonable proxy for the problem of discriminating between nuclear explosions and earthquakes. This latter problem is one of critical importance for monitoring a comprehensive test-ban treaty. Time series classification problems are not restricted to geophysical applications, but occur under many and varied circumstances in other fields. Traditionally, the detection of a signal embedded in a noise series has been analyzed in the engineering literature by statistical pattern recognition techniques (see Problem 7.10 and Problem 7.11). The historical approaches to the problem of discriminating among different optimality classes of time series can be divided into two distinct categories. The approach, as found in the engineering and statistics literature, makes specific Gaus- sian assumptions about the probability density functions of the separate groups and then develops solutions that satisfy well-defined minimum error criteria. Typically, in the time series case, we might assume the difference between classes is expressed through differences in the theoretical mean and covariance functions and use likeli- hood methods to develop an optimal classification function. A second class of tech- approach, proceeds more feature extraction niques, which might be described as a heuristically by looking at quantities that tend to be good visual discriminators for well-separated populations and have some basis in physical theory or intuition. Less i i i i

432 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 422 — #432 i i 7 Statistical Methods in the Frequency Domain 422 attention is paid to finding functions that are approximations to some well-defined optimality criterion. As in the case of regression, both time domain and frequency domain approaches to discrimination will exist. For relatively short univariate series, a time domain ap- proach that follows conventional multivariate discriminant analysis as described in conventional multivariate texts, such as Anderson (1984) or Johnson and Wichern (1992) may be preferable. We might even characterize differences by the autoco- variance functions generated by different ARMA or state-space models. For longer multivariate time series that can be regarded as stationary after the common mean has been subtracted, the frequency domain approach will be easier computation- np dimensional vector in the time domain, represented here as ally because the ′ ′ ′ ′ ′ will reduced to separate computations , x x , , . . ., x = ( ) x , with x ) = ( x x , . . ., 1 t t t p n t 1 made on the p -dimensional DFTs. This happens because of the approximate inde- X ( ω , a property that we have often used in pendence of the DFTs, , 0 ≤ ω 1 ≤ ) k k preceding chapters. Finally, the grouping properties of measures like the discrimination information and likelihood-based statistics can be used to develop measures of disparity for clustering multivariate time series. In this section, we define a measure of disparity between two multivariate times series by the spectral matrices of the two processes and then apply hierarchical clustering and partitioning techniques to identify natural groupings within the bivariate earthquake and explosion populations. The General Discrimination Problem The general problem of classifying a vector time series x occurs in the following way. We observe a time series x known to belong to one of g populations, denoted by Π , Π , . . ., Π . The general problem is to assign or classify this observation into one 1 2 g = g populations 2 of the g groups in some optimal fashion. An example might be the of earthquakes and explosions shown in Figure 7.2. We would like to classify the unknown event, shown as NZ in the bottom two panels, as belonging to either the populations. To solve this problem, we need an ( Π ) or explosion ( Π earthquake ) 2 1 T ( optimality criterion that leads to a statistic ) that can be used to assign the NZ x event to either the earthquake or explosion populations. To measure the success of the classification, we need to evaluate errors that can be expected in the future relating to the number of earthquakes classified as explosions (false alarms) and the number of explosions classified as earthquakes (missed signals). x has a proba- The problem can be formulated by assuming the observed series bility density p . ( x ) when the observed series is from population Π g for i = 1 , . . ., i i -dimensional process Then, partition the space spanned by the np mutually x into g such that, if exclusive regions , R to population , . . ., R x , we assign x falls in R R i 1 g 2 Π is defined as the probability of classifying the . The misclassification probability i i Π and would be given when it belongs to Π observation into population , for j , i j by the expression π ( P | i ) = j dx ) (7.102) . x ( p i R j i i i i

433 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 423 — #433 i i 423 7.7 Discriminant and Cluster Analysis depends also on the prior probabilities , say, The overall total error probability , π g , . . ., π π , of belonging to one of the groups. For example, the probability 1 2 g originates from Π x that an observation Π is obviously and is then classified into j i , and the total error probability becomes P ( π | i ) j i g ’ ’ P = ) π | i (7.103) . P ( j e i i j , 1 = i Although costs have not been incorporated into (7.103), it is easy to do so by mul- Π , the cost of assigning a series from population ) tiplying P ( j | i ) by C ( j | i to i Π . j into if is minimized by classifying The overall error P Π x i e π ) ( x p j i > (7.104) π p ( x ) i j for all j , i (see, for example, Anderson, 1984). A quantity of interest, from the Bayesian perspective, is the posterior probability an observation belongs to population Π , conditional on observing x , say, i ) x ( p π i i Õ x = | Π ( P ) (7.105) . i π ( x ) p ( x ) j j j x Π for which the posterior prob- The procedure that classifies into the population i ability is largest is equivalent to that implied by using the criterion (7.104). The posterior probabilities give an intuitive idea of the relative odds of belonging to each of the plausible populations. Many situations occur, such as in the classification of earthquakes and explosions, g populations of interest. For two populations, the = in which there are only 2 implies, in the absence of prior probabilities, classifying an Neyman–Pearson lemma observation into when Π 1 p x ) ( 1 > K (7.106) ( x ) p 2 minimizes each of the error probabilities for a fixed value of the other. The rule is K π / π . = identical to the Bayes rule (7.104) when 2 1 x has a p -variate The theory given above takes a simple form when the vector normal distribution with mean vectors and covariance matrices Σ under Π for μ j j j j = 1 , 2 , . . ., g . In this case, simply use } { 1 1 ′ − / p 2 / 2 − − 1 ( − μ ) ) (7.107) Σ . x μ ( x − ( x | Σ ) = ( 2 π exp ) − p | j j j j j 2 The classification functions are conveniently expressed by quantities that are propor- tional to the logarithms of the densities, say, 1 1 1 1 − − ′ − 1 1 ′ ′ + ln | Σ − |− μ x x ln Σ π Σ (7.108) x + μ μ . Σ x g − = ) ( j j j j j j j j j 2 2 2 i i i i

434 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 424 — #434 i i 7 Statistical Methods in the Frequency Domain 424 In expressions involving the log likelihood, we will generally ignore terms involving π x to population Π the constant ln 2 − . For this case, we may assign an observation i whenever x ) > g g ( ( ) (7.109) x j i i , j = j , . . ., g and the posterior probability (7.105) has the form for , 1 x ( exp g { )} i Õ . | P Π ( = ) x i { ( x )} exp g j j = 2 A common situation occurring in applications involves classification for g groups under the assumption of multivariate normality and equal covariance matrices; Σ = Σ i.e., = Σ . Then, the criterion (7.109) can be expressed in terms of the linear 1 2 discriminant function x ( ) = g d x )− g ( x ) ( 1 2 l 1 π 1 ′ − 1 ′ − 1 x − ) ( Σ − μ = μ − μ (7.110) ) , Σ ( μ ( μ ln + μ + ) 2 1 2 2 1 1 π 2 2 Π ( or Π . The according to whether d 0 where we classify into x ) ≥ 0 or d < ( x ) l l 2 1 linear discriminant function is clearly a combination of normal variables and, for the 2 2 , with Π case π / = under 2 = . 5 , will have mean D π / 2 under Π D and mean − 1 1 2 2 2 D under both hypotheses, where variances given by 2 − ′ 1 D ( μ ) ( Σ = μ − μ − μ ) (7.111) 2 1 1 2 is the Mahalanobis distance between the mean vectors μ . In this case, the and μ 2 1 two misclassification probabilities (7.1) are ( ) D | 2 ) = P ( 2 | 1 ) = Φ P − ( 1 , (7.112) 2 and the performance is directly related to the Mahalanobis distance (7.111). For the case in which the covariance matrices cannot be assumed to be the the x ( ) g same, the discriminant function takes a different form, with the difference g )− ( x 1 2 taking the form 1 | Σ | 1 1 − 1 − 1 ′ − Σ x Σ − ( ln ) x = d ( x − ) q 2 1 2 | | Σ 2 2 π 1 ′ − 1 ′ − 1 (7.113) Σ ln ( x − μ μ ) Σ + + 2 1 1 2 π 2 = groups. This discriminant function differs from the equal covariance 2 g for case in the linear term and in a nonlinear quadratic term involving the differing covariance matrices. The distribution theory is not tractable for the quadratic case so no convenient expression like (7.112) is available for the error probabilities for the quadratic discriminant function. A difficulty in applying the above theory to real data is that the group mean vectors Σ and covariance matrices μ are seldom known. Some engineering problems, such j j i i i i

435 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 425 — #435 i i 7.7 Discriminant and Cluster Analysis 425 as the detection of a signal in white noise, assume the means and covariance param- eters are known exactly, and this can lead to an optimal solution (see Problems 7.14 and 7.15). In the classical multivariate situation, it is possible to collect a sample of N i Π , say, x training j = 1 , . . ., N vectors from group , and use them to estimate , for i i i j i 1 , 2 , . . ., g ; i.e., = the mean vectors and covariance matrices for each of the groups x simply choose and · i N i ’ ′ − 1 (7.114) ) x − x )( N S ) ( − = 1 x − x ( i j i i i i j i · · 1 j = as the estimators for μ , respectively. In the case in which the covariance and Σ i i matrices are assumed to be equal, simply use the pooled estimator ) ( 1 − ’ ’ . − g S = S (7.115) ) N N 1 − ( i i i i i For the case of a linear discriminant function, we may use 1 1 − ′ 1 − ′ ( x = x ) (7.116) π S g ˆ x − log x + x S i i i · i i · · 2 g Σ ( x ) . For large samples, x and μ and S as a simple estimator for converge to i i i · ˆ x in that case. The procedure ( in probability so ) converges in distribution to g ) ( x g i i N are large, relative to the g , . . . works reasonably well for the case in which 1 = , i i length of the series n , a case that is relatively rare in time series analysis. For this reason, we will resort to using spectral approximations for the case in which data are given as long time series. The performance of sample discriminant functions can be evaluated in several different ways. If the population parameters are known, (7.111) and (7.112) can be evaluated directly. If the parameters are estimated, the estimated Mahalanobis distance 2 ˆ D can be substituted for the theoretical value in very large samples. Another approach using the result of applying the classification is to calculate the apparent error rates procedure to the training samples. If n denotes the number of observations from i j classified into , the sample error rates can be estimated by the ratio Π population Π i j n i j ˆ Õ (7.117) ( i j ) = P | n i j i , for i j . If the training samples are not large, this procedure may be biased and a resampling option like cross-validation or the bootstrap can be employed. A simple version of cross-validation is the jackknife procedure proposed by Lachenbruch and Mickey (1968), which holds out the observation to be classified, deriving the classi- fication function from the remaining observations. Repeating this procedure for each holdout of the members of the training sample and computing (7.117) for the samples leads to better estimators of the error rates. i i i i

436 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 426 — #436 i i 7 Statistical Methods in the Frequency Domain 426 Classification Based on Magnitude Features EQ EQ2 1.2 EX NZ EX6 EQ1 EQ7 1.1 EX3 EQ3 EX4 EX7 EX1 1.0 EQ4 EQ6 EX8 EQ8 EX2 log mag(S) EQ5 0.9 NZ EX5 0.8 0.0 0.5 1.0 1.5 log mag(P) Fig. 7.12. Classification of earthquakes and explosions based on linear discriminant analysis using the magnitude features. Example 7.10 Discriminant Analysis Using Amplitudes We can give a simple example of applying the above procedures to the logarithms of the amplitudes of the separate P and S components of the original earthquake and explosion traces. The logarithms (base 10) of the maximum peak-to-peak ampli- , can be considered S tudes of the P and S components, denoted by log log P and 10 10 ′ ′ = as two-dimensional feature vectors, say, ( x ) , x S ) x = ( log log P , , from a 2 1 10 10 bivariate normal population with differering means and covariances. The original data, from Kakizawa et al. (1998), are shown in Figure 7.12. The figure includes the Novaya Zemlya (NZ) event of unknown origin. The tendency of the earthquakes to P have higher values for log has been noted by many and the S , relative to log 10 10 S use of the logarithm of the ratio, i.e., log log in some references (see Lay, − P 10 10 1997, pp. 40-41) is a tacit indicator that a linear function of the two parameters will be a useful discriminant. ′ ′ x The sample means . ) , 1 . 024 ) and x 346 ( = ( . 922 , . 993 = , and covariance 1 2 · · matrices ( ) ( ) . − . 007 026 001 − 025 . . = S S and = 1 2 010 . 007 . − . 001 . 010 − are immediate from (7.114), with the pooled covariance matrix given by ( ) . 026 − . 004 = S . 004 . 010 − from (7.115). Although the covariance matrices are not equal, we try the linear π = discriminant function anyway, which yields (with equal prior probabilities 1 ) the sample discriminant functions 5 = . π 2 − . 401 62 ˆ g x ( x ) = 30 . 668 x 411 + 111 . 1 1 2 i i i i

437 i i “tsa4_trimmed” — 2017/12/8 — 15:01 — page 427 — #437 i i 7.7 Discriminant and Cluster Analysis 427 and 255 142 . ˆ g − ( x ) = 54 . 048 x x + 117 . 83 1 2 2 from (7.116), with the estimated linear discriminant function (7.110) as ˆ 740 + 20 d ( x ) . − 23 . 380 x − 5 . 843 x . = 2 1 l The jackknifed posterior probabilities of being an earthquake for the earthquake group ranged from .621 to 1.000, whereas the explosion probabilities for the explo- sion group ranged from .717 to 1.000. The unknown event, NZ, was classified as an explosion, with posterior probability .960. The R code for this example is as follows. P = 1:1024; S = P+1024 mag.P = log10(apply(eqexp[P,], 2, max) - apply(eqexp[P,], 2, min)) mag.S = log10(apply(eqexp[S,], 2, max) - apply(eqexp[S,], 2, min)) eq.P = mag.P[1:8]; eq.S = mag.S[1:8] ex.P = mag.P[9:16]; ex.S = mag.S[9:16] NZ.P = mag.P[17]; NZ.S = mag.S[17] # Compute linear discriminant function cov.eq = var(cbind(eq.P, eq.S)) cov.ex = var(cbind(ex.P, ex.S)) cov.pooled = (cov.ex + cov.eq)/2 means.eq = col