Causal Inference: The Mixtape



2 Copyright © 2018 Scott Cunningham published by com latex . googlecode tufte - . Licensed under the Apache License, Version 2 . 0 (the “License”); you may not use this file except in com- pliance with the License. You may obtain a copy of the License at LICENSE-2.0 . Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “ as is ” basis , without warranties or conditions of any kind , either express or implied. See the License for the specific language governing permissions and limitations under the License. First printing, April 2018

3 Contents Introduction 13 23 Probability theory and statistics review Properties of Regression 35 Directed acyclical graphs 67 Potential outcomes causal model 81 Matching and subclassification 105 Regression discontinuity 153 Instrumental variables 205 Panel data 245 Differences-in-differences 263 Synthetic control 287

4 4 Conclusion 315 Bibliography 317

5 List of Figures 1 16 xkcd Wright’s graphical demonstration of the identification problem 21 2 y x 46 3 on Graphical representation of bivariate regression from 4 Distribution of residuals around regression line 47 5 Distribution of coefficients from Monte Carlo simulation. 54 Regression anatomy display. 61 6 7 Top left figure: Non-star sample scatter plot of beauty (vertical axis) and talent (horizontal axis). Top right right figure: Star sample scat- ter plot of beauty and talent. Bottom left figure: Entire (stars and non- 79 stars combined) sample scatter plot of beauty and talent. 8 Regression of kindergarten percentile scores onto treatments [ , Krueger ]. 1999 98 9 Regression of first grade percentile scores onto treatments [ Krueger , ]. 99 1999 10 Regression of first grade percentile scores onto treatments for K- 3 Krueger , with imputed test scores for all post-kindergarten ages [ ]. 100 1999 11 Switching of students into and out of the treatment arms between first and second grade [ Krueger , 1999 ]. 101 12 IV reduced form approach compared to the OLS approach [ , Krueger ]. 1999 102 13 Lung cancer at autopsy trends 107 Smoking and Lung Cancer 14 109 15 Covariate distribution by job trainings and control. 123 16 Covariate distribution by job trainings and matched sample. 126 17 Lalonde [ 1986 ] Table 5 (a) 136 18 [ 1986 ] Table 5 (b) 137 Lalonde 19 Dehejia and Wahba [ 1999 ] Figure 1 , overlap in the propensity scores (using PSID) 138 20 Dehejia and Wahba [ 1999 ] Figure 2 , overlap in the propensity scores (using CPS) 139

6 6 21 [ 1999 ] Table 3 results. 139 Dehejia and Wahba Dehejia and Wahba [ ] Table 4 , covariate balance 140 22 1999 143 Histogram of propensity score by treatment status 23 Angrist and Lavy [ 1999 ] descriptive statistics 157 24 [ 1999 ] descriptive statistics for the discontinuity 25 Angrist and Lavy sample. 157 26 , 1999 ]. 158 Maimonides’ Rule vs. actual class size [ Angrist and Lavy Angrist and Lavy , 1999 ]. 159 27 Average reading scores vs. enrollment size [ Reading score residual and class size function by enrollment count 28 Angrist and Lavy 1999 [ ]. 160 , Math score residual and class size function by enrollment count [ An- 29 1999 160 grist and Lavy ]. , Angrist and Lavy 30 1999 ]. 161 OLS regressions [ , First stage regression [ Angrist and Lavy , 1999 31 162 ]. 32 Angrist and Lavy , 1999 ]. 162 Second stage regressions [ Sharp vs. Fuzzy RDD [ 33 , 2002 ]. 164 van der Klaauw 34 Dashed lines are extrapolations 165 35 Display of observations from simulation. 167 36 Display of observations discontinuity simulation. 168 Simulated nonlinear data from Stata 169 37 171 38 Illustration of a boundary problem Insurance status and age 173 39 40 2008 ] Table 1 175 Card et al. [ Investigating the CPS for discontinuities at age 65 [ Card et al. 41 2008 ] 176 , 42 Investigating the NHIS for the impact of Medicare on care and uti- Card et al. 2008 ] 177 lization [ , 43 Card et al. , 2008 ] 178 Changes in hospitalizations [ 44 Mortality and Medicare [ Card et al. , 2009 ] 178 45 Imbens and Lemieux 2008 ], Figure 3 . Horizontal axis is the running variable. Vertical axis is the [ 179 conditional probability of treatment at each value of the running variable. Potential and observed outcome regressions [ Imbens and Lemieux , 46 2008 179 ] 47 Panel C is density of income when there is no pre-announcement and no manipulation. Panel D is the density of income when there is pre-announcement and manipulation. From McCrary [ 2008 ]. 182 48 Panels refer to (top left to bottom right) district characteristics: real income, percent high school degree, percent black, and percent el- igible to vote. Circles represent the average characteristic within in- tervals of 0 . 01 in Democratic vote share. The continuous line repre- sents the predicted values from a fourth-order polynomial in vote share fitted separately for points above and below the 50 percent thresh- old. The dotted line represents the 95 percent confidence interval. 184 49 Example of outcome plotted against the running variable. 185

7 7 50 185 Example of covariate plotted against the running variable. McCrary density test, NHIS data, SNAP eligibility against a running 51 variable based on income and family size. 186 2004 52 . Main results. 189 )’s Table Lee, Moretti and Butler ( 1 2004 ], Figure I 195 Lee et al. 53 [ Lee et al. [ 2004 54 cmogram with quadratic Reproduction of ] Figure I using 196 fit and confidence intervals Lee et al. [ 2004 55 cmogram with linear Reproduction of ] Figure I using fit 197 Reproduction of Lee et al. [ 2004 ] Figure I using cmogram with lowess 56 fit 198 Carrell et al. [ ] Figure 3 199 57 2011 [ 3 ] Table Carrell et al. 199 58 2011 200 59 Local linear nonparametric regressions 60 Local linear nonparametric regressions 201 RKD kinks from Card et al. 61 2015 ] 202 [ 62 Card et al. Base year earnings and benefits for single individuals from 2015 203 [ ] 63 Log(duration unemployed) and benefits for single individuals from [ Card et al. ] 204 2015 64 Pseudoephedrine (top) vs d-methamphetamine (bottom) 220 65 Figure 3 from Cunningham and Finlay [ 2012 ] showing changing street prices following both supply shocks. 221 Figure 5 Cunningham and Finlay [ 2012 ] showing first stage. 222 66 from Figure from Cunningham and Finlay [ 2012 ] showing reduced form 67 4 effect of interventions on children removed from families and placed 222 into foster care. Table 68 Cunningham and Finlay [ 2012 ] showing OLS and 2 SLS es- 3 timates of meth on foster care admissions. 223 69 Angrist and Krueger [ 1991 ] explanation of their instrumental vari- able. 225 70 Angrist and Krueger [ 1991 ] first stage relationship between quarter of birth and schooling. 226 Angrist and Krueger 1991 ] reduced form visualization of the rela- 71 [ tionship between quarter of birth and log weekly earnings. 226 Angrist and Krueger [ 1991 72 227 ] first stage for different outcomes. 73 Angrist and Krueger [ 1991 ] OLS and 2 SLS results for the effect of ed- ucation on log weekly earnings. 229 74 [ 1995 ] OLS and 2 SLS results for the effect of education Bound et al. on log weekly earnings. 231 75 Bound et al. [ 1995 ] OLS and 2 SLS results for the effect of education on log weekly earnings with the 100 + weak instruments. 232

8 8 76 3 from Cornwell and Trumbull [ 1994 ] 252 Table Table 2 Draca et al. [ 2011 ] 253 77 from 2 255 Cornwell and Rupert [ 1997 ] Table 78 from 256 3 Cornwell and Rupert [ 1997 ] Table 79 from NJ and PA 266 80 Distribution of wages for NJ and PA in November 1992 267 81 82 Simple DD using sample averages 268 DD regression diagram 270 83 271 84 Checking the pre-treatment trends for parallelism 85 Autor 2003 ] leads and lags in dynamic DD model 271 [ Gruber [ 1994 86 3 275 ] Table 87 277 Internet diffusion and music spending Comparison of Internet user and non-user groups 277 88 Theoretical predictions of abortion legalization on age profiles of gon- 89 orrhea incidence 280 90 Differences in black female gonorrhea incidence between repeal and Roe cohorts. 280 91 Coefficients and standard errors from DD regression equation 281 Subset of coefficients (year-repeal interactions) for the DDD model, 92 of Cunningham and Cornwell [ 2013 ]. 3 Table 283 I ~ Federalism bumpersticker (for the natural experiments) 285 93 California cigarette sales vs the rest of the country 94 292 95 293 California cigarette sales vs synthetic California Balance table 96 293 California cigarette sales vs synthetic California 294 97 98 295 Placebo distribution 99 Placebo distribution 296 100 Placebo distribution 297 101 West Germany GDP vs. Other Countries 297 Synthetic control graph: West Germany vs Synthetic West Germany 102 298 103 Synthetic control graph: Differences between West Germany and Syn- 298 thetic West Germany Synthetic control graph: Placebo Date 299 104 105 Prison capacity (operational capacity) expansion 301 106 African-American male incarceration rates 302 107 304 African-American male incarceration 108 Gap between actual Texas and synthetic Texas 305 109 Histogram of the distribution of ratios of post-RMSPE to pre-RMSPE. Texas is one of the ones in the far right tail. 310 110 Histogram of the distribution of ratios of post-RMSPE to pre-RMSPE. Texas is one of the ones in the far right tail. 311 111 Placebo distribution. Texas is the black line. 313

9 List of Tables 1 23 Examples of Discrete and Continuous Random Processes. Total number of ways to get a 7 25 2 with two six-sided dice. 3 25 Total number of ways to get a 3 using two six-sided dice. 4 Two way contingency table. 28 37 5 Sum of deviations equalling zero 48 6 Simulated data showing the sum of residuals equals zero 7 Monte Carlo simulation of OLS 54 8 Regressions illustrating confounding bias with simulated gender dis- parity 77 Regressions illustrating collider bias 9 78 10 Yule , 1899 ]. 83 Yule regressions [ 1 Potential outcomes for ten patients receiving surgery or chemo Y 11 0 . Y 88 Post-treatment observed lifespans in years for surgery D =1 12 ver- sus chemotherapy D = 0. 89 13 Krueger regressions [ Krueger , 1999 ]. 102 14 Death rates per , 000 person-years [ Cochran , 1968 ] 110 1 Mean ages, years [ Cochran , 1968 ]. 111 15 16 Subclassification example. 112 17 Adjusted mortality rates using 3 age groups [ Cochran , 1968 ]. 113 18 Subclassification example of Titanic survival for large 118 K Training example with exact matching 121 19 Training example with exact matching (including matched sample) 124 20 21 Another matching example (this time to illustrate bias correction) 130 22 Nearest neighbor matched sample 131 23 Nearest neighbor matched sample with fitted values for bias correc- tion 132 24 Completed matching example with single covariate 138 25 Distribution of propensity score for treatment group. 142 26 Distribution of propensity score for CPS Control group. 142

10 10 27 151 Balance in covariates after coarsened exact matching. OLS and SLS regressions of Log Earnings on Schooling 238 28 2 OLS and 2 29 SLS regressions of Log Quantity on Log Price with wave height instrument 241 30 OLS and 2 SLS regressions of Log Quantity on Log Price with wind- speed instrument 242 31 POLS, FE and Demeaned OLS Estimates of the Determinants of Log Hourly Price for a Panel of Sex Workers 259 32 Compared to what? Different cities 264 33 Compared to what? Before and After 264 34 Compared to what? Subtract each city’s differences 265 35 Differences-in-differences-in-differences 273 36 Synthetic control weights 305

11 11 To my son, Miles, one of my favorite people. I love you. You’ve tagged my head and heart.

12 12 “It’s like the mo money we come across, the mo problems we see.” – Notorious B.I.G.

13 Introduction “Am I the only [person] who cares about mixtapes?” – Chance the Rapper I like to think of causal inference as the space between theory and estimation. It’s where we test primarily social scientific hypotheses in the wild . Some date the beginning of modern causal inference with ] or applied labor eco- ], Haavelmo [ 1943 ], Rubin Fisher 1974 [ 1935 [ nomics studies; but whenever you consider its start, causal inference is now a distinct field within econometrics. It’s sometimes listed as a lengthy chapter on “program evaluation” [ Wooldridge , 2010 ], or given entire book-length treatments. To name just a few textbooks Angrist and Pischke in the growing area, there’s 2009 ], Morgan [ and Winship 2014 ], Imbens and Rubin [ 2015 ] and probably a half [ dozen others, not to mention numerous, lengthy treatments of spe- cific strategies such as [ 2008 ] and Angrist and Imbens and Lemieux Krueger 2001 ]. The field is crowded and getting more crowded every [ year. So why does my book exist? I believe there’s some holes in the market, and this book is an attempt to fill them. For one, none of the materials out there at present are exactly what I need when I teach my own class on causal inference. When I teach that class, I use Morgan and Winship 2014 ], Angrist and Pischke [ 2009 ], and a bunch [ of other stuff I’ve cobbled together. No single book at present has everything I need or am looking for. [ 2015 ] covers Imbens and Rubin the potential outcomes model, experimental design, matching and instrumental variables, but does not contain anything about directed acyclical graphical models, regression discontinuity, panel data or 1 1 Morgan and Winship [ 2014 ] covers DAGs, the synthetic control. But hopefully volume 2 will build on volume 1 and continue to build out potential outcomes model, and instrumental variables, but is miss- this material, at which point my book ing adequate material on regression discontinuity, panel data and becomes obsolete. Angrist and Pischke [ 2009 ] is very close, but does synthetic control. not include anything on synthetic control nor the graphical models that I find so useful. But maybe most importantly, Imbens and Rubin 2009 [ ], Angrist and Pischke [ 2015 ] and Morgan and Winship [ 2014 ]

14 14 do not provide enough practical guidance in Stata, which I believe is 2 2 Although Angrist and Pischke [ 2009 ] invaluable for learning and becoming confident in this area. provides an online data warehouse This book was written for a few different people. It was written from dozens of papers, I find that first and foremost for practitioners , which is why it includes easy students need more pedagogical walk- throughs and replications for these to download datasets and programs. It’s why I have made several ideas to become concrete and familiar. efforts to review papers as well as replicate them as much as possible. I want readers to both understand this field, but also importantly, to feel empowered to apply these methods and techniques to their own research questions. Another person I have in mind is the experienced social scientist wanting to retool. Maybe these are people with more of a theoretical bent or background, or maybe it’s people who simply have some holes in their human capital. This book, I hope, can help guide them through the modern theories of causality so common in the social sciences, as well as provide a calculus in directed acyclical graphical models that can help connect their knowledge of theory with econometric identification. Finally, this book is written for people very early in their careers, be it undergraduates, graduate students, or newly minted PhDs. My hope is that this book can give you a jump start so that you don’t have to, like many of us had to, meander through a somewhat labyrinthine path to these methods. Giving it away “Did what I wanted, didn’t care about a hater Delivered my tape to the world as a caterer” – Lil Yachty For now, I have chosen to give this book away, for several reasons. First, the most personal ones. I derive satisfaction in knowing that I can take what I’ve learned, and my personal philosophies about this material, including how it’s taught, and give it away to people. This is probably because I remain deep down a teacher who cares about education. I love helping students discover; I love sharing in that discovery. And if someone is traveling the same windy path that I traveled, then why not help them by sharing what I’ve learned and now believe about this field? I could sell it, and maybe one day I will, but for the moment I’ve decided to give it away – at least, the first few versions. The second reason, which supports the first, is something that Al Roth once told me. He had done me a favor, which I could never repay, and I told him that. To which he said: “Scott, intergenerational favors aren’t supposed to be repaid, they’re

15 15 supposed to be passed forward to the next generation.” 3 3 I give a lot of thought to anything and I’ve given a lot of thought to what he said , and if you’ll indulge everything that Roth says or has ever me, I’d like to share my response to what Roth said to me. Every said actually. person must decide what their values are, how they want to live their life, and what they want to say about the life they were given to live when they look back on it. Economic models take preferences as Becker , 1993 ], but I have found that figuring given and unchanging [ out one’s preferences is the hard work of being a moral person. Love for others, love for my students, love for my senior mentors, love for my friends, love for my peers, love for junior faculty, love for graduate students, love for my family – these are the things that motivate me. I want my life to be of service to others, but I’m a teacher and a researcher, not Mother Theresa. So I have to figure out what it means to be of service to others as a teacher and a researcher, given that is a major part of my life. Each of us have to figure out what it means to be a neighbor with the resources we’ve been given and the path we’ve chosen. So, somewhat inspired by Roth and various senior mentors’ generosities towards me, I decided that at least for now giving away the book is one very small way to live consistently with these values. Plus, and maybe this is actually one of the more important reasons, I figure if I give away the book, then you, the reader, will be patient with me as I take my time to improve the book. Not everything is in this book. I see it as foundational, not comprehensive. A useful starting point, not an ending point. If you master the material in this book, then you’ll have a solid foundation to build on. You might ex- plore the exciting new area of causal inference and machine learning Rust , 1987 ], by Athey, Imbens and others, structural econometrics [ Cavallo et al. , 2013 ], ran- synthetic control with multiple treatments [ domized controlled trials and field experiments, and the seemingly never-ending innovations and updates in econometrics. Another more playful reason I am giving it away is because I find Chance the Rapper’s mentality when it comes to mixtapes infectious. A mixtape is a collection of one’s favorite songs given away to friends in the hopes they’ll have fun listening to it. Consider this my mix- tape of research designs that I hope you’ll find fun, interesting, and powerful for estimating causal effects. It’s not everything you need to know; more like the seminal things you should know as of this book’s writing. There’s far more to learn, and I consider this to be the first book you need, not the only book you need. This book is meant to be a complement to books like Angrist and Pischke [ 2009 ] rather than a substitute.

16 16 How I got here “Started from the bottom now we’re here” – Drake Figure : xkcd 1 It may be interesting to hear how I got to the point of wanting to write this book. The TL;DR version is that I followed a windy path from poetry to economics to research. I fell in love with economics, then research, and causal inference was a constant throughout all of it. But now the longer version. I majored in English at the University of Tennessee at Knoxville and graduated with a serious ambition of becoming a professional poet. But, while I had been successful writing poetry in college, I quickly realized that the road to success beyond that point was probably not realistic. I was newly married with a baby on the way, and working as a qualitative research analyst doing market research 4 4 Rilke said you should quit writing and slowly, stopped writing poetry altogether. poetry when you can imagine yourself My job as a qualitative research analyst was eye opening in part Rilke , 2012 ]. I could living without it [ because it was my first exposure to empiricism. My job was to do imagine living without poetry, so I took his advice and quit. I have no “grounded theory” – a kind of inductive approach to generating ex- regrets whatsoever. Interestingly, when planations of human behavior based on focus groups and in-depth I later found economics, I believed I would never be happy unless I was a interviews, as well as other ethnographic methods. I approached professional economist doing research each project as an opportunity to understand why people did the on the topics I found interesting. So I things they did (even if what they did was buy detergent or pick a ca- like to think I followed Rilke’s advice on multiple levels. ble provider). While the job inspired me to develop my own theories about human behavior, it didn’t provide me a way of falsifying those theories. I lacked a background in the social sciences, so I would spend my evenings downloading and reading articles from the Internet. I don’t remember how I ended up there, but one night I was on the University of Chicago Law and Economics working paper series when a speech by Gary Becker caught my eye. It was his Nobel

17 17 Prize acceptance speech on how economics applied to all of human Becker , ], and reading it changed my life. I thought behavior [ 1993 economics was about stock markets and banks until I read that speech. I didn’t know economics was an engine that one could use to analyze all of human behavior. This was overwhelmingly exciting, and a seed had been planted. [ But it wasn’t until I read ] that I became Lott and Mustard 1997 truly enamored with economics. I had no idea that there was an empirical component where economists sought to estimate causal ef- fects with quantitative data. One of the authors in Lott and Mustard [ ] was David Mustard, then an Associate Professor of economics 1997 at the University of Georgia, and one of Becker’s former students. I decided that I wanted to study with Mustard, and therefore applied for University of Georgia’s doctoral program in economics. I moved to Athens, Georgia with my wife, Paige, and our infant son, Miles, 2002 and started classes in the fall of . After passing my prelims, I took Mustard’s labor economics field class, and learned about the kinds of topics that occupied the lives of labor economists. These topics included the returns to education, inequality, racial discrimination, crime and many other fascinating and important topics. We read many, many empirical papers in that class, and afterwards I knew that I would need a strong background in econometrics to do the kind of empirical work I desired to do. And since econometrics was the most important area I could ever learn, I decided to make it my main field of study. This led to me working with Christopher Cornwell, an econometrician at Georgia from whom I learned a lot. He became my most important mentor, as well as a coauthor and friend. Without him, I wouldn’t be where I am today. Econometrics was difficult. I won’t pretend I was a prodigy. I took all the econometrics courses offered at the University of Georgia, and some more than once. They included probability and statistics, cross-section, panel data, time series, and qualitative dependent variables. But while I passed my field exam in econometrics, I failed to understand econometrics at deeper, more basic levels. You might say I lost sight of the forest for the trees. I noticed something while I was writing the third chapter of my dissertation that I hadn’t noticed before. My third chapter was an in- vestigation of the effect of abortion legalization on longrun risky sex- ual behavior [ , 2013 ]. It was a revisiting of Cunningham and Cornwell Donohue and Levitt [ 2001 ]. One of the books I read in preparation of the study was Levine [ 2004 ], which in addition to reviewing the the- ory of and empirical studies on abortion had a little table explaining the differences-in-differences identification strategy. The University

18 18 of Georgia had a traditional econometrics pedagogy, and most of my field courses were theoretical (e.g., Public and Industrial Organiza- tion), so I never really had heard the phrase “identification strategy” let alone “causal inference”. That simple difference-in-differences table was eye-opening. I saw how econometric modeling could be used to isolate the causal effects of some treatment, and that put me on a new research trajectory. Optimization Makes Everything Endogeneous “I gotta get mine, you gotta get yours” – MC Breed Causal inference is often accused of being a-theoretical, but noth- Imbens , ing could be further from the truth [ , Deaton and 2009 Cartwright , 2018 ]. Economic theory is required in order to justify a credible claim of causal inference. And economic theory also high- lights why causal inference is necessarily a thorny task. Let me explain. There’s broadly thought to be two types of data. There’s experi- mental data and non-experimental data. The latter is also sometimes called data. Experimental data is collected in something observational akin to a laboratory environment. In a traditional experiment, the researcher participates actively in the process being recorded. It’s more difficult to obtain data like this in the social sciences due to feasibility, financial cost or moral objections, although it is more common now than was once the case. Examples include the Oregon Medicaid Experiment, the RAND health insurance experiment, the field experiment movement inspired by Michael Kremer, Esther Duflo and John List, and many others. Observational data is usually collected through surveys on a retrospective manner, or as the byproduct of some other business activity (“big data”). That is, in observational studies, you collect data about what has happened previously, as opposed to collecting data as it happens. The researcher is also a passive actor in this process. She observes actions and results, but is not in a position to interfere with the outcomes. This is the most common form of data that social scientists use. Economic theory tells us we should be suspicious of correlations found in observational data. In observational data, correlations are almost certainly not reflecting a causal relationship because the variables were endogenously chosen by people who were making decisions they thought were best. In pursuing some goal while facing constraints, they chose certain things that created a spurious corre-

19 19 lation with other things. The reason we think is because of what we learn from the potential outcomes model: a correlation, in order to be a measure of a causal effect, must be completely independent of the potential outcomes under consideration. Yet if the person is making on what she thinks is best, then it necessarily vio- some choice based lates this independence condition. Economic theory predicts choices will be endogenous, and thus naive correlations are misleading. But theory, combined with intimate knowledge of the institutional details surrounding the phenomena under consideration, can be used to recover causal effects. We can estimate causal effects, but only with assumptions and data. Now we are veering into the realm of epistemology. Identifying causal effects involves assumptions, but it also requires a particular kind of belief about the work of scientists. Credible and valuable research requires that we believe that it is more important to do our work than to try and achieve a certain outcome (e.g., correctly confirmation bias, statistical significance, stars). The foundations of scientific knowledge are scientific methodologies. Science does not collect evidence in order to prove what we want to be true or what people want others to believe. That is a form of , not propaganda science. Rather, scientific methodologies are devices for forming a particular kind of belief. Scientific methodologies allow us to accept unexpected, and sometimes, undesirable answers. They are process oriented, not outcome oriented. And without these values, causal methodologies are also not credible. Example: Identifying price elasticity of demand “Oh, they just don’t understand Too much pressure, it’s supply demand.” – Zhu & Alunageorge One of the cornerstones of scientific methodologies is empirical 5 5 By empirical analysis, I mean the use of data to test a the- It is not the only cornerstone, nor analysis. even necessarily the most important ory or to estimate a relationship between variables. The first step in cornerstone, but empirical analysis conducting an empirical economic analysis is the careful formulation has always played an important role in scientific work. of the question we would like to answer. In some cases, we would like to develop and test a formal economic model that describes mathematically a certain relationship, behavior or process of interest. Those models are valuable insofar as they both describe the phenom- ena of interest as well as make falsifiable (testable) predictions. A prediction is falsifiable insofar as we can evaluate, and potentially re- 6 6 The economic model is the framework ject the prediction, with data. You can also obtain a starting point for empirical analysis less formally through an intuitive and less formal rea- soning process. But economics favors formalism and deductive methods.

20 20 with which we describe the relationships we are interested in, the 7 7 Economic models are abstract, not intuition for our results and the hypotheses we would like to test. realistic, representations of the world. After we have specified an economic model, we turn it into what is George Box, the statistician, once called an econometric model that we can estimate directly with data. quipped that “all models are wrong, but some are useful.” One clear issue we immediately face is regarding the functional form of the model, or how to describe the relationships of the variables we are interested in through an equation. Another important issue is how we will deal with variables that cannot be directly or reasonably observed by the researcher, or that cannot be measured very well, but which play an important role in our economic model. A generically important contribution to our understanding of causal inference is the notion of comparative statics. Comparative statics are theoretical descriptions of causal effects contained within the model. These kinds of comparative statics are always based on the idea of – holding all else constant. When we are ceteris paribus trying to describe the causal effect of some intervention, we are always assuming that the other relevant variables in the model are not changing. If they weren’t, then it confounds our estimation. One of the things implied by ceteris paribus that comes up re- peatedly in this book is the idea of covariate balance. If we say that everything is the same except for the movement of one variable, then everything is the same on both sides of that variable’s changing value. Thus, when we invoke ceteris paribus , we are implicitly invoking covariate balance – both the observable and unobservable covariates. To illustrate this idea, let’s begin with a basic economic model: supply and demand equilibrium and the problems it creates for estimating the price elasticity of demand. Policy-makers and business managers have a natural interest in learning the price elasticity of demand. Knowing this can help firms maximize profits, help governments choose optimal taxes, as well as the conditions under Becker et al. , 2006 which quantity restrictions are preferred [ ]. But, the problem is that we do not observe demand curves, because demand curves are theoretical objects. More specifically, a demand curve is a collection of paired potential outcomes of price and quantity. We observe price and quantity equilibrium values , not the potential price and potential quantities along the entire demand curve. Only by tracing out the potential outcomes along a demand curve can we calculate the elasticity. To see this, consider this graphic from Philip Wright’s Appendix B[ Wright , 1928 ], which we’ll discuss in greater detail later (Figure 2 ). The price elasticity of demand is the ratio of percentage changes in quantity to price for a single demand curve . Yet, when there are shifts in supply and demand, a sequence of quantity and price pairs emerge in history which reflect neither the demand curve nor the supply

21 21 curve. In fact, connecting the points does not reflect any meaningful 180 Economic Journal of Perspectives object. : Wright’s graphical demonstra- Figure 2 1 Exhibit tion of the identification problem The in B Demonstration of Problem the Identification 296) Appendix Graphical (p. Fail Data to Either Revbal 4. Supply FicruRB Price-output or Demand Curve. conditions without conditions affect cost cost or without which (B) affecting demand conditions. affecting The price elasticity of demand is the solution to the following B then the instrumental derivations estimator of two variable Appendix provides equation: first the as The the the is to "limited- identification solution 313-314) problem. (pp. ∂ Q log e = is A instrumental in or the which variable information," single-equation, approach, P log ∂ is The used to the this derivation summarized in Exhibit estimate 2. supply elasticity; P . For instance, it But in this example, the change in is exogenous the second and derivation or is derivation "full-information," 315-316) system, (pp. of uses Sewall a to extended of method coefficients, (1921,1923) system path Wright's holds supply fixed, the prices of other goods fixed, income fixed, the simultaneous This in effect solves simultaneous derivation two two equations. preferences fixed, input costs fixed, etc. In order to estimate the that as A so A and and B. are of Because functions price expressed equations quantity that are completely P price elasticity of demand, we need changes in B and estimated are the least be can coefficients by ordinary exogenous, resulting In demand and the and be elasticities can modern deduced. thence, and utterly independent of the otherwise normal determinants of supply squares, least estimator of the elasticities is the this indirect estimator that, terminology, squares supply and the other determinants of demand. Otherwise we get because is the is estimator the instrumental obtained variables identified, system exacdy shifts in either supply or demand, which creates new pairs of data for in the first derivation.4 The B "the author refers to of estimation as instrumental variable will not be a measure of the P which any correlation between Q and Appendix the he method of then uses estimate external to which factors," supply introducing elasticity of demand. and The and elasticities demand factors used for butter flaxseed. external actually Nevertheless, the elasticity is an important object, and we need to know it. So given this theoretical object, we must write out an 4 in a is the the From of flaw distinction the treatment modern the loose derivations only perspective, and minor This can between that the excused moments. strikes us as a be early population by slip sample econometric model as a starting point. One possible example of an was which at date B written. Appendix econometric model would be a linear demand function: log u = a + log Q P + g X + d d where a is the intercept, d is the elasticity of demand, X is a matrix of factors that determine demand like the prices of other goods or g is the coefficient on the relationship between X and Q income and d 8 8 u is the error term. More on the error term later.

22 22 Foreshadowing the rest of the lecture notes, to estimate price elasticity of demand, we need two things. First, we need numerous rows of data on price and quantity. Second, we need for the variation u . We call this in price in our imaginary dataset to be independent of kind of independence exogeneity . Without both, we cannot recover the price elasticity of demand, and therefore any decision that requires that information will be based on flawed or incomplete data. In conclusion, simply finding an association between two vari- ables might be suggestive of a causal effect, but it also might not. Correlation doesn’t mean causation unless key assumptions hold. Before we start digging into causal inference, we need to lay down a foundation in simple regression models. We also need to introduce a simple program necessary to do some of the Stata examples. I have uploaded numerous datasets to my website which you will use to perform procedures and replicate examples from the book. That file can be downloaded from . Once it’s downloaded, simply copy and paste the file into your personal Stata 9 9 You’re all set to go forward! ado subdirectory. To find that path, type in sysdir at the Stata command line. This will tell you the location of your personal ado subdirectory. If you copy scuse.ado it into this subdirectory, then you can call all the datasets used in the book.

23 Probability theory and statistics review “Numbers is hardly real and they never have feelings But you push too hard, even numbers got limits” – Mos Def We begin with some definitions. A random Basic probability theory process is a process that can be repeated many times with different outcomes. The sample space is the set of all possible outcomes of a and random process. We distinguish between discrete continuous random process in the following table. : Examples of Discrete and 1 Table Description Type Potential outcomes Continuous Random Processes. Coin Discrete Heads, Tails Discrete , 3 , 4 , 5 , 6 , -sided die 1 2 6 ~ 3 , . . . King ~ , Ace } , } Discrete Deck of cards 2 Housing prices Continuous P 0 We define independent events two ways. The first definition refers to logical independence. For instance, two events occur but there is no reason to believe that two events affect one another. The logical post hoc ergo propter hoc . The second definition is fallacy is called statistical independence. We’ll illustrate the latter with an example from the idea of sampling with and without replacement. Let’s use a randomly shuffled deck of cards for example. For a deck of 52 cards, what is the probability that the first card will be an Ace? 1 4 Count Aces Pr )= = ( = 0.077 = Ace Sample Space 13 52 There are 52 possible outcomes, which is the sample space – it is the set of all possible outcomes of the random process. Of those 52 possible outcomes, we are concerned with the frequency of an Ace 4 = 0.077. occurring. There are four Aces in the deck, so 52 Assume that the first card was an ace. Now we ask the question again. If we shuffle the deck, what is the probability the next card 1 drawn is an Ace? It is no longer because we did not “sample 13

24 24 causal inference : the mixtape without replacement. Thus the new with replacement”. We sampled probability is 3 ( 1= Ace )= Pr Ace = 0.059 Card | 51 Under sampling without replacement, the two events – Ace on Card and an Ace on Card 2 if Card 1 was an Ace – aren’t independent 1 events. To make the two events independent, you would have to and B , are put the Ace back and shuffle the deck. So two events, A independent if and only if (iff): A | B )= Pr ( A ) Pr ( 5 with An example of two independent events would be rolling a 3 one die after having rolled a with another die. The two events are independent, so the probability of rolling a is always 0 . 17 regardless 5 10 10 5 The probability rolling a using one of what we rolled on the first die. 1 = 0.167. die is 6 But what if we are wanting to know the probability of some event occurring that requires multiple events, first, to occur? For instance, let’s say we’re talking about the Cleveland Cavaliers winning the 2016 , the Golden State Warriors were 3 - NBA championship. In in 0 a best of seven playoff. What had to happen for the Warriors to lose the playoff? The Cavaliers had to win four in a row. In this instance, to find the probability, we have to take the product of all marginal n Pr ( ) Pr probabilities, or · ( · ) is the marginal probability of one where event occurring, and n is the number of repetitions of that one event. If the probability of each win is 0 5 , and each game is independent, . product then it is the of each game’s probability of winning: 4 Pr ( W , W , W , W ) = (0.5) Win probability = 0.0625 = 0 Another example may be helpful. Let’s say a batter has a 3 probabil- . ity of getting a hit. Then what is the probability of him getting two 2 hits in a row? The two hit probability is Pr HH )=0.3 = 0.09 and the ( 3 Pr ( HHH )=0.3 . Or to keep with our = 0.027 three hit probability is poker example, what is the probability of being dealt pocket Aces? 4 3 ⇥ It’s = 0.0045 or 0.45%. 51 52 Let’s now formalize what we’ve been saying for a more general- ized case. For independent events, calculating joint probabilities is to multiply the marginal probabilities: Pr ( A , B )= Pr ( A ) Pr ( B ) where ( A , B ) is the joint probability of both A and B occurring, and Pr Pr ( A ) is the marginal probability of A event occurring. Now for a slightly more difficult application. What is the prob- ability of rolling a 7 using two dice, and is it the the same as the

25 probability theory and statistics review 25 probability of rolling a 3 ? To answer this, let’s compare the two. We’ll use a table to help explain the intuition. First, let’s look at all the 7 36 total possible ways to get a using two six-sided die. There are 2 = 36 ) when rolling two dice. In Table outcomes ( we see that 6 2 using only two dice. So the 7 there are six different ways to roll a probability of rolling a 7 is 6 / 36 = 16.67% . Next, let’s look at all the 3 using two six-sided dice. Table 3 shows that there are ways to get a 3 only two ways to get a rolling two six-sided dice. So the probability 3 is 2 / 36 = 5.56%. So, no, the probabilities are different. of rolling a 7 : Total number of ways to get a 2 Table 2 Die 1 Die Outcome with two six-sided dice. 16 7 25 7 34 7 43 7 52 7 61 7 3 : Total number of ways to get a Table 3 Outcome Die 1 Die 2 using two six-sided dice. 12 3 21 3 First, before we talk about the Events and Conditional Probability three ways of representing a probability, I’d like to introduce some new terminology and concepts: events and conditional probabilities . A Let B be some other event. For two events, be some event. And let there are four possibilities. 1 . A and B : Both A and B occur. 2 . ⇠ A and B : A does not occur, but B occurs. 3 A and ⇠ B : A occurs but B does not occur. . 4 . ⇠ A and ⇠ B : Neither A nor B occurs. I’ll use a couple of different examples to illustrate how to represent a probability. Probability tree Let’s think about a situation where you are trying to get your driver’s license. Suppose that in order to get a driver’s license, you have to pass the written and the driving exam. However, if you fail the written exam, you’re not allowed to take the driving exam. We can represent these two events in a probability tree.

26 26 causal inference : the mixtape No )=0.1 Fail ( P driver’s license Fail 0.1 Written exam Pass P Pass \ Fail )=0.9 · 0.4 = 0.36 ( Fail 0.9 0.4 Car exam Pass 0.6 Pass ( P )=0.9 0.6 = 0.54 Pass \ · Probability trees are intuitive and easy to interpret. First, we see that the probability of passing the exam is . 9 and the probability of 0 0 1 . Second, at every branching off from a node, failing the exam is . we can further see that the probabilities are summing to 1 . 0 . For 0 . example, the probability of failing the written exam ( ) plus the 1 probability of passing it ( 0 . 9 ) equals 1 . 0 . The probability of failing the car exam ( 0 . 4 ) plus the probability of passing the car exam ( 0 . 6 ) is 1 . 0 . And finally, the joint probabilities are also all summing to 1 0 . This . law of total probability is called the is equal to the sum of all joint probability of A and B events occurring. n ( A )= ) Pr Pr ( A \ B n  n We also see the concept of a conditional probability in this tree. For instance, the probability of failing the car exam conditional on passing the written exam is represented as Pr ( Fail | Pass ) = 0.4. Venn Diagram A second way to represent multiple events occurring is with a Venn diagram. Venn diagrams were first conceived by John Venn in and are used to teach elementary set theory, as well as 1880 express set relationships in probability and statistics. This example A will involve two sets, B . and U B A Let’s return to our earlier example of your team making the play- offs, which determines whether your coach is rehired. Here we is the uni- A and B are events, and U remind ourselves of our terms. versal set of which A and B are subsets. Let A be the probability that

27 probability theory and statistics review 27 your team makes the playoffs and is the probability your coach is B Pr ( )=0.6 and let Pr ( B )=0.8 . Let the probability that rehired. Let A , and Pr ( A occur be B ) = 0.5. A B both + ⇠ A = U , where ⇠ A is the complement of A . The Note, that A complement means that it is everything in the universal set that is and ⇠ B = U . Therefore: not A. The same is said of B. The sum of B + A = B + ⇠ B A ⇠ We can rewrite out the following definitions: = A + B A ⇠ B ⇠ A B + ⇠ B ⇠ A = Additionally, whenever we want to describe a set of events in [ B could occur, it is: A or B . which either A So, here again we can also describe the shaded region as the union ⇠ A [⇠ B sets. Where ⇠ A and ⇠ B occurs is the outer area of of the the A B . So again, [ \ B + ⇠ A \⇠ A =1 B Finally, the joint sets. These are those subsets wherein A and both B occur. Using the set notation, we can calculate the probabilities because for a Venn diagram with the overlapping circles (events), there are four possible outcomes expressed as joint sets. Notice these relationships A [ B = A \⇠ B + A \ B + ⇠ A \ B A = \⇠ B + A \ B A A = B + ⇠ \ \ B B A Now it is just simple addition to find all missing values. Recall A is your team making playoffs and Pr ( A )=0.6 . And B the is the probability the coach is rehired, ( B )=0.8 . Also, Pr ( A Pr B )=0.5 , which is the probability of both A and B occurring. Then we have: A = A \⇠ B + A \ B A B = A A \ B \⇠ Pr ( A , ⇠ B )= Pr ( A ) Pr ( A , B ) Pr A , ⇠ B )=0.6 0.5 ( Pr ( A , ⇠ B )=0.1 When working with set objects, it is important to understand that probabilities should be measured by considering the share of the larger subset, say A , that some subset takes, such as A \ B . When

28 28 causal inference : the mixtape \ we write down that the probability of B A occurs at all, it is with regards to U . But what if we were to ask the question as “What share A B ?” Notice, then, that we would need to do this: A \ is due to of ÷ A B A ?= \ ?=0.5 ÷ 0.6 ? = 0.83 I left this intentionally undefined on the left hand side so as to focus on the calculation itself. But now let’s define what we are wanting to calculate. “In a world where has occurred, what is the probability A B will also occur?” This is: ( A , B ) 0.5 Pr A )= ( B = 0.83 Prob = | ) 0.6 ( Pr A , Pr 0.5 ) B ( A | ( )= Prob = B = 0.63 A Pr ( 0.8 B ) Notice, these conditional probabilities are not as easily seen in the Venn diagram. We are essentially asking what percent of a subset ( A ) – is due to the joint, for example Pr ( A , B Pr . That is the – e.g., ) notion of the conditional probability. Contingency tables The last way that we can represent events is with a contingency table. Contingency tables are also sometimes called two way tables. Table 4 is an example of a contingency table. We continue our example about the coach. : Two way contingency table. Table 4 B) ⇠ Total Coach is rehired ( Coach is not rehired (B) Event labels , ⇠ B )= 0 . 1 Pr ( A (A) team playoffs B )= 0 . 5 Pr ( A )= 0 . 6 Pr ( A , Pr ( ⇠ A , ⇠ B )= 0 . 1 Pr ( ⇠ A , B )= 0 . 3 Pr ( B )= 0 . 4 ( ⇠ A) no playoffs ⇠ B 0 . 2 Pr ( B )= 0 . 8 1 . 0 ( Pr Total )= . Pr A ) = 0 Recall that 6 , Pr ( B ) = 0 ( 8 , and Pr ( A , B ) = 0 . 5 . All probabilities . must sum correctly. Note that to calculate conditional probabilities, we must ask the frequency of the element in question (e.g., Pr ( A , B ) ) relative to some other larger event (e.g., Pr A ) ). So if we want to ask, ( “what is the conditional probability of B given A?”, it’s: B 0.5 ) , A Pr ( )= Pr ( | = 0.83 B A = ) ( 0.6 Pr A but note to ask the frequency of A [ B in a world where B occurs is to ask the following: , 0.5 ) B A ( Pr )= = 0.63 Pr | A ( B = ) B Pr 0.8 (

29 probability theory and statistics review 29 So, we can use what we have done so far to write out a defini- tion of joint probability. Let’s start with a definition of conditional A : probability first. Given two events, and B ) Pr ( A , B | )= A B ( 1 ) Pr ( Pr ( B ) B , A ) ( Pr ( B | A )= ) 2 ( Pr A ) Pr ( A A )= Pr ( B , B ) ( 3 ) ( , Pr A )= Pr ( A , ⇠ B )+ Pr ( A , B ) ( 4 ) ( Pr ( B )= Pr ( A , B )+ Pr ( ⇠ A , B ) ( 5 ) Pr 1 Using equations , I can simply write down a definition of joint and 2 probabilities. ( , B )= Pr ( A | B ) Pr ( B ) ( 6 Pr ) A Pr ( A )= Pr ( B | A ) , ( A ) ( 7 ) Pr B And this is simply the formula for joint probability. Given equa- 3 , and using the definitions of Pr ( A , B and Pr ( B , A )) , I can also tion rearrange terms, make a substitution and rewrite it as: ) A | B ) Pr ( B )= Pr ( B | A ( Pr ( A ) Pr Pr ( B | A ) Pr ( A ) B )= Pr ( A | ) 8 ( ) B Pr ( 8 is sometimes called the naive version of Bayes Rule. We Equation will now decompose this more fully, though, by substituting equation 5 into equation 8 . ) A ( Pr Pr ( B | A ) )= B | A ( Pr ( 9 ) Pr )+ Pr ( ⇠ A , B ) ( A , B Substituting equation into the denominator for equation 9 yields: 6 ( B Pr A ) Pr ( A ) | | B )= ( Pr ( ) 10 A | A Pr ( A )+ Pr ( ⇠ A , B ) B ( Pr ) Finally, we note that using the definition of joint probability, that A B Pr ⇠ A )= Pr ( B | ⇠ ( ) Pr ( ⇠ A ) , which we substitute into the , denominator of equation 10 to get: ) A ( Pr ) A Pr ( B | )= | A ( Pr B ( 11 ) | A ) Pr ( A )+ Pr ( B | ⇠ A ) Pr ( ⇠ A ) Pr ( B That’s a mouthful of substitutions so what does equation mean? 11 This is the Bayesian decomposition version of Bayes rule. Let’s use our example again of a team making the playoffs. A is your team is the B is your coach gets rehired. And A \ B makes the playoffs and

30 30 causal inference : the mixtape joint probability that both events occur. We can make each calculation using the contingency tables. The questions here is “if coach is rehired, what’s the probability that my team made the playoffs?” Or A B ) . We can use the Bayesian decomposition to find ( formally, Pr | what this equals. Pr B | A ) Pr ( A ) ( A | Pr )= ( B ( B | A ) Pr ( A )+ Pr ( B | ⇠ A ) Pr ( ⇠ A ) Pr 0.83 · 0.6 = 0.6 + 0.75 0.4 · · 0.83 0.498 = 0.498 + 0.3 0.498 = 0.798 ( A | B ) = 0.624 Pr Check this against the contingency table using the definition of joint probability: 0.5 ( ) B , Pr A Pr )= B | A = ( = 0.625 ) Pr ( 0.8 B ) Why are they different? Because is an approximation of Pr ( B | A 83 . 0 0 . 833 . . . trailing. So, if my coach is rehired, which was technically 63 percent chance we will win. there is a Let’s use a different example. This is a fun one, Monty Hall example because most people find it counterintuitive. It even used to stump 11 11 But Bayes rule makes the answer There’s a fun story in which someone mathematicians and statisticians. posed this question to the columnist, very clear. Marilyn vos Savant, and she got it right. D ( 2 ), Door ) and Let’s assume three closed doors: Door 1 ( D 2 1 People wrote in, calling her stupid, but it turns out she was right. You can read ). Behind one of the doors is a million dollars. Behind Door 3 ( D 3 here the story . each of the other two doors is a goat. Monty Hall, the game show host in this example, asks the contestants to pick a door. After they had picked the door, but before he opens their door, he opens one of the other two doors to reveal a goat. He then ask the contestant, “would you like to switch doors?” Many people answer say it makes no sense to change doors, be- cause (they say) there’s an equal chance that the million dollars is behind either door. Therefore, why switch? There’s a 50 / 50 chance it’s behind the door picked and there’s a / 50 chance it’s behind the 50 remaining door, so it makes no rational sense to switch. Right? Yet, a little intuition should tell you that’s not the right answer, because it would seem that when Monty Hall opened that third door, he told us something . But what did he tell us exactly? Let’s formalize the problem using our probability notation. As- sume that you chose door 1 , D . What was the probability that 1

31 probability theory and statistics review 31 D had a million dollars when you made that choice? Pr ( D = 1 1 1 . We will call that event A 1 million) = . And the probability that D 1 1 3 1 has a million dollars at the start of the game is because the sample 3 1 )= A doors, of which one has a million dollars. Thus, Pr 3 space is . ( 1 3 2 Pr A Also, by the law of total probability, )= ( ⇠ . Let’s say that 1 3 , to reveal a goat. Then he then D Monty Hall had opened door , 2 2 asked you “would you like to change to door number 3 ?” 3 has the million dol- We need to know the probability that door 1 ’s probability. We will call the open- lars and compare that to Door 2 event . We will call the probability that the million ing of door B , . We now write out the question just asked i A dollars is behind door i formally and decompose it using the Bayesian decomposition. We are ultimately interested in knowing, “what is the probability that Door A 1 ) given Monty Hall opened Door 2 has a million dollars (event 1 B )”, which is a conditional probability question. Let’s write (event out that conditional probability now using the Bayesian decomposi- 11 tion from equation . A ) A ( Pr ) Pr ( B | 1 1 )= B | ( ( A ) 12 Pr 1 | A A ) Pr ( A Pr )+ Pr ( B | A ( ) Pr ( B ) )+ Pr ( B | A A ) Pr ( 3 3 2 2 1 1 There’s basically two kinds of probabilities on the right-hand-side. There’s the marginal probability that the million dollars is behind a Pr A ) . And there’s the conditional probability that Monty ( given door i given the million dollars is behind Door A Hall would open Door , 2 i ( B | A Pr ). i i The marginal probability that Door has the million dollars with- 1 . We call this the prior probability, out any additional information is 3 prior belief . It may also be called the unconditional probability. or Pr ( B | A ) , require a little more careful The conditional probability, i thinking. Take the first conditional probability, Pr ( B | A . In a world ) 1 where Door has the million dollars, what’s the probability Monty 1 2 Hall would open door number ? Think about it for a second. Pr ( B | A . In a Let’s think about the second conditional probability: ) 2 world where the money is behind Door 2 , what’s the probability that Monty Hall would open Door 2 ? Think about it, too, for a second. And then the last conditional probability, Pr ( B | A . In a world ) 3 3 , what’s the probability Monty where the money is behind Door Hall will open Door 2 ? Each of these conditional probabilities require thinking carefully about the feasibility of the events in question. Let’s examine the easiest question: Pr ( B | A . In a world where the money is behind ) 2 Door 2 , how likely is it for Monty Hall to open that same door, Door 2 ? Keep in mind: this is a game show. So that gives you some idea about how the game show host will behave. Do you think Monty

32 32 causal inference : the mixtape Hall would open a door that had the million dollars behind it? After all, isn’t he opening doors that don’t have it to ask you whether you should switch? It makes no sense to think he’d ever open a door that actually had the money behind it. Historically, even, he always opens a door with a goat. So don’t you think he’s only opening doors with goats? Let’s see what happens if take that intuition to its logical never opens a door if it has extreme and conclude that Monty Hall a million dollars. He only opens doors if those doors have a goat. Under that assumption, we can proceed to estimate Pr | B ) by ( A 1 ( B | A and ) substituting values for Pr Pr A into the right-hand-side ) ( i i . 12 of equation ( B | A ? That is, in a world where ) Pr you What then is have chosen 1 Door 1 , and the money is behind Door 1 , what is the probability that he would open Door 2 ? There are two doors he could open if the 1 – he could open either Door 2 or Door 3 , as money is behind Door Pr both have a goat. So | A ( ) = 0.5. B 1 Pr B | A ) What about the second conditional probability, ? In a ( 2 world where the money is behind Door , what’s the probability he 2 will open it? Under our assumption that he never opens the door if it has a million dollars, we know this probability is 0 . 0 . And finally, Pr ( B | A what about the third probability, ) ? What is the probability 3 he opens Door 3 ? Now consider 2 given the money is behind Door 1 this one carefully - you have already chosen Door , so he can’t open that one. And he can’t open Door 3 , because that has the money. The 2 . Thus, this probability only door, therefore, he could open is Door 1 ( 0 . And all the Pr . A is )= 1 , allowing us to solve for the conditional i 3 probability on the left hand side through substitution, multiplication and division. 1 1 · 2 3 Pr | A B )= ( 1 1 1 1 1 +0 · · +1.0 · 3 3 3 2 1 6 = 1 2 + 6 6 1 = 3 Now, let’s consider the probability that Door 3 has the million dollars, as that’s the only door we can switch to since door two has been opened for us already. ) A ( Pr ) A | Pr ( B 3 3 )= | A ( Pr B 3 Pr ( B | A ) ) Pr ( A A )+ Pr ( B | A ( ) Pr ( A Pr )+ Pr ( B | A ) 2 3 2 3 1 1 1 · 1.0 3 = 1 1 1 1 · +0 + · · 1.0 2 3 3 3 2 = 3

33 probability theory and statistics review 33 Ah hah. Now isn’t that just a little bit surprising? The probability 1 you are holding the correct door is just as it was before Monty 3 . But now the probability that it is in Door Hall opened Door has 2 2 1 )= increased from its prior probability. The prior probability, Pr ( , A 3 3 2 changed to a new conditional probability value, Pr | B )= ( . This A 3 3 new conditional probability is called the posterior probability, or 12 12 Given the information you learned from witnessing posterior belief. It’s okay to giggle every time you say “posterior”. I do! B 3 , we correctly updated our beliefs about the likelihood that Door had the million dollars. Exercises Driving while intoxicated is defined as operating a mo- . 08 %. tor vehicle with a blood alcohol content (BAC) at or above 0 Standardized field sobriety tests (SFSTs) are often used as tools by officers in the field to determine if an arrest followed by a breath test is justified. However, breath tests are often not available in court for a variety of reasons, and under those circumstances, the SFSTs are frequently used as an indication of impairment and sometimes as an 0 . indicator that the subject has a BAC %. 08 Stuster and Burns 1998 ] conducted an experiment to estimate [ the accuracy of SFSTs. Seven San Diego Police Department officers administered STSTs on those stopped for suspicion of driving under the influence of alcohol. The officers were then instructed to carry out the SFSTs on the subjects, and then to note an estimated BAC based 13 13 only on the SFST results. Subjects driving appropriately were not In case you’re interested, the SFST consists of three tests: the walk and stopped or tested. However, “poor drivers” were included because turn test, the one leg stand test, and the 14 they attracted the attention of the officers. The officers were asked horizontal gaze nystagmus test. 14 to estimate the BAC values using SFSTs only. Some of the subjects The data collected included gender and age, but not race, body weight, were arrested and given a breath test. The criteria used by the officers presence of prior injuries and other for estimation of BAC was not described in the Stuster and Burns factors that might influence SFSTs of the measured BAC. 1998 ] study, and several studies have concluded that officers were [ using the SFSTs to then subjectively guess at the subjects’ BAC. There were subjects in the original data. The raw data is reproduced 297 below. 0 . MBAC % MBAC < 0 . 08 % 08 EBAC 0 . 08 % n = 24 n = 210 59 EBAC . 08 % n = 0 n = 4 < 1 . Represent the events above using a probability tree, two way tables, and a Venn Diagram. Calculate the marginal probability of each event, the joint probability, and the conditional probability. . Let F be the event where a driver fails the SFST with an estimated 2 BAC (EBAC) at or above be an event where the . 08 , and ⇠ F 0

34 34 causal inference : the mixtape < ). Let I be the event wherein a driver is driver passes (EBAC 0.08 impaired by alcohol with an actual or measured BAC (MBAC) is 0 at or above 08 %, and ⇠ I be an event where MBAC < 0.08 . Use . Bayes Rule to decompose the conditional probability, Pr ( I | F ) , into it the correct expression. Label the prior beliefs, posterior beliefs, false and true positive, false and true negative if and where they apply. Show that the posterior belief calculated using Bayes Rule is equal to the value that could be directly calculated using the sample information above. Interpret the posterior belief in plain language. What does it mean? 3 . Assume that because of concerns about profiling, a new policy is enacted. Police must randomly pull over all automobiles for suspected driving while intoxicated and apply the SFST to all drivers. Using survey information, such as from Gallup or some other reputable survey, what percent of the US population drinks and drives? If the test statistics from the sample are correct, then how likely is it that someone who fails the SFST is impaired under this new policy?

35 Properties of Regression “I like cool people, but why should I care? Because I’m busy tryna fit a line with the least squares” – J-Wong Summation operator Now we move on to a review of the least 15 15 squares estimator. Before we begin, let’s introduce some new nota- This chapter is heavily drawn from ] and Wooldridge Wooldridge [ 2010 tion starting with the summation operator . The Greek letter (the  ]. All errors are my own. 2015 [ , capital Sigma) denotes the summation operator. Let x x ,..., x be n 2 1 a sequence of numbers. We can compactly write a sum of numbers using the summation operator as: n + x ··· x x + + x ⌘ n 2 1 i  =1 i is called the index of summation. Other letters are some- i The letter times used, such as j k , as indices of summation. The subscripted or variable simply represents a specific value of a random variable, . The numbers 1 and n are the lower limit and upper limit of the x n summation. The expression can be stated in words as “sum  =1 i n the numbers i from 1 to for all values of ”. An example can help x i clarify: 9 x + x x = x + + x 9 7 8 6 i  i =6 The summation operator has three properties. The first property is called the constant rule. Formally, it is: n : For any constant ( ) 13 c = nc c  =1 i Let’s consider an example. Say that we are given: 3 5 = 15 · 5 = (5 + 5 + 5) = 3  =1 i A second property of the summation operator is: n n ( ) cx x = c 14 i i   i =1 i =1

36 36 causal inference the mixtape : Again let’s use an example. Say we are given: 3 x x =5 x x 5 +5 +5 2 3 1 i  =1 i = 5( x x ) + x + 3 2 1 3 =5 x i  i =1 We can apply both of these properties to get the following third property: n n n : For any constant b a ( ax b + by + )= a and x i i i    i i j =1 =1 =1 Before leaving the summation operator, it is useful to also note things which are properties of this operator. Two things which not do: summation operators cannot n n x x  i i =1 i = 6  n y y  i i i =1 i ✓ ◆ 2 n n 2 = 6 x x i   i i =1 i =1 We can use the summation indicator to make a number of calcu- lations, some of which we will do repeatedly over the course of this book. For instance, we can use the summation operator to calculate : average the n 1 = x x i  n i =1 x + x x + ··· + n 2 1 ( 15 ) = n x x . Another where is the average (mean) of the random variable i calculation we can make is a random variable’s deviations from its own mean. The sum of the deviations from the mean is always equal to zero: n ) 16 ( )=0 x x ( i  =1 i Let’s illustrate this with an example in Table : 5 Consider a sequence of two numbers { y }. , y x ,..., y ,..., } and { x x , n n 2 2 1 1 Then we may consider double summations over possible values of x . Then, y ’s. For example, consider the case where n = m =2 ’s and

37 properties of regression 37 : Sum of deviations equalling 5 Table xx x zero 2 10 4 4 - 13 5 5 3 - 0 Mean= 8 Sum= 2 2 x x + . This is because: x y y y is equal to x + y y + x   2 2 2 2 1 1 1 1 i j j =1 i =1 + x y ) + x y y + x + y x = x ( y y + y y )+ x ( 2 2 2 2 2 2 2 1 1 1 1 1 1 1 2 = ) x y ( y + 2 1 i  =1 i ◆ ✓ 2 2 y = x i j   =1 j i =1 ◆ ✓ 2 2 x y = i j   j =1 =1 i 2 2 = y x j i   =1 =1 j i One result that will be very useful throughout the semester is: n n 2 2 2 x ) x ) 17 = ( ) x ( x ( n i   i =1 i =1 i An overly long, step-by-step, proof is below. Note that the summa- tion index is suppressed after the first line for brevity sake. n n 2 2 2 x ) ( = ) + x x ( x x 2 x i i   i =1 i =1 i 2 2 x = 2 x x x + n i   i 1 2 2 = x 2 n x + x x i i    i n ◆ ✓ 2 2 2 2 = x x n x + i   i n ✓ ◆ ◆ ✓ 2 2 1 1 2 x + n x 2 n = x i i    i n n ◆ ✓ 2 1 2 n x x = i   i n 2 2 = x n x  i

38 38 causal inference : the mixtape A more general version of this result is: n n x )( y y y ( y x ) x ( )= i i i i   =1 i =1 i n ( y = x ) x i i  i =1 n n ) x 18 y ( = ( xy ) i i  =1 i Or: n n n n x y x )( y x y )= x ( ) n 19 ( y y y )= )( x ( = ( x y x ) i i i i i i i i     =1 =1 i i =1 i i =1 The of a random variable, also called Expected value expected value the expectation and sometimes the population mean, is simply the weighted average of the possible values that the variable can take, with the weights being given by the probability of each value occurring in the population. Suppose that the variable X can take x , , x ) ,..., x x each with probability on values ( x ( ), f ( x f ), . . . , f 2 2 1 1 k k respectively. Then we define the expected value of X as: E ( X )= x ) f ( x x )+ x ( f ( x f )+ ··· + x 2 2 1 1 k k k = ) f ( x x ) ( 20 j j  =1 j takes on values of - 1 , 0 and 2 Let’s look at a numerical example. If X 16 16 , 0 . 3 and 0 . 4 3 . respectively. Then the expected , Recall the law of total probability with probabilities 0 requires that all marginal probabilities value of X equals: sum to unity. ( E 1)(0.3) + (0)(0.3) + (2)(0.4) X )=( =0.5 In fact you could take the expectation of a function of that variable, 2 2 X too, such as takes only the values 1 X 0 and 4 with . Note that , 2 . 3 , 0 . 3 and 0 . 4 . Calculating the expected value of X probabilities 0 therefore is: 2 2 2 2 )=( 1) ( (0.3) + (0) X (0.3) + (2) E (0.4) =1.9 The first property of expected value is that for any constant, , c ( )= c . The second property is that for any two constants, a and b , E c E ( aX + b then E ( aX )+ E ( b )= aE ( X )+ b . And the third property )= is that if we have numerous constants, a and many random ,..., a n 1 variables, X ,..., X , then the following is true: n 1 E ( a X X ( + ··· + a ) X E )= a a E ( X + )+ ··· n n n n 1 1 1 1

39 properties of regression 39 We can also express this using the expectation operator: ✓ ◆ n a ) X X E = ( E a i i i i   =1 i i =1 = 1, then a And in the special case where i ✓ ◆ n n E = E ) X ( X i i   i i =1 =1 The expectation operator, ( · ) , is a population E Variance concept. It refers to the whole group of interest, not just the sample we have available to us. Its intuition is loosely similar to the average of a random variable in the population. Some additional properties for the expectation operator can be explained assuming two random W and H . variables, ( E b )= aE ( W )+ b for any constants a , b aW + )+ W H )= E ( W + E ( H ) E ( ( W E ( W )) = 0 E W : Consider the variance of a random variable, 2 2 )= s W = E [( V E ( W )) ( ] in the population W We can show that: 2 2 W )= E ( W ) ) E ( W V ( ( 21 ) In a given sample of data, we can estimate the variance by the follow- ing calculation: n 1 2 2 b n 1) S ) ( x =( x i  =1 i n because we are making a degrees of freedom where we divide by 1 adjustment from estimating the mean. But in large samples, this degree of freedom adjustment has no practical effect on the value of 17 2 17 b S Whenever possible, I try to use the . “hat” to represent an estimated statistic. A few more properties of variance. First, the variance of a line is: 2 2 b S S . But it is Hence instead of just probably more common to see the 2 + V ( aX ) X )= a b V ( 2 sample variance represented as . S And the variance of a constant is zero (i.e., V ( c )=0 for any constant, ). The variance of the sum of two random variables is c equal to: V ( X + Y )= V ( X )+ V ( Y ) + 2( E ( XY ) E ( X ) E ( Y )) ( 22 ) ) If the two variables are independent, then XY )= E ( X ( E ( Y ) and E ). ( X + Y ) is just equal to the sum of V ( X )+ V ( Y V

40 40 causal inference : the mixtape The last part of equation 22 Covariance is called the covariance. measures the amount of linear dependence between The covariance X Y ) operator. C , ( two random variables. We represent it with the C Y ) > 0 indicates that two variables move in the same direction, ( X , ( X , Y ) < 0 indicates they move in opposite directions. Thus whereas C as: we can rewrite Equation 22 )+2 X Y )= V ( X )+ V ( V + C ( X , Y ) ( Y While it’s tempting to say that a zero covariance means two random variables are unrelated, that is incorrect. They could have a nonlinear relationship. The definition of covariance is ( X , C )= E ( XY ) E ( X ) E ( Y )( 23 ) Y As we said, if and Y are independent, then C ( X , Y )=0 in the X 18 18 It may be redundant to keep saying population. The covariance between two linear functions is: this, but since we’ve been talking about only the population this whole time, I ( a X + b Y ) , a C + b Y )= b , b C ( X 2 2 2 1 1 1 wanted to stress it again for the reader. The two constants, a , zero out because their mean is them- and a 2 1 selves and so the difference equals zero. Interpreting the magnitude of the covariance can be tricky. For that, we are better served looking at . We define correla- correlation E X ) X Y ( E ( Y ) p p = Z = . Then: tion as follows. Let W and X ) V V Y ) ( ( ) Y , X C ( p )= Corr , W ( ) 24 ( Z V ( Y ) X ( V ) The correlation coefficient is bounded between – 1 and 1 . A posi- tive (negative) correlation indicates that the variables move in the or – 1 the stronger the linear 1 same (opposite) ways. The closer to relationship is. Population model We begin with cross-sectional analysis. We will also assume that we can collect a random sample from the popu- x and y , and we lation of interest. Assume there are two variables, 19 19 x . y varies with changes in Notice – this is not necessarily would like to see how causal language. We are speaking There are three questions that immediately come up. One, what if first and generally just in terms of x is affected by factors other than ? How will we handle that? Two, y two random variables systematically moving together in some measurable what is the functional form connecting these two variables? Three, way. x if we are interested in the causal effect of y , then how can we on distinguish that from mere correlation? Let’s start with a specific model. y = b ) + b 25 x + u ( 0 1 This model is assumed to hold in the population . Equation 25 defines a linear bivariate regression model . For causal inference, the terms

41 properties of regression 41 on the left-hand-side are usually thought of as the effect, and the terms on the right-hand-side are thought of as the causes. Equation y by includ- 25 explicitly allows for other factors to affect term, ing a random variable called the u error . This equation also is linearly explicitly models the functional form by assuming that y coefficient the . We call the x intercept parameter , dependent on b 0 . These, note, de- coefficient the slope parameter and we call the b 1 scribe a population, and our goal in empirical work is estimate their values. I will emphasize this several times throughout this book: we never directly observe these parameters, because they are not data. data What we can do, though, is estimate these parameters using and assumptions . We just have to have credible assumptions to accurately estimate these parameters with data. We will return to this point later. In this simple regression framework, all unobserved variables are subsumed by the error term. First, we make a simplifying assumption without loss of generality. Let the expected value of u be zero in the population. Formally: E u )=0 ( 26 ) ( where ( · ) is the expected value operator discussed earlier. Normal- E izing ability to be zero in the population is harmless. Why? Because the presence of (the intercept term) always allows us this flexibility. b 0 u is different from zero – for instance, say that it’s If the average of a – then we just adjust the intercept. Adjusting the intercept has no 0 effect on the b slope parameter, though. 1 =( b ) + a a )+ b y x +( u 0 0 0 1 where a and the new intercept = E ( u ) . The new error term is u a 0 0 b a . But while those two terms changed, notice what did term is + 0 0 change: the slope, , has not changed. not b 1 Mean independence An assumption that meshes well with our ele- mentary treatment of statistics involves the mean of the error term for x each “slice” of the population determined by values of : E u | x )= E ( u ) for all values x ( 27 ) ( where E ( u | x ) means the “expected value of u given x ”. If equation . An example 27 is mean independent of x u holds, then we say that might help here. Let’s say we are estimating the effect of schooling on wages, and u is unobserved ability. Mean independence requires that E [ ability | x = 8] = E [ ability | x = 12] = E [ ability | x = 16] so that the average ability is the same in the different portions of the population th th with an 8 grade education, a 12 grade education and a college

42 42 causal inference : the mixtape education. Because people choose education, though, based partly on that unobserved ability, equation 27 is almost certainly violated in this actual example. E | x ]= E [ u ] (a non-trivial u [ Combining this new assumption, u ]=0 (a normalization and trivial [ E assumption to make), with assumption), and you get the following new assumption: u | x ) = 0, for all values x ( 28 ) E ( is called the 28 and Equation zero conditional mean assumption is a key identifying assumption in regression models. Because the E u | x ) = 0 implies conditional expected value is a linear operator, ( y | x E b ( + b x )= 0 1 which shows the population regression function is a linear function x , or what Angrist and Pischke [ 2009 of ] call the conditional expec- 20 20 tation function. Notice that the conditional expecta- This relationship is crucial for the intuition of the tion passed through the linear function . causal parameter , as a b parameter, 1 leaving a constant, because of the first property of the expectation operator, and a constant times . This is be- x x y and Least Squares Given data on , how can we estimate the ⇢ cause the conditional expectation of ? Let b and population parameters, n =1,2,..., i ): y , x ( be b | X E [ X ]= X ] X . This leaves us with | u E [ 0 1 i i which under zero conditional mean is random sample of size n (the number of observations) from the a equal to zero. population. Plug any observation into the population equation: y b + b x + u = 0 1 i i i i indicates a particular observation. We observe where and x y but i i not . We just know that u is there. We then use the two population u i i restrictions that we discussed earlier: ( u )=0 E ( , u )=0 C x b to obtain estimating equations for and b . We talked about the 0 1 x and u first condition already. The second one, though, means that uncorrelated are because recall covariance is the numerator of correla- 24 ). Both of these conditions imply equation tion equation (equation : 28 ( u | x )=0 E With E ( xu )=0 , we get E ( u )=0 , C ( x , u )=0 . Notice that if C ( x , u )=0 , 21 21 . are independent. it implies Next we plug in for u , which is u See equation 23 and x equal to b : y b x 0 1 E ( y b )=0 b x 0 1 E ( x [ y ]) = 0 b x b 0 1

43 properties of regression 43 that effectively deter- These are the two conditions in the population b and . And again, note that the notation here is population mine b 0 1 concepts. We don’t have access to populations, though we do have their sample counterparts: n 1 c c y 29 ) b ( ( b )=0 x 0 1 i i  n =1 i n 1 c c x ]) = 0 y ) ( b [ 30 b ( x 0 1 i i i  n =1 i 22 22 b b b b are the estimates from the data. and These are two lin- where Notice that we are dividing by n , not 0 1 . There is no degrees of freedom n 1 b b b b and . Recall the properties ear equations in the two unknowns 0 1 correction, in other words, when using of the summation operator as we work through the following sample samples to calculate means. There is a degrees of freedom correction when we and properties of these two equations. We begin with equation 29 start calculating higher moments. pass the summation operator through. n n n n 1 1 1 1 c c c c ( y b )= b b b ( x y ) x 0 0 1 1 i i i i     n n n n =1 i i =1 i =1 i =1 ◆ ✓ n n 1 1 c c b b = y x 0 1 i i   n n =1 i i =1 c c = y b b x 0 1 n 1 = : where y { y y which is the average of the n numbers  i i =1 i n n } . For emphasis we will call y the sample average . We have 1, . . . , 29 already shown that the first equation equals zero (Equation ), so c c = + y b b x . So we now use this equation to write the this implies 0 1 intercept in terms of the slope: c c x = y b b 0 1 n c c c . )=0 x x ( y b We now plug b into the second equation, b  0 0 1 i i i i =1 This gives us the following (with some simple algebraic manipula- tion): n c c x ( y x [ b y x ) ]=0 b 1 1 i i i  i =1  n n c )= ( b x y y x x ( x ) 1 i i i i   i =1 =1 i 23 23 Recall from much earlier that: So the equation to solve is n n  n n ( )( x y )= x ( x y y ) y i i i i   2 c x x x ( )( y x ) y b ( )= 1 i i i   i =1 i =1 =1 i i =1 n y ) x = x ( i i  n 2 6 ) = 0, we can write: If x x ( =1 i  i =1 i n n ) x = y y x ( n ( ) y y )( x x  i i  i i =1 i b b = i =1 1 n 2 ( x ) x  i i =1 ) Sample covariance( x y , i i = ( 31 ) Sample variance( x ) i

44 44 causal inference : the mixtape b The previous formula for b is important because it shows us 1 how to take data that we have and compute the slope estimate. The b , is commonly referred to as the b estimate, ordinary least squares 1 (OLS) slope estimate. It can be computed whenever the sample variance of x x is not constant across all isn’t zero. In other words, if i i i . The intuition is that the variation in x is what permits us values of to identify its impact in y . This also means, though, that we cannot determine the slope in a relationship if we observe a sample where everyone has the same years of schooling, or whatever causal variable we are interested in. b Once we have calculated b , we can compute the intercept value, 1 b b b b y = as b b because it is x . This is the OLS intercept estimate 0 0 1 calculated using sample averages. Notice that it is straightforward be- b b cause is linear in b b . With computers and statistical programming 0 1 languages and software, we let our computers do these calculations 24 24 n is small, these calculations are quite tedious. Back in the old days, though? Let’s because even when be glad that the old days of calculating b b b b , we define a fitted value for , For any candidate estimates, 0 1 OLS estimates by hand is long gone. as: each i b b b y b x + = b 0 1 i i i = { 1, . . . , n } so we have n of these equations. This is the Recall that y given that x = x . But there is prediction error value we predict for i i b y 6 = y u . We call that mistake the residual , and here use the because i i notation for it. So the residual equals: b b u y y = i i i c c b u = b x y b 0 1 i i i Suppose we measure the size of the mistake, for each i , by squaring it. Squaring it will, after all, eliminate all negative values of the mistake so that everything is a positive value. This becomes useful when summing the mistakes if we aren’t wanting positive and negative values to cancel one another out. So let’s do that: square the mistake 2 n b and add them all up to get, u :  i =1 i n n 2 2 b b u ) ( y = y i i i   i i =1 =1 n 2 c c ( y ) = b x b 0 1 i i  =1 i sum of squared residuals because the This equation is called the b b u . But, the residual is based on estimates of the = y y residual is i i slope and the intercept. We can imagine any number of estimates of those values. But what if our goal is to minimize the sum of squared c c b and residuals by choosing b ? Using calculus, it can be shown that 0 1

45 properties of regression 45 the solutions to that problem yields parameter estimates that are the same as what we obtained before. b b and Once we have the numbers b b for a given dataset, we write 0 1 the : OLS Regression line c c b = y b b ) x ( 32 + 0 1 Let’s consider an example in Stata. set seed 1 clear set obs 10000 gen x = rnormal() gen u = rnormal() gen y = 5.5 x + 12 u * * reg y x predict yhat1 x gen yhat2 = 0.0732608 + 5.685033 * sum yhat * predict uhat1, residual gen uhat2=y-yhat2 sum uhat * twoway (lfit y x, lcolor(black) lwidth(medium)) (scatter y x, mcolor(black) msize(tiny) msymbol(point)), title(OLS Regression Line) rvfplot, yline(0) Run the previous lines verbatim into Stata. Notice that the esti- mated coefficients – y-intercept and slope parameter – are repre- sented in blue and red below in Figure 3 . b and we defined the Recall that we defined the fitted value as y i b b . Notice that the scatter plot relationship be- , as residual, y y u i i i tween the residuals and the fitted values created a spherical pattern suggesting that they are uncorrelated (Figure 4 ). Once we have the estimated coefficients, and we have the OLS regression line, we can predict y (outcome) for any (sensible) value of x . So plug in certain values of , we can immediately calculate what x will probably be with some error. The value of OLS here lies in how y large that error is: OLS minimizes the error for a linear function. In y for all linear estimators because it fact, it is the best such guess at minimizes the prediction error. There’s always prediction error, in other words, with any estimator, but OLS is the least worst. Notice that the intercept is the predicted value of y if and when , it’s a little hard to read, but x . Since here that value is 0 . 0732608 =0 that’s because x and u were random draws and so there’s a value of

46 46 causal inference : the mixtape Figure 3 : Graphical representation of OLS Regression Line on bivariate regression from x y 40 y-intercept = 0.0732609 20 0 Slope=5.685033 -20 -40 -60 4 2 0 -2 -4 x Fitted values y 25 25 x =0 . x The slope allows us to predict zero for This is because on average u and y on average when are independent, even if in the sample according to: changes in y for any reasonable change in x they aren’t. Sample characteristics tend to be slightly different from population b b y = b D x D 1 properties because of sampling error. b x =1 , then And if increases by one unit, and so D D y = 5.685033 in x b our numerical example because b = 5.685033. 1 b b b , we get the OLS fitted and Now that we have calculated b 0 1 : x i =1,..., n into the following equation for values by plugging the i b b b y b x + = b 0 1 i i The OLS residuals are also calculated by: b b b y b x u = b 0 1 i i i Most residuals will be different from zero (i.e., they do not lie on the regression line). You can see this in Figure . Most of the residuals 3 are not on the regression line. Some are positive, and some are negative. A positive residual indicates that the regression line (and hence, the predicted values) underestimates the true value of y . And i if the residual is negative, then it overestimated. b b ? b Algebraic Properties of OLS and Remember how we obtained b 0 1 When an intercept is included, we have: n b b )=0 ( y x b b 0 1 i i  =1 i

47 properties of regression 47 : Distribution of residuals 4 Figure 50 around regression line 0 Residuals -50 10 0 -10 -20 20 Fitted values always adds up to zero, by construction . The OLS residual n b =0 ) u ( 33 i  =1 i Sometimes seeing is believing, so let’s look at this together. Type the following into Stata verbatim. . clear . set seed 1234 set obs 10 . gen x = 9 rnormal() . * gen u = 36 . rnormal() * . gen y = 3 + 2 x+u * . reg y x . predict yhat . predict residuals, residual . su residuals . list . collapse (sum) x u y yhat residuals . list Output from this can be summarized in the following table (Table 6 ).

48 48 causal inference : the mixtape : Simulated data showing the Table 6 b b b b b ux y u u x no. y y u sum of residuals equals zero . 95803 - 38 . 72134 - 3 . 256034 - 35 . 46531 155 . 3967 115 . 4762 381653 32 - 1 .- 4 . 8 . 028061 - 31 . 59613 - 26 . 2 - 5 . 28619 70 . 22192 139 . 0793 .- 13 . 28403 - 30994 80379 . 60738 7 . 836532 12 . 77085 - 20 . 254141 100 . 0792 . 0982034 17 -. . 3 1 9 . 443188 - 6 . 690872 7 . 770137 - 14 . 46101 1 . 790884 - 4 . 364 . -. 1238423 - 112 640209 . 18046 25 . 46088 20 . 10728 5 . 353592 24 . 84179 107 . 6462 5 . 4 . 13 00131 48 - 64874 - 34 . 15294 4 . 848374 - 39 . . . 83337 - 189 . 0929 252096 34 6 .- 1 . 9 . 118524 35 . 29023 38 . 09396 - 2 . 7 - 32 . 48362 - 106 . 8052 . 11 . 58586 80373 289957 82 23296 74 . 65305 - 5 . 608207 80 . 26126 . 424 . 5786 - 450 . 1217 . 5 .- 8 - . 60571 14 . 0549 7 . 377647 6 . 677258 - 1 . 838944 49 . 26245 9 . -. 2754041 11 77159 - 61257 - 51 . 15575 - 43 . 11034 - 8 . 045414 159 . 0706 346 . 8405 14 10 .- 19 . . . 25085 749418 7 . 749418 1 . 91 e- 06 - 6 . 56 e- 06 . 0000305 . 25072 34 . 28 Sum - 7 b b y Notice the difference between the and u u columns. When we , sum these ten lines, neither the error term nor the fitted values of y . This is, as we said, sum to zero. But the residuals do sum to zero one of the algebraic properties of OLS – coefficients were optimally chosen to ensure that the residuals sum to zero. b b Because + y = y by definition (which we can also see in the u i i i above table), we can take the sample average of both sides n n n 1 1 1 b b = u y y + i i i    n n n i =1 =1 i i =1 b y = and so y because the residuals sum to zero. Similarly, the way that we obtained our estimates yields, n b b y )=0 ( b x b x 0 1 i i i  i =1 The sample covariance (and therefore the sample correlation) be- tween the explanatory variables and the residuals is always zero (see Table 6 ): n b x u =0 i i  i =1 b Because the are linear functions of the x , the fitted values and y i i 6 ): residuals are uncorrelated too (See Table n b b y =0 u i i  i =1 b b b and b were Both properties hold by construction. In other words, 0 1 26 26 . Using the Stata code from Table to make them true selected , you can show all these algebraic 6 A third property is that if we plug in the average for x , we predict properties yourself. I encourage you the sample average for ) y is on the OLS , y . That is, the point, ( x to do so by creating new variables equalling the product of these terms regression line, or: and collapsing as we did with the other variables. This will help you believe b b x y = b + b 0 1 these algebraic properties hold.

49 properties of regression 49 Goodness of Fit For each observation, we write b b + y u y = i i i (SSR) sum of residual (SST), Define the explained (SSE) and total squares as n 2 ) y SST y ( ( 34 ) = i  =1 i n 2 b ) ( = y 35 y ) SSE ( i  i =1 n 2 b = u 36 SSR ( ) i  =1 i SST 27 27 . These are sample variances when divided by n Recall the earlier discussion about 1 is the 1 n degrees of freedom correction. SSE SSR b , is the sample variance of , and y sample variance of is y i i 1 1 n n b u . With some simple manipulation rewrite the sample variance of i equation 34 : n 2 = SST ) ( y y i  i =1  2 n b b y ) = y y ) ( ( y i i i  =1 i  2 n b b ) ( u y = y i i  =1 i And then using that the fitted values are uncorrelated with the 34 ), we can show that: residuals (equation = SSE + SSR SST SST > 0 , we can define the fraction of the total variation in Assuming y (or the OLS regression line) as that is explained by x i i SSE SSR 2 =1 R = SST SST R-squared of the regression. It can be shown to which is called the b of the correlation between . Therefore be equal to the and square y y i i 2 0  1 . An R-squared of zero means no linear relationship  R y and an R-squared of one means a perfect linear and x between i i 2 = x +2 relationship (e.g., R are closer and increases, the y y ). As i i i closer to falling on the OLS regression line. 2 R in causal inference, though. It’s You don’t want to fixate on a useful summary measure but it does not tell us about causality. Remember, we aren’t trying to explain y ; we are trying to estimate 2 R tells us how much of the variation in causal effects. The y is i explained by the explanatory variables. But if we are interested in the 2 causal effect of a single variable, R is irrelevant. For causal inference, we need equation 28 .

50 50 causal inference : the mixtape Up to now, we motivated simple regression Expected Value of OLS using a population model. But our analysis has been purely algebraic based on a sample of data. So residuals always average to zero when we apply OLS to a sample, regardless of any underlying model. But now our job gets tougher. Now we have to study the statistical properties of the OLS estimator, referring to a population model and 28 28 This section is a review of a tradi- assuming random sampling. tional econometrics pedagogy. We Mathematical statistics is concerned with questions like “how do cover it for the sake of completeness, as our estimators behave across different samples of data?” On average, traditionally, econometricians motivated their discuss of causality through ideas for instance, will we get the right answer if we could repeatedly like unbiasedness and consistency. sample? We need to find the expected value of the OLS estimators – in effect the average outcome across all possible random samples – and determine if we are right on average. This leads naturally to a unbiasedness characteristic called , which is a desirable characteristic of all estimators. b ) )= b E 37 ( b ( b , which is the slope population Remember our objective is to estimate 1 y and x . Our es- parameter that describes the relationship between c timate, is an estimator of that parameter obtained for a specific b 1 c sample. Different samples will generate different estimates ( ) for b 1 b the “true” (and unobserved) . Unbiasedness is the idea that if 1 we could take as many random samples on Y as we want from the population and compute an estimate each time, the average of these b . estimates would be equal to 1 There are several assumptions required for OLS to be unbiased. We will review those now. The first assumption is called “linear in the parameters”. Assume a population model of: = u + + b y x b 0 1 where b and b are the unknown population parameters. We view 0 1 and u as outcomes of random variables generated by some data x y is a function of x and u , both of generating process. Thus, since y which are random, then is also random. Stating this assumption b formally shows our goal is to estimate b . and 0 1 Our second assumption is “random sampling”. We have a random sample of size n , { ( x , following the population , y } ): i =1,..., n i i b by OLS. and b model. We know how to use this data to estimate 0 1 i is a draw from the population, we can write, for each Because each i : y u = b + + b x 0 1 i i i Notice that u not here is the unobserved error for observation i . It is i the residual that we compute from the data.

51 properties of regression 51 The third assumption is called the “sample variation in the ex- planatory variable”. That is, the sample outcomes on are not all x i the same value. This is the same as saying the sample variance of x x are all is not zero. In practice, this is no assumption at all. If the i y in the x the same value (i.e., constant), we cannot learn how affects population. Recall that OLS is the covariance of x divided by y and x x is constant, then we are dividing by zero, the variance in and so if and the OLS estimator is undefined. The fourth assumption is where our assumptions start to have real teeth. It is called the “zero conditional mean” assumption and is probably the most critical assumption in causal inference. In the population, the error term has zero mean given any value of the explanatory variable: ( u | x )= E ( u )=0 E This is the key assumption for showing that OLS is unbiased, with the zero value being of no importance once we assume u | x ) ( E does x . Note that we can compute the OLS estimates not change with whether or not this assumption holds, or even if there is an underly- 29 29 c We will focus on b . There are a few ing population model. 1 approaches to showing unbiasedness. c b is an unbiased estimate of b (Equation So, how do we show 1 1 One explicitly computes the expected )? We need to show that under the four assumptions we just 37 c x , x : value of i = b { conditional on 1 i c . Even though this is the more } n 1, . . . , outlined, the expected value of b , when averaged across random 1 proper way to understand the problem, b . In other words, unbiasedness has to be samples, will center on 1 technically we can obtain the same understood as related to repeated sampling. We will discuss the results by treating the conditioning variables as if they were fixed in answer as a series of steps. repeated samples. That is, to treat the x i c Step : Write down a formula for 1 b . It is convenient to use the 1 as nonrandom in the derivation. So, the ) ( x , y C c comes through the randomness in b form: 1 V x ) ( y (equivalently, the u ). Nevertheless, i i it is important to remember that x are random variables and that we are n x ( x ) y  taking expectations conditional on i i =1 i c = b 1 n 2 knowing them. The approach that ) x ( x  i i =1 we’re taking is called sometimes “fixed in repeated samples”, and while not realistic in most cases, it gets us to the n ( Now get rid of some of this notational clutter by defining x  i same place. We use it as a simplifying =1 i 2 device because ultimately this chapter x ). Rewrite as: = SST (i.e., the total variation in the x ) x i is just meant to help you understand this traditional pedagogy better. n y ) x x (  i i =1 i c = b 1 SST x Step 2 : Replace each y which uses the with y u = b + + b x 0 1 i i i i first linear assumption and the fact that we have sampled data (our

52 52 causal inference : the mixtape second assumption). The numerator becomes: n n x x ) y + = ) x u ( ( x x )( b b + 0 1 i i i i i   =1 =1 i i n n n ) ( x x x )+ = b + u + x ( x ( x ) x b 0 1 i i i i i    i i =1 =1 i =1 n n 2 =0+ b x x u ) ( + ) ( x x 1 i i i   =1 i =1 i n b SST + = u ) x ( x x 1 i i  i =1 n n n 2 x to do and ) )=0 ( ( x x x ) x = x Note, we used ( x    i i i i i i =1 =1 i =1 30 30 Told you we would use this result a this. lot. We have shown that: n ) x b x SST ( + u  x 1 i i =1 i c = b 1 SST x n ) x u x (  i i =1 i + b = 1 SST x Note how the last piece is the slope coefficient from the OLS regres- 31 31 :1,..., I find it interesting that we see so , i u n . sion of We cannot do this regression because x on i i cov ) x x ( many terms when working with i var = the are not observed. Now define so that we have the u w i i SST x regression. It shows up constantly. following: Keep your eyes peeled. n c b = b + u w 1 1 i i  =1 i c Note the following things that this showed: first, b is a linear 1 u . The w function of the unobserved errors, are all functions of i i { x ,..., } . Second, the random difference between b x and the n 1 1 c b , is due to this linear function of the unobservables. estimate of it, 1 c E ( Step b 3 ) . Under the random sampling assumption and : Find 1 E u the zero conditional mean assumption, | x ( ,..., x )=0 , that n 1 i means conditional on each of the variables: x ( w u )=0 | x x ,..., x E )= w E ( u | x ,..., n n 1 1 i i i i ,..., because is a function of { x . This would be true if in the } x w n 1 i population u and x are correlated. Now we can complete the proof: conditional on { x ,..., x } , n 1

53 properties of regression 53 ✓ ◆ n c )= E E b b + ( u w 1 1 i i  =1 i n w = b ) E ( + u 1 i i  =1 i n ) + = u ( b w E 1 i i  =1 i b +0 = 1 = b 1 b Remember, is the fixed constant in the population. The estimator, 1 c b , varies across samples and is the random outcome: before we 1 c collect our data, we do not know what b will be. Under the four 1 c c ( E b )= b . and E ( aforementioned assumptions, b b )= 0 0 1 1 I find it helpful to be concrete when we work through exercises like this. So let’s visualize this in Stata. Let’s create a Monte Carlo simulation in Stata. We have the following population model: =3+2 x + u ( 38 ) y (0, 36) where Normal (0, 9) , u ⇠ Normal ⇠ . Also, x and u are x independent. The following Monte Carlo simulation will estimate OLS on a sample of data 1 , 000 times. The true b parameter equals 2 . b But what will the average equal when we use repeated sampling? b clear all program define ols, rclass version 14.2 syntax [, obs(integer 1) mu(real 0) sigma(real 1) ] clear _ drop all set obs 10000 gen x = 9 rnormal() * gen u = 36 rnormal() * gen y = 3 + 2 x+u * reg y x end _ simulate beta= b[x], reps(1000): ols su hist beta c 7 gives us the mean value of Table b repetitions over the 1 , 000 1 (repeated sampling). While each sample had a different estimate, the

54 54 causal inference : the mixtape : Monte Carlo simulation of OLS 7 Table Variable Obs St. Dev. Mean 000737 0 0409954 . 000 2 , beta . 1 c average for was 2 . 000737 , which is close to the true value of 2 (see b 1 38 ). The standard deviation in this estimator was 0 . 0409954 , Equation 32 32 which is close to the standard error recorded in the regression itself. The standard error I found from running on one sample of data was Thus we see that the estimate is the mean value of the coefficient . 0393758 0 . from repeated sampling, and the standard error is the standard deviation from that repeated estimation. We can see the distribution 5 . of these coefficient estimates in Figure : Distribution of coefficients 5 Figure from Monte Carlo simulation. 10 8 6 Density 4 2 0 2.05 2.1 2 1.95 1.9 1.85 _b[x] The problem is, we don’t know which kind of sample we have. Do 2 ” samples or do we have one of we have one of the “almost exactly 2 ” samples? We can never know whether the “pretty different from we are close to the population value. We hope that our sample is c “typical” and produces a slope estimate close to b but we can’t know. 1 Unbiasedness is a property of the procedure of the rule. It is not a property of the estimate itself. For example, say we estimated an . 2 % return on schooling. It is tempting to say 8 . 2 % is an “unbiased 8 estimate” of the return to schooling, but that’s incorrect technically. c The rule used to get b = 0.082 is unbiased (if we believe that u is 1 unrelated to schooling) – not the actual estimate itself.

55 properties of regression 55 Law of iterated expectations As we said earlier in this chapter, the conditional expectation function (CEF) is the mean of some outcome y held fixed. Now we focus more intently on with some covariate x 33 33 Let’s get the notation and some of the syntax out of this function. This section is based heavily on 2009 ]. Angrist and Pischke [ y | x E ) . Note that the the way. As noted earlier, we write the CEF as ( i i . And because x is random, the CEF x CEF is explicitly a function of i i is random – although sometimes we work with particular values for , like x ( y E | x . When there = 8 years schooling) or E ( y = Female) | x i i i i i ( y E | d = 0) are treatment variables, then the CEF takes on two values: i i E ( y d | and = 1). But these are special cases only. i i An important complement to the CEF is the law of iterated expec- tations (LIE). This law says that an unconditional expectation can be written as the unconditional average of the CEF. In other words . This is a fairly simple idea to grasp. What it y )= E { E ( y ( | x E ) } i i i states is that if you want to know the unconditional expectation of y some random variable , you can simply calculate the weighted sum . Let’s of all conditional expectations with respect to some covariate x . 5 , look at an example. Let’s say that average GPA for females is 3 3 average GPA for males is a 2 , half the population is females, and . half is males. Then: [ } ]= E { E ( GPA GPA | Gender E ) i i ⇥ 3.5) + (3.2 ⇥ 0.5) = (0.5 = 3.35 You probably use LIE all the time and didn’t even know it. The proof x and y each be continuously distributed. is not complicated. Let i i The joint density is defined as f . The conditional distribution ( u , t ) xy of x = u is defined as f y ( t | x . The marginal densities are = u given ) y i ( ( ) and g g t u ). y x Z E ( y | x ) } = E E ( y | x = u ) g { ( u ) du x  Z Z | ) u ( t = x = u ) dt tf g du ( x x | y ZZ u dudt ) u ( t | x = = tf g ( ) x x | y  Z Z = f dt t du ( t | x = u ) g ) ( u x | y x Z dt t [ f = ] du y x , Z t tg dt ( = ) y = E ( y ) The first line uses the definition of expectation. The second line uses

56 56 causal inference : the mixtape the definition of conditional expectation. The third line switches the integration order. The fourth line uses the definition of joint density. The sixth line integrates joint density over the support of x which is . So restating the law of iterated equal to the marginal density of y y ( E { E ( y | x )= ) } . expectations: E i i The first property of the CEF we will CEF Decomposition Property discuss is the CEF Decomposition Property. The power of LIE comes from the way it breaks a random variable into two pieces – the CEF and a residual with special properties. The CEF Decomposition Property states that E ( y y | = # )+ x i i i i , that is is mean independent of x where (i) # i i # E | x ( )=0 i i and (ii) x # . is uncorrelated with any function of i i y can be decomposed The theorem says that any random variable i ” (the CEF) and a piece that is left into a piece that is “explained by x i x over and orthogonal to any function of . The proof is provided now. i y = y I’ll prove the (i) part first. Recall that # ( as we will make ) | x E i i i i a substitution in the second line below. ( E # | x ) )= E ( y x E ( y | | x ) i i i i i i ( ( = | x ) E E y y x | ) i i i i =0 # is uncorrelated with The second part of the theorem states that i )= x . Let h ( x # any function of be any function of x ) . Then E ( h ( x ) i i i i i | { ( x The second term in the interior product is equal to ) E E # } h x ) ( i i i 34 34 zero by mean independence. Let’s take a concrete example of this proof. Let h ( x . )= a + g x i i Then take the joint expectation CEF Prediction Property The second property is the CEF Prediction x ( E ) # ( )= E ([ h + g x Then ] # ) a i i i i 2 take conditional expectations )) x ( m where y [( E ) = arg min x | y ( E Property. This states that ] i i i x ( m ) i x ) ( a | = x )+ E ( g | } E | x ( E ) ) x x | # E ( i i i i i ) . In words, this states that the CEF is the x m ( x is any function of i i x x )=0 | a # ( after we pass the E + i i i conditional expectation through. ( y x minimum mean squared error of . By adding E given y | x ) i i i i E | x ) = 0 to the right hand side we get ( y i i 2 2 [ m x )] = [( y y [ y | x ( ]) + ( E ( y | x ) m ( x ))] E i i i i i i i i I personally find this easier to follow with simpler notation. So replace this expression with the following terms: 2 a b ( b c ) +

57 properties of regression 57 Distribute the terms, rearrange, and replace the terms with their original values until you get the following 2 2 E ( y y | x arg min ( )) +2( E ( y )) | x x ) m ( x ( )) ⇥ ( y m )+ E ( y x | x | ))+( E ( y i i i i i i i i i i i i m ( x ) . When mini- Now minimize the function with respect to i m ( x ) mizing this function with respect to , note that the first term i 2 y E ( y ( | x . )) doesn’t matter because it does not depend on m ( x ) i i i i So it will zero out. The second and third terms, though, do depend m ( x equal to ) . So rewrite 2( E ( y # | on . Also set ) m ( x ) )) as h ( x x i i i i i i | y [ ( y )] and substitute x E i i i 2 2 ( arg min + h ( x )] ) # x +[ E ( y # | x m )+ i i i i i i Now minimizing this function and setting it equal to zero we get 0 h ) # x ( i i which equals zero by the Decomposition Property. The final property of the CEF that we will discuss ANOVA Theory is the analysis of variance theorem, or ANOVA. It is simply that the unconditional variance in some random variable is equal to the variance in the conditional expectation plus the expectation of the conditional variance, or y V V [ E ( y | x )] + E [ V ( y | x ( )] )= i i i i i V is the variance and ( y where | x V ) is the conditional variance. i i Linear CEF Theorem 2009 ] give several argu- Angrist and Pischke [ ments as to why linear regression may be of interest to a practitioner even if the underlying CEF itself is not linear. I will review of those linear theorems now. These are merely arguments to justify the use of 35 35 Note, [ 2009 ] Angrist and Pischke linear regression models to approximate the CEF. make their arguments for using regres- The Linear CEF Theorem is the most obvious theorem of the sion, not based on unbiasedness and Angrist and Pischke ] discuss. Suppose that the CEF 2009 three that [ the four assumptions that we discussed, but rather because regression approxi- itself is linear. Then the population regression is equal to the CEF. mates the CEF. I want to emphasize that This simply states that you should use the population regression to this is a subtly different direction. I in- cluded the discussion of unbiasedness, estimate the CEF when you know that the CEF is linear. The proof is though, to be exhaustive. Just note, 0 b b | . | x b ) is linear, then E ( y x provided. If E vector )= x ( y b for some K i i i i there is a slight change in pedagogy By the Decomposition Property though. 0 b x ( y E ( y | x )) = E ( x ( y x E ( b )) = 0 0 b y = b . Hence E ( Solve this and get | x )= x b b .

58 58 causal inference the mixtape : Best Linear Predictor Theorem Recall that the CEF is the minimum x in the class of all functions given y mean squared error predictor of according to the CEF prediction property. Given this, the population 0 0 1 E E ( X ( X ) X Y is the best that we can do in regression function, ) 36 1 0 36 0 X E Note that Y ) E ( X solves the population X ) b Proof: is the ( the class of all linear functions. matrix notation expression of the minimum mean squared error problem. population regression, or what we have ( X , Y ) C . discussed as ( ) V X The function X b provides the minimum Regression CEF Theorem mean squared error linear approximation to the CEF. That is, 2 0 } E { [ E ( y ] | x b ) x = arg min b i i i b Regression anatomy theorem In addition to our discussion of the CEF and regression theorems, we now dissect the regression itself. Here regression anatomy theorem . The regression anatomy we discuss the [ theorem is based on earlier work by ] and Frisch and Waugh 1933 37 37 ]. A helpful proof of the Frisch-Waugh- I find it more intuitive when thinking through a Lovell [ 1963 Lovell theorem can be found at Lovell specific example and offering up some data visualization. In my 2008 [ ]. opinion, the theorem helps us interpret the individual coefficients of a multiple linear regression model. Say that we are interested in the causal effect of family size on labor supply. We want to regress labor supply onto family size: = Y u + b + X b 0 1 i i i where Y X is family size. is labor supply and If family size is truly random, then the number of kids is uncor- related with the unobserved error term. This implies that when we b can be inter- regress labor supply onto family size, our estimate b 1 preted as the causal effect of family size on labor supply. Visually, we could just plot the regression coefficient in a scatter plot showing all i pairs of data, and the slope coefficient would be the best fit of this data through this data cloud. That slope would tell us the average causal effect of family size on labor supply. b if the family size is b ? Af- But how do we interpret not random 1 ter all, we know from living on planet Earth and having even half a brain that a person’s family size is usually chosen, not randomly assigned to them. And oftentimes, it’s chosen according to something akin to an optimal stopping rule. People pick both the number of kids to have, as well as when to have them, and in some instance, even attempt to pick the gender, and this is all based on a variety of observed and unobserved economic factors that are directly correlated with the decision to supply labor. In other words, us- ing the language we’ve been using up til now, it’s unlikely that ) = 0. ( u | X )= E ( u E

59 properties of regression 59 But let’s say that we have reason to think that the number of kids random. That is, for a given person of a certain race is conditionally and age, any remaining variation in family size across a population is 38 38 Almost certainly not a credible Then we have the following population model: random. assumption, but stick with me. b Y + b u X + + g = R A + g 2 0 1 1 i i i i i is labor supply, where is family size, R is race, A is age, and u is Y X the population error term. If we want to estimate the average causal effect of family size on labor supply, then we need two things. First, we need a sample of containing all four of these variables. Without all four of the data variables, we cannot estimate this regression model. And secondly, we need for number of kids, X , to be randomly assigned for a given set of race/age. b b Now how do we interpret ? And for those who like pictures, 1 how might we visualize this coefficient given there’s six dimensions to the data? The regression anatomy theorem tells us both what this coefficient estimate actually means, and it also lets us visualize the data in only two dimensions. To explain the intuition of the regression anatomy theorem, let’s write down a population model with multiple variables. Assume that your main multiple regression model of interest is b = b ) + y 39 x ( e + ··· + b + x x + ··· + b 0 K 1 i i 1 i Ki k ki regression in which the variable x is Now assume an auxiliary 1 i regressed on all the remaining independent variables x ) = g 40 + g ( f + x x g + ··· + g + x 0 K Ki i i 1 1 +1 k k 1 +1 k i k i ̃ b x being the residual from that auxiliary regression. and = x x i 1 1 i 1 i b Then the parameter can be rewritten as: 1 ̃ x ( y , C ) i i b ( 41 ) = 1 ̃ V x ( ) i Notice that again we see the coefficient estimate being a scaled co- variance, only here the covariance is with respect to the outcome and residual from the auxiliary regression and the scale is the variance of that same residual. ̃ b To prove the theorem, note that E , ]= E [ x ] ] E [ [ x f ]= E [ x i ki ki ki ̃ y and residual and plug x auxiliary regression into the from x i ki ki covariance cov ( y ). , x i ki ̃ b ( b x + b , x e + + ··· + cov ) + ··· + b x x 0 K 1 Ki i 1 i ki ki k = b k ̃ ( x var ) ki + cov ( b ) + b f x , e + ··· + b + x x b ··· + 0 K 1 Ki i i 1 i ki k = var ( f ) i

60 60 causal inference the mixtape : f Since by construction ]=0 , it follows that the term b [ E [ f . E ]=0 0 i i is a linear combination of all the independent variables with Since f i the exception of , it must be that x ki E [ f b x ]=0 x ]= ··· = b f [ E E [ f b x = ··· ]= x ]= b f [ E K 1 i 1 Ki i i i i i k 1 k 1 +1 k k i +1 [ e Consider now the term f E ]. This can be written as i i [ E f e ]= E [ e ] f i i i i ̃ [ e = E x ] i ki b e = ( x E [ x )] i ki ki ̃ e x E ] ] E [ = x [ e i i ki ki Since e is uncorrelated with any independent variable, it is also i x . With regard . Accordingly, we have E [ e uncorrelated with x ]=0 i ki ki to the second term of the subtraction, substituting the predicted value from the x auxiliary regression, we get ki b b ̃ b b b E [ e x ( E g [ + e x + x ··· + + ··· + )] g x + ]= g x g 0 K 1 i 1 i i Ki i i k ki k +1 1 k +1 Once again, since e is uncorrelated with any independent variable, i the expected value of the terms is equal to zero. Then it follows that e f ] = 0. E [ i i ̃ ] x The only remaining term then is f ] [ which equals E [ b x x b i ki k ki ki k ̃ = since f x can be substituted using a rewriting of the . The term x i ki ki x , such that auxiliary regression model, ki ̃ = E [ x | x x ]+ X k ki ki ki This gives ̃ ̃ ̃ )] x x E x E ]= b ]+ E [ [ x X ( b [ x | k ki k ki ki ki k ki 2 ̃ ̃ E [ = x b } ]+ E [( E [ x )] | x x { ] k ki ki k ki ̃ b ) var ( = x ki k E [ x | X which follows directly from the orthogonality between ] k ki ̃ x . From previous derivations we finally get and ki ̃ ̃ y ) , cov x x )= b ( var ( i k ki ki which completes the proof.

61 properties of regression 61 I find it helpful to visualize things. Let’s look at an example in Stata. . ssc install reganat, replace . sysuse auto.dta, replace . regress price length . regress price length weight headroom mpg . reganat price length weight headroom mpg, dis(length) biline Let’s walk through the regression output. The first regression of 6 Figure : Regression anatomy display. price on length yields a coefficient of . 20 on length. But notice the 57 output from the fourth line. The effect on length is 94.5 . The first regression is a bivariate regression and gives a positive slope, but the second regression is a multivariate regression and yields a negative slope. One of the things we can do with regression anatomy (though this isn’t its main purpose) is visualize this negative slope from the multivariate regression in nevertheless two dimensional space. Now how do we visualize this first multivariate slope coefficient, given our data has four dimensions? We run the auxiliary regression, use ̃ ) cov y , x ( i i the residuals, and then calculate the slope coefficient as .We ̃ ) x ( var i can also show scatter plots of these auxiliary residuals paired with their outcome observations and slice the slope through them (Figure 6 ). Notice that this is a useful way to preview the multidimensional correlation between two variables from a multivariate regression.

62 62 causal inference : the mixtape And as we discussed before, the solid black line is negative while the slope from the bivariate regression is positive. The regression anatomy theorem shows that these two estimators – one being a multivariate OLS and the other being a bivariate regression price and a residual – are identical. Variance of the OLS Estimators In this chapter we discuss inference under a variety of situations. Under the four assumptions we men- tioned earlier, the OLS estimators are unbiased. But these assump- tions are not sufficient to tell us anything about the variance in the estimator itself. These assumptions help inform our beliefs that the estimated coefficients, on average, equal the parameter values them- selves. But to speak intelligently about the variance of the estimator, we need a measure of dispersion, or spread, in the sampling distri- bution of the estimators. As we’ve been saying, this leads us to the variance and ultimately the standard deviation. We could character- ize the variance of the OLS estimators under the four assumptions. But for now, it’s easiest to introduce an assumption that simplifies the calculations. We’ll keep the assumption ordering we’ve been using and call this the fifth assumption. The fifth assumption is the homoskedasticity or constant variance assumption. This assumption stipulates that our population error u term, , has the same variance given any value of the explanatory x variable, . Formally, it’s: 2 u | x )= s V ( 0( 42 ) > where s is some finite, positive number. Because we assume the zero conditional mean assumption, whenever we assume homoskedastic- ity, we can also write: 2 2 2 ( u )= s ( = E x u E )( 43 ) | Now, under the first, fourth and fifth assumptions, we can write: E ( y | x )= b x + b 0 1 2 s | x V y ( ( 44 ) )= So the average, or expected, value of y is allowed to change with x x . The constant variance , but the variance does not change with assumption may not be realistic; it must be determined on a case-by- case basis. , . Under assumptions 1 and 2 Theorem: Sampling variance of OLS

63 properties of regression 63 we get: 2 s c b x )= ( V | 1 n 2 ) x x (  i i =1 2 s 45 ) ( = SST x n 1 2 2 s x ( )  =1 i i n c ( | x )= V b ) ( 46 0 SST x To show this, write, as before, n c b b ) + = 47 ( u w 1 1 i i  =1 i ) x ( x i w = . We are treating this as nonrandom in the deriva- where i SST x c b is a constant, it does not affect V ( tion. Because b . Now, we need ) 1 1 to use the fact that, for uncorrelated random variables, the variance { u are : i =1,..., n } of the sum is the sum of the variances. The i actually independent across and are uncorrelated. Remember: if we i x , we know w . So: know n c ) b V | x )= Var ( b 48 + ( ( ) x w | u 1 1 i i  =1 i ✓ ◆ n ) = w 49 u ( | x Var i i  =1 i n u ) Var ( w 50 = ( | x ) i i  =1 i n 2 ) 51 w = ( Var ( u ) | x i  i =1 i n 2 2 52 s ( ) w =  i i =1 n 2 2 ( ) 53 w s =  i =1 i where the penultimate equality condition used the fifth assumption u x . Now we have: so that the variance of does not depend on i i n n 2 ) x ( x i 2 ) 54 = ( w   i 2 SST x =1 i =1 i n 2 ( x x )  i =1 i = ) ( 55 2 SST x SST x 56 ) = ( 2 SST x 1 = ( 57 ) SST x

64 64 causal inference : the mixtape We have shown: 2 s c V b ( )= ) ( 58 1 SST x A couple of points. First, this is the “standard” formula for the vari- ance of the OLS slope estimator. It is valid if the fifth assumption not (“homoskedastic errors”) doesn’t hold. The homoskedasticity as- sumption is needed, in other words, to derive this standard formula. But, the homoskedasticity assumption is used to show unbi- not asedness of the OLS estimators. That requires only the first four assumptions we discussed. Usually, we are interested in . We can easily study the two b 1 factors that affect its variance: the numerator and the denominator. 2 s c V b ( ) )= ( 59 1 SST x 2 As the error variance increases – that is, as s increases – so does the variance in our estimator. The more “noise” in the relationship between y x (i.e., the larger the variability in u ) – the harder it is and b . By contrast, more variation in { x } is a to learn something about 1 i c SST . " , V ( b good thing. As ) # x 1 SST x is the sample variance in x . We can think of this Notice that n 2 , as getting close to the population variance of x gets large. , as n s x This means: 2 ) ⇡ n s 60 SST ( x x 1 c ( n b grows, ) shrinks at the rate of V . This which means that as 1 n is why more data is a good thing – because it shrinks the sampling variance of our estimators. c b The standard deviation of is the square root of the variance. So: 1 s c p ) )= sd b 61 ( ( 1 SST x This turns out to be the measure of variation that appears in confi- dence intervals and test statistics. Next we look at estimating the error variance. In the formula, 2 s c b . But we )= V ( } n , we can compute SST =1,..., from { x i : x 1 i SST x 2 2 2 s . Recall that need to estimate E ( u s ) . Therefore, if we could = observe a sample on the errors, { u , an unbiased : i =1,..., n } i 2 s would be the sample average: estimator of n 1 2 ) 62 u ( Â i n =1 i But this isn’t an estimator that we can compute from the data we observe because u u are unobserved. How about replacing each i i

65 properties of regression 65 b : u with its “estimate”, the OLS residual i ( b u b y x = ) 63 0 1 i i i c c b = b u y b ) x 64 ( 0 1 i i i b Whereas be computed, u u cannot can be computed from the data i i c c because it depends on the estimators, b b . But, except by fluke, and 0 1 b = u u i for any 6 . i i c c b ) ( u = y b 65 x b 0 1 i i i c c b x =( + u b ) + b ) 66 ( b x 0 0 1 1 i i i c c ( = b ) u b 67 ) ( b ( b x ) 0 0 1 1 i i c c , but the estimators almost always b b Note that b )= and E ( E b ( )= 0 0 1 1 differ from the population values in a sample. So what about this as 2 ? s an estimator of n 1 1 2 b ) SSR = 68 u ( i  n n i =1 It is a true estimator and easily computed from the data after OLS. As it turns out, this estimator is slightly biased: its expected value 2 s is a little less than . The estimator does not account for the two c c and b : b restrictions on the residuals used to obtain 0 1 n b u =0 ( 69 ) i  i =1 n b x ) =0 u 70 ( i i  =1 i There is no such restriction on the unobserved errors. The unbiased 2 estimator, therefore, of uses a degrees of freedom adjustment. The s n 2 degrees-of-freedom, not n . Therefore: residuals have only 1 2 b = s ) 71 SSR ( 2 n We now propose the following theorem. The Unbiased Estimator 2 of s under the first five assumptions is: 2 2 b ) )= s E ( 72 s ( In regression output, this is the usually reported: p 2 b b s s = ( 73 ) s SSR = ( 74 ) ( n 2) sd ( u ) , the standard deviation of the population This is an estimator of 39 39 b is not unbiased for . s This will error. One small glitch is that There does exist an unbiased esti- s mator of s but it’s tedious and hardly anyone in economics seems to use it. See Holtzman [ 1950 ].

66 66 causal inference the mixtape : b is called the standard error of the not matter for our purposes. s regression , which means it is an estimate of the standard deviation of root mean squared error the error in the regression. Stata calls it the . c c b ( Given b b ) and sd ( s , we can now estimate sd ) . The estimates 0 1 b b standard errors . We will use these of the of these are called the a j . Almost all regression packages report the standard errors in a lot b in for s : s column next to the coefficient estimates. We just plug b s c p )= b se ( ) 75 ( 1 SST x where both the numerator and denominator are computed from the data. For reasons we will see, it is useful to report the standard errors below the corresponding coefficient, usually in parentheses. Some phenomena do not affect obser- Cluster robust standard errors vations individually, but rather, affect groups of observations which contain individuals. And then it affects those individuals within the group in a common way. Say you wanted to estimate the effect of class size on student achievement, but you know that there exist unobservable things (like the teacher) which affects all the students equally. If we can commit to independence of these unobservables across classes, but individual student unobservables are correlated within a class, then we have a situation where we need to cluster the standard errors. Here’s an example: 0 = y b + # where 1, . . . , G x ig ig ig and 0 # E # [ ] ig jg 0 0 = g and equals which equals zero if s . g g = if g 6 g ) ( ij Let’s stack the data by cluster first. 0 y x b + # = g g g 0 1 0 b X The OLS estimator is still X ] = b X [ Y . We just stacked the E data which doesn’t affect the estimator itself. But it does change the variance. 1 0 0 1 0 b X [[ E X )= W X [ X ] X ] ( V ] X With this in mind, we can now write the variance-covariance matrix for clustered data as G 0 0 0 1 1 0 b b b b b X ] X x X V )=[ ] X [ # ( ][ # g  g g =1 i

67 Directed acyclical graphs “Everyday it rains, so everyday the pain Went ignored and I’m sure ignorance was to blame But life is a chain, cause and effected” – Jay-Z Here we take a bit of a detour, because this material is not com- monly featured in the economist’s toolbox. It is nonetheless ex- tremely valuable, and worth spending some time learning because I will try to convince you that these graphical models can help you to identify causal effects in observational data. The history of graphical causal modeling in science goes back to Phillip Wright, an economist and the father of Sewell Wright, the father of modern genetics. Sewell developed path diagrams for genetics and Philip, we believe, adapted them for econometric 40 40 2012 ]. identification [ , We will discuss Wright again in the Matsueda chapter on instrumental variables. The use of graphs in causal modeling has been largely ignored by the economics profession with only a few exceptions [ Heckman and Pinto , 2015 ]. It was not revitalized for the purposes of causal inference until Judea Pearl began developing his own unique theory , 2009 ]. Pearl’s influence has been Pearl of causation using them [ immense outside of economics, including many of the social sciences, but few economists are familiar with him or use graphical models in their work. Since I think graphical models are immensely helpful for designing a credible identification strategy, I have chosen to include these models for your consideration. We will now have a simple review of graphical models, one of Pearl’s contributions to the theory 41 41 of causal inference. This section is heavily influenced by Morgan and Winship [ 2014 ]. Introduction to DAG notation Before we begin, I’d like to discuss some limitations to the directed acyclical graphical (DAG) representation of causality. The first to note in the DAG notation is causality runs in one direction. There are no cycles in a DAG. To show reverse causality, one would need

68 68 causal inference : the mixtape to create multiple nodes, most likely with two versions of the same node separated by a time index. Secondly, DAGs may not be built [ 2015 ]. But to handle simultaneity according to Heckman and Pinto with those limitations in mind, we proceed forward as I have found DAGs to be extremely valuable otherwise. A DAG is a way of modeling a causal effect using graphs. The DAG represents these causal effects through a set of nodes and arrows, or directed edges. For a complete understanding of the Pearl [ 2009 notation, see ]. I will use a modified shorthand that I believe is sufficient for my purposes in the book. A DAG contains nodes which represent random variables. These random variables are assumed to be created by some data generating process that is often left out of the DAG itself, though not always. I leave them out because it clutters the graph unnecessarily. Arrows represent a causal effect between two random variables moving in the intuitive direction of the arrow. The direction of the arrow captures cause and effect, in other words. Causal effects can happen D ! Y ), or they can in two ways. They can either be direct (e.g., be mediated by a third variable (e.g., D ! X ! Y ). When they are mediated by the third variable, technically speaking we are not capturing the effect of on Y , but rather we are capturing a D D sequence of events originating with , which may or may not be important to you depending on the question you’re asking. A DAG is meant to be a complete description of all causal relation- ships relevant to some phenomena relevant to the effect of on Y . D What makes the DAG distinctive is both the explicit commitment to a causal effect pathway, but also the complete commitment to the lack of a causal pathway represented by missing arrows. A complete DAG will have all direct causal effects among the variables in the graph, as well as all common causes of any pair of variables in the graph. At this point, you may be wondering, “where does the DAG come from?” It’s an excellent question. A DAG is a theoretical represen- tation of some phenomena, and it comes from a variety of sources. Examples would include economic theory, economic models, your own observations and experiences, literature reviews, as well as your own intuition and hypotheses. I will argue that the DAG, at minimum, is useful for a few reasons. One, it is helpful for students to better understand research designs and estimators for the first time. This is, in my experience, especially true for instrumental variables which has a very intuitive DAG representation. Two, through concepts such as the backdoor criterion and collider bias, a well-designed DAG can help you develop a credible research design for identifying the causal effects of some intervention.

69 directed acyclical graphs 69 A Simple DAG Let’s begin with a concrete example. Consider the following DAG. We begin with a basic DAG to illustrate a few ideas, but will expand it to slightly more complex ones later. D Y X In this DAG, we have three random variables: , D and Y . There is X from D a direct Y , which represents a causal effect. There are path to D to Y – one direct path, and one backdoor path . The two paths from D ! Y direct path, or causal effect, is . The idea of the backdoor path is one of the most important things that we learn from the DAG. It is similar to the notion of omitted variable bias in that it represents a determinant of some outcome that is itself correlated with a variable of interest. Just as not controlling for a variable like that in a regression creates omitted variable bias, leaving a backdoor open creates bias. The backdoor path is D ! . We therefore call X a confounder in the sense that because it X D jointly determines Y , it confounds our ability to discern the and D effect of Y in naive comparisons. on Think of the backdoor path like this: sometimes when D takes on different values, Y takes on different values because D causes Y . But sometimes D Y take on different values because X takes on and D Y is different values, and that bit of the correlation between and purely spurious. The existence of two causal pathways is contained D and Y . When a backdoor path has a within correlation between confounder on it and no “collider”, we say that backdoor path is 42 42 open . More on colliders in a moment. Let’s look at a second DAG, this one more problematic than the one before. In the previous example, X was observed. We know it was observed because the direct edges from X to D and Y were solid lines. But sometimes there exists a confounder that is unobserved, and when there is, we represent its direct edges with dashed lines. Consider the following DAG: D Y U Same as before, U is a noncollider along the backdoor path from D to Y , but unlike before, U is unobserved to the researcher. It exists, but it may simply be missing from the dataset. In this situation, there are two pathways from D to Y . There’s the direct pathway, D ! Y , which is the causal effect, and there’s the backdoor pathway

70 70 causal inference : the mixtape U Y . And since U is unobserved, that backdoor pathway is ! D . open Let’s now move to another example, one that is slightly more real- istic. A traditional in labor economics is whether college education increases earnings. According to the Becker human capital model Becker , 1994 ], education increases one’s marginal product, and since [ workers are paid their marginal product in competitive markets, it also increases their earnings. But, college education is not random; it is optimally chosen given subjective preferences and resource con- straints. We will represent that with the following DAG. As always, D be the treatment (e.g., college education) and Y be the outcome let of interest (e.g., earnings). Furthermore, let be parental education, PE I be family income, and B be unobserved background factors, such as genetics, family environment, mental ability, etc. PE I D Y B This DAG is telling a story. Can you interpret that story for yourself? Here is my interpretation. Each person has some background. It’s not contained in the most datasets, as it measures things like intelli- gence, contentiousness, mood stability, motivation, family dynamics, and other environmental factors. Those environmental factors are likely correlated between parent and child, and therefore are sub- B . Background causes a relevant parent to sumed in the variable herself choose some level of education, and that choice also causes the child to choose a level of education through a variety of channels. First, there is the shared background factors, B . Those background factors cause the child to herself choose a level of education, just as it had with the parent. Second, there’s a direct effect, perhaps through simple modeling of achievement, a kind of peer effect. And third, there’s the effect that parental education has on family earnings, I , which in turn affects how much schooling the child receives. Fam- ily earnings may itself affect earnings through bequests and other transfers, as well as external investments in the child’s productivity. This is a simple story to tell, and the DAG tells it well, but I want to alert your attention to some subtle points contained in this DAG. One, notice that B has no direct effect on the child’s earnings except through its effect on schooling. Is this realistic, though? Economists have long maintained that unobserved ability both determines how much schooling a child gets, but also directly affects their earnings, insofar as intelligence and motivation can influence careers. But in

71 directed acyclical graphs 71 this DAG, there is no relationship between background and earnings, assumption . which is itself an Now that we have a DAG, what do we do? We want to list out all D the direct paths and indirect paths (i.e., backdoor paths) between Y . and 1 . D Y (the causal effect of education on earnings) ! D I ! Y 2 1 ) . (backdoor path # . PE ! I D Y (backdoor path # 2 ) 3 ! 4 D B ! PE ! I ! Y (backdoor path # 3 ) . Thus, we have four paths between and Y : one direct causal effect D and three backdoor paths. And since none of the variables along the backdoor paths are , each of these backdoors paths are open , colliders D and Y creating systematic and independent correlations between . Colliding But what is this term “collider”. It’s an unusual term, one you may have never seen before, so let’s introduce it with another example. We’ll use a simple DAG to illustrate what a collider is. D Y X Notice in this graph there are two paths from D to Y as before. There’s the direct (causal) path, D Y . And there’s the backdoor ! D X Y . Notice the subtle difference in this backdoor path, ! path than in the previous one. This time the X has two arrows from and Y point to it. X D on this backdoor path is called a “collider” (as opposed to a confounder) because and Y ’s causal effects are D at colliding . But first, let’s list all paths from D to Y . X 1 . D ! Y (causal effect of D on Y ) 1 2 ! X Y (backdoor path # D ) . Here we have one backdoor path. And because along that backdoor path is a collider, it is currently closed . Colliders, when they are left alone, always close a specific backdoor path. Backdoor criterion Open backdoor paths create systematic, non- causal correlations between D and Y . Thus, usually our goal is to close that specific backdoor path. And if we can close all backdoor paths, then we can isolate the causal effect of D on Y using one of the

72 72 causal inference : the mixtape research designs and identification strategies discussed in this book. So how do we close a backdoor path? There are two ways to close a backdoor path. First, if you have a confounder that has created an open backdoor path, then you conditioning can close that path by on the confounder. Conditioning requires holding the variable fixed using something like subclassifica- tion, matching, regression, or some other method. It is equivalent to “controlling for” the variable in a regression. The second way to close a backdoor path is if along that backdoor path appears a collider. Since colliders always close backdoor paths, and conditioning on a collider always opens a backdoor path, you want to leave colliders alone. That is, don’t control for colliders in any way, and you will have closed that backdoor path. When all backdoor paths have been closed, we say that you have backdoor criterion met the through some conditioning strategy. Let’s X formalize it: a set of variables satisfies the backdoor criterion in X blocks every path between confounders a DAG if and only if that contain an arrow from to Y . Let’s review our original DAG D involving parental education, background and earnings. PE I D Y B The minimally sufficient conditioning strategy necessary to achieve I , because I appeared as a the backdoor criterion is the control for non-collider along every backdoor path (see earlier). But maybe in hearing this story, and studying it for yourself by reviewing the literature and the economic theory surrounding it, you are skeptical of this DAG. Specifically, you are skeptical that B has no relationship to Y except through D or PE . That skepticism leads you to believe that there should be a connection from B to Y , not direct merely one mediated through own education. I PE D Y B Note that including this new backdoor path has created a prob- lem because no longer is our conditioning strategy satisfying the

73 directed acyclical graphs 73 backdoor criterion. Even controlling for , there still exists spurious I D and , and without more information about correlations between Y ! and B ! D , we cannot say much more about the B Y the nature of and Y partial correlation between D – only that it’s biased. In our earlier DAG with collider bias, we conditioned on some variable X that was a collider – specifically, though, it was a descen- dent of D and Y . But sometimes, colliders are more subtle. Let’s D and Y be child school- consider the following scenario. Again, let ing and child earnings. But this time we introduce three new vari- ables – 1 , which is father’s unobserved genetic ability, U 2 , which U I which is joint family is mother’s unobserved genetic ability, and I is observed, but U is unobserved for both income. Assume that i parents. U 2 I D Y 1 U Notice in this DAG, there are several backdoor paths from D to Y . They are: 1 . D U 2 ! Y 2 . U 1 ! Y D . 2 U 1 ! I U 3 ! Y D 4 U 2 ! I D U 1 ! Y . Notice, the first two are open backdoor paths, and as such, cannot be closed because U 1 and U 2 are not observed. But what if we con- trolled for I I only makes matters worse, anyway? Controlling for I was a because then it opens the third and fourth backdoor paths, as any conditioning collider along both of them. It does not appear that strategy could meet the backdoor criterion in this DAG. So to summarize, satisfying the backdoor criterion requires simply a few steps. First, write down all paths – both directed and backdoor paths – between D and Y . Second, note whether each backdoor path is open or closed by checking for whether there are any colliders along those backdoor paths or confounders. Third, check whether you can close all backdoor paths through some conditioning strat- egy. If you can do that, then that conditioning strategy satisfies the backdoor criterion and thus you can identify the causal effect of D on Y .

74 74 causal inference : the mixtape Examples of collider bias: Gender disparities controlling for occupation The issue of conditioning on a collider is important, so how do we know if we have that problem or not? No dataset is going to come with a flag saying “collider” and “confounder”. Rather, the only way to know if you have satisfied the backdoor criterion is with a DAG, and a DAG requires a model. It requires in-depth knowledge of the data generating process for the variables in your DAG, but it also requires ruling out pathways too. And the only way to rule out pathways is through logic and models. There is no way to avoid it – all empirical work requires theory to guide the work. Otherwise, how do you know if you’ve conditioned on a collider or a noncollider? Put differently, you cannot identify treatment effects without making assumptions. Collider bias is a difficult concept to understand at first, so I’ve included a couple of examples to help you sort through it. So let’s first examine a real world example. It is common to hear someone deny the existence of gender disparities in earnings by saying that once occupation or other characteristics of a job are conditioned on, the wage disparity disappears or gets smaller. For instance, the NYT claimed that Google systematically underpaid its female employees. But Google responded that their data showed that when you take “location, tenure, job role, level and performance” into consideration, female pay is basically identical to that of male counterparts. In other words, controlling for characteristics of the job, women received the same pay. But what if one of the ways in which gender discrimination creates gender disparities in earnings is occupational sorting? Then through naive regressions of wages onto a gender dummy controlling for occupation characteristics will be biased towards zero, thus understat- ing the degree of discrimination in the marketplace. Put differently, when there exists occupational sorting based on unobserved ability then assuming gender discrimination we cannot identify the actual discrimination effect controlling for occupation. Let’s first give a DAG to illustrate the problem. d y F o A Notice that there is in fact no effect of females on earnings, be- cause they are assumed to be just as productive of males. Thus if we

75 directed acyclical graphs 75 could control for discrimination, we’d get a coefficient of zero as in this example women are just as productive as men. But in this example, we aren’t interested in estimating the effect of female on earnings; we are interested in estimating the effect of discrimination itself. Now you can see several backdoor paths between discrimination and earnings. They are: 1 F ! o ! y . d d ! o ! y . 2 3 d F ! o A . 4 d ! o A . y ! So let’s say we regress y onto d (which will always pick up the discrimination effect). This is biased because it picks up the effect of discrimination on occupation and earnings, as well as gender’s effect on occupation and earnings. So naturally, we might want to control for occupation, but notice when we do this, we close down those two backdoor paths but open a new path (the last one). That is ! o because ! y has a collider (o). So when we control for F A occupation, we open up a new path. This is the reason we cannot merely control for occupation. Such a control ironically introduces new patterns of bias. What is needed rather is to control for occupation and ability, but since ability is unobserved, we cannot do that, and therefore we do not possess an identification strategy that satisfies the backdoor criterion. Let’s now look at Stata code created by Erin Hengel at the University of Liverpool which she has graciously lent to me with 43 43 permission to reproduce here. Erin has done very good work on gender discrimination. See her http://www. website for more of this * Create confounding bias for female occupation and gender gap . clear all set obs 10000 * Half of the population is female. generate female = runiform()>= 0 . 5 * Innate ability is independent of gender. generate ability = rnormal() * All women experience discrimination. generate discrimination = female

76 76 causal inference : the mixtape * Continuum of occupations ranked monotonically according to ability, conditional * on discrimination—i.e., higher ability people are allocated to higher ranked * occupations, but due to discrimination, women are sorted into lower ranked * occupations, conditional on ability. Also assumes that in the absence of * discrimination, women and men would sort into identical occu- pations (on average). 1 generate occupation = ( 2 )*ability + ( 0 )*female + (- 2 )*discrimination )+( + rnormal() * The wage is a function of discrimination even in identical jobs, occupational * choice (which is also affected by discrimination) and ability. generate wage = ( 1 ) + (- 1 )*discrimination + ( 1 )*occupation + 2 *abil- ity + rnormal() * Assume that ability is unobserved. Then if we regress female on wage, we get a * a consistent estimate of the unconditional effect of discrimination— i.e., * both the direct effect (paying women less in the same job) and indirect effect * (occupational choice). regress wage female * But occupational choice is correlated with the unobserved factor ability *and* * it is correlated with female, so renders our estimate on female and occupation * no longer informative. regress wage female occupation * Of course, if we could only control for ability... regress wage female occupation ability

77 directed acyclical graphs 77 : Regressions illustrating con- 8 Table founding bias with simulated gender Biased Unbiased conditional Covariates: Biased unconditional disparity 0 074 *** - 0 . 994 *** . 3 . - Female *** 601 0 000 0 . 000 ) )( )( 000 . 0 ( . . 991 *** Occupation 1 . 793 *** 0 . . ) ( 0 000 )( 0 000 *** 017 . 2 Ability 0 000 ) ( . N 10 , 000 10 , 000 10 , 000 . 45 0 . 45 Mean of dependent variable 0 . 45 0 Examples of collider bias # : qualitative change in sign 2 Sometimes the problem with conditioning on a collider, though, can be so severe that the correlation becomes statistically insignificant, or worse, even switches sign. Let’s see an example where that is true. clear all set seed 541 Creating collider bias * Z -> D -> Y * D ->X <- Y * 2500 independent draws from standard normal distribution * clear set obs 2500 gen z = rnormal() gen k = rnormal(10,4) gen d = 0 replace d =1 if k>=12 Treatment effect = 50. Notice y is not a function of X. * gen y = d 50 + 100 + rnormal() * gen x = d 50 + y + rnormal(50,1) * Regression * reg y d, robust reg y x, robust reg y d x, robust Okay, so let’s walk through this exercise. We can see from the above code that the treatment effect is , because we coded y as gen 50 y=d . It is for this reason when we run the 50 + 100 + rnormal() * 1 49 . 998 (column first regression, we get a coefficient of ). Next we ran a regression of Y on X . Here when we do this, we find a significant

78 78 causal inference : the mixtape Table 9 : Regressions illustrating collider 123 Covariates: bias - . 757 50 . 0 004 d *** . 024 ) ( 0 . 044 )( 1 . 508 *** 0 *** 500 . x 0 0 ( 010 ) 0 . 000 )( . 2 500 2 , 500 N 2 , 500 , Mean of dependent variable . 114 114 90 90 . 114 90 . effect, yet recall that Y is not a function of X . Rather, X is a function of Y . So this is a spurious result driven by reverse causality. That said, surely we can at least control for X Y on D , in a regression of 3 right? Column shows the impossibility of this regression; it makes it impossible to recover the causal effect of D on Y when we control X . Why? Because X is a collider, and by conditioning on it, we for D are introducing new systematic correlations between Y that and are wiping out the causal effect. Maybe this is still Examples of collider bias: Nonrandom sample selection not clear. I hope that the following example, therefore, will clarify matters, as it will end in a picture and a picture speaks a thousand words. A 2009 article stated that Megan Fox, of Transformers , was voted the worst and most attractive actress of . While not 2009 explicit in the article, the implication of the article was that talent and beauty were negatively correlated. But are they? What if they are in fact independent of each other, but the negative correlation found is a 44 44 I wish I had thought of this example, result of a collider bias? What would that look like? I didn’t . Gabriel Rossman gets but alas, To illustrate, we will generate some data based on the following full credit. DAG: Movie Star Beauty Tal ent Run the following program in Stata. clear all set seed 3444 2500 independent draws from standard normal distribution * set obs 2500 generate beauty=rnormal() generate talent=rnormal() Creating the collider variable (star) *

79 directed acyclical graphs 79 gen score=(beauty+talent) egen c85=pctile(score), p(85) = gen star=(score > c85) label variable star "Movie star" Conditioning on the top 15% * twoway (scatter beauty talent, mcolor(black) msize(small) msymbol(smx)), ytitle(Beauty) xtitle(Talent) subtitle(Aspiring actors and actresses) by(star, total) Figure 7 : Top left figure: Non-star Aspiring actors and actresses sample scatter plot of beauty (vertical 1 0 axis) and talent (horizontal axis). Top 4 right right figure: Star sample scatter plot of beauty and talent. Bottom left 2 figure: Entire (stars and non-stars 0 combined) sample scatter plot of beauty and talent. -2 -4 4 2 0 -2 -4 Total Beauty 4 2 0 -2 -4 2 0 -2 -4 4 Talent Graphs by Movie star The bottom left panel shows the scatterplot between talent and beauty. Notice that the two variables are independent draws from the standard normal distribution, creating an oblong data cloud. But, because “movie star” is in the top 15 percentile of the distribution of a linear combination of talent and beauty, the movie star sample is formed by a frontier of the combined variables. This frontier has a negative slope and is in the upper right portion of the data cloud, creating a negative correlation between the observations in the movie star sample. Likewise, the collider bias has created a negative correlation between talent and beauty in the non-movie star sample as well. Yet we know that there is in fact no relationship between the two variables. This kind of sample selection creates spurious 45 45 correlations. A random sample of the full popula- tion would be sufficient to show that there is no relationship between the two variables.

80 80 causal inference : the mixtape In conclusion, DAGs are powerful tools. There is far Conclusion more to them than I have covered here. If you are interested in learn- ing more about them, then I encourage you to carefully read Pearl [ ], which is his magnum opus. It’s a major contribution to the 2009 theory of causation, and in my opinion, his ideas merit inclusion in your toolkit as you think carefully about identifying causal effects with observational data. DAGs are helpful at both clarifying the rela- tionships between variables, but more importantly than that, DAGs make explicit whether you can identify a causal effect in your dataset. The concept of the backdoor criterion is one way by which you can hope to achieve that identification, and DAGs will help guide you to the identification strategy that satisfies that criterion. Finally, I have found that students learn a lot through this language of DAGs, and since Pearl [ 2009 ] shows that DAGs subsume the potential outcomes model (more on that in the next chapter), you need not worry that it is creating unnecessary complexity and contradictions in your pedagogy.

81 Potential outcomes causal model “And if you had the chance to go back to her pad for a passionate act you wont allow it. But if your plans for a chance to go back ain’t even had. Then the passionate act won’t happen, ’cause you plan not to have the chance.” - The Street Practical questions about causation has been a preoccupation of economists for several centuries. Adam Smith wrote about the causes of the wealth of nations [ 2003 ]. Karl Marx was interested in Smith , Needleman and the transition of society from capitalism to socialism [ th 1969 ]. The 20 century Cowles Commission sought to Needleman , Heckman and Vytlacil , better understand the identification problem [ 46 46 2007 ]. This brief history will focus on the development of the potential outcomes We can see the development of the modern causality concepts [ 1991 ] for a more model. See Morgan ] described in the writings of several philosophers. Hume [ 1993 comprehensive history of econometric ideas. causation as sequence of temporal events in which had the first event not occurred, the subsequent ones would not either. An example of this is where he said: “[w]e may define a cause to be an object, followed by another, and where all the objects similar to the first are followed by objects similar to the second. Or in other words where, if the first object had not been, the second never had existed” Mill [ 2010 ] devised five methods for inferring causation. Those methods were ( 1 ) the method of agreement, ( 2 ) the method of differ- ence, ( ) the joint method, ( 4 ) the method of concomitant variation 3 and ( 5 ) the method of residues. The second method, the method of differences, is most similar to the idea of causation as a comparison among counterfactuals. For instance, he wrote: “If a person eats of a particular dish, and dies in consequence, that is, would not have died if he had not eaten it, people would be apt to say that eating of that dish was the source of his death.”

82 82 causal inference : the mixtape A major jump in our understanding of causa- Statistical inference tion occurs coincident with the development of modern statistics. th Probability theory and statistics revolutionized science in the 19 th 19 century, originally with astronomy. Giuseppe Piazzi, an early century astronomer, discovered the dwarf planet Ceres, located be- . Piazzi observed it 24 tween Jupiter and Mars, in 1801 times before it was lost again. Carl Friedrich Gauss proposed a method which could successfully predict Ceres’ next location using data on its prior location. His method minimized the sum of the squared errors, or ordinary least squares . He discovered it at age 18 and published it in at age 1809 [ Gauss , 1809 ]. Other contributors include LaPlace and 24 Legendre. Regression analysis enters the social sciences through the work of statistician G. Udny Yule. [ 1899 ] was interested in the causes of Yule poverty in England. Poor people depended on either poor-houses or the local authorities. Yule wanted to know if public assistance increased the number of paupers, which is a causal question. Yule used Gauss’s least squares method to estimate the partial correlation between public assistance and poverty. Here was his data, drawn from the English Censuses of 1871 and 1881 . Download it using scuse . . scuse yule Each row is a particular location in England (e.g., Chelsea, Strand). And the second through fourth columns are growth rates. Yule estimated a model similar to the following: = b u + b + Outrelie f + Pauper Pop Old + b b 3 2 0 1 Using our data, we would estimate this using the regress command: . regress paup outrelief old pop His results are reported in Table . 13 10 percentage point change in the outrelief growth rate In words, a 7 . 5 percentage point increase in the pauperism is associated with a growth rate, an elasticity of 0 . 75 . Yule used regression to isolate the effects of out-relief, and his principal conclusion was that welfare increased pauper growth rates. What’s wrong with his statistical rea- soning? Do we think that the unobserved determinants of pauperism growth rates are uncorrelated with out-relief growth rates? After all, he does not control for any economic factors which surely affect both poverty and the amount of resources allocated to out-relief. Like- wise, he may have the causality backwards – perhaps the growth in

83 potential outcomes causal model 83 10 ]. 1899 , Yule : Yule regressions [ Table Dependent variable Covariates Pauperism growth Outrelief . 752 0 . ) 135 ( 0 0 056 . Old ( 0 . 223 ) Pop 0 - . 311 ( ) 067 . 0 pauperism is the cause of the growth in out-relief, not the other way around. But, despite its flaws, it represented the first known instance where statistics (and regression in particular) was used to estimate a policy-relevant causal effect. The notion that physical randomization was Physical randomization 19 th and early the foundation of causal inference was in the air in the th century, but it was not until ] that it crystalized. [ 1935 20 Fisher The first historically recognized randomized experiment was fifty years earlier in psychology [ Peirce and Jastrow 1885 ]. But interest- , ingly, their reason for randomization was not as the basis for causal inference. Rather, they proposed randomization as a way of fooling subjects in their experiments. Peirce and Jastrow [ 1885 ] were using an experiment on subjects that had a sequence of treatments, and they used physical randomization so that participants couldn’t guess at what would happen next. But Peirce appears to have anticipated Neyman’s concept of unbiased estimation when using random sam- ples and appears to have even thought of randomization as a physical process to be implemented in practice, but no one can find any sug- gestion for the physical randomization of treatments to units as a Fisher 1923 basis for causal inference until Splawa-Neyman [ ] and 1925 ]. [ Splawa-Neyman 1923 ] develops the very useful potential out- [ comes notation, and while he proposes randomization, it is not taken Fisher [ 1925 ]. to be literally necessary until [ 1925 ] proposes Fisher the explicit use of randomization in experimental design for causal 47 47 inference. For more on the transition from [ 1923 Splawa-Neyman Fisher [ 1925 ], ] to ] described a thought experiment in which a lady 1935 [ Fisher ]. [ Rubin see 2005 claims she can discern whether milk or tea was poured first in a cup of tea. While he does not give her name, we now know that the lady in the thought experiment was Muriel Bristol and that the 48 48 Muriel Bristol established thought experiment in fact did happen. Apparently, Bristol correctly guessed all four cups of tea. the Rothamstead Experiment Station in 1919 and was a PhD scientist

84 84 causal inference : the mixtape back in the days when women weren’t PhD scientists. One day during afternoon tea, Muriel claimed that she could tell whether the milk was added to the cup before or after the tea, which as one might guess, got a good laugh from her male colleagues. Fisher took the bait and devised the following randomized experiment. Given a cup of tea with milk, a lady claims she can discern whether milk or tea was first added to the cup. To test her claim, 4 of which the milk was added first, and 8 cups of tea were prepared, where the tea was added first. How many cups does she have to 4 correctly identify to convince us of her uncanny ability? 1935 ] proposed a kind of permutation-based inference – a Fisher [ method we now call the Fisher exact test. She possesses the ability probabilistically, not with certainty, if the likelihood of her guessing ⇥ 7 ⇥ 8 ⇥ 5 = 1, 680 all four correctly was sufficiently low. There are 6 ways to choose a first cup, a second cup, a third cup, and a fourth 4 ⇥ cup, in order. There are ⇥ 2 ⇥ 1 = 24 ways to order 4 cups. So 3 1680 cups out of 8 is the number of ways to choose 4 = 70 . Note, the 24 lady performs the experiment by selecting cups. The probability 4 1 cups is that she would correctly identify all 4 . Either she has no 70 ability, and has chosen the correct 4 cups by chance alone, or she has the discriminatory ability that she claims. Since choosing correctly is highly unlikely (one chance in 70), we decide for the second. To only get 3 right, she would have to choose 3 from the 4 correct ones. She can do this by 4 ⇥ 2 = 24 with order. Since 3 cups can ⇥ 3 ⇥ 2=6 ways, there are 4 ways for her to choose 3 be ordered in 3 1 incorrect cup 4 ways, there are a correct. Since she can now choose 4 ⇥ 4 = 16 ways for her to choose exactly 3 right and 1 wrong. total of 16 3 correctly is Hence the probability that she chooses exactly . If she 70 got only correct and 1 wrong, this would be evidence for her ability, 3 16 but not persuasive evidence, since getting 3 correct is = 0.23. 70 Causal inference, in this context, is a probabilistic idea wherein the observed phenomena is compared against permutation-based randomization called the null hypothesis . The null hypothesis is a specific description of a possible state of nature. In this example, the null hypothesis is that the lady has no special ability to discern the order in which milk is poured into tea, and thus, the observed phenomena was only by chance. We can never prove the null, but the data may provide evidence to reject it. In most situations, we are trying to reject the null hypothesis. Medicine and Economics Physical randomization had largely been the domain of agricultural experiments until the mid- 1950 s when it began to be used in medical trials. One of the first major randomized experiments in medicine were polio vaccination trials. The Salk polio

85 potential outcomes causal model 85 vaccine field trials was one of the largest randomized experiments 1954 , the Public ever attempted, as well as one of the earliest. In Health Service set out to answer whether the Salk vaccine prevented at random polio. Children in the study were assigned to receive the 49 49 vaccine or a placebo. Also the doctors making the diagnoses of In the placebo, children were inocu- lated with a saline solution. polio did not know whether the child had received the vaccine or double-blind, randomized the placebo. The polio vaccine trial was a controlled trial . It was necessary for the field trial to be very large because the rate at which polio occurred in the population was 50 per , 000 . The treatment group, which contained 200 , 100 individuals, 745 saw polio cases. The control group who had been inoculated had 33 , individuals, and saw 115 cases. The probability of seeing 201 229 this big a difference by chance alone is about 1 in a billion. The only plausible explanation, it was argued, was that the polio vaccine caused a reduction in the risk of polio. A similar large scale randomized experiment occurred in eco- 1970 s. Between 1971 nomics in the 1982 , the Rand corporation and conducted a large-scale randomized experiment studying the causal effect of healthcare insurance on healthcare utilization. For the study, Rand recruited 7 , 700 individuals under age 65 . The experiment was somewhat complicated with multiple treatment arms. Participants were randomly assigned to one of five health insurance plans: free care, three types with varying levels of cost sharing, and an HMO plan. Participants with cost sharing made fewer physician visits and had fewer hospitalizations than those with free care. Other declines in health care utilization, such as we fewer dental visits, were also found among the cost-sharing treatment groups. Overall, participants in the cost sharing plans tended to spend less on health which came from using fewer services. The reduced use of services occurred mainly because participants in the cost sharing treatment groups 50 50 More information about this fasci- were opting not to initiate care. nating experiment can be found in [ 1993 ]. Newhouse While the potential outcomes ideas were around, Potential outcomes it did not become the basis of causal inference in the social sciences 51 51 [ Rubin In the potential outcomes tradition [ Splawa- until The idea of causation as based on ]. 1974 counterfactuals appears in philosophy , Rubin , 1923 ], a causal effect is defined as a compar- , Neyman 1974 ] with 1974 [ Rubin independent of Lewis ison between two states of the world. In the first state of the world, ]. Some evidence for it may exist 1973 [ in John Stuart Mill’s methods for causal a man takes aspirin for his headache and one hour later reports the inference as well. severity of his headache. In the second state of the world, that same man refused aspirin and one hour later reported the severity of his headache. What was the causal effect of the aspirin? According to Rubin, the causal effect of the aspirin is the difference in the severity of his headache between two states of the world: one where he took the aspirin (the actual state of the world) and one where he never

86 86 causal inference : the mixtape took the aspirin (the counterfactual state of the world). The difference between these two dimensions, if you would, at the same point in time represents the causal effect of the intervention itself. To ask questions like this is to engage in a kind of storytelling. Humans have always been interested in stories exploring counter- Christmas Carol It’s a Wonderful Life , factuals. Examples include and Man in the High Castle , just to name just a few. What if Bruce Wayne’s parents had never been murdered? What if that waitress had won the lottery? What if your friend from high school had never taken that first drink? What if Neo had taken the blue pill? These are the sort of questions that can keep a person up at night. But it’s important to note that these kinds of questions are by 52 52 To wonder how life would be different definition It is also worth noting that counterfac- unanswerable . tual reasoning appears to be a hallmark had one single event been changed is to indulge in counterfactual of the human mind. We are unusual reasoning, and since counterfactuals by definition don’t exist, the among creatures in that we are capable of asking and imagining these types of question cannot be answered. History is a sequence of observable, what-if questions. factual events, one after another. We don’t know what would have happened had one event changed because we are missing data on the . Potential outcomes exist ex ante as a set of possibilities, counterfactual but once a decision is made, all but one of them disappears. Donald Rubin, and statisticians Roland Fisher and Jerzy Neyman before him, take as a starting point that a causal effect is a compar- 53 53 This analysis can be extended to more ison across two potential outcomes. To make this concrete, we than two potential outcomes, but for introduce some notation and language. For simplicity, we will as- simplicity we will stick with just two. dummy variable that takes on a value of one if a particular sume a 54 54 treatment and a zero if they do not. unit Each unit i The treatment here is any particular receives the intervention, or causal variable of will have two potential outcomes , but only one observed outcome. Po- interest. In economics, it is usually the 1 Y tential outcomes are defined as if the unit received the treatment comparative statics exercise. i 0 Y and if the unit did not. We’ll call the state of the world where no i treatment occurred the state. Notice the superscripts and the control i subscripts – each unit has exactly two potential outcomes: a poten- tial outcome under a state of the world where the treatment occurred 1 0 ) and a potential outcome where the treatment did not occur ( Y ( Y ). Observable outcomes, Y , are distinct from potential outcomes. i Whereas potential outcomes are hypothetical random variables that differ across the population, observable outcomes are factual random variables. A unit’s observable outcome is determined according to a switching equation : 0 1 = D 76 Y ) ( + (1 D Y ) Y i i i i i where D equals one if the unit received the treatment and zero if it i 1 D Y =1 , then Y did not. Notice the logic of the equation. When = i i i because the second term zeroes out. And when D =0 , the first term i 0 Y . = Y zeroes out and therefore i i Rubin defines a treatment effect, or causal effect, as simply the

87 potential outcomes causal model 87 difference between two states of the world: 1 0 d = Y Y i i i Immediately we are confronted with a problem. If a treatment effect 1 0 and Y Y requires knowing two states of the world, , but by the i i switching equation we only observe one, then we cannot calculate the treatment effect. Average treatment effects From this simple definition of a treatment effect come three different parameters that are often of interest to researchers. They are all population means. The first is called the average treatment effect and it is equal to 1 0 1 0 Y E ]= ] E Y d [ ]= E [ Y [ [ ] E Y i i i i i Notice, as with our definition of individual level treatment effects, the average treatment effect is unknowable as well because it requires two observations per unit i , one of which is a counterfactual. Thus ATE the average treatment effect, , like the individual treatment effect, is not a quantity that can be calculated with any data set known to man. The second parameter of interest is the average treatment effect for the treatment group . That’s a mouthful, but let me explain. There exist two groups of people: there’s a treatment group and there’s a control group. The average treatment effect for the treatment group, or ATT for short, is simply that population mean treatment effect for the group of units that have been assigned the treatment in the first place. Insofar as differs across the population, the ATT may be different d i from the ATE. In observational data, it almost always will be in fact different from the ATE. And, like the ATE, it is unknowable, because i : like the ATE, it requires two observations per treatment unit = E [ d ATT | D = 1] i i 1 0 [ Y = 1] = E E D | i i i 1 0 [ Y | = 1] | D D = 1] E [ Y = E i i i i The final parameter of interest is called the average treatment effect for the control group, or untreated group. It’s shorthand is ATU which stands for average treatment effect for the untreated. And like it’s ATT brother, the ATU is simply the population mean treatment effect for the units in the control group. Given heterogeneous treatment effects, it’s probably the case that the ATT 6 = ATU – especially in an

88 88 causal inference : the mixtape observational setting. The formula for the ATU is E d = | ATU [ = 0] D i i 1 0 = E Y [ Y | D = 0] i i i 1 0 Y = 0] = | D E = 0] E [ Y [ D | i i i i Depending on the research question, one or all three of these parameters are interesting. But the two most common ones of interest are the and the ATT . ATE Simple difference in means decomposition This has been somewhat abstract, so let’s be concrete. Let’s assume there are ten patients i who have cancer, and two medical procedures or treatments. There is a surgery intervention, D =1 , and there is a chemotherapy interven- i D . Each patient has the following two potential outcomes =0 tion, i where a potential outcome is defined as post-treatment lifespan in years: : Potential outcomes for ten Table 11 0 1 Patients Y Y d 1 patients receiving surgery Y or chemo 0 Y . 1716 - 256 1 3514 478 - 1 5422 10 1 9 6 9 - 1 10 7 856 1 - 4 - 937 9 8 1 10 We can calculate the average treatment effect if we have this matrix of data because the average treatment effect is simply the mean 1 0 and 3 . That is E [ Y ]= ]=5.6 and E [ Y difference between columns 2 5 , which means that =0.6 . In words, the average causal effect ATE 0 additional years (compared to 6 of surgery for these ten patients is . 55 55 chemo). Note that causality always involves comparisons. Now notice carefully: not everyone benefits from surgery. Patient 7 , for instance, lives only 1 additional year post-surgery versus 10 additional years post-chemo. But the ATE is simply the average over these heterogeneous treatment effects. To maintain this fiction, let’s assume that there exists the perfect 56 56 The perfect doctor knows each person’s potential outcomes doctor. I credit Donald Rubin with this example [ Rubin , 2004 ] and chooses the treatment that is best for each person. In other

89 potential outcomes causal model 89 words, he chooses to put them in surgery or chemotherapy depend- ing on whichever treatment has the longer post-treatment lifespan. Once he makes that treatment assignment, he observes their post- treatment actual outcome according to the switching equation we mentioned earlier. : Post-treatment observed 12 Table Patients YD =1 lifespans in years for surgery D versus chemotherapy = 0. D 171 260 351 480 541 10 1 6 10 0 7 860 970 9 1 10 differs from Table 11 because Table 11 shows only the Table 12 shows only the observed outcome 12 potential outcomes, but Table for treatment and control group. Once treatment has been assigned, we can calculate the average treatment effect for the surgery group 4 . 4 and the (ATT) versus the chemo group (ATU). The ATT equals 3 . 2 ATU equals – . In words, that means that the average post-surgery 4 additional years, whereas the . 4 lifespan for the surgery group is 3 . 2 average post-surgery lifespan for the chemotherapy group is 57 57 The reason that the ATU is negative fewer years. is because the treatment here is the Now the ATE is , which is just a weighted average between the 6 . 0 surgery, which was the worse treatment 58 So we know that the overall effect of surgery is ATT and the ATU. of the two of them. But you could just additional as easily interpret this as 3 . 2 positive, though the effect for some is negative. There exist heteroge- years of life if they had received chemo neous treatment effects in other words, but the net effect is positive. instead of surgery. 58 But, what if we were to simply compare the average post-surgery ⇥ ATT + (1 p ) ⇥ ATU = ATE = p ⇥ 3.2 = 0.6. 0.5 4.4 + 0.5 ⇥ of the ATE – it lifespan for the two groups? This is called an estimate estimate the takes observed values, calculates means, in an effort to parameter of interest, the ATE. We will call this simple difference in 59 59 and it is simply equal to Morgan and Winship [ 2014 ] call this mean outcomes the SDO, estimator the naive average treatment or NATE for short. effect 1 0 Y | D = 1] E [ Y [ | D = 0] E which can be estimated using samples of data 1 0 E [ Y SDO | D = 1] E [ Y = | D = 0] n n 1 1 y ( d ( y | | d = = 1) = 0) i i i i   N N T C i =1 i =1

90 90 causal inference : the mixtape 7 0.4 . Or in words, the 7.4 = which in this situation is equal to 4 . treatment group lives 0 fewer years post-surgery than the chemo group. Notice how misleading this statistic is, though. We know that the average treatment effect is positive, but the simple difference in mean outcomes is negative. Why is it different? To understand why it is different, we will decompose the simple difference in mean out- comes using LIE and the definition of ATE. Note there are three parts to the SDO. Think of the left hand side as the calculated average, but the right hand side as the truth about that calculated average. 1 0 | D = 1] E [ Y Y | D = 0] = ATE E [ 0 0 Y | D = 1] + E [ Y E | D = 0] [ ) ATT ATU ) ( 77 )( p +(1 To understand where these parts on the right-hand-side originate, we need to start over and decompose the parameter of interest, ATE , into the sum of four parts using the law of iterated expectations. ATE ATT and ATU , is equal to sum of conditional average expectations, by LIE 1 0 E ATE = ] E [ Y [ ] Y 1 1 p E [ Y = | D = 1] + (1 p ) E [ Y { | D = 0] } 0 0 = 0] [ Y | D = 1] + (1 p ) E [ Y E | D p } { p 1 p is the is the share of patients who received surgery and where share of patients that received chemotherapy. Because the conditional expectation notation is a little cumbersome, let’s exchange each term ATE , and right hand side, the part we got from on the left hand side, LIE, using some letters. This will allow the proof to be a little less cumbersome to follow. 1 [ Y E | D = 1] = a 1 E Y | D = 0] = b [ 0 [ Y D | E = 1] = c 0 D Y E | [ = 0] = d ATE = e Now through the following algebraic manipulation.

91 potential outcomes causal model 91 e { p a + (1 p ) b } { p c + (1 p ) d } = = p + b p b p c d + p d e a +( p + b p = p c d + p d a a a )+( c c )+( d d ) e b e p a b + p b + p c + d p d a + 0= c + c d + d a a d = e p a b + p b + p c + d p d + a c + c d b = e +( c d )+ a p a d + p b c + p c + d p d a a d = e +( c d ) + (1 p ) a (1 p ) b + (1 p ) d (1 p ) c a e +( c d ) + (1 p )( a c ) (1 p )( b = d ) d Now substituting our definitions, we get the following: 1 0 | D = 1] E [ Y Y | D = 0] = ATE E [ 0 0 Y [ D = 1] E [ Y E | D = 0]) +( | p ATT ATU ) +(1 )( And the proof ends. Now the left hand side can be estimated with a sample of data. And the right-hand-side is equal to the following: n n 1 1 1 0 ( d = 1) ( Y [ E ] y y | d | = 0) Y [ E ] = i i i i   | {z } N N T C i i =1 =1 Average Treatment Effect } | {z SDO 0 0 Y E | D = 1] E [ Y + | D [ = 0] | } {z Selection bias + (1 )( ATT ATU ) p | {z } Heterogenous treatment effect bias Let’s discuss each of these in turn. The left-hand-side is the simple . Thus it difference in mean outcomes and we know it is equal to 0.4 0.4 must be the case that the right hand side sums to . The first term is the average treatment effect, which is the parameter of interest. We know that it is equal to +0.6 . Thus the remaining two terms must be the source of the bias that is causing our SDO < ATE. The second term is called the which merits some unpacking. The selection bias selection bias is the inherent differences between the two groups if they both received chemo. Usually, though, it’s just a description of the differences between the two if there had never been a treatment in the first place. There are in other words two groups: there’s a surgery group and there’s a chemo group. How do their potential outcomes under control differ? Notice that the first is a counterfac- tual, whereas the second is an observed outcome according to the switching equation. We can calculate this difference here because we have the complete potential outcomes in Table 11 . That difference

92 92 causal inference : the mixtape 4.8 is equal to . The third term is a lesser known form of bias, but we include it to be comprehensive, and because we are focused on 60 60 the ATE. Note that Angrist and Pischke [ 2009 ] The heterogenous treatment effect bias is simply the differ- have a slightly different decomposition ent returns to surgery for the two groups multiplied by the share of ATT + selection bias where the SDO = , the population that is in the chemotherapy group at all. This final but that is because their parameter of interest is the ATT and therefore the 3.2)) 0.5 (4.4 term is ( ⇥ which is 3.8 . Note in case it’s not obvious, third term doesn’t appear. p =0.5 is because 5 units out of 10 units are in the the reason that chemotherapy group. Now that we have all three parameters on the right-hand-side, we is equal to can see why the SDO 0.4. 0.4 = 0.6 4.8 + 3.8 Notice that the SDO actually does contain the parameter of interest. But the problem is that that parameter of interest is confounded by two forms of bias, the selection bias and the heterogeneous treat- = d 8 i , then d ment effect bias. If there is a constant treatment effect, i = ATT and so SDO = ATE + selection bias. A large part of em- ATU pirical research is simply trying to develop a strategy for eliminating selection bias. SDO to esti- Let’s start with the most credible situation for using mate ATE : when the treatment itself (e.g., surgery) has been assigned independent of their potential outcomes. Notationally to patients speaking, this is 1 0 , ( Y Y ) ?? D Now in our example, we already know that this is violated because the perfect doctor specifically chose surgery or chemo based on their 1 0 Y and potential outcomes. Specifically, they received surgery if Y > 1 0 Y Y . Thus in our case, the perfect doctor ensured that D chemo if < 1 0 Y , depended on . Y But, what if he hadn’t done that? What if he had chosen surgery in 1 0 such a way that did not depend on or Y ? What might that look Y like? For instance, maybe he alphabetized them by last name, and the first five received surgery and the last five received chemotherapy. Or maybe he used the second hand on his watch to assign surgery to 1 30 seconds, he gave them surgery, and if it them: if it was between 61 61 60 seconds he gave them chemotherapy. was between In other 31 In Craig [ 2006 ], a poker-playing banker used his watch as a random words, let’s say that he chose some method for assigning treatment number generator to randomly bluff in that did not depend on the values of potential outcomes under either certain situations. state of the world. What would that mean in this context? Well, it would mean that 1 1 Y Y D = 1] E [ | E | D = 0] = 0 [ 0 0 Y E | D = 1] E [ Y = 0] = 0 | D [

93 potential outcomes causal model 93 1 Or in words, it would mean that the mean potential outcome for Y 0 or is the same (in the population) for either the surgery group or Y the chemotherapy group. This kind of of the treatment randomization assignment would eliminate both the selection bias and the heteroge- neous treatment effect bias. Let’s take it in order. The selection bias zeroes out as follows: 0 0 | E = 1] [ E [ Y Y | D = 0] 0 D SDO no longer suffers from selection bias. How does And thus the randomization affect heterogeneity treatment bias from the third line? Rewrite definitions for ATT and ATU: 0 1 E Y D = 1] E [ Y | | D = 1] [ ATT = 0 1 Y D | D = 0] E [ Y ATU = | E = 0] [ : p Rewrite the third row bias after 1 1 0 = E[Y ] | D= 1 ATT E [ Y | D = 1] ATU 1 0 | D= 0 ] + E [ Y E[Y | D = 0] =0 If treatment is independent of potential outcomes, then: n n 1 1 0 1 ( y d = 1) | ] ] Y [ Y E [ E = 0) = ( y d | i i i i   N N T C =1 =1 i i SDO ATE = What’s necessary in this situation is simply (a) data on observable 1 0 Y ) , Y outcomes, (b) data on treatment assignment, and (c) ( ?? D .We call (c) the independence assumption. To illustrate that this would lead to the SDO, we will use the following Monte Carlo simulation. ATE in this example is equal to 0 Note that 6 . . clear all program define gap, rclass version 14.2 syntax [, obs(integer 1) mu(real 0) sigma(real 1) ] clear _ drop all set obs 10 gen y1 = 7 in 1 replace y1 = 5 in 2 replace y1 = 5 in 3 replace y1 = 7 in 4

94 94 causal inference : the mixtape replace y1 = 4 in 5 replace y1 = 10 in 6 replace y1 = 1 in 7 replace y1 = 5 in 8 replace y1 = 3 in 9 replace y1 = 9 in 10 gen y0 = 1 in 1 replace y0 = 6 in 2 replace y0 = 1 in 3 replace y0 = 8 in 4 replace y0 = 2 in 5 replace y0 = 1 in 6 replace y0 = 10 in 7 replace y0 = 6 in 8 replace y0 = 7 in 9 replace y0 = 8 in 10 drawnorm random sort random gen d=1 in 1/5 replace d=0 in 6/10 y1 + (1-d) gen y=d y0 * * egen sy1 = mean(y) if d==1 egen sy0 = mean(y) if d==0 collapse (mean) sy1 sy0 gen sdo = sy1 - sy0 keep sdo summarize sdo gen mean = r(mean) end simulate mean, reps(10000): gap _ _ su 1 sim This Monte Carlo runs 10 , 000 times, each time calculating the average SDO under independence – which is ensured by the random number sorting that occurs. In my running of this program, the ATE 62 62 is 0 . 59088 . and the SDO is on average equal to 0 Because it’s not seeded, when you run . 6 it, your answer will be close but slightly Before we move on from the SDO, let’s just re-emphasize some- different due to the randomness of the thing that is often lost on students first learning the independence sample drawn. 1 [ Y | concept and notation. Independence does not imply that D = E 0 1 0 [ Y . Nor does it imply that | D = 0] = 0 1] E [ Y | D = 1] E [ Y E | D = 1] = 0. Rather, it implies 1 1 Y | D = 1] E E [ Y [ | D = 0] = 0 in a large population. That is, independence implies that the two

95 potential outcomes causal model 95 groups of units, surgery and chemo, have the same potential outcome on average in the population. How realistic is independence in observational data? Economics – maybe more than any other science – tells us that independence is unlikely to hold observationally. Economic actors are always attempt- ing to achieve some optima. For instance, parents are putting kids in what they perceive to be the best school for them and that is based choosing their in- on potential outcomes. In other words, people are terventions and most likely that decision is related to the potential outcomes, which makes simple comparisons improper. Rational choice is always pushing against the independence assumption, and therefore simple comparison in means will not approximate the true causal effect. We need unit randomization for simple comparisons to tell us anything meaningful. One last thing. Rubin argues that there are a bundle of assump- tions behind this kind of calculation, and he calls these assumptions stable unit treatment value assumption the or SUTVA for short. That’s a mouthful, but here’s what it means. It means that we are assuming the unit-level treatment effect (“treatment value”) is fixed over the en- tire population, which means that the assignment of the treatment to one unit cannot affect the treatment effect or the potential outcomes of another unit. First, this implies that the treatment is received in homogenous doses to all units. It’s easy to imagine violations of this though – for instance if some doctors are better surgeons than others. In which case, we just need to be careful what we are and are not defining as the treatment. Second, this implies that there are no externalities, because by definition, an externality spills over to other units untreated. In other 1 receives the treatment, and there is some externality, words, if unit 0 will have a different 2 then unit value than she would have if unit Y 1 had not received the treatment. We are assuming away this kind of spillover. Related to that is the issue of general equilibrium. Let’s say we are estimating the causal effect of returns to schooling. The increase in college education would in general equilibrium cause a change in relative wages that is different than what happens under partial equilibrium. This kind of scaling up issue is of common concern when one consider extrapolating from the experimental design to the large scale implementation of an intervention in some population. STAR Experiment Now I’d like to discuss a large-scale randomized experiment to help explain some of these abstract concepts. Krueger [ 1999 ] analyzed a 1980 s randomized experiment in Tennessee called

96 96 causal inference : the mixtape the Student/Teacher Achievement Ratio (STAR). This was a state- wide randomized experiment that measured the average treatment effect of class size on student achievement. There were two arms to 17 13 students, and a regular sized the treatment: a small class of - 25 students with a full-time teacher’s aide. The classroom of 22 - 25 students with - 22 control group was a regular sized classroom of , 600 students and their teachers were no aide. Approximately 11 randomly assigned to one of the three groups. After the assignment, the design called for the students to remain in the same class type ). Randomization occurred within schools at the 3 for four years (K- kindergarten level. For this section, we will use Krueger’s data and attempt to repli- cate as closely as possible his results. Type in (ignoring the period): . clear _ sw . scuse star Note that insofar as it was truly a randomized experiment, then the average potential outcomes for students in a small class will be the same as the average potential outcomes for each of the other treatment arms. As such we can simply calculate the mean outcomes for each group and compare them to determine the average treatment effect of a small class size. Nonetheless, it is useful to analyze exper- imental data with regression analysis because in this instance the conditional randomization was on the school itself. Assume for the sake of argument that the treatment effects are 0 1 Y Y , first constant. This implies two things: it implies that d 8 i = i i ATE = ATT = ATU . Thus the of all. And second, it implies that SDO ATE plus selection is equal to simple difference in outcomes bias because the heterogenous treatment effect bias zeroes out. Let’s write out the regression equation by first writing out the switching equation: 1 0 Y Y = + (1 D Y ) D i i i i i 0 Y we get Distributing the i 0 0 1 Y ) + D ( Y = Y Y i i i i i which is equal to 0 D = Y Y + d i i i given the definition of the treatment effect from earlier. Now add 0 0 Y to the right-hand side and rearrange the terms to get ] E E [ Y [ ]=0 i i 0 0 0 E [ Y Y ]+ d D Y + Y ] [ = E i i i i i and then rewrite as the following regression equation Y u = a + d D + i i i

97 potential outcomes causal model 97 0 is the random part of Y where u . This is a regression equation we i i could use to estimate the average treatment effect of . D on Y We will be evaluating experimental data, and so we could just compare the treatment group to the control group. But we may want to add additional controls in a multivariate regression model for a couple of reasons. The multivariate regression model would be = a + Y D u + d + g X i i i i is a matrix of unit specific predetermined covariates unaf- X where fected by . There are two main reasons for including additional D controls in the experimental regression model. 1 . Conditional random assignment. Sometimes randomization is conditional on some observable. In this example, that’s the done school, as they randomized within a school. We will discuss the “conditional independence assumption” later when we cover matching. . Additional controls increase precision. Although control variables 2 are uncorrelated with D X , they may have substantial explana- i i Y . Therefore including them reduces variance in tory power for i the residuals which lowers the standard errors of the regression estimates. Krueger estimates the following econometric model ) ( b + b 78 SMALL # + b Y REG / A + + a = s cs cs 2 0 1 ics ics i c a class, s a school, Y a student’s per- where indexes a student, SMALL a dummy equalling 1 if the student was in a centile score, REG small class, A a dummy equalling 1 if the student was assigned / a is a school fixed effect . A school fixed a regular class with an aide and effect is simply a dummy variable for each school, and controlling for that means that the variance in SMALL and REG / A is within each school. He did this because the STAR program was random- ized within a school – treatment was independent of the conditionally potential outcomes. First, I will produce Krueger’s actual estimates, then I will pro- _ star sw.dta duce similar regression output using the file we down- loaded. Figure 8 shows estimates from this equation using least squares. The first column is a simple regression of the percentile score onto the two treatment dummies with no controls. The effect of a small class is . 82 which means that the kids in a small class 4 moved 4 . 82 points up the distribution at the end of year. You can see by dividing the coefficient by the standard error that this is sig- nificant at the 5 % level. Notice there is no effect of a regular sized

98 98 causal inference : the mixtape classroom with an aide on performance, though. The coefficient is 2 both close to zero and has large standard errors. The R is very small as well – only 1 % of the variation in percentile score is explained by 512 S ECONOMIC OF L JOURNA Y QUARTERL this treatment variable. add in additional controls. First Krueger controls 4 - 2 Columns e doe — a e hav t no s whethe r th degree teache r ha s a maste r ’ s for school fixed effects which soaks up a lot of the variation in Y 2 2 R evidenced by the larger R has increased, . It’s interesting that the e th so , male ar s systemati c effect . Hardl y an y of th e teacher e but the coefficient estimates have not really changed. This is evidence r eache T meaningful a y ver t no e ar s result r gende s ha e experienc . of randomization. The coefficient on small class size is considerably more precise, but not materially different from what was shown in - experi in c quadrati a h wit n Experimentatio . effect e positiv , small 1 column . This coefficient is also relatively stable when additional le t abou at k pea to s tend indicate é pro e experienc e th enc e t d tha student demographics (column ) and teacher characteristics (column 3 ) are controlled for. This is a reassuring table in many respects 4 d because it is showing that [ E | X ]= which is an extension of ] E [ d the independence assumption. That suggests that D is assigned 1 independent of the potential outcomes, because d is a function of Y V TABLE 0 . Y and ORM FFECT OF STIMATES LASS OF IZE AND SSIGNMEN T ON EDUCED E C -S OLS E A R -F OF EST VERAGE TANFORD ERCENTIL E CHIEVEMENT A P T A S Figure 8 : Regression of kindergarten percentile scores onto treatments l clas s siz e : actua l clas s siz e Reduce d OLS : initia form Krueger [ ]. 1999 , Explanatory ) (7 ) (6 ) (5 ) (4 ) (3 ) (2 ) (1 variable ) (8 n A. Kindergarte 6 5.3 5.3 7 5.3 2 4.8 7 5.3 6 5.3 7 5.3 2 4.8 class Small 7 ) (1.19 ) (2.19 ) (1.25 ) (1.21 ) ) (1.26 (1.19 ) (2.19 ) (1.21 9 3 1 .5 9 .2 2 .1 .3 3 .5 .3 .2 2 .1 s clas e aid / Regular 1 (1.13 ) (1.09 ) (1.07 ) (2.23 ) (1.13 ) (1.09 ) (1.07 ) (2.23 ) 5 5 / Asia n (1 — — 8.3 White 8.4 4 — — 8.3 5 8.4 4 ) (1.36 ) (1.35 ) (1.35 (1.36 ) yes 4.3 4.3 Girl (1 9 8 4.4 — — 9 5 8 4.4 — — yes) (.63) (.63) (.63) (.63) 13.07 Fre e lunc h (1 5 — — 2 13.15 2 13.07 — — 2 13.15 2 (.77) (.77) (.77) (.77) yes) 2 — — — r teache e Whit .57 — — .57 2 — (2.10) (2.10) — Teache r experienc e — .26 — .26 — — — (.10) (.10) e degre — 2 .51 — — — 2 .51 Master s — — ’ (1.06) (1.06) es Y es Schoo l é xe d effect s No Y es Y es Y es No Y es Y 2 1 R .0 1 1 .3 1 .3 5 .2 .2 .0 1 .3 1 .3 5 Firs e grad t B. 7 Small class 8.5 7 8.4 3 7.9 1 7.4 0 7.5 4 7.1 7 6.7 9 6.3 (1.76 (1.97 (1.11 ) (1.10 ) (1.14 ) ) ) (1.18 ) (1.17 ) (1.21 ) / 4 1.4 8 aid clas s 3.4 4 2.2 2 2.2 3 1.7 8 1.9 2 1.6 9 1.6 e Regular ) ) ) (0.76 ) (0.80 ) (1.12 (0.76 (0.98 ) (0.98 ) (1.00 ) (2.05 / n Asia (1 5 — — 6.9 7 6.9 7 — — 6.8 6 6.8 5 White ) yes) (1.18 ) (1.19 ) (1.18 ) (1.18 3.8 Girl 3.8 6 3.7 — — 5 2 0 3.8 — — yes) 5 (1 (.56) (.56) (.56) (.56) 2 2 13.65 2 — — 13.61 13.77 13.49 2 — — 5 (1 h lunc e Fre (.87) (.88) (.87) yes) (.87) r — — — 2 4.28 — — — 2 4.40 Whit e teache (1.96) (1.97) — — — — 2 11.8 6 — — Mal 13.0 r e teache (3.33) (3.38) — r — — — .05 .06 — — e experienc Teache (.06) (0.06) — — s degre e — ’ Master .48 — — — .63 (1.09) (1.07) é es Y es es d effect s No Y es Y xe es Y l No Y es Y Schoo 2 0 9 .2 3 .2 1 .0 0 .3 .3 4 .2 2 .0 0 R .3

99 potential outcomes causal model 99 Next [ 1999 ] estimated the effect of the treatments on the Krueger first grade outcomes. Note that this is the same group of students from the kindergarten sample, just aged one year. Figure shows 9 the results from this regression. Here we find coefficients that are about % larger than the ones we found for kindergarteners. Also, 40 the regular sized classroom with an aide, while smaller, is no longer equal to zero or imprecise. Again, this coefficient is statistically stable across all specifications. Figure 9 : Regression of first grade percentile scores onto treatments Krueger , 1999 ]. [ A common problem in randomized experiments involving human beings, that does not plague randomized experiments involving non- humans, is attrition . That is, what if people leave the experiment? If attrition is random, then attrition affects the treatment and control groups in the same way (on average). Random attrition means that our estimates of the average treatment effect remain unbiased. But in this application, involving schooling, attrition may be non-random. For instance, especially good students placed in large

100 100 causal inference the mixtape : classes may leave the public school for a private school under the belief that large class sizes will harm their child’s performance. Thus 0 the remaining students will be people for whom Y is lower, thus giving the impression the intervention was more effective than maybe it actually was. Krueger 1999 [ ] addresses this concern by imputing the test scores from their earlier test scores for all children who leave the sample and then re-estimates the model including students with imputed test scores. EXPERIMENTA 515 S ESTIMATE L TABLE VI Figure 10 : Regression of first grade ARIABLE VERAGE TTRITION XPLORATIO N OF FFECT OF EPENDENT D A E V E A : percentile scores onto treatments for ERCENTILE ON CORE P S SAT K- 3 with imputed test scores for all Krueger , ]. 1999 post-kindergarten ages [ d impute d an l Actua a t tes l test data Actua dat Coefficient Coefficient Sample on Sample small on small size class Grade dum. class size dum. 5.32 5.32 K 5900 5900 (.76) (.76) 1 6.30 8328 6.95 6632 (.74) (.68) 5.59 6282 9773 2 5.64 (.65) (.76) 3 5.58 5.49 10919 6339 (.79) (.63) s of reduced-for m model s ar e presented . Eac h regressio n include s th e Estimate g explanator y followin variables a dumm y variabl e indicatin g initia l assignmen t to : smal l class ; a dumm y variabl e indicatin g initia l a a ; effects l schoo t gender ; an d a d unrestricte , class e aid / regular a to t assignmen studen r fo e variabl y dumm t . r regula to e relativ is y dumm s clas l smal on classes coefficien d reporte e Th . race t studen r fo e variabl y dumm e ar s error d Standar . parentheses in student d to ha assigne s n tha , average on , scores t tes r highe d s effect smal l classe s wh o als o lef t th e sample , the n th s e smal l clas , there is nothing in the analysis that 10 As you can see in Figure e l attritio biase d wil upwards . On be reaso n wh y thi s patter n n of suggests bias has crept in because of attrition. s of childre n in large r migh t occu r is tha t high-incom e parent the experiment alto- leaving Whereas attrition is a problem of units y migh r classe s l y subsequentl to thei likel e mor n bee e hav t enrol gether, there’s also a problem in which students switch between treat- simila r parent s of childre n in privat e school s ove r tim e tha n ment status. A contemporary example of this was in the antiretrovial g n in smal - nonran e possibl r fo At adjustin , heart . classes l childre s. These experiments were 1980 treatment experiments for HIV in the of imputin g tes t score s n is r a matte fo attritio m do r student s wh o be d th e sample . h longitudina e don exite n ca s thi , data l Wit often contaminated by the fact that secondary markets for the ex- t g student recen t tes t th percentil e to e ’ assignin by y crudel s mos perimental treatments formed in which control groups purchased studen tha t studen t in year s whe n th e e t wa s absen t fro m th the treatment. Given the high stakes of life and death associated 15 sample. with HIV at the time, this is understandable, but scientifically, this d e th Th e sampl e use s in th e é rs t pane l of Tabl e VI include switching of treatment status contaminated the experiment making eac s larges t numbe r of student h wit e availabl a dat g nonmissin h ATE estimates of the biased. Krueger writes that in this educational s result e Thes . grade n colum in d estimate l mode e th to d correspon 7 of e V , excep t th e fre e lunc h variabl e is omitte d becaus e it Tabl context “It is virtually impossible to prevent some students from switching th 15 cas e of a studen t wh o lef t In e sampl e bu t late r returned , th e . th e averag e tes t scor e in th e year s surroundin g th e student ’ s absenc e wa s used . T es t ] p. 1999 [ Krueger between class types over time.” ( 506 ) wh s wer e als o impute d fo r student s e o ha d a missin g tes t score scor t exi t no d di t bu th n whe t absen e wer y s e becaus , (e.g. e sampl e th Thi . the s wa t tes e conducted) To illustrate this problem of switching, Krueger created a helpful e ’ is closel y relate d to th e ‘ ‘ last-observation-carry-forward ’ techniqu metho d tha t studies ha s bee n use d in clinica l .

101 potential outcomes causal model 101 which is shown in Figure . If students in the transition matrix 11 EXPERIMENTA L ESTIMATE S 507 second grade had remained in their first grade classes, then the TABLE IV off-diagonal elements of this transition matrix would be zero. But BETWEE RANSITION S RADES IN IZE N LASS DJACENT C -S T A G because they are not zero, it means there was some switching. Of IN LASS TUDENTS OF YPE OF UMBER ACH E T C S N remained , , the 1 first graders assigned to small classrooms, 1 482 435 e grad t rs é to Kindergarte n A. in small classes, and 23 switched into the other kinds. If 24 and First grade students with stronger expected academic potential were more likely Reg/aide Regular Small Kindergarten All to move into small classes, then these transitions would bias the 1292 1400 48 60 Small simple comparison of outcomes upwards, making small class sizes 1526 663 Regular 126 737 appear to be more effective than they really are. Aide 761 706 1589 122 All 1540 1558 1417 4515 B. e d grad secon Firs t grad e to : Switching of students into 11 Figure and out of the treatment arms between Secon e grad d 1999 ]. , Krueger first and second grade [ Reg/aide First All Regular Small grade 1482 1435 23 Small 24 1852 Regular 202 1498 152 Aide 115 1560 1715 40 1627 1636 5049 All 1786 grad d e to Secon d grad e C. thir grade Third What can be done to address switching under random assign- grad e Small Regular Reg/aide All Secon d 1564 37 1636 Small 35 been done is to make it very could’ve ment? Well, one thing that 167 1485 Regular 152 1804 difficult. One of the things that characterizes the modern random 1973 1857 76 40 Aide All 2044 5413 1771 1598 experiment is designing the experiment in such a way that makes switching very hard. But this may be practically infeasible, so a sec- ond best solution is to regress the student’s outcome against the origi- d th thi s potentia l problem , an s addres e variabilit y of clas s o T nal randomized kindergarten class size, as opposed to the actual class 63 63 r a n typ e siz tha e s fo analysi give of assignment , in som e of th e t size – a kind of reduced form instrumental variables approach. If We will discuss this in more detail m l instrumenta an as d use is t assignmen follows initial rando when we cover instrumental variables a student had been randomly assigned to a small class, but switched (IV). But it is necessary to at least cover actua r variabl e fo . size l clas s to a regular class in the first grade, we would regress scores on the bits of IV now since this is a common second best solution in a randomized Dat a an d Standardize d T est s B. original assignment since the original assignment satisfies the inde- experiment when switching occurs. pendence assumption. So long as this original assignment is highly beginnin or h Marc d en e of at d teste e wer s Student of g th d . th yea h eac of l Apri of consiste s e Stanfor d Achieve - test e Th r correlated with the first through third grade class size (even if not T men d wor , reading in t achievemen d measure h whic , (SAT) t es t perfectly correlated), then this regression is informative of the effect , an d mat h in grade s K– 3, an d th e T ennesse e Basi recognition c of class size on test scores. This is what makes the original assign- measure t (BSF ) test , whic h Firs d achievemen t in Skill s readin g ment a good instrumental variable – because it is highly correlated h in grade s 1– 3. Th e test s wer e tailore d d to eac an grad e mat h with subsequent class size, even with switching. natura , t tes e th r fo s unit I results no e ar e ther e Becaus . level l e ranks . Speci é call y , in eac h scale d th e tes t score s int o percentil In this approach, kindergarten is the same for both the OLS and l e wer s student e aid / regular d an r regula e th poole leve e grad d reduced form IV approach, because the randomization assignment and the actual classroom enrollment are the same in kindergarten. But from grade 1 onwards, OLS and reduced form IV differ because of the switching. Figure 12 shows eight regressions – four per approach, where each four is like the ones shown in previous figures. Briefly, just notice that while the two regressions yield different coefficients, their

102 102 causal inference : the mixtape 12 : IV reduced form approach Figure compared to the OLS approach Krueger 1999 ]. [ , magnitudes and precision are fairly similar. The reduced form IV percentage point lower approach yields coefficients that are about 1 on average than what he got with OLS. Some other problems worth mentioning when it comes to ran- domized experiments. First, there could be heterogeneous treatment d effects. In other words, perhaps differs across i students. If this i is the case, then ATT = ATU though in large enough samples, and 6 under the independence assumption, this difference should be negli- gible. Now we do our own analysis. Go back into Stata and type: . reg tscorek sck rak Krueger standardized the test scores into percentiles, but I will keep the data in its raw form simplicity. This means the results will 6 . be dissimilar to what is shown in his Figure : Krueger regressions [ Krueger , 13 Table Dependent variable Total kindergarten score (unscaled) 1999 ]. 90 . 13 Small class ) 2 41 ( . 314 Regular/aide class 0 . ) 310 . 2 ( In conclusion, we have done a few things in this chapter. We’ve introduced the Rubin causal model by introducing its powerful po-

103 potential outcomes causal model 103 tential outcomes notation. We showed that the simple difference in mean outcomes was equal to the sum of the average treatment effect, a term called the selection bias, and a term called the weighted het- erogeneous treatment effect bias. Thus the simple difference in mean outcomes estimator is biased unless those second and third terms zero out. One situation in which they zero out is under independence of the treatment, which is when the treatment has been assigned independently of the potential outcomes. When does independence occur? The most commonly confronted situation where it would occur is under physical randomization of the treatment to the units. Because physical randomization assigns the treatment for reasons that are independent of the potential outcomes, the selection bias ze- roes out as does the heterogeneous treatment effect bias. We now move to discuss a second situation where the two terms zero out: conditional independence.


105 Matching and subclassification “I’m Slim Shady, yes I’m the real Shady All you other Slim Shadys are just imitating So won’t the real Slim Shady, please stand up, Please stand up, Please stand up” – Eminem One of the main things I wanted us to learn from the chapter on directed acylical graphical models is the idea of the backdoor criterion. Specifically, if there exists a conditioning strategy that will satisfy the backdoor criterion, then you can use that strategy to identify your causal effect of interest. We now discuss three different kinds of conditioning strategies. They are subclassification, exact matching, and approximate matching. I will discuss each of them now. Subclassification Subclassification is a method of satisfying the backdoor criterion by weighting differences in means by strata-specific weights. These strata-specific weights will, in turn, adjust the differences in means so that their distribution by strata is the same as that of the counter- balance factual’s strata. This method implicitly achieves distributional between the treatment and control in terms of that known, observable confounder. This method was created by statisticians like Cochran [ 1968 ] when trying to analyze the causal effect of smoking on lung cancer, and while the methods today have moved beyond it, we in- clude it because some of the techniques implicit in subclassification 64 64 To my knowledge, are present throughout the rest of the book. [ 1968 ] is Cochran the seminal paper on subclassification. One of the concepts that will thread through this chapter is the concept of the conditional independence assumption, or CIA. In the previous example with the STAR test, we said that the experimenters had assigned the small classes to students conditionally randomly. That is, conditional on a given school, a , the experimenters ran- s domly assigned the treatment across teachers and students. This

106 106 causal inference : the mixtape technically meant that treatment was independent of potential out- comes for any given school . This is a kind of independence assumption, but it’s one that must incorporate the conditional element to the independent assignment of the treatment. assumption is written as 0 1 Y ) , D | X ( Y ?? is the notation for statistical independence and X where again ?? is the variable we are conditioning on. What this means is that the 0 1 Y and Y are equal for treatment and control expected values of . Written out this means: group for each value of X 1 1 =1, | D E [ ]= E [ Y Y | D =0, X ] X 0 0 E | D =1, X ]= E [ Y Y | [ =0, X ] D Put into words, the expected value of each potential outcome is equal for the treatment group and the control group, once we condition on some X variable. If CIA can be credibly assumed, then it necessarily means you have selected a conditioning strategy that satisfies the backdoor criterion. They are equivalent concepts as far as we are concerned. An example of this would mean that for the ninth school in our sample, a , the expected potential outcomes are the same for =1 9 small and large classes, and so on. When treatment is conditional on observable variables, such that the CIA is satisfied, we say that the situation is one of selection on observables . If one does not directly address the problem of selection on observables, estimates of the treatment effect will be biased. But this is remedied if the observable variable is conditioned on. The variable can be thought of as an X n matrix of covariates which satisfy the CIA as a whole. ⇥ k I always find it helpful to understand as much history of thought behind econometric estimators, so let’s do that here. One of the pub- 20 th century was the problem lic health problems of the mid to late 1860 of rising lung cancer. From 1950 , the incidence of lung cancer to 0 % (Figure 7 in cadavers grew from % of all autopsies to as high as ). The incidence appeared to be growing at an increasing rate. The 13 100 , 000 from cancer of the lungs in males reached mortality rate per 80 - 100 per 100 , 000 by 1980 in the UK, Canada, England and Wales. From 1860 1950 , the incidence of lung cancer in cadavers grew to 0 from 7 % (Figure 13 ). The incidence % of all autopsies to as high as appeared to be growing at an increasing rate. The mortality rate per 100 , 000 from cancer of the lungs in males reached 80 - 100 per 100 , 000 by 1980 in the UK, Canada, England and Wales. Several studies were found to show that the odds of lung cancer was directly related to the amount of cigarettes the person smoked

107 matching and subclassification 107 GERRY HILL, WAYNE MILLAR and JAMES B. CONNELLY : Lung cancer at autopsy 13 Figure Figure 1 trends 18 Combined Results from Lung Cancer Studies at Autopsy: autopsies Per of cent 8 * 1900 1930 1940 1950 1890 1880 1870 1860 l910 1920 Year "The Great Debate" 371 fitted + Observed Statistics Mortality 2(a) Figure General The and Wales began of publishing the num- Registrar England Mortality from Cancer of the Lung in Males death cancer sites rates 1911.W specific in for can- for deaths of bers The per Rate 100,000 Percy The the lung 1911 cer Stocks.26 of by were published 1955 to from 120 10% period: the rates males year in increased exponentially over per - l00 1931-52 were 6% per year in females. Canadian rates for the period and in rates A. J. Phillips.27 The were consistently lower Canada published by - 80 per 8% also increased exponentially England than in Wales, and at but - 60 year and 4% per year in females. in males - Figure and Canadian rates are shown in The British 2. The rates (a) for 40 and males, for females have been age-standardized,28 and the trends (b) - 20 29 to and colleagues, Richard extended 1990, using Peto published by data rates by Statistics Canada.30 In British males reached a maxi- the and 0 1970 1920 1930 l910 1960 1940 2000 l980 l990 1960 I declined. and then mid-1970's the in mum initial the males Canadian In Year 1990. rise was more prolonged, reaching a maximum Among females in climb in both countries, the rise the age-standardized rates continue to Canada Kingdom United + + Wales & England - Canada than being steeper in in Britain. first The fact that mortality was lower at in Canada than in Britain to Standardlzed European Population the be explained by difference may smoking countries. two in the in Percy the data on cited Stocks31 ciga- of adult per consumption annual the 1939 and rettes in 1957. In 1939 con- countries various between Figure 2(b) United sumption in Canada was only half that in the Kingdom, while in in Females Lung the of from Cancer Mortality was 5% the higher consumption in than in the 1957 Canada United Kingdom. per 100,000 Rate 36 trends developed countries The most in the age-standardized rates in 30 - by International 1992 have been published recently the and 1953 between - 26 - 20 - l6 - l0 0 l960 1970 1980 1990 2000 1910 1920 1930 1940 1960 Year Wales Canada -+- United Kingdom - + England (L Population European Standardlzed to

108 108 causal inference : the mixtape 14 shows that the relationship between daily smoking a day. Figure and lung cancer in males was monotonic in the number of cigarettes the male smoked per day. Smoking, it came to be believed, was causing lung cancer. But some statisticians believed that scientists couldn’t draw that conclusion because it was possible that smoking was not independent of health outcomes. Specifically, perhaps the people who smoked cigarettes differed from one another in ways that were directly related to the incidence of lung cancer. This was a classic “correlation does not necessarily mean causa- tion” kind of problem. Smoking was clearly correlated with lung caused cancer, but does that necessarily mean that smoking lung cancer? Thinking about the simple difference in means notation, we know that a comparison of smokers and non-smokers will be biased in observational data is the independence assumption does not hold. And because smoking is endogenous, meaning people choose to smoke, it’s entirely possible that smokers differed from the non-smokers in ways that were directly related to the incidence of lung cancer. Criticisms at the time came from such prominent statis- ticians as Joseph Berkson, Jerzy Neyman and Ronald Fisher. Their reasons were as follows. First, it was believed that the correlation was spurious because of a biased selection of subjects. The samples were non-random in other words. Functional form complaints were also common. This had to do with people’s use of risk ratios and odds ratios. The association, they argued, was sensitive to those kinds of functional form choices. Probably most damning, though, was the hypothesis that there existed an unobservable genetic element that both caused people to smoke and which, independently, caused people to develop lung cancer [ , 2009 ]. This confounder meant that smokers and non- Pearl smokers differed from one another in ways that were directly related to their potential outcomes, and thus independence did not hold. Other studies showed that cigarette smokers and non-smokers dif- fered on observables. For instance, smokers were more extroverted than non-smokers, as well as differed in age, income, education, and so on. Other criticisms included that the magnitudes relating smoking and lung cancer were considered implausibly large. And again, the ever present criticism of observational studies, there did not exist any experimental evidence that could incriminate smoking as a cause of lung cancer. But think about the hurdle that that last criticism actually cre- ates. Imagine the hypothetical experiment: a large sample of people, with diverse potential outcomes, are assigned to a treatment group (smoker) and control (non-smoker). These people must be dosed

109 "The Great Debate" 375 publication to London results later that year73 and the com- of the the and 1,465 lung cancer patients of controls, in study, involving plete pairs results were consistent with 1952.54 studies, but the large The earlier of involved, and the more detailed questions asked, number patients to more searching analysis and the elimination of possible con- led a founding factors. 1950 the number of case-coytrol studies mushroomed. The accu- After from mulating evidence was reviewed successively by these studies Harold the and the Surgeon General of Doll75 United Richard Dorn,56 Initially results of the case-control studies were expressed States.57 the in terms of the proportions of smokers among cases and con- simply trols (p1 p2,say), but in a seminal paper in 1951 Jerome Cornfield58 and - pl): p241 - p2), in a case-control showed that the odds ratio, pl/(l study is estimate of the ratio of the risk of lung cancer among smok- an ' contri- that among ers This paper was the first major to non-smokers. lung bution to epidemiological methods produced by the problem of matching and subclassification 109 and smoking. cancer 4 summarizes results of 24 case-control studies in males the and Figure females. Many of the in were small so that the studies such studies 12 a weighted of ratio varied considerably. In estimates 4 Figure odds the is Woolf calculated by a method described by Barnet mean in shown, males The of 1955.59 cancer is greater in risk than females, and lung increases with the amount smoked. 14 Figure : Smoking and Lung Cancer Figure 4 376 and MILLAR JAMES CONNELLY GERRY B. HILL, WAYNE and Cancer Case-control Lung Smoking Studies Cohort Studies 60 Odds Ratio* less though Cohort studies, prone to bias, are much more difficult to 1 '0 assemble than studies, since it is necessary to case-control many perform thousands of individuals, determine their smoking status, and follow them up for several years to determine how many develop lung cancer. Four such studies were the 1950s. The subjects used were mounted in British United States veterans,62 Canadian veterans,63 and vol- doctors,61 mor- four used All unteers the Cancer by assembled American [email protected] tality as the end-point. Figure 5 shows the combined mortality ratios for cancer of the lung in males of cigarette by level smoking. females, of the studies involved Two lung of numbers precise to provide small too deaths were cancer but the " than Lees 20 All or more 20 recorded causes of death were all in the cohort studies it estimates. Since was possible to determine the relationship between smoking and dis- ' eases than lung cancer. other Sigruficant associations were found in rela- Male8 Females m cancer mouth, pharynx, larynx, esophagus, of tion to several types (e.g. with chronic respiratory disease and cardiovascular disease. bladder) and mean Weighted Figure 5 and in Males cancer Lung Smoking Cohort Studies Ratio* Mortality 25 -1 10 19 , 20 or more Less 10 than to etudles 4 of mean welghied smoking and cancer The the lung stim- epidemiological studies of of of discourse: (1) the scientific literature, within two areas ulated debates the establishment and government agencies. The first of (2) medical and as to the development of epidemiology is a scientific most relevant these discipline.

110 110 causal inference : the mixtape with their corresponding treatments long enough for us to observe lung cancer develop – so presumably years of heavy smoking. How could anyone ever run an experiment like that? To describe it is to confront the impossibility of running such a randomized experiment. But how then do we answer the causal question without indepen- dence (i.e., randomization)? It is too easy for us to criticize Fisher and company for their stance on smoking as a causal link to lung cancer because that causal link is now universally accepted as scientific fact. But remember, it was not always. And the correlation/causation point is a damning one. 65 65 sound science Yet, , it is probably not a coincidence Fisher’s arguments, it turns out, was based on But . that Roland Fisher, the harshest critic we now know that in fact the epidemiologists were right. Hooke of the epidemiological theory that ] wrote: 1983 [ smoking caused lung cancer, was himself a chain smoker. When he died “the [epidemiologists] turned out to be right, but only because bad of lung cancer, he was the highest paid logic does not necessarily lead to wrong conclusions.” expert witness for the tobacco industry in history. To motivate what we’re doing in subclassification, let’s work with Cochran 1968 ], which was a study trying to address strange patterns [ 66 66 Cochran lays out I first learned of this paper from in smoking data by adjusting for a confounder. Alberto Abadie in a lecture he gave 14 ). mortality rates by country and smoking type (Table at the Northwestern Causal Inference workshop. 000 person- 1 : Death rates per 14 Table , Smoking group Canada British US years [ ] 1968 , Cochran . 5 3 13 . 11 2 . 20 Non-smokers . 1 13 . 5 . 20 Cigarettes 14 5 5 20 . 7 17 . 4 Cigars/pipes 35 . As you can see, the highest death rate among Canadians is the cigar and pipe smokers, which is considerably higher than that of non-smokers or cigarettes. Similar patterns show up in both countries, though smaller in magnitude than what we see in Canada. This table suggests that pies and cigar smoking are more dan- gerous than cigarette smoking which, to a modern reader, sounds ridiculous. The reason it sounds ridiculous is because cigar and pipe smokers often do not inhale, and therefore there is less tar that accu- mulates in the lungs than with cigarettes. And insofar as it’s the tar that causes lung cancer, it stands to reason that we should see higher mortality rates among cigarette smokers. But, recall the independence assumption. Do we really believe that: 1 1 1 [ | Cigarette ]= E [ Y E | Pipe ]= E [ Y Y | Cigar ] 0 0 0 | | Cigarette ]= E [ Y Y [ Pipe ]= E [ Y E | Cigar ] Is it the case that factors related to these three states of the world are truly independent to the factors that determine death rates? One

111 matching and subclassification 111 way to check this is to see if the three groups are balanced on pre- treatment covariates. If the means of the covariates are the same for each group, then we say those covariates are balanced and the two groups are with respect to those covariates. exchangeable One variable that appears to matter is the age of the person. Older people were more likely at this time to smoke cigars and pipes, and without stating the obvious, older people were more likely to die. In 15 we can see the mean ages of the different groups. Table , Cochran : Mean ages, years [ 15 Table Smoking group Canada British US 1968 ]. . 0 Non-smokers 54 . 9 49 . 1 57 5 2 50 . 49 . 8 53 . Cigarettes . 7 . 55 9 Cigars/pipes . 65 7 59 The high means for cigar and pip smokers are probably not terri- bly surprising to most of you. Cigar and pipe smokers are typically older than cigarette smokers, or at least were in 1968 when this was written. And since older people die at a higher rate (for reasons other than just smoking cigars), maybe the higher death rate for cigar smokers is because they’re older on average. Furthermore, maybe by the same logic the reason that cigarette smoking has such a low mor- tality rate is because cigarette smokers are younger on average. Note, using DAG notation, this simply means that we have the following DAG: D Y A D is smoking, Y is mortality, and A is age of the smoker. where Insofar as CIA is violated, then we have a backdoor path that is open, which also means in the traditional pedagogy that we have omitted variable bias. But however we want to describe it, the common thing it will mean is that the distribution of age for each group will be different – which is what I mean by covariate imbalance . My first strategy for addressing this problem of covariate balance is by condition balance the treatment on the key variable, which in turn will 67 67 and control groups with respect to this variable. This issue of covariate balance runs throughout nearly every identification So how do we exactly close this backdoor path using subclassi- strategy that we will discuss, in some fication? We calculate the mortality rate for some treatment group way or another. (cigarette smokers) by some strata (here, that is age). Next, we then weight the mortality rate for the treatment group by a strata (age)- specific weight that corresponds to the control group. This gives us the age adjusted mortality rate for the treatment group. Let’s explain

112 112 causal inference the mixtape : 16 with an example by looking at Table . Assume that age is the only relevant confounder between cigarette smoking and mortality. Our 40 , 41 first step is to divide age into strata: say 70 , and 71 and 20 - - older. Table 16 : Subclassification example. Death rates Number of Cigarette-smokers Cigarette-smokers Pipe/cigar-smokers Age 10 20 40 - 20 65 40 25 25 Age 41 - 70 10 Age 71 60 65 100 Total 100 What is the average death rate for pipe smokers without subclas- sification? It is the weighted average of the mortality rate column N t are the number of N and N and where each weight is equal to t N people in each group and the total number of people, respectively. 65 25 10 Here that would be . That is, the + 40 ⇥ 20 = 29 + 60 ⇥ ⇥ 100 100 100 29 100 , 000 . mortality rate of smokers in the population is per But notice that the age distribution of cigarette smokers is the exact opposite (by construction) of pipe and cigar smokers. Thus the age distribution is imbalanced. Subclassification simply adjusts the mortality rate for cigarette smokers so that it has the same age distri- bution of the comparison group. In other words, we would multiple each age-specific mortality rate by the proportion of individuals in that age strata for the comparison group. That would be 10 25 65 = 51 + 40 ⇥ 20 ⇥ + 60 ⇥ 100 100 100 That is, when we adjust for the age distribution, the age-adjusted mortality rate for cigarette smokers (were they to have the same age distribution as pipe and cigar smokers) would be 51 100 , 000 – per almost twice as large as we got taking a simple naive calculation unadjusted for the age confounder. Cochran uses a version of this subclassification method in his paper and recalculates the mortality rates for the three countries and the three smoking groups. See Table 17 . As can be seen, once we adjust for the age distribution, cigarette smokers have the highest death rates among any group. Which variables should be used for adjustments? This kind of adjust- ment raises a question – which variable should we use for adjust- ment. First, recall what we’ve emphasized repeatedly. Both the backdoor criterion and CIA tell us precisely what we need to do. We

113 matching and subclassification 113 : Adjusted mortality rates using 17 Table Smoking group Canada UK US ]. 1968 , Cochran age groups [ 3 11 5 Non-smokers . . 20 . 2 3 13 5 2 . 8 21 . 14 . 29 Cigarettes 11 . 0 13 19 8 7 Cigars/pipes . . need to choose a variable that once we condition on it, all backdoor paths are closed and therefore the CIA is met. We call such a variable the covariate . A covariate is usually a random variable assigned to the individual units prior to treatment. It is predetermined and therefore exogenous. It is not a collider, nor is it endogenous to the outcome itself (i.e., no conditioning on the dependent variable). A variable is D exogenous with respect to X does not depend on is the value of the value of . Oftentimes, though not always and not necessarily, D this variable will be time invariant, such as race. Why shouldn’t we include in our adjustment some descendent of the outcome variable itself? We saw this problem in our first collider example from the DAG chapter. Conditioning on a variable that is a descendent of the outcome variable can introduce collider bias, and it is not easy to know ex ante just what kind of bias this will introduce. Thus, when trying to adjust for a confounder using subclassifica- tion, let the DAG help you choose which variables to include in the conditioning strategy. Your goal ultimately is to satisfy the backdoor criterion, and if you do, then the CIA will hold in your data. Identifying assumptions Let me now formalize what we’ve learned. In order to estimate a causal effect when there is a confounder, we 1 ) CIA and ( 2 ) the probability of treatment to be between 0 need ( and 1 for each strata. More formally, 1 0 ) . , 1 Y ( ?? D | X (conditional independence) Y 2 . 0 < Pr ( D =1 | X ) < 1 with probability one (common support) These two assumptions yield the following identity 1 0 1 0 Y Y | X ]= E [ Y E Y | X , D [ = 1] 1 0 Y = | X , D = 1] E E [ Y [ | X , D = 0] = E [ Y | X , D = 1] E [ Y | X , D = 0] where each value of is determined by the switching equation. Y Given common support, we get the following estimator: Z [ X = ) ( E [ Y | X , d = 1] E [ Y | X , D = 0] dPr D ) ( ATE

114 114 causal inference : the mixtape These assumptions are necessary to identify the ATE, but fewer as- sumptions are needed. They are that D is conditionally independent 0 Y of , and that there exists some units in the control group for each treatment strata. Note, the reason for the common support assump- tion is because we are weighting the data; without common support, we cannot calculate the relevant weights. For what we are going to do Subclassification exercise: Titanic dataset next, I find it useful to move into actual data. We will use a dataset to conduct subclassification which I hope you find interesting. As ev- eryone knows, the Titanic ocean cruiser hit an iceberg and sank on its 700 maiden voyage. A little over passengers and crew survived out of 2200 the total. It was a horrible disaster. Say that we are curious as to whether or not seated in first class, were you more likely to survive. To answer this, as always, we need two things: data and assumptions. . scuse titanic, clear Our question as to whether first class seating increased the prob- ability of survival is confounded by the oceanic norms during disas- ters. Women and children should be escorted to the lifeboats before the men in the event of a disaster requiring exiting the ship. If more women and children were in first class, then maybe first class is sim- ply picking up the effect of that social norm, rather than the effect of class and wealth on survival. Perhaps a DAG might help us here, as a DAG can help us outline the sufficient conditions for identifying the causal effect of first class on survival. W D Y C Now before we commence, let’s review what it means. This says that being a female made you more likely to be in first class, but also made you more likely to survive because lifeboats were more likely to be allocated to women. Furthermore, being a child made you more likely to be in first class and made you more likely to survive. Finally, 68 68 there are no other confounders, observed or unobserved. I’m sure you can think of others, though, in which case this DAG is Here we have one direct path (the causal effect) between first class misleading. Y D ) and survival ( Y ) and that’s D ! ( . But, we have two backdoor paths. For instance, we have the D C ! Y backdoor path and

115 matching and subclassification 115 we have the W ! Y backdoor path. But fortunately, we have D data that includes both age and gender, so it is possible to close each backdoor path and therefore satisfy the backdoor criterion. We will use subclassification to do that. But, before we use subclassification to achieve the backdoor cri- terion, let’s calculate a naive simple difference in outcomes (SDO) which is just [ Y | D = 1] E [ Y | D = 0] E for the sample. . gen female=(sex==0) . label variable female "Female" . gen male=(sex==1) . label variable male "Male" . gen s=1 if (female==1 & age==1) . replace s=2 if (female==1 & age==0) . replace s=3 if (female==0 & age==1) . replace s=4 if (female==0 & age==0) . gen d=1 if class==1 . replace d=0 if class!=1 . summarize survived if d==1 . gen ey1=r(mean) . summarize survived if d==0 . gen ey0=r(mean) . gen sdo=ey1-ey0 . su sdo SDO says that being in first class raised the probability of survival by 35.4% * When you run this code, you’ll find that the people in first class were 35 . 4 % more likely to survive than people in any other group of pas- sengers including the crew. But note, this does not take into account the confounders of age and gender. So next we use subclassification weighting to control for these confounders. Here’s the steps that that will entail: 1 . Stratify the data into four groups: young males, young females, old males, old females 2 . Calculate the difference in survival probabilities for each group 3 . Calculate the number of people in the non-first class groups and

116 116 causal inference the mixtape : divide by the total number of non-first class population. These are our strata specific weights. 4 . Calculate the weighted average survival rate using the strata weights. Let’s do this in Stata, which hopefully will make these steps more concrete.

117 matching and subclassification 117 . cap n drop ey1 ey0 . su survived if s==1 & d==1 . gen ey11=r(mean) . label variable ey11 "Average survival for male child in treatment" . su survived if s==1 & d==0 . gen ey10=r(mean) . label variable ey10 "Average survival for male child in control" . gen diff1=ey11-ey10 . label variable diff1 "Difference in survival for male children" . su survived if s==2 & d==1 . gen ey21=r(mean) . su survived if s==2 & d==0 . gen ey20=r(mean) . gen diff2=ey21-ey20 . su survived if s==3 & d==1 . gen ey31=r(mean) . su survived if s==3 & d==0 . gen ey30=r(mean) . gen diff3=ey31-ey30 . su survived if s==4 & d==1 . gen ey41=r(mean) . su survived if s==4 & d==0 . gen ey40=r(mean) . gen diff4=ey41-ey40 . count if s==1 & d==0 . count if s==2 & d==0 . count if s==3 & d==0 . count if s==4 & d==0 . count . gen wt1=425/2201 . gen wt2=45/2201 . gen wt3=1667/2201 . gen wt4=64/2201 . gen wate=diff1 wt1 + diff2 wt2 + diff3 wt3 + diff4 wt4 * * * * . sum wate sdo

118 118 causal inference : the mixtape Here we find that once we condition on the confounders, gender and age, we find a much lower probability of survival associated with 1 % vs 16 first class (though frankly, still large). The weighted ATE is . 4 %. 35 the SDO which is . Curse of dimensionality Here we’ve been assuming two covariates each of which has two possible set of values. But this was for conve- nience. Our Titanic dataset, for instance, only came to us with two possible values for age – child and adult. But what if it had come to us with multiple values for age, like specific age? Then once we con- dition on individual age and gender, it’s entirely likely that we will not have the information necessary to calculate differences within strata, and therefore be unable to calculate the strata-specific weights that we need for subclassification. For this next part, let’s assume that we have precise data on Titanic survivor ages. But because this will get incredibly laborious, let’s just focus on a few of them. Table 18 : Subclassification example of Titanic survival for large K Number of Survival Prob 1 Age and Gender st Class Controls 1 st Class Controls Diff. 1 Male . 0011 2 11 -yo 01 12 -yo – Male – 1 0011 2 Male 13 -yo 1 . . 25 04 – 0 – -yo 14 Male ... Here we see an example of the common support assumption being violated. The common support assumption requires that for each strata, there exist observations in both the treatment and control group, but as you can see, there are not any 12 year old male passengers in first class. Nor are there any 14 -year old male passengers in first class. And if we were to do this for every age ⇥ gender combination, we would find that this problem was quite common. Thus we cannot calculate the ATE. But, let’s say that the problem was always on the treatment group, not the control group. That is, let’s assume that there is always some- one in the control group for a given gender ⇥ age combination, but there isn’t always for the treatment group. Then we can calculate the ATT. Because as you see in this table, for those two strata, 11 and 13 year olds, there is both treatment and control group values for the cal- culation. So long as there exists controls for a given treatment strata, we can calculate the ATT. The equation to do so can be compactly

119 matching and subclassification 119 written as: ◆ ✓ k K N k 1, k 0, T b Y d ( ) ⇥ Y = ATT Â N T k =1 Plugging in values for those summations, we get We’ve seen now a problem that arises with subclassification – in a finite sample, subclassification becomes less feasible as the number of covariates grows because as K grows, the data becomes sparse. We will at some point be missing values, in other words, for those cat- K egories. Imagine if we tried to add a third strata, say race (black and white). Then we’d have two age categories, two gender categories and two race categories, giving us eight possibilities. In this small sample, we probably will end up with many cells having missing information. This is called the curse of dimensionality . If sparseness occurs, it means many cells may contain either only treatment units, or only control units, but not both. If that happens, we can’t use subclassification, because we do not have common support. And therefore we are left searching for an alternative method to satisfy the backdoor criterion. Exact matching Subclassification uses the difference between treatment and control group units, and achieves covariate balance by using the K probabil- ity weights to weight the averages. It’s a simple method, but it has the aforementioned problem of the “curse of dimensionality”. And probably, that’s going to be an issue practically in any research you undertake because it may not be merely one variable you’re worried about, but several. In which case, you’ll already be running into the curse. But the thing that we emphasize here is that the subclassifica- tion method is using the raw data, but weighting it so as to achieve balance. We are weighting the differences, and then summing over those weighted differences. But there’s alternative approaches. For instance, what if we esti- b mated d by imputing the missing potential outcomes by condition- ATT ing on the confounding, observed covariate? Specifically, what if we filled in the missing potential outcome for each treatment unit using a control group unit that was “closest” to the treatment group unit for some X confounder. This would give us estimates of all the coun- terfactuals from which we could simply take the average over the differences. As we will show, this will also achieve covariate balance. This method is called matching . There are two broad types of matching that we will consider: exact

120 120 causal inference : the mixtape matching and approximate matching. We will first start by describing exact matching. Much of what I am going to be discussing is based 69 69 [ Abadie and Imbens I first learned about this form of 2006 ]. on matching from lectures by Alberto A simple matching estimator is the following: Abadie at the Northwestern Causal Inference workshop – a workshop that I 1 recommend. highly b Y = d ) Y ( ATT i  ( ) j i N T D =1 i th th th is the j unit matched to the i Y unit based on the j where being i j ( ) th i unit for some X covariate. For instance, let’s say “closest to” the that a unit in the treatment group has a covariate with value 2 and we find another unit in the control group (exactly one unit) with 2 . Then we will impute the treatment unit’s a covariate value of missing counterfactual with the matched unit’s, and take a difference. th But, what if there’s more than one variable “closest to” the i th unit? For instance, say that the same unit has a covariate value i 2 units with a value of j of 2 . What can we then and we find two do? Well, one option is to simply take the average of those two units’ outcome value. What if we find 3 ? What if we find 4 , and so on? Y However many matches M that we find, we would assign the average 1 outcome ( ) as the counterfactual for the treatment group unit. M Notationally, we can describe this estimator as ✓  ◆ M 1 1 b Y d Y = ATT i   (1) j m M N T D =1 m =1 i This really isn’t too different of an estimator from the one before it; 1 the difference is the which is the averaging over closest matches M that we were talking about. This approach works well when we can find a number of good matches for each treatment group unit. We usually define M to be small, like M =2 . If there are more than 2 , then we may simply randomly select two units to average outcomes 70 70 Note that all of these approaches over. require some programming, as they’re Those were all average treatment effects on the treatment group algorithms. b d estimators. You can tell that these are estimators because of the ATT 71 71 D Notice the summing over the treatment group. in the subscript of =1 But we can also estimate the i the summation operator. ATE . But note, when estimating the ATE , we are filling in both (a) missing control group units like before and (b) missing treatment i is treated, in other words, then we group units. If observation 0 need to fill in the missing Y using the control matches, and if the i observation i is a control group unit, then we need to fill in the 1 missing Y using the treatment group matches. The estimator is i below. It looks scarier than it really is. It’s actually a very compact, nicely written out estimator equation.

121 matching and subclassification 121 ◆  ✓ N M 1 1 b (2 D = 1) Y Y d ATE i i   j ( i ) m M N =1 m =1 i D D 1 is the nice little trick. When The then that leading term =1 2 i i 72 72 . D 1 =0 , then that leading term becomes becomes a 2 ⇥ 1 1=1. And when i 1 a negative , and the outcomes reverse order so that the treatment observation can be imputed. Nice little mathematical form! 19 Let’s see this work in action by working with an example. Table shows two samples: a list of participants in a job trainings program and a list of non-participants, or non-trainees. The left-hand-side group is the treatment group and the right-hand-side group is the control group. The matching algorithm that we defined earlier will matched sample consisting of each create a third group called the treatment group unit’s matched counterfactual. Here we will match on the age of the participant. 19 : Training example with exact Table Trainees Non-Trainees matching Unit Age Earnings Unit Age Earnings 18 1 20 8500 9500 1 10075 29 12250 2 27 2 21 8725 3 24 11000 3 39 4 11750 27 4 12775 12550 38 5 13250 33 5 10525 22 29 6 6 10500 19 9750 7 39 12775 7 11425 8 20 10000 8 33 9400 10250 9 21 9 24 10750 10 30 12500 30 10 11 33 11425 36 12 12100 22 8950 13 14 18 8050 43 15 13675 16 39 12775 8275 19 17 30 9000 18 15475 19 51 14800 48 20 $ 11 , 075 31 . 95 $ 11 , 101 . 25 Mean 24 . 3 Before we do this, though, I want to show you how the ages of the trainees differs on average with the ages of the non-trainees.

122 122 causal inference : the mixtape 19 – the average age of the participants is We can see that in Table 3 31 . 95 . Thus the 24 and the average age of the non-participants is . people in the control group are older, and since wages typically rise with age, we may suspect that part of the reason that their average , earnings is higher ($ vs. $ 11 , 101 ) is because the control group 11 075 exchangeable because the is older. We say that the two groups are not balanced covariate is not . Let’s look at the age distribution ourselves to see. To illustrate this, let’s download the data first. We will create two histograms – the distribution of age for treatment and non- trainee group – as well as summarize earnings for each group. That 15 . information is also displayed in Figure _ . scuse training example, clear _ histogram age treat, bin(10) frequency . _ histogram age controls, bin(10) frequency . _ _ su age treat age controls . _ _ su earnings treat earnings control . 15 , these two populations not only As you can see from Figure have different means (Table 19 ), but the entire distribution of age across the samples is different. So let’s use our matching algorithm and create the missing counterfactuals for each treatment group unit. This method, since it only imputes the missing units for each b d . treatment unit, will yield an estimate of the ATT Now let’s move to creating the matched sample. As this is exact matching, the distance traveled to the nearest neighbor will be zero integers. This won’t always be the case, but note that as the control group sample size grows, the likelihood we find a unit with the same covariate value as one in the treatment group grows. I’ve created a dataset like this. The first treatment unit has an age of 18 . Searching down through the non-trainees, we find exactly one person with an age of 18 and that’s unit 14 . So we move the age and earnings information to the new matched sample columns. We continue doing that for all units, always moving the control group unit with the closest value on X to fill in the missing coun- terfactual for each treatment unit. If we run into a situation where there’s more than one control group unit “close”, then we simply av- erage over them. For instance, there are two units in the non-trainees group with an age of 30 , and that’s 10 and 18 . So we averaged their earnings and matched that average earnings to unit 10 . This is filled out in Table 20 . Now we see that the mean age is the same for both groups. We can also check the overall age distribution (Figure 16 ). As you can

123 matching and subclassification 123 : Covariate distribution by job 15 Figure Trainees trainings and control. 2 1.5 1 Frequency .5 0 35 20 30 25 15 Age Non-Trainees 4 3 2 Frequency 1 0 20 30 40 50 Age

124 124 causal inference the mixtape : : Training example with exact 20 Table matching (including matched sample) Matched Sample Non-Trainees Trainees Unit Age Earnings Unit Age Earnings Unit Age Earnings 8500 14 18 8050 18 1 1 20 9500 10075 6 10525 2 12250 29 2 27 29 8725 24 9400 21 3 11000 24 3 9 12775 8 27 10075 4 27 11750 4 39 38 12550 33 11425 5 33 13250 5 11 10525 22 8950 29 13 6 10500 22 6 7 17 19 8275 19 9750 7 39 12775 33 11425 1 20 8500 8 20 10000 8 24 10250 3 21 8725 9 21 9 9400 30 12500 , 18 30 9875 10750 30 10 10 10 11 33 11425 12100 36 12 22 8950 13 8050 14 18 13675 43 15 12775 39 16 17 19 8275 30 9000 18 15475 19 51 48 20 14800 25 $ , 075 31 . 95 $ 11 3 101 . 11 24 . 3 $ 9 , 380 . 24 Mean ,

125 matching and subclassification 125 see, the two groups are exactly balanced on age. Therefore we describe . And the difference in earnings be- the two groups as exchangeable tween those in the treatment group and those in the control group is , 695 $ 1 . That is, we estimate that the causal effect of the program was 73 73 in higher earnings. 1 $ I included code for reproducing this 695 , information as well. _ . example, clear scuse training _ histogram age treat, bin(10) frequency . _ . histogram age matched, bin(10) frequency _ _ treat age su age controls . _ _ matched earnings . su earnings matched Let’s summarize what we’ve learned. We’ve been using a lot of different terms, drawn from different authors and different statistical traditions, so I’d like to map them onto one another. The two groups were different in ways that were directly related to both the treatment and the outcome itself. This mean that the independence assumption was violated. Matching on X meant creating an exchangeable set of observations – the matched sample – and what characterized this balance . matched sample was Approximate matching The previous example of matching was relatively simple – find a unit or collection of units that have the same value of some covariate X and substitute their outcomes as some unit j ’s counterfactuals. Once you’ve done that, average the differences and you have an estimate of the ATE. But what if when you had tried to find a match, you couldn’t find another unit with that exact same value? Then you’re in the world of a set of procedures that I’m calling approximate matching. Nearest Neighbor Covariate Matching One of the instances where exact matching can break down is when the number of covariates, K , grows large. And when we have to match on more than one variable, but are not using the sub-classification approach, then one of the first things we confront is the concept of distance . What does it mean for one unit’s covariate to be “close” to someone else’s? Furthermore, what does it mean when there are multiple covariates with therefore measurements in multiple dimensions? Matching on a single covariate is straightforward because distance is measured in terms of the covariate’s own values. For instance, a distance in age is simply how close in years or months or days the

126 126 causal inference the mixtape : : Covariate distribution by job 16 Figure Trainees trainings and matched sample. 2 1.5 1 Frequency .5 0 35 20 30 25 15 Age Matched Sample 2 1.5 1 Frequency .5 0 20 15 25 30 35 Age

127 matching and subclassification 127 person is to another person. But what if we have several covariates needed for matching? Say it’s age and log income. A one point change in age is very different from a one point change in log income, not to mention that we are now measuring distance in two, not one, dimensions. When the number of matching covariates is more than one, we need a new definition of distance to measure closeness. We begin with the simplest measure of distance: the Euclidean distance : q 0 X ) || = || ( X X X ) ( X X i i j j i j v u k u t 2 ) = ( X X ni nj  n =1 The problem with this measure of distance is that the distance mea- sure itself depends on the scale of the variables themselves. For this reason, researchers typically will use some modification of the Eu- clidean distance, such as the , or they’ll normalized Euclidean distance use an alternative distance measure altogether. The normalized Eu- clidean distance is a commonly used distance and what makes it different is that the distance of each variable is scaled by the vari- able’s variance. The distance is measured as: q 0 1 b X X X ) X || V = || ( X X ) ( i j i j i j where 1 0 2 b 0...0 s 1 B C 2 b ... 0 0 s B C 2 1 C B b = V . . . . C B . . . . . @ A . . . 2 b s 00... k Notice that the normalized Euclidean distance is equal to: v u k u ( ) X X ni nj t || = X || X i j  2 b s n =1 n X , these changes also affect Thus if there are changes in the scale of its variance, and so the normalized Euclidean distance does not change. Mahalanobis distance which like the normalized Finally, there is the Euclidean distance measure is a scale-invariant distance metric. It is: q 1 0 b X || = || ( X X X ) X ) S X ( i i j j i j X b . S X is the sample variance-covariance matrix of where X

128 128 causal inference : the mixtape Basically, more than one covariate creates a lot of headaches. Not only does it create the curse of dimensionality problem, but it also makes measuring distance harder. All of this creates some challenges for finding a good match in the data. As you can see in each of these distance formulas, there are sometimes going to be matching X = X . What does this mean though? It 6 discrepancies. Sometimes i j i has been matched with some unit j on the means that some unit = basis of a similar covariate value of . Maybe unit i has an age X x 25 j has an age of 26 . Their difference is 1 , but unit of . Sometimes the discrepancies are small, sometimes zero, sometimes large. But, as they move away from zero, they become more problematic for our estimation and introduce bias. How severe is this bias? First, the good news. What we know is that the matching discrepancies tend to converge to zero as the sample size increases – which is one of the main reasons that ap- proximate matching is so data greedy. It demands a large sample But size in order for the matching discrepancies to be trivially small. what if there are many covariates? The more covariates, the longer it takes for that convergence to zero to occur. Basically, if it’s hard to find good matches with an X that has a large dimension, then you will need a lot of observations as a result. The larger the dimension, the greater likelihood of matching discrepancies, the more data you need. Abadie and Imbens Bias correction This material is drawn from 2011 ] which introduces bias correction techniques with matching [ estimators when there are matching discrepancies in finite samples. So let’s begin. Everything we’re getting at is suggesting that matching is biased due to these poor matching discrepancies. So let’s derive this bias. First, we write out the sample estimate and then we subtract out ATT the true ATT . 1 b = d ) Y ( Y ATT i  j ( i ) N T =1 D i and j and i ) units are matched, X where each ⇡ X i . Next =0 D ( i i ) ( i ) j ( j we define the conditional expection outcomes 0 0 ( x )= E [ Y μ X = x , D = 0] = E [ Y | | X = x ] 1 1 D x )= E [ Y | X = x , μ = 1] = E [ Y ( | X = x ] Notice, these are just the expected conditional outcome functions based on the switching equation for both control and treatment groups.

129 matching and subclassification 129 As always, we write out the observed value as a function of ex- pected conditional outcomes and some stochastic element: D i = ( X μ )+ # Y i i i μ terms: ATT estimator using the above Now rewrite the 1 1 0 b ) # )+ X ( [( μ = ( X d )+ # μ ) ( ATT i i  ) i ) ( j j i ( N T =1 D i 1 1 1 0 μ = ( X ) μ ) ( X )) + # ( # ( i i   ) i ( j ( i ) j N N T T =1 =1 D D i i with the stochastic element ATT Notice, the first line is just the included from the previous line. And the second line rearranges it so ATT plus the average difference that we get two terms: the estimated in the stochastic terms for the matched sample. Now we compare this estimator with the true value of ATT . 1 0 1 b = d X d d ) ( μ ( ( X μ ) ATT ATT ATT i  j ( i ) N T =1 D i 1 # ) + # ( i  ( j i ) N T =1 D i which with some simple algebraic manipulation is: ⌘ ⇣ 1 0 1 b ( μ d ) d μ X ( X = ) d ATT ATT ATT i i  N T =1 D i 1 ) # + ( # i  j ( i ) N T =1 D i ⌘ ⇣ 1 0 0 X ( . μ + ( X ) ) μ i  ( ) j i N T D =1 i p b ( d N Applying the central limit theorem and the difference, T ATT ) converges to a normal distribution with zero mean. But, how- d ATT ever, p p 0 0 b [ d D d = 1]. )] = E [ )) N X ( μ N ( X | ) μ E ( ( T T ATT ATT i ) i ( j Now consider the implications if the number of covariates is large. X and X First, the difference between converges to zero slowly. i ) ( j i 0 ( This therefore makes the difference X μ ) μ ( X converge to zero ) i i ) ( j p 0 0 μ N may not converge ( very slowly. Third, E ( X = 1] ) μ [ ( X D | )) T i ) ( j i p b [ ( N )] may not converge to zero. to zero. And fourth, E d d T ATT ATT As you can see, the bias of the matching estimator can be severe depending on the magnitude of these matching discrepancies. How- ever, one good piece of news is that these discrepancies are observed. We can see the degree to which each unit’s matched sample has se- vere mismatch on the covariates themselves. Secondly, we can always

130 130 causal inference the mixtape : make the matching discrepancy small by using a large donor pool of untreated units to select our matches, because recall, the likelihood of finding a good match grows as a function of the sample size, and so if we are content to estimating the , then increasing the size ATT of the donor pool can buy us out of this mess. But, let’s say we can’t do that and the matching discrepancies are large. Then we can apply bias correction methods to minimize the size of the bias. So let’s see Abadie what the bias correction method looks like. This is based on 2011 ]. and Imbens [ Note that the total bias is made up of the bias associated with i . Thus, each treated observation contributes each individual unit 0 0 X ( ) μ μ ( X to the overall bias. The bias-corrected matching is the ) i i ( j ) following estimator:  1 BC 0 0 b b b ( d ( Y )) Y X ( μ = ) ( μ ) X i i  ) i ) ( j j ( i ATT N T =1 D i 0 b μ where ( X ) is an estimate of E [ Y | X = x , D = 0] using, for example, OLS. Again, I find it always helpful if we take a crack at these estima- tors with concrete data. Table 21 contains more make-believe data for 4 of whom are treated and the rest of whom are functioning units, 8 as controls. According to the switching equation, we only observe the actual outcomes associated with the potential outcomes under treat- ment or control, which therefore means we’re missing the control values for our treatment group. : Another matching example Table 21 0 1 DX Y Y Unit (this time to illustrate bias correction) 1 5 1 11 22 17 3 10 1 5 46 13 4 0 10 5 6008 7504 8101 Notice in this example, we cannot implement exact matching because none of the treatment group units have exact matches in the control group. It’s worth emphasizing that this is a consequence of finite samples; the likelihood of finding an exact match grows when the sample size of the control group grows faster than that of the treatment group. Instead, we use nearest neighbor matching, which is simply going to be the matching, to each treatment unit, the control group unit whose covariate value is nearest to that of the

131 matching and subclassification 131 treatment group unit itself. But, when we do this kind of matching, matching discreprancies we necessarily create , which is simply another way of saying that the covariates are not perfectly matched for every 22 . unit. Nonetheless, the nearest neighbor “algorithm” creates Table 22 Table : Nearest neighbor matched 1 0 DX Unit Y Y sample 15 1 11 4 0 17 22 15 5 3 10 13 1 46 5 4 0 10 6008 7504 8101 5 4 2 6 0 10 5 1 b + . With the + Recall that the d + = 3.25 ATT 4 4 4 4 0 74 74 b X ) μ ( We’ll use OLS. Let’s bias correction, we need to estimate Hopefully, now it will be obvious . 0 b μ ( X ) is. All that it is is the what exactly illustrate this using another Stata dataset based on Table 22 . Y on fitted values from a regression of X . _ _ bias reduction, clear . scuse training reg Y X . predict muhat . . list When we regress Y onto X and D, we get the following estimated coefficients: 0 b b b X )= μ b X + b ( 0 1 X = 4.42 0.49 This give us the following table of outcomes, treatment status and predicted values. And then this would be done for the other three simple differ- ences, each of which is added to a bias correction term based on the fitted values from the covariate values. Now care must be given when using the fitted values for bias cor- rection, so let me walk you through it. You are still going to be taking the simple differences (e.g., 5 - 4 for row 1 ), but now you will also sub- tract out the fitted values associated with each observation’s unique 1 , the outcome 5 has a covariate of covariate. So for instance, in row 11 , which gives it a fitted value of 3 . 89 , but the counterfactual has a value of 10 which gives it a predicted value of 3 . 94 . So therefore we

132 132 causal inference : the mixtape : Nearest neighbor matched Table 23 0 0 1 b X ) Y YDX ( Y μ Unit sample with fitted values for bias correction 89 . 5 4 5 1 11 3 1 08 . 2202174 . 18 3 10 5 10 1 5 4 28 4616134 . . 5 4 40 94 10 3 . 84 00 0 6 03 7 5 50 44 . 23 37 14 10 1 8 . would use the following bias correction: 3.94) (3.89 5 4 BC b = d +... ATT 4 Now that we see how a specific fitted value is calculated and how it contributes to the calculation of the ATT, let’s look at the entire calculation now. c c c c 0 0 0 0 ( μ (11) (5 μ 4) (10)) (8)) μ (2 0) ( μ (7) BC b + d = ATT 4 4 c c c c 0 0 0 0 ( (5) μ μ ( (4)) (1)) μ (6 1) μ (3) (10 5) + + 4 4 = 3.28 which is slightly higher than the unadjusted ATE of 3 . 25 . Note that this bias correction adjustment becomes more significant as the matching discrepancies themselves become more common. But, if the matching discrepancies are not very common in the first place, then practically by definition, then bias adjustment doesn’t change the estimated parameter very much. Bias arises because of the effect of large matching discrepancies. M (e.g., To minimize these discrepancies, we need a small number of M ). Larger values of M produce large matching discrepancies. =1 Second, we need matching with replacement. Because matching with replacement can use untreated units as a match more than once, matching with replacement produces smaller discrepancies. And 0 μ (.) well. finally, try to match covariates with a large effect on The matching estimators have a normal distribution in large sam- ples provided the bias is small. For matching without replacement, the usual variance estimator is valid. That is: ◆ ✓ 2 M 1 1 2 b b s Y = d Y ATT i   ) j i ( ATT m N M T =1 =1 D m i

133 matching and subclassification 133 For matching with replacement: ! 2 M 1 1 2 b b Y = d Y s ATT i   ( i ) j ATT m M N T =1 D m =1 i ◆ ✓ 1) 1 ( K K i i c ( D , X | # = 0) var + i i  2 N M T D =0 i K i is the number of times that observation where is used as a match. i c ( Y = 0) | X can be estimated also by matching. For example, , D var i i i D = D take two observations with = 0 and X ⇡ X , then i j j i 2 ) ( Y Y i j c Y = 0) = ( | X var , D i i i 2 c D ( # . The bootstrap, though, | X is an unbiased estimator of , var = 0) i i i doesn’t work. Propensity score methods There are several ways of achieving the conditioning strategy implied by the backdoor criterion. One addi- tional one was developed by Donald Rubin in the mid- 1970 s to early 1980 s called the propensity score method [ , 1977 , Rosenbaum Rubin , ]. The propensity score is very similar in spirit to and Rubin 1983 Abadie and Imbens both nearest neighbor covariate matching by [ ] and subclassification. It’s a very popular method, particularly 2006 in the medical sciences, of addressing selection on observables and has gained some use among economists as well [ Dehejia and Wahba , 2002 ]. Before we dig into it, though, a couple of words to help manage your expectations. Propensity score matching has not been as widely used by economists as other methods for causal inference because economists are oftentimes skeptical that CIA can be achieved in any dataset. This is because for many applications, economists as a group are more concerned about selection on unobservables than they are of selection on observables, and as such, have not found matching methods to be used as often. I am agnostic as to whether CIA holds or doesn’t in your particular application, though. Only a DAG will tell you what the appropriate identification strategy is, and insofar as the backdoor criterion can be met, then matching methods may be appropriate. Propensity score matching is used when treatment is nonrandom but is believed to be based on a variety of observable covariates. It requires that the CIA hold in the data. Propensity score matching takes those covariates needed to satisfy CIA, estimates a maximum likelihood model of the conditional probability of treatment, and uses the predicted values from that estimation to collapse those covariates

134 134 causal inference : the mixtape into a single scalar. All comparisons between the treatment and 75 75 There are multiple methods that use control group are then based on that value. But I cannot emphasize the propensity score, as we will see, but this enough – this method, like regression more generally, only they all involve using the propensity has value for your project if you can satisfy the backdoor criterion score to make valid comparisons between the treatment group and by conditioning on . If you cannot satisfy the backdoor criterion X control group. in your data, then the propensity score gains you nothing. It is absolutely critical that your DAG be, in other words, defensible and accurate, as you depend on those theoretical relationships to design 76 76 the appropriate identification strategy. We will discuss in the instrumental variables chapter a common method The idea with propensity score methods is to compare units who, for addressing a situation where the based on observables, had very similar probabilities of being placed backdoor criterion cannot be met in your data. into the treatment group even though those units differed with X , two regards to actual treatment assignment. If conditional on units have the same probability of being treated, then we say they have similar propensity scores . If two units have the same propensity score, but one is the treatment group and the other is not, and the conditional independence assumption (CIA) credibly holds in the data, then differences between their observed outcomes are attributable to the treatment. CIA in this context means that the assignment of treatment, conditional on the propensity score, is independent of 77 77 This is what meant by the phrase potential outcomes, or “as good as random”. . selection on observables One of the goals when using propensity score methods is to create covariate balance between the treatment group and control group 78 78 . Exchangeable simply means that such that the two groups become observationally exchangeable appear the two groups similar to one There are three steps to using propensity score matching. The first another on observables . step is to estimate the propensity score; the second step is to select an algorithmic method incorporating the propensity score to calculate average treatment effects; the final step is to calculate standard errors. The first is always the same regardless of which algorithmic method we use in the second stage: we use maximum likelihood models to estimate the conditional probability of treatment, usually probit or logit. Before walking through an example using real data, let’s review some papers that use it. Example: the NSW Job Trainings Program The National Supported Work Demonstration (NSW) job trainings program was operated by the Manpower Demonstration Research Corp (MRDC) in the mid- 1970 s. The NSW was a temporary employment program designed to help disadvantaged workers lacking basic job skills move into the labor market by giving them work experience and counseling in a sheltered environment. It was also unique in that it randomly assigned qualified applicants to training positions. The treatment group received all the benefits of the NSW program. The controls were basically left to fend for themselves. The program admitted

135 matching and subclassification 135 AFDC females, ex-drug addicts, ex-criminal offenders, and high school dropouts of both sexes. 9 18 months Treatment group members were guaranteed a job for - depending on the target group and site. They were then divided into - 5 participants who worked together and met frequently crews of 3 with an NSW counselor to discuss grievances and performance. Finally, they were paid for their work. NSW offered the trainees lower wage rates than they would’ve received on a regular job, but allowed their earnings to increase for satisfactory performance and attendance. After their term expired, they were forced to find regular employment. The kinds of jobs varied within sites – some were gas station attendants, some worked at a printer shop – and males and females were frequently performing different kinds of work. The MDRC collected earnings and demographic information from both the treatment and the control group at baseline as well as every 9 4 post-baseline months thereafter. MDRC also conducted up to interviews. There were different sample sizes from study to study, which can be confusing, but it has simple explanations. NSW was a randomized job trainings program; therefore the independence assumption was satisfied. So calculating average treatment effects was straightforward – it’s the simple difference in 79 79 Remember, randomization means means estimator that we discussed in the Rubin causal chapter. that the treatment was independent of the potential outcomes, so simple 1 1 0 1 ⇡ Y [ E Y ] Y Y i i  difference in means identifies the  N N T C =0 D D =1 average treatment effect. i i The good news for MDRC, and the treatment group, was that the 80 80 treatment worked. Treatment group participants’ real earnings [ 1986 ] lists several studies Lalonde that discuss the findings from the post-treatment in 1978 was larger than that of the control group by program in footnote 3 . approximately $ 900 [ Lalonde , 1986 ] to $ 1 , 800 [ Dehejia and Wahba , 2002 ], depending on the sample the researcher used. [ 1986 ] is an interesting study both because he is evaluat- Lalonde ing the NSW program, and because he is evaluating commonly used econometric methods from that time. He evaluated the econometric estimators’ performance by trading out the experimental control group data with non-experimental control group data drawn from the population of US citizens. He used three samples of the Current Population Survey (CPS) and three samples of the Panel Survey of Income Dynamics (PSID) for this non-experimental control group data. Non-experimental data is, after all, the typical situation an economist finds herself in. But the difference with the NSW is that it was a randomized experiment, and therefore we know the average treatment effect. Since we know the average treatment effect, we can see how well a variety of econometric models perform. If the NSW program increased earnings by ⇡ $ 900 , then we should find that if

136 136 causal inference : the mixtape the other econometrics estimators does a good job, right? [ ] reviewed a number of popular econometric meth- 1986 Lalonde ods from this time using both the PSID and the CPS samples as non-experimental comparison groups, and his results were consis- tently bad. Not only were his estimates usually very different in magnitude, but his results are almost always the wrong sign! This paper, and its pessimistic conclusion, was influential in policy circles 81 81 It’s since been cited a little more than We can and led to greater push for more experimental evaluations. 1 , 700 times. Lalonde [ 1986 ]. Figure see these results in the following tables from shows the effect of the treatment when comparing the treatment 17 group to the experimental control group. The baseline difference 82 82 The treatment group made $ in real earnings between the two groups were negligible, But the 39 more than the control group in the and post-treatment difference in average earnings was between $ 798 21 less in the simple difference and $ $ 886 . multivariate regression model, but neither is statistically significant. Figure 17 Lalonde [ 1986 ] Table 5 (a) : 18 shows the results he got when he used the non-experimental Figure data as the comparison group. He used three samples of the PSID and three samples of the CPS. In nearly every point estimate, the effect is negative. So why is there such a stark difference when we move from the NSW control group to either the PSID or CPS? The reason is because of selection bias. That is 0 0 = E | D = 1] 6 [ E [ Y Y | D = 0] In other words, it’s highly like that the real earnings of NSW par-

137 matching and subclassification 137 Figure : Lalonde [ 1986 ] Table 5 (b) 18 ticipants would have been much lower than the non-experimental control group’s earnings. As you recall from our decomposition of the simple difference in means estimator, the second form of bias is 0 0 Y [ | D = 1] < E selection bias, and if Y E | D [ , this will bias the = 0] estimate of the ATE downward (e.g., estimates that show a negative effect). But a violation of independence also implies that the balancing 24 shows the mean values for each property doesn’t hold. Table covariate for the treatment and control groups where the control is 15 the 992 observations from the CPS. As you can see, the treatment , group appears to be very different on average from the control group CPS sample along nearly every covariate listed. The NSW participants are more black, more hispanic, younger, less likely to be married, more likely to have no degree, less schooling, more likely to be unemployed in 1975 and have considerably lower earnings in . In short, the two groups are not on observables 1975 exchangeable (and likely not exchangeable on unobservables either). Lalonde The first paper to re-evaluate 1986 ] using propensity [ 83 83 [ 1999 ]. [ Their interest was score methods was Lalonde Dehejia and Wahba 1986 ] did not review propen- sity score matching in this study. One two fold - to examine whether propensity score matching could be an possibility is that he wasn’t too familiar improvement in estimating treatment effects using non-experimental Rosenbaum and Ru- with the method. 1983 [ bin ] was relatively new, after all, data. And two, to show the diagnostic value of propensity score when LaLonde had begun his project matching. The authors used the same non-experimental control and had not yet been incorporated into ]. group datasets from the CPS and PSID as Lalonde [ 1986 most economists’ toolkit.




141 matching and subclassification 141 . gen interaction1 = school re74 * . gen re74sq=re74^2 . gen re75sq=re75^2 hispanic . gen interaction2 = u74 * Now we are ready to estimate the propensity score. We will use a [ ]. logit model to be consistent with 2002 Dehejia and Wahba . logit treat age agesq agecube school schoolsq married nodegree black hispanic re74 re75 u74 u75 interaction1 . predict pscore command uses the estimated coefficients from our The predict logit model and then estimates the conditional probability of treat- ment using: D =1 | X )= Pr ( b ( + g Tr e a t + a X ) F 0 e where () = F is the cumulative logistic distribution. ) (1+ e The propensity score used the fitted values from the maximum likelihood regression to calculate each unit’s conditional probability of treatment regardless of their actual treatment status . The propensity score is just the predicted conditional probability of treatment or 85 85 fitted value for each unit. It is advisable to use maximum like- lihood when estimating the propensity The definition of the propensity score is the selection probability score because so that the fitted values X | =1 D ( )= . X ) conditional on the confounding variables; p ( Pr . We could use [0, 1] are in the range a linear probability model, but linear There are two identifying assumptions for propensity score methods. probability models routinely create 0 1 ( . The second is called the The first is CIA. That is, D Y X , Y | ) ?? 1 fitted values below 0 and above , which are not true probabilities since which is 1 < ) X common support assumption. That is, 0 < Pr ( D =1 | 1. 0  p  86 the common support assumption. The conditional independence 86 This simply means that for any assumption simply means that the backdoor criterion is met in the probability, there must be units in both X data by conditioning on a vector . Or, put another way, conditional and the treatment group the control group. 87 , the assignment of units to the treatment is . X as good as random on 87 CIA is expressed in different ways This is according to the econometric/statistical Rosenbaum and Rubin tradition. 0 X b + a = # Y + 1983 [ ] called it the ignorable treatment i i i unconfoundedness . Pearl assignment, or 1 0 Y = Y + d i i Barnow calls it the backdoor criterion. Dale and Krueger et al. 1981 [ ] and X b + D d + a = # Y + i i i i . In selection on observables ] call it 2002 [ the traditional econometric pedagogy, . One Conditional independence is the same as assuming # ?? D X | i i i as we discussed earlier, it’s called the zero conditional mean assumption as last thing before we move on: CIA is not testable because it requires we see below. potential outcomes, which we do not have. We only have observed outcomes according to the switching equation. CIA is an assumption, and it may or may not be a credible assumption depending on your application. The second identifying assumption is called the common support assumption . It is required to calculate any particular kind of defined average treatment effect, and without it, you will just get some kind of weird weighted average treatment effect for only those regions that

142 142 causal inference : the mixtape do have common support. Common support requires that for each value of , there is a positive probability of being both treated and X 0 < Pr ( D . This implies that the probability =1 | X 1 ) untreated, or < i i X of receiving treatment for every value of the vector is strictly within the unit interval. Common support ensures there is sufficient overlap in the characteristics of treated and untreated units to find adequate matches. Unlike CIA, the common support requirement is by simply plotting histograms or summarizing the data. testable Here we do that two ways: by looking at the summary statistics and looking at a histogram. . su pscore if treat==1, detail : Distribution of propensity 25 Table Treatment group score for treatment group. Percentiles Values Smallest % 0018834 1 . 0022114 . 5 . 0072913 . % 0022114 . 0034608 % . 0202463 10 25 % . 0974065 . 0035149 . 1938186 % 50 Percentiles Values Largest . % 3106517 . 5583002 75 5698137 90 % . 4760989 . . 5705917 95 % . 5134488 % 5727966 . 5705917 . 99 . su pscore if treat==0, detail : Distribution of propensity Table 26 CPS Control group score for CPS Control group. Percentiles Values Smallest 69 06 4 . . e- 07 1 5 14 e- % 6 . 79 e- 07 5 % . 0000116 7 0000205 e- 07 . % 10 . 78 . e- 07 9 25 % 12 . 0000681 % . 0003544 50 Percentiles Values Largest . 5727351 0021622 . 75 % . 5794474 90 % . 0085296 . 5929902 95 % . 0263618 99 . 2400503 . % 5947019 The mean value of the propensity score for the treatment group 0 is . 22 and the mean for the CPS control group is 0 . 009 . The 50 th

143 matching and subclassification 143 0 194 but the control group percentile for the treatment group is . th percentile. 99 doesn’t reach that high a number until almost the Let’s look at the distribution of the propensity score for the two groups using a histogram now. . histogram pscore, by(treat) binrescale : Histogram of propensity 23 Figure Experimental Treatment CPS Control score by treatment status 60 40 Density 20 0 0 .4 0 .6 .2 .4 .6 .2 Propensity score Graphs by treatment status These two simple diagnostic tests show what is going to be a prob- lem later when we use inverse probability weighting. The probability of treatment is spread out across the units in the treatment group, but there is a very large mass of nearly zero propensity scores in the CPS. How do we interpret this? What this means is that the char- acteristics of individuals in the treatment group are rare in the CPS sample. This is not surprising given the strong negative selection into treatment. These individuals are younger, less likely to be married, and more likely to be uneducated and a minority. The lesson is if the two groups are significantly different on background characteristics, then the propensity scores will have grossly different distributions by treatment status. We will discuss this in greater detail later. For now, let’s look at the treatment parameter under both assump- tions. 1 0 d ] ( X x )] = E [ Y E = Y [ X | i i i i i 1 0 [ Y x ] | X x = = ] E [ Y E = | X i i i i The conditional independence assumption allows us to make the

144 144 causal inference : the mixtape following substitution 1 = [ | D E =1, X ] = x ]= E [ Y x | D Y =1, X i i i i i i and same for the other term. Common support means we can esti- mate both terms. Therefore under both assumptions E [ d ( X d )] = i propensity score theorem , which From these assumptions we get the 1 0 1 0 ( Y ) ?? D | X (CIA) then ( Y , , Y states that if ) ?? D | p ( X ) where Y p X )= Pr ( D =1 | X ) , the propensity score. This means that condi- ( tioning on the propensity score is sufficient to have independence. Conditioning on the propensity score is enough to have indepen- dence between the treatment and the potential outcomes. This is an extremely valuable theorem because stratifying on X tends to run into the sparseness-related problems (i.e., empty cells) in finite samples for even a moderate number of covariates. But the propensity scores is just a scalar. So stratifying across a probability is going to reduce that dimensionality problem. The proof of the propensity score theorem is fairly straightfor- ward as it’s just an application of the law of iterated expectations 88 88 with nested conditioning. See Angrist and Pischke [ 2009 ] p. If we can show that the probability an - 80 81 . individual receives treatment conditional on potential outcomes and the propensity score is not a function of potential outcomes, then we will have proven that there is independence between the potential X . Before diving into the outcomes and the treatment conditional on proof, first recognize that 0 0 1 1 D Y , Y =1 D p ( X )) = E [ | | Y ( , Y Pr , p ( X )] , because 1 0 1 0 Y E , Y D , p ( X )] = 1 ⇥ Pr ( D =1 | Y | , Y [ , p ( X )) 0 1 , ( D =0 | Y +0 ⇥ Y Pr , p ( X )) and the second term cancels out because it’s multiplied zero. The

145 matching and subclassification 145 formal proof is as follows: 1 0 1 0 ( , Y | , p ( X )) = E [ D | Y Y , Y Pr , p ( X )] D =1 } {z | See previous description 1 0 1 0 | Y = , Y E , p ( X ), [ ] | Y E , Y [ , p ( X )] D X {z } | by LIE 1 0 0 1 D Y Y [ , E [ , X ] | Y E , Y = , p ( X )] | {z } | , we know p ( Given ) X X 1 0 E [ D | X ] | Y = , Y E , p ( X )] [ | {z } by conditional independence 0 1 p ( X ) | Y = , Y [ , E ( X )] p | {z } propensity score definition = p ( X ) Using a similar argument, we obtain: Pr ( =1 | p ( X )) = E [ D | p ( X )] D {z | } Previous argument E [ E [ D | X ] | p ( X )] = {z | } LIE E [ p ( X ) | p ( = )] X | } {z definition = p X ) ( 1 0 ( D =1 | Y D , Y and , p ( X )) = Pr ( Pr =1 | p ( X )) by CIA. Like the omitted variable bias formula for regression, the propen- sity score theorem says that you need only control for covariates that determine the likelihood a unit receives the treatment. But it also says something more than that. It technically says that the only covariate you need to condition on is the propensity score. All of the information from the X matrix has been collapsed into a single number: the propensity score. A corollary of the propensity score theorem, therefore, states that given CIA, we can estimate average treatment effects by weighting 89 89 . This all works if we match on the appropriately the simple difference in means. propensity score and then calculate Because the propensity score is a function of X , we know differences in means. Direct propensity score matching works in the same way =1 D ( Pr X ) X Pr ( D =1 | | , p ( X )) = as the covariate matching we discussed earlier (e.g., nearest neighbor matching) = p ( X ) score except that we match on the directly. instead of the covariates Therefore conditional on the propensity score, the probability that D =1 does not depend on X any longer. That is, D and X are inde- pendent of one another conditional on the propensity score, or D ?? | p ( X )

146 146 causal inference : the mixtape of the propensity So from this we also obtain the balancing property score: X | D =1, p ( X )) = Pr ( X | D =0, p Pr X )) ( ( which states that conditional on the propensity score, the distribution of the covariates is the same for treatment as it is for control group units. See this in the following DAG. D Y ) p X ( X X and D . There’s the Notice that there exists two paths between X ! p ( X ) ! and there’s the backdoor path X ! direct path of Y D . The backdoor path is blocked by a collider, so there is not and D through it. But there is X systematic correlation between X and D through the first directed systematic correlation between p ( X ) , the propensity score, notice path. But, when we condition on D X are statistically independent . This implies that D ?? and that | p ( X ) which implies X b b X | D =1, =0, p ( X )) = Pr ( X | D Pr ( p ( X )) This is something we can directly test, but note the implication: conditional on the propensity score, treatment and control should on . In other words, the propensity average be the same with respect to X 90 90 observable covariates. I will have now officially beaten the balanced score theorem implies dead horse. But please understand - just because something is exchangeable on observables does not make it Estimation using propensity score matching exchangeable on unobservables. The propensity score theorem does not Inverse probability weighting has become a common approach imply balanced unobserved covariates. Brooks and Ohsfeldt [ 2013 ]. See within the context of propensity score estimation. We have the follow- ing proposition related to weighting. If CIA holds, then 1 0 Y = E Y [ ] d ATE  D p ( X ) Y · = E ( X ) · (1 p ( X )) p 1 0 [ Y d = Y E | D = 1] ATT  p ) X ( D 1 Y = E · · Pr D ( 1 p ( X ) = 1)

147 matching and subclassification 147 The proof for this is:   ) X ( D p Y , Y ) X X X = ( p E =1 D E p X ( )) X ( ) )(1 X ( p p  Y )) X X , D E + (1 p ( =0 ) p ( X 1 E Y | X , D = 1] = E [ Y | X , D = 0] [ P ( X ) and P ( X | D = 1). and the results follow from integrating over ATE and ATT are suggested by a The sample versions of both two-step estimator. Again, first estimate the propensity score. Second, use the estimated score to produce sample estimators. N b ( p 1 D ) X i i b Y · = d ATE i  b b (1 N )) ) · p ( p ( X X i i =1 i N b p X ( 1 ) D i i b = d Y · ATT i  b 1 N ) p ( X T i i =1 Using our earlier discussion of steps, this is technically the second step. Let’s see how to do this in Stata. I will move in steps because I want to illustrate to you the importance of trimming the data. First, we need to rescale the outcome variable, as the teffects command chokes on large values. So: _ . gen re78 scaled = re78/10000 _ . cap n teffects ipw (re78 scaled) (treat age agesq agecube school schoolsq married nodegree black hispanic re74 re75 u74 u75 interaction1, logit), osample(overlap) . keep if overlap==0 . drop overlap _ . cap n teffects ipw (re78 scaled) (treat age agesq agecube school schoolsq married nodegree black hispanic re74 re75 u74 u75 interaction1, logit), osample(overlap) . cap drop overlap 0 . 70 . We have to multiply this by Notice the estimated ATE: - 10 , 000 since we originally scaled it by 10 , 000 which is 0.70 ⇥ 10, 000 = . In words, inverse probability weighting methods found an 7, 000 ATE that was not only negative, but very negative. Why? What happened? Recall what inverse probability weighting is doing. It is weighting b X p ( treatment and control units according to ) which is causing the unit to blow up for very small values of the propensity score. Thus, we will need to trim the data. Here we will do a very small trimming to eliminate the mass of values at the far left tail. Crump et al. [ 2009 ] develop a principled method for addressing a lack of overlap. A good rule of thumb, they note, is to keep only observations on the

148 148 causal inference : the mixtape 0 . , 0 . 9 ], but here I will drop drop the ones with less then interval [ 1 05 . 0 , and leave it to you to explore this in greater detail. . drop if pscore <= 0.05 Now let us repeat the analysis and compare our answer both to what we found when we didn’t trim, but also what we got with the experimental ATE. _ scaled) (treat age agesq . cap n teffects ipw (re78 agecube school schoolsq married nodegree black hispanic re74 re75 u74 u75 interaction1, logit), osample(overlap) 918 Here we find an ATE of $ p < 0.12 . which is significant at Better, but still not exactly correct and not very precise. nearest An alternative approach to inverse probability weighting is on both the propensity score and covariates them- neighbor matching selves. The standard matching strategy is nearest neighbor matching i with one or more comparable where you pair each treatment unit j , where comparability is measured in terms of control group units distance to the nearest propensity score. This control outcome is then plugged into a matched sample, and then we simple calculate 1 [ = Y Y ( ATT i i ( j ) N T Y i . We will focus on where is the matched control group unit to ) ( i j the ATT because of the problems with overlap that we discussed earlier. For this next part, rerun your do file up to the point where you estimated your inverse probability weighting models. We want to go back to our original data before we dropped the low propensity score units as I want to illustrate how nearest neighbor works. Now type in the following command: . teffects psmatch (re78) (treat age agesq agecube school schoolsq married nodegree black hispanic re74 re75 u74 u75 _ cps) nn(3) interaction1, logit), atet gen(pstub A few things to note. First, we are re-estimating the propensity score. Notice the command in the second set of parentheses. We are estimating that equation with logit. Second, this is the ATT, not the ATE. The reason being, we have too many near zeroes in the data to find good matches in the treatment group. Finally, we are matching with three nearest neighbors. Nearest neighbors, in other words, will find the three nearest units in the control group, where “nearest” is measured as closest on the propensity score itself. We then average their actual outcome, and match that average outcome to each treatment unit. Once we have that, we subtract each unit’s matched control from its treatment value, and then divide by N , the T number of treatment units. When we do that in Stata, we get an ATT of $ 1 , 407 . Thus it is both relatively precise and 75 with a p < 0.05 .

149 matching and subclassification 149 closer in magnitude to what we find with the experiment itself. There are two kinds of matching we’ve re- Coarsened Exact Matching viewed so far. There’s exact matching which matches a treated unit to all of the control units with the same covariate value. But sometimes this is impossible and therefore there are matching discrepancies. For instance, say that we are matching continuous age and continu- ous income. The probability we find another person with the exact same value of both is very small if not zero. This leads therefore to mismatching on the covariates which introduces bias. The second kind of matching we’ve discussed are approximate matching methods which specify a metric to find control units that are “close” to the treated unit. This requires a distance metric, such as Euclidean, Mahalanobis or the propensity score. All of these can be implemented in Stata’s teffects . Iacus et al. 2012 ] introduced a kind of exact matching called [ coarsened exact matching. The idea is very simple. It’s based on the notion that sometimes it’s possible to do exact matching if we coarsen the data enough. Thus, if we coarsen the data, meaning we create 0 - 10 year olds, categorical variables (e.g., - 20 year olds, etc.), then 11 oftentimes we can find exact matches. Once we find those matches, we calculate weights based on where a person fits in some strata and those weights are used in a simple weighted regression. First, we begin with covariates and make a copy called X ⇤ . X Next we coarsen ⇤ according to user-defined cutpoints or CEM’s X automatic binning algorithm. For instance schooling becomes less than high school, high school only, some college, college graduate, post college. Then we create one stratum per unique observation of X ⇤ and place each observation in a stratum. Assign these strata to the original and uncoarsened data, , and drop any observation X whose stratum doesn’t contain at least one treated and control unit. You then add weights for stratum size and analyze without matching. But there are tradeoffs. Larger bins mean more coarsening of the data, which results in fewer strata. Fewer strata result in more di- verse observations within the same strata and thus higher covariate imbalance. CEM prunes both treatment and control group units, which changes the parameter of interest, but so long as you’re trans- parent about this and up front about it, readers are willing to give you the benefit of the doubt. Just know, though, that you are not estimating the ATE or the ATT when you start pruning (just as you aren’t doing so when you trim propensity scores). The key benefit of CEM is that it is part of a class of matching methods called monotonic imbalance bounding (MIB). MIB methods bound the maximum imbalance in some feature of the empirical

150 150 causal inference : the mixtape distributions by an ex ante decision by the user. In CEM, this ex ante choice is the coarsening decision. By choosing the coarsening before- hand, users can control the amount of imbalance in the matching solution. It’s also much faster. There are several ways of measuring imbalance, but here we focus 1( f on the g ) measure which is L , 1 f , g )= | | g f L 1( Â l l 1... l ... l 1 k k 2 l l ... 1 k f and g record the relative frequencies for the treatment and where L 1=0 . control group units. Perfect global balance is indicated by Larger values indicate larger imbalance between the groups, with a maximum of L 1=1 . Hence the “imbalance bounding” between 0 and 1 . Now let’s get to the fun part: estimation. Here’s the command in Stata: . ssc install cem . cem age (10 20 30 40 60) agesq agecube school schoolsq married nodegree black hispanic re74 re75 u74 u75 interaction1, treatment(treat) _ . reg re78 treat [iweight=cem weights], robust 2 The estimated ATE is $ 771 . 06 , which is much larger than our , estimated experimental effect. But, this ensured a high degree of balance on the covariates as can be seen from the output from cem command itself. As can be seen from Table 27 , the values of L 1 are close to zero in most cases. The largest it gets is . 12 for age squared. 0 Conclusions Matching methods are an important member of the causal inference arsenal. Propensity scores are an excellent tool to check the balance and overlap of covariates. It’s an under appreciated diagnostic and one that you might miss if you only ran regressions. There are extensions for more than two treatments, like multinomial models, but we don’t cover those here. The propensity score can make groups comparable but only on the variables used to estimate the propensity score in the first place. There is no guarantee you are balancing on unobserved covariates. If you know that there are important, unobservable variables, you will need another tool. Ran- domization for instance ensures that observable and unobservable variables are balanced.

151 matching and subclassification 151 : Balance in covariates after 27 Table coarsened exact matching. 25 Mean Min % 75 % Max 1 L % Covariate 50 1 . 1 0 08918 . age 1 55337 0 0 agesq . 1155 21 . 351 33 49 35 0 0 0 9 817 919 . 05263 626 . agecube 1801 - 2 . 3 e- 14 0 0 0 0 0 6 . 0 e- school 16 16 - schoolsq . 8 e- 13 0 0 0 0 0 5 . 4 e- 2 e- - 1 . 1 e- 16 0 0 0 0 0 married 1 . 1 16 0 16 3 . 3 e- 16 0 0 - 0 0 7 . 4 nodegree e- e- 16 - 8 . black e- 16 0 0 0 0 0 4 . 7 9 1 17 - 3 . 1 e- 17 0 0 0 0 0 hispanic 7 . e- . . 0 0 0 - 94 . 801 74 re 399 06096 42 0 85 73 - 0 0 0 - 222 . . - 545 . 65 03756 . 75 re 999 9 e- 16 - 2 . 2 u 16 0 0 0 0 0 74 1 . e- . e- 16 - 1 . 1 e- 16 0 0 0 0 0 u 75 5 2 853 . 0 0 0 0 - 68 . 21 06535 425 1 interaction .


153 Regression discontinuity “Jump around! Jump around! Jump up, jump up and get down! Jump!” – House of Pain Over the last twenty years, there has been significant interest in 2008 Cook the regression-discontinuity design ] provides a (RDD). [ Thistlehwaite fascinating history of the procedure, dating back to [ 1960 and Campbell ] and the multiple implementations of it by its originator, Donald Campbell, an educational psychologist. Cook 2008 [ ] documents the early years of the procedure involving Camp- 1970 s, Campbell was bell and his students, but notes that by the mid- virtually alone in his use of and interest in this design, despite sev- eral attempts to promote it. Eventually he moved on to other things. Campbell and his students made several attempts to bring the pro- cedure into broader use, but despite the publicity, it was not widely adopted in either psychology or education. The earliest appearance of RDD in economics is an unpublished paper [ Goldberger , 1972 ]. But neither this paper, nor Campbell’s work, got into the microeconomist’s toolkit until the mid-to-late 1990 s when papers using RDD started to appear. Two of the first papers in economics to use a form of it were [ 1999 ] and Black Angrist and Lavy [ ]. Angrist and Lavy [ 1999 ], which we discuss in detail later, 1999 studied the effect of class size on pupil achievement using an unusual feature in Israeli public schools that mechanically created smaller classes when the number of students went over a particular threshold. [ 1999 ] used a kind of RDD approach when she creatively Black exploited discontinuities at the geographical level created by school district zoning to estimate people’s willingness to pay for better schools. Both papers appear to be the first time since Goldberger [ 1972 ] that RDD showed back up in the economics literature. But 1972 to 1999 is a long time without so much as a peep for what is now considered to be one of the most credible research

154 154 causal inference : the mixtape 91 91 2008 ] says I should say, for the class of obser- [ designs in all of causal inference, so what gives? Cook vational data designs. Man, though that RDD was “waiting for life” during this time. The conditions in not all, applied economists and econo- empirical microeconomics had to change before microeconomists metricians consider the randomized experiment the gold standard for causal realized its potential. Most likely, this was both due to the growing inference. influence of the Rubin causal model among labor economists, as well as the increased availability of large administrative datasets, including their unusual quirks and features. Thistlehwaite and Campbell [ 1960 ], the first publication using In RDD, the authors studied the effect of merit awards on future aca- demic outcomes. Merit awards were given out to students based on a score, and anyone with a score above some cutoff received the merit award, whereas everyone below that cutoff did not. In their application, the authors knew the mechanism by which the treatment was being assigned to each individual unit – treatment was assigned based on a cutoff in some continuous running variable . Knowing the treatment assignment allowed them to carefully estimate the causal effect of merit awards on future academic performance. The reason that RDD was so appealing was because of underlying selection bias. They didn’t believe they could simply compare the treatment group (merit award recipients) to the control group (merit award non-recipients), because the two groups were likely very dif- ferent from one another – on observables, but even more importantly, on unobservables. To use the notation we’ve been using repeatedly, they did not believe 0 0 Y E | D = 1] E [ Y [ | D = 0] = 0 It was very likely that the recipients were on average of higher overall ability, which directly affects future academic performance. So their solution was to compare only certain students from the treatment and control groups who they thought were credibly equivalent – those students who had just high enough scores to get the award and those students with just low enough scores not to get the award. It’s a simple idea, really. Consider the following DAG that illus- trates what I mean.

155 regression discontinuity 155 D Y c 0 X If there is some variable, X , that determines treatment, D , by trig- gering treatment at , then isn’t this just another form of selection c 0 on observables? If a unit receives treatment because some variable exceeds some threshold, then don’t we fully know the treatment as- signment? Under what conditions would a comparison of treatment and control group units, incorporating information from the cutoff, yield a credible estimate of the causal effect of treatment? RDD is appropriate in any situation where a person’s entry into jumps in probability when some running vari- the treatment group X c . Think about this for a , exceeds a particular threshold, able, 0 moment: aren’t jumps of any kind sort of unnatural? The tendency is for things to change gradually. Charles Darwin once wrote Natura , or “nature does not make jumps.” Jumps are so non facit saltum unusual that when we see them happen, they beg for some expla- nation. And in the case of RDD, that “something” is that treatment assignment is occurring based on some running variable, and when that running variable exceeds a particular cutoff value, c , that unit i 0 either gets placed in the treatment group, or that person is more likely to be placed in the treatment group. But whichever, the probability of treatment is jumping discontinuously at . c 0 That’s the heart and soul of RDD. We use our knowledge about selection into treatment in order to estimate average treatment ef- fects. More specifically, since we know the probability of treatment assignment changes discontinuously at , then we will compare c 0 people above and below c to estimate a particular kind of average 0 treatment effect called the local average treatment effect , or LATE for short [ , 1994 ]. To help make this method concrete, Imbens and Angrist we’ll first start out by looking carefully at one of the first papers in economics to use this method [ Angrist and Lavy , 1999 ]. Maimonedes Rule and Class Size Krueger [ 1999 ] was interested in estimating the causal effect of class size on student test scores using

156 156 causal inference : the mixtape the Tennessee randomized experiment STAR. The same year, an- other publication came out interested in the same question which experiment [ Angrist and Lavy , 1999 ]. Both students used a natural were interested in estimating the causal effect of class size on pupil achievement, but went about the question in very different ways. One of the earliest references to class size occurs in the Babylonian th 6 century. One Talmud, a holy Jewish text completed around the section of the Talmud discusses rules for the determination of class th 12 size and pupil-teacher ratios in bible studies. Maimonides was a century Rabbinic scholar who interpreted the Talmud’s discussion of class size in what is now known as Maimonides’ Rule: “Twenty-five children may be put in charge of one teacher. If the number in the class exceeds twenty-five, but is not more than forty, two teachers must be appointed.” th th So what? What does a 12 century Rabbi’s interpretation of a 6 1969 , century text have to do with causal inference? Because “since [Maimonides’ Rule] has been used to determine the division of enrollment cohorts into classes in Israeli public schools” [ Angrist and , 1999 Lavy ]. The problem with studying the effect of class size on pupil achieve- ment is that class size is likely correlated with the unobserved de- terminants of pupil achievement too. As a result, any correlation we find is likely biased, and that bias may be large. It may even dom- inate most of the correlation we find in the first place, making the correlation practically worthless for policy purposes. Those unob- servables might include poverty, affluence, enthusiasm/skepticism about the value of education, special needs of students for remedial or advanced instruction, obscure and barely intelligible obsessions of bureaucracies, and so on. Each of these things both determines class size and clouds the effect of class size on pupil achievement because 92 1 92 0 Y each is independently correlated with pupil achievement. , Put another way, ( ) ?? D likely Y does not hold, because D is correlated However, if adherence to Maimonides’ Rule is perfectly rigid, then with the underlying potential outcomes. from the what would separate a school with a single class of size 40 93 41 93 ? The only . same school with two classes whose average size is 20 5 = 20.5. 2 difference between them would be the enrollment of a single student. In other words, that one additional student is causing the splitting off of the classes into smaller class sizes. But the two classes should be basically equivalent otherwise. Maimonides’ Rule, they argue, appears to be creating exogenous variation in class size. It turns out Maimonides’ Rule has the largest impact on a school with about students in a grade cohort. With cohorts of size 40 , 80 40 and 120 students, the steps down in average class size required by 40 to Maimonides’ Rule when an additional student enrolls are from 41 81 121 94 94 Schools also use the percent dis- ), 40 to 27 ( . 5 ) and 40 to 30 . 25 ( ( 20 ). 4 2 3 advantaged in a school to allocate supplementary hours of instruction and other school resources which is why Angrist and Lavy [ 1999 ] control for it in their regressions.

157 regression discontinuity 157 Their pupil achievement data are test scores from a short-lived national testing program in Israeli elementary schools. Achievement 1992 and 1991 , near the end of the school tests were given in June year, to measure math and reading skills. Average math and reading test scores were rescaled to be on a 100 -point scale. The authors then linked this data on test scores with other administrative data on 95 95 This has become increasingly com- class size and school characteristics. The unit of observation in the mon as administrative data has become linked data sets is the class and includes data on average test scores USING MAIMONIDES’RULE 539 digitized and personal computers have in each class, spring class size, beginning-of-year enrollment for each become more powerful. TABLE I school and grade, a town identifier, school-level index of student SES U NWEIGHTED D ESCRIPTIVE S TATISTICS called “percent disadvantaged” and variables identifying the ethnic Quantiles and religious composition of the school. Their study was limited to Jewish public schools which account for the vast majority of school Mea n S .D. 0.10 0.25 0.50 0.75 0.90 Va r ia ble children in Israel. 539 USING MAIMONIDES’RULE Downloaded from A. Full sample 5th grade (2019 classes, 1002 schools, tested in 1991) TABLE I [ Angrist and Lavy : 24 Figure ] 1999 38 35 31 26 6.5 21 29.9 Class size U ESCRIPTIVE TATISTICS D NWEIGHTED S descriptive statistics Enrollment 77.7 38.8 31 50 72 100 128 Percent disadvantaged 14.1 13.5 2 4 10 20 35 Quantiles 32 27.3 Reading size 6.6 19 23 28 36 28 Math size 27.7 6.6 19 33 23 36 0.90 0.10 0.25 0.50 0.75 Mea n S .D. Va r ia ble 79.8 83.3 7.7 64.2 69.9 75.4 Aver a ge ver ba l 74.4 67.3 9.6 54.8 61.1 67.8 74.1 79.4 Aver a ge m a t h Downloaded from A. Full sample 5th grade (2019 classes, 1002 schools, tested in 1991) 4th grade (2049 classes, 1013 schools, tested in 1991) 31 26 35 38 Class size 29.9 6.5 21 30.3 6.3 22 26 31 35 38 Class size 77.7 38.8 31 128 100 72 50 Enrollment 101 78.3 37.7 30 51 Enrollment 74 127 10 20 35 Percent disadvantaged 14.1 13.5 2 4 Percent disadvantaged 13.8 13.4 2491935 32 28 36 23 Reading size 27.3 6.6 19 6.5 19 24 36 32 28 27.7 Reading size 33 Math size 27.7 6.6 19 36 23 28 Math size 28.1 6.5 19 24 29 33 36 83.3 7.7 64.2 69.9 75.4 79.8 Aver a ge ver ba l 74.4 Aver a ge ver ba l 72.5 8.0 62.1 67.7 73.3 78.2 82.0 Aver a ge m a t h 79.4 67.3 9.6 54.8 61.1 67.8 74.1 75.0 Aver a ge m a t h 68.9 8.8 57.5 63.6 69.3 79.4 at Baylor University on February 27, 2013 4th grade (2049 classes, 1013 schools, tested in 1991) 3rd grade (2111 classes, 1011 schools, tested in 1992) 35 31 26 6.3 22 30.3 Class size 38 127 101 74 51 78.3 37.7 30 Enrollment 38 26 31 35 30.5 Class size 6.2 22 Percent disadvantaged 13.8 13.4 2491935 Enrollment 74 79.6 37.3 34 129 104 52 shows the mean, standard deviation and quantile values 24 Figure 28 24 6.5 19 27.7 Reading size 36 32 2491935 Percent disadvantaged 13.8 13.4 29 24 6.5 19 28.1 Math size 36 33 31 29 25 21 5.4 17 Reading size 24.5 ). 1991 schools (from 002 for seven variables for the 5 th grade across 1 , 72.5 8.0 62.1 67.7 73.3 78.2 82.0 Aver a ge ver ba l 31 24.7 25 21 5.4 18 29 Math size 75.0 8.8 57.5 63.6 69.3 68.9 Aver a ge m a t h 79.4 students. 30 As can be seen, the mean class size is almost at Baylor University on February 27, 2013 90.7 Aver a ge ver ba l 6.1 78.4 83.0 87.2 86.3 93.1 84.1 91.9 Aver a ge m a t h 6.8 75.0 80.2 84.7 89.0 3rd grade (2111 classes, 1011 schools, tested in 1992) B. ! / " 5Discontinuitysample(enrollment36–45,76–85,116–124) 38 31 30.5 6.2 22 26 35 Class size Figure 1999 ] de- [ Angrist and Lavy : 25 129 Enrollment 104 74 52 79.6 37.3 34 scriptive statistics for the discontinuity 2491935 Percent disadvantaged 13.8 13.4 5th grade 3rd grade 4th grade sample. 21 5.4 17 24.5 Reading size 25 31 29 24.7 21 25 29 31 Math size 5.4 18 S.D. Mean S.D. Mean S.D. Mean 90.7 93.1 Aver a ge ver ba l 6.1 78.4 83.0 87.2 86.3 84.1 91.9 89.0 Aver a ge m a t h 6.8 75.0 80.2 84.7 (441 classes, (471 classes, (415 classes, B. ! / " 5Discontinuitysample(enrollment36–45,76–85,116–124) 195 schools 206 schools) 224 schools) 5th grade 4th grade 3rd grade 7.2 31.1 30.6 7.4 7.4 30.8 Class size 76.4 29.5 78.5 30.0 75.7 28.2 Enrollment Mean Mean S.D. Mean S.D. S.D. 13.6 13.2 12.9 12.3 14.5 14.6 Percent disadvantaged 6.2 Reading size 28.1 7.3 28.3 7.7 24.6 (441 classes, (415 classes, (471 classes, Math size 28.5 7.4 28.7 7.7 24.8 6.3 195 schools 206 schools) 224 schools) 6.3 Aver a ge ver ba l 74.5 8.2 72.5 7.8 86.2 68.7 Aver a ge m a t h 10.2 7.0 84.2 67.0 9.1 7.4 7.2 31.1 7.4 30.8 Class size 30.6 28.2 Enrollment 29.5 78.5 30.0 75.7 76.4 # Va r ia ble d efin it ion s a r e a s follows : Cla s s s ize # number of students in class in the spring, Enrollment 13.2 14.6 14.5 12.3 12.9 13.6 Percent disadvantaged # percent of students in the school from ‘‘disadvantaged September grade enrollment, Percent disadvantaged 6.2 Reading size 28.1 7.3 28.3 7.7 24.6 # # number ofstudents backgrounds,’’Reading size number ofstudents whotook the reading test, Math size 28.5 24.8 Math size 6.3 7.7 28.7 7.4 average composite reading score in the class, Average math # who took the math test, Average verbal # Aver a ge ver ba l 74.5 8.2 72.5 7.8 6.3 86.2 average composite math score in the class. Aver a ge m a t h 67.0 10.2 68.7 9.1 84.2 7.0 Angrist and Lavy ] present descriptive statistics for what 1999 [ Va r ia ble d efin it ion s a r e a s follows : Cla s s s ize number of students in class in the spring, Enrollment # # September grade enrollment, Percent disadvantaged # percent of students in the school from ‘‘disadvantaged # number ofstudents # backgrounds,’’Reading size number ofstudents whotook the reading test, Math size average composite reading score in the class, Average math who took the math test, Average verbal # # average composite math score in the class. Page 539 @xyserv1/disk4/CLS_jrnlkz/GRP_qjec/JOB_qjec114-2/DIV_064a01 dans Page 539 @xyserv1/disk4/CLS_jrnlkz/GRP_qjec/JOB_qjec114-2/DIV_064a01 dans

158 158 causal inference : the mixtape they call a discontinuity sample which is a sample defined as only , , 45 ; 76 36 85 ; 5 schools with enrollments greater or less than students: , 125 . Average class size is a bit larger in this discontinuity and 116 sample than in the overall sample but otherwise very similar to the 25 full sample (see Figure ). Papers like these have to figure out how to model the underlying running variable that determines treatment, and in some cases that can be complicated. This is one of those cases. The authors attempt to capture the fact that Maimonides’ Rule allows enrollment cohorts - of to be grouped in a single class, but enrollment cohorts of 41 - 1 40 are split into two classes of average size 20 . 5 - 40 Enrollment cohorts 80 40 81 are split into three classes of average size 27 - 120 and so on. - of Their class size equation is e s = f sc e 1 s int +1 40 e where is the beginning-of-year enrollment in school s in a given s grade (e.g., 5 th grade); is class size assigned to class c in school f sc for that grade; int( ) is the largest integer less than or equal to n . s n . Although the class size func- They call this the class size function tion is fixed within schools, in practice enrollment cohorts are not necessarily divided into classes of equal size. But, even though the actual relationship between class size and enrollment size involves many factors, in Israel it clearly has a lot to do with f . The authors sc f show this by laying on top of one another from Maimonides’ Rule sc and actual class sizes (Figure ). Notice the very strong correlation 26 between the two. 541 USING MAIMONIDES’RULE Figure : Maimonides’ Rule vs. actual 26 class size [ 1999 ]. , Angrist and Lavy Downloaded from at Baylor University on February 27, 2013 Before moving on, look at how great this graph is. The identifi- cation strategy told in one picture. Angrist made some really great F IGURE I Class Size in 1991 by Initial Enrollment Count, Actual Average Size and as Predicted by Maimonides’Rule one-quarter of the classes are of equal size. On the other hand, even though the actual relationship between class size and enrollment size involves many factors, in Israel it clearly has a lot f todowith . This can be seen in Figures Ia and Ib, which plot the sc average class size by enrollment size for fifth and fourth grade pupils, along with the class-size function. The dashed horizontal 541 Page @xyserv2/disk4/CLS_jrnlkz/GRP_qjec/JOB_qjec114-2/DIV_064a01 dawn

159 regression discontinuity 159 graphs. Good graphs tell the story. It’s worth your time trying to figure out a figure that really conveys your main results or your iden- tification strategy. Readers would prefer it. Okay back to business. f The class size function, , is a mechanical representation of sc Maimonides’ Rule and is highly correlated with actual class size. But it’s also highly correlated with average test scores of the fourth and fifth graders. The following picture plots average reading test scores f by enrollment size in enrollment intervals and average values of sc ). The figure shows that test scores 27 of ten for fifth graders (Figure are generally higher in schools with larger enrollments and larger predicted class sizes, but it also shows an up-and-down pattern in which average scores by enrollment size mirror the class-size function. 543 USING MAIMONIDES’RULE Figure : Average reading scores vs. 27 enrollment size [ Angrist and Lavy , 1999 ]. Downloaded from at Baylor University on February 27, 2013 The overall positive correlation between test scores and enrollment is partly attributable to larger schools in Israel being geographically concentrated in larger, more affluent cities. Smaller schools are in poorer “developmental towns” outside the major urban centers. An- 1999 ] note that the enrollment size and the percent grist and Lavy [ disadvantaged index measuring the proportion of students from dis- advantaged backgrounds are negatively correlated. They control for the “trend” association between test scores and enrollment size and plot the residuals from regressions of average scores and the average f on average enrollment and the percent disadvantaged index of sc F II IGURE for each interval. The estimates for fifth graders imply a reduction Av e r a g e R e a d i n g S cor e s b y E n r ol l m e n t C ou n t , a n d t h e C or r e s p on d i n g Av e r a g e point class size of ten students is associated with a predicted 2 . 2 in Cla ss Size P r edict ed by Ma im on ides’Ru le increase in average reading scores – a little more than one-quarter of a standard deviation in the distribution of class averages. See the following figures showing the correlation between score residuals In a ddit ion t o exh ibit in g a st r on g a ssocia t ion wit h a ver a ge cla ss size, t h e cla ss-size fu n ct ion is a lso cor r ela t ed wit h t h e and the class size function by enrollment. a ver a ge t est scor es of fou r t h a n d fift h gr a der s (a lt h ou gh n ot t h ir d gr a der s). Th is ca n be seen in F igu r es IIa a n d IIb, wh ich plot by en r ollm en t f a ver a ge r ea din g t est scor es a n d a ver a ge va lu es of sc size, in en r ollm en t in t er va ls of t en . F igu r e IIa plot s t h e scor es of 543 Page @xyserv1/disk4/CLS_jrnlkz/GRP_qjec/JOB_qjec114-2/DIV_064a01 dans

160 545 USING MAIMONIDES’ RULE Downloaded from 160 causal inference : the mixtape 545 USING MAIMONIDES’ RULE Figure 28 : Reading score residual and class size function by enrollment count Angrist and Lavy [ ]. , 1999 Downloaded from at Baylor University on February 27, 2013 Figure : Math score residual and 29 class size function by enrollment count Angrist and Lavy [ , 1999 ]. at Baylor University on February 27, 2013 III F IGURE Aver a ge Test (Rea din g/Ma t h ) Scor es a n d P r edict ed Cla ss Size by E n r ollm en t , Residu a ls fr om Regr ession s on P er cen t Disa dva n t a ged a n d E n r ollm en t The visual evidence is strong that class size causes test scores to 96 96 1999 As we will see, graphical evidence is Angrist and Lavy decrease. Next, ] estimate regression models [ do n ot pr ovide a fr a m ewor k for for m a l st a t ist ica l in fer en ce. very common in RDD. of the following form Alt h ou gh t h e m icr o da t a for fou r t h a n d fift h gr a der s a r e u n - a va ila ble, a m odel for in dividu a l pu pils’ t est scor es is u sed t o F III IGURE descr ibe t h e ca u sa l r ela t ion sh ips t o be est im a t ed. F or t h e i th + d + μ h X + = n b # y c s sc s sc isc Aver a ge Test (Rea din g/Ma t h ) Scor es a n d P r edict ed Cla ss Size by E n r ollm en t , Residu a ls fr om Regr ession s on P er cen t Disa dva n t a ged a n d E n r ollm en t i ’s score, y where is pupil is a vector of school characteristics, X s isc is the size n sometimes including functions of enrollment, and sc do n ot pr ovide a fr a m ewor k for for m a l st a t ist ica l in fer en ce. in school . The μ c is an identical and independently dis- of class s c Alt h ou gh t h e m icr o da t a for fou r t h a n d fift h gr a der s a r e u n - is an identical and inde- tributed class component, and the term h a va ila ble, a m odel for in dividu a l pu pils’ t est scor es is u sed t o s descr ibe t h e ca u sa l r ela t ion sh ips t o be est im a t ed. F or t h e i th , d pendently distributed school component. The class-size coefficient, is the primary parameter of interest. 545 Page @xyserv2/disk4/CLS_jrnlkz/GRP_qjec/JOB_qjec114-2/DIV_064a01 dawn This equation describes the average potential outcomes of students . under alternative assignments of n controlling for any effects of X sc s would be the If n was randomly assigned conditional on X , then d s sc weighted average response to random variation in class size along the length of the individual causal response functions connecting class Page 545 n is not randomly assigned. Therefore in size and pupil scores. But sc @xyserv2/disk4/CLS_jrnlkz/GRP_qjec/JOB_qjec114-2/DIV_064a01 dawn practice, it is likely correlated with potential outcomes – in this case, the error components in the equation. Estimates of this OLS model are contained in Figure 30 . Though OLS may not have a causal interpretation, using RDD might. The authors go about estimating an RDD model in multiple

161 @xyserv1/disk4/CLS_jrnlkz/GRP_qjec/JOB_qjec114-2/DIV_064a01 dans regression discontinuity 161 TABLE II : OLS regressions [ 30 Figure Angrist and OLS E STIMATES FOR 1991 , ]. 1999 Lavy 5th Grade 4th Grade USING MAIMONIDES’RULE Math Math Reading comprehension Reading comprehension (11) (12) (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Page M ean score 69.9 72.5 67.3 74.3 551 (s.d .) (8.8) (8.0) (9.9) (8.1) R egressors .055 .009 Class size .031 ! .025 .322 .076 .019 0.141 ! .053 ! .040 ! .221 .221 (.031) (.026) (.031) (.039) (.036) (.044) (.033) (.028) (.033) (.036) (.033) (.039) ! ! .281 ! ! .289 Percent disadvantaged .339 ! .341 .340 ! .351 ! .350 ! .332 (.016) (.016) (.013) (.014) (.018) (.018) (.012) (.013) .014 Enrollment ! .002 .017 ! .004 (.006) (.009) (.007) (.008) 6.65 7.82 7.81 Root MSE 7.54 6.10 6.10 9.36 8.32 8.30 7.94 6.65 8.66 2 R .249 .025 .204 .309 .309 .013 .252 .036 .048 .369 .369 .207 N2,0192,0182,0492,049 The unit of observation is the average score in the class. Standard errors are reported in parentheses. Standard errors were corrected for within-school correlation between classes. 551 steps. In the first stage, they estimate the following model: + f p n p = y + X s sc sc sc 0 where are parameters and the error term is defined as the residual p j from the population regression of n and captures onto X f and s sc sc other things that are associated with enrollment. Results from this first stage regressions are presented in Figures 31 . In the second step, the authors calculate the fitted values from the b and then estimate the following regression model first regression, n sc at Baylor University on February 27, 2013 Downloaded from b y X + d + n b = h +[ μ + # ] c s s sc sc sc Results from this second stage regression are presented in Figure 32 . Compare these second stage regressions to the OLS regressions from earlier (Figure ). The second stage regressions are all negative 30 and larger in magnitude. Pulling back for a moment, we can take these results and compare Krueger [ 1999 ] found in the Tennessee STAR experi- them to what Krueger [ 1999 ment. 0 . 13 - 0 . 2 standard ] found effect sizes of around deviations among pupils and about 0 . 32 - 0 . 66 standard deviations in the distribution of class means. [ 1999 ] compare Angrist and Lavy their results by calculating the effect size associated with reducing class size by eight pupils (same as STAR). They then multiple this number times their second step estimate for reading scores for fifth graders (- 0 . 275 ) which gives them an effect size of around 2 . 2 points or 0 . 29 standard deviation. Their estimates of effect size for fifth

162 162 causal inference the mixtape : @xyserv1/disk4/CLS_jrnlkz/GRP_qjec/JOB_qjec114-2/DIV_064a01 dans TABLE III R 1991 EDUCED - STIMATES FOR E FORM 4th Graders 5th Graders 31 Figure An- : First stage regression [ 1999 , grist and Lavy ]. Reading Reading Math Class size Math comprehension Class size comprehension (12) (2) (3) (4) (1) (6) (7) (8) (9) (10) (11) (5) A. Full sample USING MAIMONIDES’RULE 68.9 72.5 30.3 67.3 74.4 29.9 Means (6.3) (s.d .) (8.0) (7.7) (6.5) (9.6) (8.8) Regressors Page .124 .085 ! .089 .038 ! .033 .009 ! .149 ! .111 ! .542 .704 .670 ! .772 ! f sc (.049) (.047) (.040) (.025) (.020) (.037) (.039) (.035) (.028) (.027) (.022) (.031) ! Percent disadvantaged ! .039 ! .340 ! .340 ! .292 ! .282 ! .076 ! .053 ! .360 ! .355 ! .354 ! .338 .054 553 (.016) (.014) (.013) (.009) (.008) (.016) (.017) (.013) (.012) (.009) (.010) (.018) .019 Enrollment .043 .010 .031 .027 .001 (.009) (.007) (.005) (.009) (.006) (.005) 4.20 6.07 6.07 4.38 4.56 Root MSE 4.13 6.64 7.83 7.81 6.64 8.28 8.33 2 .247 .311 .575 .561 .255 .204 .377 .375 .553 .516 .207 .311 R N2,0192,0192,0182,0492,0492,049 B. Discontinuity sample 74.5 Means 30.8 67.0 31.1 72.5 68.7 (10.2) (7.2) (8.2) (9.1) (s.d .) (7.4) (7.8) Regressors .503 ! .061 ! .075 .059 .012 .481 .346 ! .197 ! .202 ! .089 ! .154 .625 f sc (.063) (.056) (.053) (.050) (.077) (.072) (.054) (.050) (.052) (.053) (.080) (.071) ! .068 ! .029 ! Percent disadvantaged ! .343 ! .306 ! .291 ! .130 ! .067 ! .424 ! .422 ! .435 ! .405 .348 (.029) (.028) (.027) (.029) (.039) (.042) (.029) (.028) (.032) (.034) (.041) (.043) .007 .086 .003 .041 .063 Enrollment .024 (.022) (.015) (.015) (.022) (.014) (.017) Root MSE 5.58 6.24 6.24 8.58 8.53 5.49 5.26 6.57 6.57 8.26 8.25 5.95 2 .437 .182 .360 .421 .421 .296 .178 .305 .428 .475 .299 .299 R 553 N471471471415415415 " ! is equal to enrollment/[int((enrollment 1)/40) 1]. Standard errors are reported in parentheses. Standard errors were corrected for within-school correlation f The function sc between classes. The unit of observation is the average score in the class. @xyserv1/disk4/CLS_jrnlkz/GRP_qjec/JOB_qjec114-2/DIV_064a01 dans 554 TABLE IV 2SLS E G RADERS STIMATES FOR 1991 (F IFTH ) : Second stage regressions 32 Figure Math Reading comprehension Angrist and Lavy ]. 1999 [ , ! / 5 ! / " 5 " QUARTERLY J OURNAL OF ECONOMICS Discontinuity Discontinuity sample Full sample sample Full sample at Baylor University on February 27, 2013 Downloaded from (5) (2) (3) (4) (12) (11) (8) (10) (9) (1) (7) (6) 74.5 67.3 67.0 M ean score 74.4 (10.2) (9.6) (8.2) (7.7) (s.d .) R egressors Page .582 " .013 " .230 " .261 " .202 " .185 " .443 Class size " .158 " .275 " .260 " .186 " .410 " (.040) (.066) (.081) (.104) (.113) (.181) (.056) (.092) (.113) (.131) (.151) (.236) 554 " .350 .461 .350 " .459 " .435 " Percent disadvantaged " .372 " .369 " .369 " " .477 " .355 (.014) (.014) (.013) (.049) (.049) (.037) (.037) (.019) (.019) (.019) .079 .062 .041 .053 .012 .022 Enrollment (.009) (.026) (.012) (.037) (.036) (.028) Enrollment squared/100 " .010 .005 (.016) (.011) .136 .193 Piecewise linear trend (.032) (.040) Root MSE 6.15 6.23 6.22 7.71 6.79 7.15 8.34 8.40 8.42 9.49 8.79 9.10 2019 N 1961 471 2018 1960 471 The unit of observation is the average score in the class. Standard errors are reported in parentheses. Standard errors were corrected for within-school correlation between classes. as an instrument for class size. All estimates use f sc Downloaded from at Baylor University on February 27, 2013

163 regression discontinuity 163 graders are at the low end of the range that [ 1999 ] found in Krueger the Tennessee experiment. Observational are often confounded by a failure to isolate a cred- ible source of exogenous variation in school inputs which leads some researchers to conclude that school inputs don’t matter in pupil achievement. But RDD overcomes problems of confounding by exploiting exogenous variation created by , administrative rules and as with the STAR experiment, shows that smaller classes appear beneficial to student academic achievement. Data requirements for RDD RDD is all about finding “jumps” in the probability of treatment as we move along some running variable X . discon- So where do we find these jumps? Where do we find these ? The answer is that humans often embed jumps into rules. tinuities Sometimes these embedded rules give us a designs for a careful observational study. The validity of an RDD doesn’t require that the assignment rule be arbitrary. It only requires that it be known, precise and free of manipulation. The most effective RDD studies involve programs where X has a “hair trigger’ that is not tightly related to the outcome being studied. Examples the probability of being arrested for DWI jumping at 0.08 [ Hansen , 2015 ]; the probability of receiving health- > care insurance jumping at age 65 [ Card et al. , 2008 ]; the probability of receiving medical attention jumping when birthweight falls below 1 500 grams [ Almond et al. , 20010 ]; the probability of attending sum- , Jacob and mer school when grades fall below some minimum level [ , 2004 ]. Lefgen In all these kinds of studies, we need data. But specifically, we around the discontinuities which itself implies need a lot of data that the datasets useful for RDD are likely very large. In fact, large sample sizes are characteristic features of the RDD. This is also because in the face of strong trends, one typically needs a lot of data. Researchers are typically using administrative data or settings such as birth records where there are many observations. There are generally accepted two kinds of RDD studies. Definition There are designs where the probability of treatment goes from 0 to 1 at the cutoff, or what is called a “sharp” design. And there are designs where the probability of treatment discontinuously increases at the cutoff. These are often called “fuzzy” designs. In all of these, though, there is some running variable X that upon reaching a cutoff c the likelihood of being in treatment group switches. van der 0 Klaauw [ 2002 ] presents the following diagram showing the difference between the two designs:

164 164 causal inference : the mixtape 1261 AND ENROLLMENT AID COLLEGE Figure 33 : Sharp vs. Fuzzy RDD van der Klaauw ]. [ 2002 , i- II H Q) o I, v) 0) o Selection S variable 2 FIGURE IN RD AND THE DESIGN ASSIGNMENT SHARP FUZZY (DASHED) (SOLID) = such that in but the function is a manner score 1 I way propensity Pr(T S) again The have can occur in S. known at to a of case discontinuity fuzzy design misassign- cutoff in ment relative to the value with values of a S near the cutoff sharp design, in in and both treatment control addition to the appearing groups. Alternatively, Sharp RDD is where treatment is a deterministic function of the relative the to of score the individual's cutoff be value, position assignment may 97 97 X . calls the running variable An example might be Medicare enrollment 33 Figure running variable on variables based additional observed the but unobserved administrator, by by “selection variable”. This is because the the to evaluator. on both selection is here observables 65 which happens sharply at age including disability situations. A Compared sharp design, ] is an early 2002 van der Klaauw [ = = Instead and unobservables. of the 1 function 1{S > 3), I Pr(T step having S) fuzzy RDD represents a discontinuous “jump” in the probability of paper in the new literature, and the the a function selection as of now S as the func- probability appear may S-shaped terminology hadn’t yet been hammered treatment when > . In these fuzzy designs, the cutoff is used as c X tion shown 2. in 0 Figure out. But they are the same thing. the in RD As it is to the in case, sharp design again exploit possible discontinuity ] an instrumental variable for treatment, like [ 1999 Angrist and Lavy , rule to the a selection treatment effect under Al. identify continuity assumption who instrument for class size with the class size function. = if the To see that note conditional mean function is continuous at S S] this, I E[u - - = then More formally, in a sharp RDD, treatment status is a deterministic It lims I S] S] 5, S] limst I lims, E[YI a(limsts E[YI E[T E[T S]). follows that treatment the effect a is identified by and discontinuous function of a running variable X where i - S] S] E[YI E[YI limss limsts 8 (9) - S] S] < limsts E[TI E[TI lims,s 1 if X c 0 i D = where the denominator in known is nonzero of because the of i (9) discontinuity : 0 if X < c S. at 0 i E[TIS] where is a known threshold or cutoff. In other words, if you know c 0 the value of X for unit i , then you know treatment assignment for i unit i with certainty. For this reason, people ordinarily think of RDD as a selection on observables observational study. If we assume constant treatment effects, then in potential outcomes terms, we get 0 X Y = a + b i i 1 0 + Y Y d = i i

165 regression discontinuity 165 Using the switching equation we get 1 0 0 Y Y Y = Y +( D ) i i i i i = a Y b X + + d D # + i i i i where the treatment effect parameter, d , is the discontinuity in the conditional expectation function: 1 0 ] X = X | E [ Y = d | X Y = X [ ] lim E lim 0 0 X ! X X X i i i i 0 0 i i = lim ] X = X | E [ Y Y | X [ = X E ] lim 0 0 ! X X X X i i i i 0 0 i i The sharp RDD estimation is interpreted as an average causal effect of the treatment at the discontinuity, which is a kind of local average treatment effect (LATE). 1 0 E [ Y d ] Y | X = X = 0 SRD i i i Notice the role that plays in estimating treatment effects extrapolation is just below c is with sharp RDD. If unit D i =0 . But if unit i , the 0 i c , then the D just above = 1. See Figure 34 . 0 i : Dashed lines are extrapola- 34 Figure tions 250 200 150 100 Outcome (Y) 50 0 100 0 20 40 60 80 Test Score (X) The key identifying assumption in an RDD is called the continuity assumption. It states 0 1 Y = ] | X E c c ] and E [ Y [ = | X 0 0 i i are continuous (smooth) in X at c . In words, this means that popula- 0 0 1 tion average potential outcomes, Y and Y , are continuous functions

166 166 causal inference : the mixtape X at the cutoff, of . That is, the continuity assumption requires that c 0 c . Ab- the expected potential outcomes remain continuous through 0 sent the treatment, in other words, the expected potential outcomes wouldn’t have jumped; they would’ve remained smooth functions of . This implies that all other unobserved determinants of Y are X continuously related to the running variable . Such an assumption X should remind you of omitted variable bias. Does there exist some c even if we dis- omitted variable wherein the outcome would jump at 0 regarded the treatment altogether ? If so, then the continuity assumption is violated and our methods do not require the LATE. Sometimes these abstract ideas become much easier to understand with data, so here is an example of what we mean using a simulation. /// --- Examples using simulated data . clear . capture log close . set obs 1000 . set seed 1234567 . Generate running variable * . gen x = rnormal(50, 25) . replace x=0 if x < 0 . drop if x > 100 . sum x, det . Set the cutoff at X=50. Treated if X > 50 * . gen D = 0 . replace D = 1 if x > 50 . gen y1 = 25 + 0 D + 1.5 x + rnormal(0, 20) * * _ . twoway (scatter y1 x if D==0, msize(vsmall) msymbol(circle hollow)) // _ (scatter y1 x if D==1, sort mcolor(blue) msize(vsmall) msymbol(circle hollow)) // (lfit y1 x if D==0, lcolor(red) msize(small) lwidth(medthin) lpattern(solid)) // (lfit y1 x, lcolor(dknavy) msize(small) lwidth(medthin) lpattern(solid)), // xtitle(Test score (X)) xline(50) legend(off) Figure 35 shows the results from this simulation. Notice that the value of Y is changing continuously over X and through c . This is an 0 example of the continuity assumption. It means absent the treatment

167 regression discontinuity 167 itself , the potential outcomes would’ve remained a smooth function X of c . It is therefore , that causes the only the treatment, triggered at 0 jump. It is worth noting here, as we have in the past, that technically speaking the continuity assumption is not testable because it is based on counterfactuals as so many other identifying assumptions we’ve reviewed are. Figure 35 : Display of observations from simulation. 200 150 100 50 Potential Outcome (Y1) 0 -50 100 20 0 60 80 40 Test score (X) Next we look at an example of discontinuities using simulated data (Figure 36 ). . gen y = 25 + 40 D + 1.5 x + rnormal(0, 20) * * . scatter y x if D==0, msize(vsmall) || scatter y x if D==1, msize(vsmall) legend(off) // xline(50, lstyle(foreground)) || lfit y x if D ==0, color(red) || // lfit y x if D ==1, color(red) ytitle("Outcome (Y)") xtitle("Test Score (X)") Notice the jump at the discontinuity in the outcome. It is common for authors to transform the running Implementation variable X by re-centering at c : 0 # = D + b ( X + c Y )+ d a 0 i i i i This doesn’t change the interpretation of the treatment effect – only the interpretation of the intercept. Let’s use Card et al. [ 2008 ] as an example. Medicare is triggered when a person turns 65 . So re-center the running variable ( age ) by subtracting 65 :

168 168 causal inference : the mixtape Figure 36 : Display of observations discontinuity simulation. 250 200 Vertical distance is the LATE 150 100 Outcome (Y) 50 0 100 80 40 20 0 60 Test Score (X) Vertical distance is the LATE = b b + Y Edu ( Age 65) + b 2 0 1 Edu + b b = b b 65 + Age 2 0 1 1 =( b Edu b b 65) + b + Age 2 0 1 1 + b a Age + b = Edu 2 1 where a = b + b 65 . All other coefficients, notice, have the same 0 1 interpretation except for the intercept. Another practical question is nonlinearity. Because sometimes we are fitting local linear regressions around the cutoff, we will pick up an effect because of the imposed linearity if the underlying data generating process is nonlinear. Here’s an example from Figure 37 : capture drop y gen x2=x^2 gen x3=x^3 gen y = 25 + 0 D+2 x + x2 + rnormal(0, 20) * * scatter y x if D==0, msize(vsmall) || scatter y x if D==1, msize(vsmall) legend(off) // xline(50, lstyle(foreground)) ytitle("Outcome (Y)") xtitle("Test Score (X)") In this situation, we would need some way to model the nonlin- earity below and above the cutoff to check whether, even given the

169 regression discontinuity 169 Figure 37 : Simulated nonlinear data from Stata 10000 8000 6000 Outcome (Y) 4000 2000 0 80 60 40 20 0 100 Test Score (X) nonlinearity, there had been a jump in the outcome at the discontinu- ity. Suppose that the nonlinear relationships is 0 ]= E | X [ Y f ( X ) i i i f ( X for some reasonably smooth function ) . In that case, we’d fit the i regression model: D = ( X Y )+ h f + d i i i i f ( X Since ) is counterfactual for values of X , how will we model > c 0 i i the nonlinearity? There are two ways of approximating f ( X ) . First, i th p X f ) equal a ( let order polynomial: i p 2 = a + b + x D + b d x Y + + ··· + b h x p 2 1 i i i i i i This approach, though, has recently been found to introduce bias [ Gelman and Imbens , 2016 ]. Those authors recommend using local linear regressions with linear and quadratic forms only. Another way of approximating f ( X is to use a nonparametric kernel, which I will ) i discuss later. th But let’s stick with this example where we are using p order polynomials, just so you know the history of this method and un- derstand better what is being done. We can generate this function, f ( X terms to differ on both sides of the cutoff by ) , by allowing the x i i including them both individually and interacting them with D . In i

170 170 causal inference : the mixtape that case, we have: p 0 ̃ ̃ E Y ]= a + b [ | X X + ··· + b X p 0 01 i i i i p 1 ̃ ̃ X Y ]= a + b E [ X X + ··· + b | p 1 11 i i i i ̃ X ). Centering where is the re-centered running variable (i.e., c X 0 i i c X = X is the coefficient ensures that the treatment effect at at 0 0 i Lee and in a regression model with interaction terms. As on D i [ ] note, allowing different functions on both sides of the 2010 Lemieux discontinuity should be the main results in an RDD paper. To derive a regression model, first note that the observed values must be used in place of the potential outcomes 0 1 0 E ] E [ Y [ | X ]+( E [ Y Y | X E [ Y | | X ]) D X Your regression model then is p ̃ ̃ = + a x D + ··· + b d + b x Y p 0 01 i i i p ⇤ ⇤ ̃ ̃ # + ··· + b D + + b x x i i i p 1 i ⇤ ⇤ = , and b . The equation we looked at b b b b where = b 0 p 1 p 01 11 p 1 ⇤ ⇤ =0 . = b earlier was just a special case of the above equation with b p 1 > . And the treatment effect at X The treatment effect at c d 0 c is 0 0 i ⇤ ⇤ p + c + ··· + b is . c d b p 1 . capture drop y x2 x3 . gen x2 = x x * . gen x3 = x x x * * . gen y = 10000 + 0 x +x2 + rnormal(0, 1000) D - 100 * * . reg y D x x2 x3 . predict yhat . scatter y x if D==0, msize(vsmall) || scatter y x if D==1, msize(vsmall) legend(off) xline(140, lstyle(foreground)) ylabel(none) || line yhat x if D ==0, color(red) sort || line yhat x if D ==1, sort color(red) xtitle("Test Score (X)") ytitle("Outcome (Y)") But, as we mentioned earlier, [ 2016 ] has Gelman and Imbens recently discouraged the use of higher order polynomials when estimating local linear regressions. The alternative is to use kernel regression. The nonparametric kernel method has problems because you are trying to estimate regressions at the cutoff point which can 38 from Hahn et al. [ 2001 ]). result in a boundary problem (see Figure While the true effect in this diagram is AB , with a certain band- 0 0 width a rectangular kernel would estimate the effect as A B , which

171 The Kernel Method The nonparametric kernel method has its problems in this case regression discontinuity 171 § point. because you are trying to estimate regressions at the cuto This results in a "boundary problem". : Illustration of a boundary 38 Figure problem ect is AB, with a certain bandwidth a rectangular While the "true" e § § kernel would estimate the e ect as A’B’. is as you can see a biased estimator. There is systematic bias with the f There is therefore systematic bias with the kernel method if the ( X ) kernel method if the underlying nonlinear function, f ( X ) , is upwards is upwards or downwards sloping. or downwards sloping. Waldinger (Warwick) 21 / 48 The standard solution to this problem is to run local linear non- Hahn et al. parametric regression [ ]. In the case described above, , 2001 this would substantially reduce the bias. So what is that? Think of kernel regression as a weighted regression restricted to a window (hence “local”). The kernel provides the weights to that regression. poly command estimates kernel-weighted local polynomial Stata’s regression. A rectangular kernel would give the same result as taking [ Y ] at a given bin on X . The triangular kernel gives more impor- E tance to the observations closest to the center. The model is some version of: n x c o i 2 b b )( y c > x ( a )1( a b ( x , c ( )) b K ( )= ) 79 0 0 i i i  b , a h i =1 h around the While estimating this in a given window of width cutoff is straightforward, what’s not straightforward is knowing how 98 98 large or small to make the window. So this method is sensitive to You’ll also see the window referred to as the bandwidth. They mean the same the choice of bandwidth. Optimal bandwidth selection has become thing. available [ Imbens and Kalyanaraman , 2011 ]. Card et al. 2008 ] Card et al. [ 2008 ] is an example of a sharp RDD, [ because it focuses on the provision of universal healthcare insur- ance for the elderly – Medicare at age 65 . What makes this a policy- relevant question is because questions regarding universal insurance have become highly relevant because of the debates surrounding the Affordable Care Act. But also because of the sheer size of Medicare. In 2014 , Medicare was 14 % of the federal budget at $ 505 billion.

172 172 causal inference : the mixtape % of non elderly adults in the US lacked insur- Approximately 20 . Most were from lower-income families, and nearly half ance in 2005 were African American or Hispanic. Many analysts have argued that unequal insurance coverage contributes to disparities in health care utilization and health outcomes across socioeconomic status. But, even among the insurance, there is heterogeneity in the form of different copays, deductibles and other features that affect use. Evi- dence that better insurance causes better health outcomes is limited because health insurance suffers from deep selection bias. Both sup- ply and demand for insurance depend on health status, confounding observational comparisons between people with different insurance characteristics. The situation for elderly looks very different, though. Less than % of the elderly population are uninsured. Most have fee-for-service 1 Medicare coverage. And that transition to Medicare occurs sharply at age 65 – the threshold for Medicare eligibility. The authors estimate a reduced form model measuring the causal effect of health insurance status on health care usage: k k y a + f u ( a ; b )+ + = C X d ija ija ija  k ija k i indexes individuals, j indexes a socioeconomic group, a in- where u y health care usage, dexes age, indexes the unobserved error, ija ija a smooth a set of covariates (e.g., gender and region), X ( a f b ) ; ija j function representing the age profile of outcome y for group j , and k C ( k =1,2,..., K ) are characteristics of the insurance coverage held ija by the individual such as copayment rates. The problem with esti- mating this model, though, is that insurance coverage is endogenous: ( u , C cov 6 =0 . So the authors use as identification of the age threshold ) 65 , which they argue is credibly exogenous for Medicare eligibility at 39 as an example of the variation in insurance status. See Figure correlation between age and insurance status. Suppose health insurance coverage can be summarized by two 1 2 (any coverage) and (generous insurance). dummy variables: C C ija ija ] estimate the following linear probability models: [ Card et al. 2008 1 1 1 1 1 C = + g X v ( a )+ D + p b a ija j j j ija ija 2 2 2 2 2 b = v + g X + ( a )+ D C p a ija j j ija j ija 1 2 1 2 and b b ) are group-specific coefficients, g where a ( a ) and g are ( j j j j smooth age profiles for group j , and D is a dummy if the respon- a 65 . Recall the reduced form model: dent is equal to or over age k k = X + a + f y ( a ; b )+ d u C ija ija ija  k ija k

173 regression discontinuity 173 D E M BER 2008 2244 E C MIC M E A AN E C ONO ER REV I EW TH IC  Figure : Insurance status and age 39   "OZDPWFSBHFIJHIFEXIJUFT "OZDPWFSBHFBMMHSPVQT  'SBDUJPOXJUIDPWFSBHF "OZDPWFSBHFMPXFENJOPSJUJFT  5XP QPMJDJFTIJHIFEXIJUFT  QPMJDJFTBMMHSPVQT 5XP   QPMJDJFTMPXFENJOPSJUJFT 5XP                                         "HFJORVBSUFST G emographic D and ge A by , olicies P ore M or wo T roup and nsurance I ny A by overage 1. C igure F by 1 y 1 2 2 5 where h 1 a 2 1 d f g 1 1 a 2 1 d a p g 2 , 1 a 2 represents the reduced-form age pro fi le for group j j j j j j y 1 2 1 2 2 1 1 2 fi is an error term. Assuming that the pro , and p d d 1 p d les v 1 d v v 5 5 1 u C Combining the equations, and rewriting the reduced form model, ija ija j ija j ija ija 2 1 are all continuous at age 65, any discontinuity in can be attributed to f g 1 a 2 , g , and 1 a 1 a 2 2 y j j j we get: discontinuities in insurance. The magnitude depends on the size of the insurance changes at 65 2 1 1 2 ◆ ✓ 2 2 d . p and d and p 1 , and on the associated causal effects 1 j j y y 2 1 2 1 For some basic health care services—for example, routine doctor visits—it is arguable that a d )+ D h p + v b + d = X ( a b + y a j ija ija j j j j j ija j only the presence of insurance matters. In this case, the implied discontinuity in at age 65 for y will be proportional to the change in insurance coverage experienced by the group. For group j 1 1 2 2 more expensive or elective services, the generosity of coverage may also matter, if patients are g )= f ) ( a )+ d a where h ( ( a )+ d ( g a is the reduced form age profile j j j - unwilling to cover the required copayment or if managed care programs will not cover the ser y y 2 1 2 2 1 2 1 1 v p for group d = and is the error p = u , + v d d p + v + d j ija for any y fi vice. This creates a potential identi cation problem in interpreting the discontinuity in j ija ija j j ija y 1 2 is a linear combination of the discontinuities in coverage and generosity, p one group. Since d j ( f term. Assuming that the profiles are continuous ) a ( g and ) a ( g , ) a j j 2 j can be estimated by a regression across groups: d and at age 65 (i.e., the continuity assumption necessary for identification), 0 1 2 y 2 1 p , e d p 1 d 5 1 d p (4) 1 j j j j then any discontinuity in y is due to insurance. The magnitudes will 1 2 2 1 y 65 ) ( and depend on the size of the insurance changes at age p p . , p e ecting a combination of the sampling errors in where , and p p fl is an error term re j j j j j j This framework can be extended to include additional dimensions of insurance coverage. In 2 1 and on the associated causal effects ( d and ). d practice, however, a key limitation is the lack of information on the insurance packages held by dif - ferent individuals. In the absence of more complete data, we use the presence of at least two forms For some basic health care services, such as routine doctor visits, of coverage as an indicator of “generous” coverage. We also explore a simple measure of gatekeeper it may be that the only thing that matters is insurance. But, in those limitations, based on whether an individual’s primary insurer is a managed care provider. In our empirical analysis, we t regression discontinuity (RD) models like (2a), (2b), and (3) by fi Y at age 65 for group j will situations, the implied discontinuity in demographic subgroup to individual data using OLS estimators. We then combine the estimates be proportional to the change in insurance status experienced by that group. For more expensive or elective services, the generosity of the coverage may matter. For instance, if patients are unwilling to cover the required copay or if the managed care program won’t cover the service. This creates a potential identification problem in y interpreting the discontinuity in y p is a for any one group. Since j linear combination of the discontinuities in coverage and generosity, 1 2 d d can be estimated by a regression across groups: and y 1 1 2 2 0 + d p p e + + d = d p j j j j j where e is an error term reflecting a combination of the sampling j y 2 1 p , p . errors in and p j j j

174 174 causal inference : the mixtape [ 2008 Card et al. ] use a couple of different datasets – one a stan- dard survey and the other administrative records from hospitals in 1992 2003 National Health Interview three states. First, they use the - Survey (NHIS) . The NHIS reports respondents’ birth year, birth month, and calendar quarter of the interview. Authors used this to construct an estimate of age in quarters at date of interview. A in the interview quarter is coded as age 65 person who reaches 65 and 0 quarters. Assuming a uniform distribution of interview dates, 0 - 6 weeks younger than 65 one-half of these people be and one-half will be - 6 weeks older. Analysis is limited to people between 55 and 0 . The final sample has observations. , 821 75 160 The second dataset is hospital discharge records for California, Florida and New York. These records represent a complete census of discharges from all hospitals in the three states except for federally regulated institutions. The data files include information on age in months at the time of admission. Their sample selection criteria is to drop records for people admitted as transfers from other institutions, 60 and 70 years of age at admission. Sam- and limit people between ple sizes are 4 , 017 , 325 (California), 2 , 793 , 547 (Florida) and 3 , 121 , 721 (New York). Some institutional details about the Medicare program may be 65 and helpful. Medicare is available to people who are at least 40 quarters or more in covered employment or have a have worked spouse who did. Coverage is available to younger people with severe kidney disease and recipients of Social Security Disability Insurance. Eligible individuals can obtain Medicare hospital insurance (Part A) free of charge, and medical insurance (Part B) for a modest monthly premium. Individuals receive notice of their impending eligibility th for Medicare shortly before their birthday and are informed they 65 have to enroll in it and choose whether to accept Part B coverage. Coverage begins on the first day of the month in which they turn 65 . There are five insurance-related variables: probability of Medicare coverage, any health insurance coverage, private coverage, two or more forms of coverage, and individual’s primary health insurance is managed care. Data are drawn from the 1999 - 2003 NHIS and for each characteristic, authors show the incidence rate at ages 63 64 and the - change at age based on a version of the C equations that include 65 K a quadratic in age, fully interacted with a post- 65 dummy as well as controls for gender, education, race/ethnicity, region and sample year. Alternative specifications were also used, such as a parametric model fit to a narrower age window (ages 63 - 67 ) and a local linear regression specification using a chosen bandwidth. Both show similar estimates of the change at age 65 . The authors present their findings in Table, which is reproduced

175 regression discontinuity 175 40 . The way that you read this table is that the odd here as Figure numbered columns show the mean values for comparison group 63 year olds) and the even numbered columns show the average - ( 64 for this population that complies with the treatment. treatment effect We can see, not surprisingly, that the effect of receiving Medicare is to cause a very large increase of being on Medicare, as well as reducing coverage on private and managed care. E M D 2246 C E BER 2008 ER IC AN E C TH MIC REV I EW E A M ONO 2008 [ Card et al. : 40 Figure ] Table 1 able 1—I nsurance C haracteristics T ust before A ge 65 and E stimated D iscontinuities at A ge 65 J On Medicare Private coverage Any insurance Managed care Forms coverage 2 1 Age RD RD Age RD RD Age RD Age Age at 65 at 65 63–4 63–4 at 65 63–4 63–4 at 65 at 65 63–4 2 10 1 2 9 1 2 8 1 1 1 2 1 2 2 1 3 2 1 4 2 1 5 2 1 6 2 1 7 2 Overall sample 12.3 87.9 9.5 71.8 59.4 59.7 10.8 44.1 2 2 28.4 2.9 1 4.1 2 1 0.6 2 1 1.1 2 1 2.8 2 1 2.1 2 C fi ed by ethnicity and education: lassi White non-Hispanic: High school dropout 21.1 58.5 84.1 13.0 63.5 48.1 44.5 15.0 6.2 2 2 25.0 2 1 4.6 2 1 2.7 1 3.3 2 1 4.0 2 1 4.5 2 High school graduate 64.7 92.0 7.6 80.5 58.9 11.4 10.1 51.8 2 2 30.3 1.9 1 5.0 2 1 0.7 2 1 1.6 2 1 3.8 2 1 2.6 2 6.1 68.4 94.6 4.4 85.6 At least some college 8.8 55.1 69.1 2.3 40.1 2 2 2 2 1 2.6 2 2 1.8 1 4.0 1 4.7 2 1 0.5 1 Minority: 19.4 11.4 33.2 21.5 66.8 44.5 19.5 High school dropout 39.1 2 2 8.3 1.2 1 3.1 2 1 2.1 2 1 2.5 2 1 1.9 2 1 3.1 2 16.7 44.6 85.2 8.9 60.9 High school graduate 13.6 23.4 54.2 5.8 2 2 15.4 3.5 1 4.7 2 1 2.8 2 1 5.1 2 1 4.8 2 1 2 At least some college 52.1 89.1 5.8 73.3 66.2 10.3 11.1 38.4 2 2 22.3 5.4 1 4.9 2 1 2.0 2 1 4.3 2 1 3.8 2 1 7.2 2 C fi ed by ethnicity only: lassi White non-Hispanic 10.8 65.2 91.8 7.3 79.7 61.9 51.9 10.4 2.8 2 2 33.6 1 2 3.5 1.4 1 2 0.5 1 2 4.6 2 2.3 1 2 1 all 2 1 48.2 Black non-Hispanic 17.9 48.5 84.6 57.1 13.4 27.8 11.9 2 4.2 2 13.5 2 2 3.7 1 all 2 1 3.6 2 1 2.0 2 1 2.8 1 1 3.7 2 70.0 44.4 16.0 2 all 1 Hispanic 52.9 21.7 10.8 42.5 17.3 2 2.0 2 12.1 2 2 1 2.1 2 1 3.7 2 1 1 2 3.7 1 3.0 1.7 Entries in odd-numbered columns are percentages of age 63- 64-year-olds in group with insurance characteristic Note: shown in column heading. Entries in even-numbered columns are estimated regression discontinuties at age 65, from models that include quadratic control for age, fully interacted with dummy for age 65 or older. Other controls include indicators for gender, race/ethnicity, education, region, and sample year. Estimates are based on linear probability fi t to pooled samples of 1999–2003 NHIS. models Medicare coverage rises by 60 percentage points at age 65, from a base level of 12 percent Formal identification in an RDD relating to some outcome (insur- among 63- 64-year-olds. Consistent with DI enrollment patterns (David H. Autor and Mark G. Duggan 2003), Medicare enrollment prior to 65 is higher for minorities and people with below- ance coverage) to a treatment (MEdicare age-eligibility) that itself average schooling, and these groups experience relatively smaller gains at age 65 (see rows 2–7). depends on some running variable, age, relies on the continuity as- The pattern is reversed for the probability of any insurance coverage (columns 3 and 4): groups with lower insurance coverage rates prior to 65 experience larger gains at age 65. There is still sumptions that we discussed earlier. That is, we must assume that some disparity in insurance coverage after 65, but the 28-point gap between more educated whites the conditional expectation functions for both potential outcomes is educated minorities narrows to about 10 points. Similarly, as shown in rows 8–10, the and less 21-point gap in coverage between whites and Hispanics before age 65 closes to only 12 points after. 1 0 | are Y [ E and ] a | ] Y a continuous at age = 65 . This means that both E [ Thus, the onset of Medicare eligibility dramatically reduces disparities in insurance coverage. 65 . If that assumption is plausible, then continuous through age of Columns 5 and 6 present information on the prevalence of private insurance coverage (i.e., employer-provided or purchased coverage). Prior to age 65 private coverage rates range from 33 the average treatment effect at age 65 is identified as: percent for less educated minorities to 86 percent for better educated whites. The RD estimates in column 6 show that these differences are hardly affected by the onset of Medicare eligibility. 0 1 ] | ] a y [ E lim lim a | E [ y 65 a ! a 65 ects the fact that most people who hold private coverage before 65 transition fl This stability re to a combination of Medicare and supplemental coverage, either through an employer-provided 3 The continuity assumption requires that all other factors, observed plan or an individually purchased Medigap policy. Columns 7 and 8 of Table 1 analyze the age patterns of multiple coverage (i.e., reporting two or more policies). Prior to age 65, the rate and unobserved, that affect insurance coverage are trending smoothly at the cutoff, in other words. But what else changes at age 65 other 3 Across the six groups in rows 2–7 of Table 1, for example, the correlation between the private coverage rate at ages 63–64 shown in column 5 and the fraction of 65- 66-year-olds with private supplemental Medicare coverage is 0.97. than Medicare eligibility? Employment changes. Typically, 65 is the traditional age when people retire from the labor force. Any abrupt change in employment could lead to differences in health care utilization if non workers have more time to visit doctors.

176 176 causal inference : the mixtape The authors need to, therefore, investigate this possible con- founder. They do this by testing for any potential discontinuities for confounding variables using a third dataset – the March at age 65 – 2004 1996 CPS . And they ultimately find no evidence for discontinu- ities in employment at age 65 (Figure 41 ). TI C AR D E T AL : T H E E FF E CT O F M E DIC ARE ON H EAL TH C ARE UTI ON IZ A L 2249 VOL . 98 NO. 5  : Investigating the CPS for 41 Figure discontinuities at age 65 [ Card et al. , )JHIFEXIJUFT‡BDUVBM  2008 ] )JHIFEXIJUFT‡QSFEJDUFE  0WFSBMMTBNQMF‡BDUVBM 0WFSBMMTBNQMF‡QSFEJDUFE  E F -PXFENJOPSJUJFT‡BDUVBM PZ M Q  -PXFENJOPSJUJFT‡QSFEJDUFE N F  PO  BDUJ 'S                                            "HF ge (1992–2003 NHIS) G emographic D and A by roup ates R mployment 2. E igure F 6 As an additional check, gender, and show no large discontinuities for either men or women. - we used longitudinal data from the 2001 Survey of Income and Program Participation to esti mate month-to-month changes in individual employment (see the online Appendix). Consistent Next the authors investigate the impact that Medicare had on with the results here, we found no evidence of a discontinuity in employment at age 65. We also investigated the age pro fi ed as poor, and receiving food stamps in fi les of marriage, being classi , NHIS access to care and utilization using the NHIS data. Since 1997 the NHIS, as well as residential mobility, marital status, and the incidence of low income in the has asked four questions. They are: CPS. As summarized in the online Appendix to this paper, none of these outcomes shows sig - ni fi cant discontinuities at age 65 for the overall sample or the subgroups used in Tables 1 and 2. We conclude that employment, family structure, family income, and location, taken as a whole, “During the past 12 months has medical care been delayed for this all trend relatively smoothly at age 65, and are unlikely to confound our analysis of the impact person because of worry about the cost?” of Medicare eligibility. months was there any time when this person 12 “During the past III. Changes in Health Care Access and Utilization at Age 65 needed medical care but did not get it because (this person) could not We now turn to an analysis of the effects of reaching age 65 on access to care and utilization afford it?” of health care services. Since 1997 the NHIS has asked two questions: (1) “During the past 12 months has medical care been delayed for this person because of worry about the cost?” and (2) “Did the individual have at least one doctor visit in the past year?” “During the past 12 months was there any time when this person needed medical care but did not get it because (this person) could not afford it?” Columns 1 and 3 of Table 2 show the fractions of “Did the individual have one or more overnight hospital stays in the - people ages 63–64 in the pooled 1997–2003 NHIS who responded positively to these two ques past year?” tions. Overall, about 7 percent of the near-elderly reported delaying care, and 5 percent reported not getting care, with relatively higher rates for less educated minorities and for Hispanics. Our cant declines at age 65 in both measures of access fi RD estimates in columns 2 and 4 imply signi Estimates from this analysis are in Figure 42 . Again, the odd numbered columns are the baseline, and the even numbered columns 6 Graphs similar to Figure 2 by gender are available in our online Appendix. are the average treatment effect. Standard errors are in parenthesis below coefficient estimates in the even numbered columns. There’s a few encouraging findings from this table. First, the share of the relevant population who delayed care the previous year fell 1 . 8 points, and similar for the share who did not get care at all in the previous year. The share who saw a doctor went up slightly, as did the share who stayed at a hospital. These are not very large effects in magnitude, it is important to note, but they are relatively precisely estimated. Note that these effects differed considerably by race and ethnicity as well as education.

177 regression discontinuity 177 E E M BER 2008 2250 C D I M AN E C ONO MIC REV IC EW TH E A ER Figure 42 : Investigating the NHIS for stimated 65 at T able 3—M easures of A ccess iscontinuities to C are D ust before 65 and E J the impact of Medicare on care and 1997–2003 NHIS 1992–2003 NHIS Card et al. , ] 2008 utilization [ Saw doctor last year Hospital stay last year Did not get care last year Delayed care last year Age 63–64 Age 63–64 RD at 65 RD at 65 RD at 65 RD at 65 64 2 Age 63 2 64 Age 63 2 7 1 2 6 1 2 2 8 1 1 2 1 2 2 1 3 2 1 4 1 1 5 2 84.8 Overall sample 4.9 1.2 11.8 1.3 7.2 1.8 1.3 2 2 0.4 2 1 0.7 2 1 1 2 1 0.4 2 0.3 C lassi fi ed by ethnicity and education: White non-Hispanic: High school dropout 11.6 7.9 1.6 81.7 3.1 14.4 2 1.5 0.2 2 1.3 1 1 1.0 2 1 2 2 1 1.3 2 1.1 High school graduate 7.1 0.3 5.5 0.3 12.0 85.1 2 0.4 1.3 2 2 1 2.8 2 1 2.8 2 1 1.5 2 1 0.7 87.6 0.0 9.8 2.1 6.0 3.7 At least some college 1.5 1.4 2 2 1 2 0.7 1 2 1.3 1 2 0.3 1 2 0.4 Minority: 0.0 13.6 14.5 11.7 High school dropout 5.0 80.2 4.2 5.3 2 2 1 1.0 2 1 0.9 2 1 2.2 2 1 1.4 2 1.8 High school graduate 4.3 1.9 84.8 1.2 1.5 11.4 3.8 2 3.2 1 2 2 1 3.7 2 1 2.7 2 1 1.4 4.8 5.4 85.0 3.7 9.5 0.7 At least some college 0.2 2 2 0.6 1 1.1 2 1 0.8 2 1 3.9 2 1 2.0 2 ed by ethnicity only: C lassi fi White non-Hispanic 4.4 1.3 85.3 11.6 0.6 6.9 1.6 2 1.2 2 0.5 1 2 0.8 2 0.3 1 2 0.4 1 1 2 84.2 2 14.4 3.6 Black non-Hispanic 1 all 0.5 7.3 6.4 1.9 2 2 0.3 2 1.9 2 1.1 1 1 2 1.1 2 1.1 1 1 1.0 1 all 2 Hispanic 9.3 79.4 8.2 11.8 11.1 4.9 2 3.8 2 2 1 2 0.8 1 0.7 2 2 1.6 1 1 0.8 Entries in odd numbered columns are mean of variable in column heading among people ages 63–64. Entries in Note: even numbered columns are estimated regression discontinuties at age 65, from models that include linear control for age interacted with dummy for age 65 or older (columns 2 and 4) or quadratic control for age, interacted with dummy - for age 65 and older (columns 6 and 8). Other controls in models include indicators for gender, race/ethnicity, educa tion, region, and sample year. Sample in columns 1–4 is pooled 1997–2003 NHIS. Sample in columns 5–8 is pooled 1992–2003 NHIS. Samples for regression models include people ages 55–75 only. Standard errors (in parentheses) are clustered by quarter of age. problems, especially for less educated minorities and Hispanics. The onset of Medicare eligibil - ity leads to a fall in cost-related access problems and a narrowing of intergroup disparities in 7 access. The right-hand columns of Table 3 present results for two key measures of health care utili - Having shown modest effects on care and utilization, the authors zation: (1) “Did the individual have at least one doctor visit in the past year?” and (2) “Did the turn to examining the kinds of care they received by examining individual have one or more overnight hospital stays in the past year?” based on pooled samples of the 1992–2003 NHIS. Inspection of the utilization rates among 63- 64-year-olds shows a well- shows the effect of 43 specific changes in hospitalizations. Figure likely to have a routine doctor visit than known fact: less educated and minority groups are less Medicare on hip and knee replacements by race. The effects are likely to have had a hospital spell. The RD more better educated and nonminority groups, but estimates in column 6 suggest that the age 65 threshold is associated with a (modest) increase in largest for Whites. 8 For routine doctor visits, with relatively larger gains for the groups with lower rates before 65. In conclusion, the authors find that universal healthcare coverage example, among the near-elderly there is a 7.4 percentage point gap in the probability of a routine for the elderly increases care and utilization, as well as coverage. 2009 , ], the authors examined In a subsequent study [ Card et al. 7 Because the questions refer to the previous year, our estimates of the effect of reaching 65 on access problems may be attenuated. Speci cally, people who recently turned 65 could have had problems in the past year, but before their fi the impact of Medicare on mortality and find slight decreases in birthday. Such attenuation may be reduced if people tend to recall only their most recent experiences. 8 Lichtenberg (2002) also found a discontinuous rise in physician visits in the National Ambulatory Medical Care mortality rates (see Figure 44 ). Surveys, but did not disaggregate visits by race/ethnic group. We will return to the question of healthcare coverage when we cover the Medicaid Oregon experiment in the instrumental variables chapter, but for now we stop. Fuzzy RDD In the sharp RDD, treatment was determined when X . But that kind of deterministic assignment does not always c 0 i happen. Sometimes there is a discontinuity, but it’s not entirely deterministic, though it nonetheless is associated with a discontinuity in treatment assignment. When there is an increase in the probability of treatment assignment, we have a fuzzy RDD. The formal definition

178 : 178 causal inference the mixtape E E M BER 2008 2252 C D EW M AN E C ONO MIC REV I IC E A TH ER : Changes in hospitalizations 43 Figure    Card et al. , [ ] 2008 T "MMBENJTTJPOT MFGUTDBMF "MMBENJTTJPOTQSFEJDUFE S FB 8IJUFIJQBOELOFFSFQMBDFNFOUGJUUFE 8IJUFIJQBOELOFFSFQMBDFNFOU Z  )JTQBOJDIJQBOELOFFSFQMBDFNFOUGJUUFE )JTQBOJDIJQBOELOFFSFQMBDFNFOU PO  T QFS  S  #MBDLIJQBOELOFFSFQMBDFNFOUGJUUFE #MBDLIJQBOELOFFSFQMBDFNFOU  F Q     QMBDFNFOU  S F S F   Q  T PO J T  TT J FBS BOELOFF  N Z  E B  PO M  T B S JU F GPSIJQ Q Q T T  P PO J I  M TT B U P 5  BENJ   JUBM Q T )P                        "HF dmission thnicity /E ace R by ates R 3. H A ospital igure F Hispanic-white differences in admission rates narrow at age 65, as whites gain relative to blacks and Hispanics gain relative to whites. For reference, the bottom row of Table 4 shows insurance coverage rates among 60- 64-year- 11 olds in the three states, along with estimated jumps in insurance coverage at 65. Coverage rates in the three states are below the national average prior to 65, but rise by more (15 percent versus a national average of about 10 percent). Consistent with the national data in Table 1, the gains in - insurance coverage in the three states are largest for Hispanics (20.3 versus 17.3 percent nation ally), a little smaller for blacks (17.6 versus 11.9 percent nationally) and smallest for whites (12.7 versus 7.3 percent nationally). 622 A key advantage of our hospital data is that we can break down admissions by route into the hospital, and by admission diagnosis and primary procedure. A comparison of rows 2 and TABLE V 3 in Table 4 shows that most of the jump in admissions at age 65 is driven by non–emergency Figure : Mortality and Medicare 44 R ORTALITY R ATES C HANGES IN STIMATES OF M E ISCONTINUITY D EGRESSION room admissions, although for each race/ethnic group there is also some increase in ER admis - 2009 , Card et al. [ ] QUARTERLY JOURNAL OF ECONOMICS Death rate in 12 Further insights can be gleaned from the admissions patterns across diagnoses. The most sions. 180days 14days 7days 365days 28days 90days common admission diagnosis for near-elderly patients is chronic ischemic heart disease (IHD), Estimated discontinuity at age 65 ( × 100) which is often treated by coronary artery bypass surgery. There are substantial disparities in IHD . 1 − 0 Fully interacted quadratic with no − 1 . 1 − 1 . 0 − 1 . 1 − 1 . 1 − 1 . 2 3) 4) (0 . 2) (0 . 2) (0 . additional controls (0 . 3) (0 . 4) (0 . . . Fully interacted quadratic plus − 1 . 0 − 0 . 8 − 0 7 9 − 0 . 9 − 0 . 8 − 0 . 4) additional controls (0 . 2) (0 . 2) (0 . 3) (0 . 3) (0 . 3) (0 11 These data are drawn from the 1996–2004 CPS data for California, New York, and Florida. Given the small 0 Fully interacted cubic plus additional − 0 . 7 − 0 . 7 − 4 . 6 − 0 . 9 − 0 . 9 − 0 . sample sizes and the coarseness of the age measure in the CPS, we estimated the insurance RDs assuming a linear age (0 . 3) (0 controls 2) (0 . 4) . . 4) (0 . 5) (0 . 5) (0 Local linear regression procedure fit − 0 . 8 − 0 . 8 − 0 . 8 − 0 . 9 − 1 . 1 − 0 . 8 pro fi le but allowing a different slope before and after 65. 12 2) (0 . 2) (0 . 2) (0 . 2) (0 . 3) (0 . 3) separately to left and right with (0 . ER admissions include extremely urgent cases (which one might expect to be unresponsive to insurance status) rule-of-thumb bandwidths as well as patients who have presented at the ER without being referred by a physician. Some analysts have argued that . 81 4 . 71 8 . 42 3 . 0 Mean of dependent variable (%) . 19 17 . 5 provision of health insurance would reduce ER use and shift patients to outpatient care. Nevertheless, our results are consistent with the RAND Health Insurance Experiment, which found ER use as responsive to copayment rates as use Standard errors in parentheses. Dependent variable is indicator for death within interval indicated by column heading. Entries in rows (1)–(3) are estimated coefficients of Notes. dummy for age over 65 from models that include a quadratic polynomial in age (rows (1) and (2)) or a cubic polynomial in age (row (3)) fully interacted with a dummy for age over 65. of outpatient care (Joseph P. Newhouse 1993). Models in rows (2) and (3) include the following additional controls: a dummy for people who are within 1 month of their 65 birthdays, dummies for year, month, sex, race/ethnicity, and Saturday or Sunday admissions, and unrestricted fixed effects for each ICD-9 admission diagnosis. Entries in row (4) are estimated discontinuiti es from a local linear regression procedure, fit separately to the left and right, with independently selected bandwidths from a rule-of-thumb procedure suggested by Fan and Gijbels ( 1996). Sample includes 407,386 observations on patients between the ages of 60 and 70 admitted to California hospitals between January 1, 1992, and November 30, 2002, for unplanned a dmission through the ED who have nonmissing Social Security numbers. All coefficients and their SEs have been multiplied by 100. Downloaded from at University of Kentucky Libraries on November 29, 2011

179 regression discontinuity 179 of a probabilistic treatment assignment is D ( ) X = Pr ( D X =1 | X | = X =1 ) 6 = lim lim Pr 0 0 ! c X c X i i i i 0 0 i i In other words, the conditional probability is becoming discontinuous approaches c as in the limit. A visualization of this is presented X 0 Imbens and Lemieux [ 2008 ] in Figure 45 : from ], Figure 45 : Imbens and Lemieux [ 2008 Figure . Horizontal axis is the running 3 variable. Vertical axis is the conditional probability of treatment at each value of the running variable. As you can see in this picture, the treatment assignment is increas- ing even before , but is not fully assigned to treatment above c . c 0 0 Rather, the fraction of the units in the treatment jumps at . This is c 0 what a fuzzy discontinuity looks like. The identifying assumptions are the same under fuzzy designs as they are under sharp designs: they are the continuity assumptions. For identification, we must assume that the conditional expectation 0 E [ Y c | X < of the potential outcomes (e.g., ) is changing smoothly ] 0 . What changes at c is the treatment assignment prob- through c 0 0 ability. An illustration of this identifying assumption is in Figure . 46 Figure 46 : Potential and observed outcome regressions [ Imbens and Lemieux , 2008 ] Calculating the average treatment effect under a fuzzy RDD is very similar to how we calculate an average treatment effect with instrumental variables. Specifically, it’s the ratio of a reduced form difference in mean outcomes around the cutoff and a reduced form

180 180 causal inference : the mixtape difference in mean treatment assignment around the cutoff. lim ] X = E [ Y | X = X X ] | lim Y [ E 0 0 X X X ! X 0 0 d = Fuzzy RDD lim ] X E [ D | X = X = ] lim X | D [ E 0 0 X ! X X X 0 0 ivregress This can be calculated with software in Stata, such as . The assumptions for identification are the same as those with 2sls instrumental variables: there are caveats about the complier vs. the defier populations, statistical tests (e.g., weak instrument using F tests on the first stage), etc. One can use both as well as the interaction terms as instruments T i D . If one uses only T as an instrumental variable, for the treatment i i then it is a “just identified” model which usually has good finite sample properties. In the just identified case, the first stage would be: p 2 D z + g + X T + g p X g + + ··· + g = X p 0 2 1 i i i 1 i i i T on the conditional probability of where p is the causal effect of treatment. The fuzzy RDD reduced form is: p 2 Y k z X + + k T X + rp + ··· + k = X μ + p 2 1 i i 2 i i i i As in the sharp RDD case, one can allow the smooth function to be different on both sides of the discontinuity. The second stage model with interaction terms would be the same as before: p 2 ̃ ̃ ̃ x b x = + b a Y x b + ··· + + 02 p 0 01 i i i i p ⇤ ⇤ 2 ⇤ ̃ ̃ ̃ D h x x + b + r D b + x + D + ··· + b D i i i i i i p 2 1 i i ̃ Where are now not only normalized with respect to c but are also x 0 fitted values obtained from the first stage regressions. Again, one can T use both as well as the interaction terms as instruments for D . If i i T , the estimated first stage would be: we only used p 2 ̃ ̃ ̃ + g D ··· X g + g + = X g X + 0 p 00 02 01 i i i i ⇤ ⇤ ⇤ 2 ̃ ̃ + + p X T T T + g z g + X g + T ··· + i i i i i i 1 p 2 1 i p ̃ ̃ We would also construct analogous first stages for D D ,..., . X X i i i i Hahn et al. As 2001 ] point out, one needs the same assumptions [ for identification as one needs with IV. As with other binary instru- mental variables, the fuzzy RDD is estimating the local average treatment effect (LATE) [ Imbens and Angrist , 1994 ], which is the average treatment effect for the compliers. In RDD, the compliers are x those whose treatment status changed as we moved the value of i from just to the left of . c to just to the right of c 0 0

181 regression discontinuity 181 Challenges to identification The identifying assumption for RDD to estimate a causal effect are the continuity assumptions. That is, the expected potential outcomes change smoothly as a function of the running variable through the cutoff. In words, this means that nothing that determines the potential outcomes changes abruptly at except for the treatment assignment. But, this can be violated in c 0 practice if: 1 . the assignment rule is known in advance 2 . agents are interested in adjusting . agents have time to adjust 3 Examples include re-taking an exam, self-reported income, etc. But some other unobservable characteristic change could happen at the threshold, and this has a direct effect on the outcome. In other words, the cutoff is endogenous. An example would be age thresholds used for policy, such as when a person turns and faces more severe 18 penalties for crime. This age threshold both with the treatment (i.e., higher penalties for crime), but is also correlated with variables that affect the outcomes such as graduating from high school, voting rights, etc. Because of these challenges to identification, a lot of work by econometricians and applied microeconomists has gone to trying to figure out solutions to these problems. The most influential is a density test by Justin McCrary, now called the McCrary density test McCrary , 2008 ]. The McCrary density is used to check for whether [ units are sorting on the running variable. Imagine that there were two rooms – room A will receive some treatment, and room B will receive nothing. There are natural incentives for the people in room B to get into room A. But, importantly, if they were successful, then the two rooms would look different. Room A would have more observations than room B – thus evidence for the manipulation. Manipulation on the sorting variable always has that flavor. As- suming a continuous distribution of units, manipulation would mean that more units are showing up just on the other side of the cut off. Formally, if we assume a desirable treatment D and as assignment X rule c such that . If individuals sort into D by choosing X 0 X c , then we say individuals are sorting on the running variable. 0 The kind of test needed to investigate whether manipulation is occurring is a test that checks whether there is bunching of units at the cutoff. In other words, we need a denstiy test. McCrary [ 2008 ] suggests a formal test where under the null, the density should be continuous at the cutoff point. Under the alternative hypothesis, 99 99 the density should increase at the kink. Mechanically, partition In those situations, anyway, where the treatment is desirable to the units.

182 182 causal inference the mixtape : the assignment variable into bins and calculate frequencies (i.e., the ARTICLE IN PRESS J. McCrary / Journal of Econometrics 142 (2008) 698–714 706 number of observations) in each bin. Treat the frequency counts as the dependent variable in a local linear regression. If you can estimate the conditional expectations, then you have the data on the running variable, so in principle you can always do a density test. You can download the (no longer supported) Stata ado package 100 100 , or you can install it for R as rddensity DCdensity or the package jmc- crary/DCdensity/ 101 well. 101 http://cran.r- For RDD to be useful, you already need to know something about the mechanism generating the assignment variable and how suscep- tible it could be to manipulation. Note the rationality of economic Fig. 1. The agent’s problem. actors that this test is built on. A discontinuity in the density is con- sidered suspicious and suggestive of manipulation around the cutoff. 0.50 0.50 to c This is a high-powered test. You need a lot of observations at 0 0.30 0.30 McCrary distinguish a discontinuity in the density from noise. 2008 [ ] 0.10 0.10 ` presents a helpful picture of a situation with and without manipula- -0.10 -0.10 Estimate Estimate -0.30 -0.30 tion in Figure 47 . Conditional Expectation Conditional Expectation -0.50 -0.50 20 15 10 5 10 15 20 5 Income Income Figure 47 : Panel C is density of income when there is no pre-announcement and 0.16 0.16 no manipulation. Panel D is the density of 0.14 0.14 income when there is pre-announcement and 0.12 0.12 ]. McCrary [ manipulation. From 2008 0.10 0.10 0.08 0.08 0.06 0.06 0.04 0.04 Density Estimate Density Estimate 0.02 0.02 0.00 0.00 5 5 20 15 10 15 20 10 Income Income Fig. 2. Hypothetical example: gaming the system with an income-tested job training program: (A) conditional expectation of returns to treatment with no pre-announcement and no manipulation; (B) conditional expectation of returns to treatment with pre-announcement and manipulation; (C) density of income with no pre-announcement and no manipulation; (D) density of income with pre-announcement and manipulation. There are also helpful visualization of manipulation from other contexts, such as marathon running. Allen et al. [ 2013 ] shows a pic- c f a c also necessary, and we may characterize those who reduce their labor supply as those with p o = and ture of the kinds of density jumps that occur in finishing times. The i i b . Þ ð 1 " f 4 a = d i i i reason for these finishing time jumps is because many marathon shows the implications of these behavioral effects using a simulated data set on 50,000 agents with Fig. 2 runners have target times that they’re shooting for. These are usu- linear utility. The simulation takes ð a ; b Þ to be distributed as independent normals, with E ½ 9, Š¼ a a ½ V 12, Š¼ i i i i E ½ b b 0 ½ distribution to be uniform on Þ f 1, and the Š¼ ; b a ð and independent of Š 1 ½ V 0, and Š¼ ; . The i i i i i 30 ally minute intervals, but also include unique race qualification earnings threshold is set at c 14. ¼ times (e.g., Boston qualifying times). The panel on the top shows a This data generating process is consistent with (A0). If the program did not exist, then period 1 earnings  would be R a given R a ¼ is thus just the 45 line, which is continuous; the . The conditional expectation of i 0 i i i 0 histogram of times with black lines showing jumps in the number of b conditional expectation of is the normal given R is flat, which is likewise continuous; and the density of R i 0 i 0 i observations. Density tests are provided on the bottom. is a local linear regression estimate of the conditional expectation Fig. 2 density, hence continuous. Panel A of b of R given . The smoothness of the conditional expectation indicates the validity of (A0). i 0 i However, even though (A0) is satisfied, agents’ endogenous labor supply creates an identification problem. It is become common in this literature to provide Testing for validity The actual running variable is not R , which is manipulated by those agents who find it worthwhile to , but R 0 i i evidence for the credibility of the underlying identifying assump- do so. Panel B gives a local linear regression estimate of the conditional expectation of b given . This panel R i i tions. While the assumptions cannot be directly tested, indirect evidence may be persuasive. We’re already mentioned one such test – the McCrary density test. A second test is a covariate balance test. For RDD to be valid in your study, there must not be an observable discontinuous change in the average values of the covariates around

183 regression discontinuity 183 , 378 , 546) =9 n Figure 2: Distribution of marathon finishing times ( NOTE: The dark bars highlight the density in the minute bin just prior to each 30 minute threshold. z -statistic Figure 3: Running McCrary 150 100 12 50 McCrary z 0 -50 2:30 3:00 3:30 4:00 4:30 5:00 5:30 6:00 6:30 7:00 threshold Non-categorical 10 minute threshold 30 minute threshold NOTE: The McCrary test is run at each minute threshold from 2:40 to 7:00 to test whether there is a significant discontinuity in the density function at that threshold. 14

184 184 causal inference the mixtape : the cutoff. As these are pretreatment characteristics, they should be invariant to change in treatment assignment. An example of this is 102 102 from where they evaluated the impact of Democratic voteshare, David S. Lee. Randomized experi- ments from non-random selection in u.s. ). 48 %, on various demographic factors (Figure 50 just at Journal of Econometrics , house elections. 835 DO VOTERS AFFECT OR ELECT POLICIES? 2008 : – 675 , 142 697 : Panels refer to (top left to 48 Figure bottom right) district characteristics: real income, percent high school degree, percent black, and percent eligible to vote. Circles represent the average characteristic within intervals of 0 . 01 in Democratic vote share. The continuous line represents the predicted values from a fourth-order polynomial in vote share fitted separately for points above Downloaded from and below the 50 percent threshold. The dotted line represents the 95 percent confidence interval. F III IGURE Similarity of Constituents Characteristics in Bare Democrat and Republican ’ – Part 1 Districts Panels refer to (from top left to bottom right) the following district character- istics: real income, percentage with high-school degree, percentage black, percent- at Baylor University on April 22, 2014 test. That placebo This test is basically what is sometimes called a age eligible to vote. Circles represent the average characteristic within intervals of 0.01 in Democrat vote share. The continuous line represents the predicted is, you are looking for there to be no effects where there shouldn’t be tted separately for points values from a fourth-order polynomial in vote share fi above and below the 50 percent threshold. The dotted line represents the 95 any. So a third kind of test is an extension of that – just as there percent con fi dence interval. shouldn’t be effects at the cutoff on pretreatment values, there shouldn’t be effects on the outcome of interest at arbitrarily cho- 2008 ] suggest to look at one side Imbens and Lemieux [ sen cutoffs. cient reported in column (6) is the predicted fi share. The coef rms that, for many ob- fi difference at 50 percent. The table con of the discontinuity, take the median value of the running variable in 0 servable characteristics, there is no signi cant difference in a fi that section, and pretend it was a discontinuity, . Then test whether c 0 close neighborhood of 50 percent. One important exception is the 0 want to find there is a discontinuity in the outcome at not . You do c 0 percentage black, for which the magnitude of the discontinuity is 23 anything. statistically signi fi cant. fi cients in Table I from As a consequence, estimates of the coef regressions that include these covariates would be expected to Data visualization RDD papers are intensive data visualization — as in a randomized experiment produce similar results — since studies. You typically see a lot of pictures. The following are modal. First, a graph showing the outcome variable by running variable is 23. This is due to few outliers in the outer part of the vote share range. When the polynomial is estimated including only districts with vote share between 25 standard. You should construct bins and average the outcome within cant. The gap for percent fi cients becomes insigni fi percent and 75 percent, the coef urban and open seats, while not statistically signi fi cant at the 5 percent level, is bins on both sides of the cutoff. You should also look at different signi fi cant at the 10 percent level. bin sizes when constructing these graphs [ , 2010 ]. Lee and Lemieux Plot the running variables X on the horizontal axis and the average i for Y for each bin on the vertical axis. Inspect whether there is a i c . Also inspect whether there are other unexpected discontinuity at 0 discontinuities at other points on X . An example of what you want i to see is in Figure 49 .

185 regression discontinuity 185 Example: Outcomes by Forcing Variable From Lee and Lemieux (2010) based on Lee (2008) Figure 49 : Example of outcome plotted against the running variable. Waldinger (Warwick) 26 / 48 If it’s a fuzzy design, then you want to see a graph showing the probability of treatment jumps at . This tells you whether you have c 0 a first stage. You also want to see evidence from your placebo tests. As I said earlier, there should be no jump in some covariate at c , so 0 readers should be shown this lack of an effect visually, as well as in a Example Covariates by Forcing Variable regression. From Lee and Lemieux (2010) based on Lee (2008) Figure 50 : Example of covariate plotted against the running variable. Waldinger (Warwick) 29 / 48 Another graph that is absolutely mandatory is the McCrary den- sity test. The reader must be shown that there is no sign of manip- ulation. One can either use a canned routine to do this, such as rddensity or DCDensity , or do it oneself. If one does it oneself, then the method is to plot the number of observations into bins. This plots allows us to investigate whether there is a discontinuity in the

186 186 causal inference : the mixtape distribution of the running variable at the threshold. If so, this sug- gests that people are manipulating the running variable around the threshold. This is an indirect test of the identifying assumption that each individual has imprecise control over the assignment variable. An example of a dataset where manipulation seems likely is the Na- tional Health Interview Survey where respondents were asked about participation in the Supplemental Nutrition Assistance Program (SNAP). I merged into the main survey data imputed income data. As SNAP eligibility is based in part on gross monthly income and family size, I created a running variable based on these two variables. Individuals with income that surpassed some given monthly income level appropriate for their family size were then eligible for SNAP. But if there was manipulation, meaning some people misreported their income in order to become eligible for SNAP, we would expect the number of people with income just below that threshold would jump. I estimated a McCrary density test to evaluate whether there was evidence for that. I present that evidence in Figure . 51 Figure 51 : McCrary density test, NHIS data, SNAP eligibility against a running .06 variable based on income and family size. .04 .02 0 -100 0 100 200 -200 That, in fact, is exactly what I find. And statistical tests on this 1 % level, suggesting there is evidence difference are significant at the for manipulation. Example: Elect or Affect [ Lee et al. , 2004 ] To illustrate how to imple- ment RDD in practice, we will replicate the Lee et al. [ 2004 ] paper. First install the data. It’s large, so it will take a moment to get fully

187 regression discontinuity 187 downloaded. . scuse lmb-data The big question motivating this paper has to do with whether and in what way voters affect policy. There are two fundamentally different views of the role of elections in a representative democracy. They are: . : Heterogeneous voter ideology forces each candidate Convergence 1 to moderate his or her position (e.g., similar to the median voter theorem). “Competition for votes can force even the most partisan Republicans and Democrats to moderate their policy choices. In the extreme case, competition may be so strong that it leads to ‘full policy conver- gence’: opposing parties are forced to adopt identical policies.” [ Lee 2004 et al. , ] Divergence : When partisan politicians cannot credibly commit . 2 to certain policies, then convergence is undermined. The result can be fully policy divergence. Divergence is when the winning candidate, after taking office, simply pursues his most-preferred policy. In this case, voters fail to compel candidates to reach any kind of policy compromise. R The authors present a model, which I’ve simplified. Let and D be candidates in a Congressional race. The policy space is a sin- gle dimension where and R ’s policy preferences in a period are D is the policy variable. quadratic loss functions, u(l) and v(l), and l Each player has some bliss point, which is their most preferred lo- cation along the unidimensional policy range. For Democrats, it’s ⇤ = c ( > 0) and for Republicans it’s l ⇤ = 0. Here’s what this means. l Ex ante, voters expect the candidate to choose some policy and e e they expect the candidate to win with probability P( , y x ) where e e x are the policies chosen by Democrats and Republicans, and y P P ∂ ∂ e > respectively. When y 0, x , then < 0. > e e x ∂ ∂ y ⇤ P represents the underlying popularity of the Democratic party, or put differently, the probability that would win if the policy D chosen x equalled the Democrat’s bliss point c . The solution to this game has multiple Nash equilibria, which I discuss now. 1 . Partial/Complete Convergence: Voters affect policies. ⇤ ∂ x • The key result under this equilibrium is > 0. ⇤ ∂ P

188 188 causal inference the mixtape : • Interpretation: if we dropped more Democrats into the district ⇤ P from a helicopter, it would exogenously increase and this would result in candidates changing their policy positions, i.e., ⇤ ∂ x 0 > ⇤ P ∂ 2 . Complete divergence: Voters elect politicians with fixed policies 103 103 who do whatever they want to do. The “honey badger” don’t care. It https://www. takes what it wants. See • Key result is that more popularity has no effect on policies. That . ⇤ ∂ x = 0. is ⇤ ∂ P ⇤ • (i.e., dropping Democrats into the An exogenous shock to P to equilibrium policies. Voters elect politi- district) does nothing cians who then do whatever they want because of their fixed policy preferences. Potential roll-call voting record outcomes of the representative follow- ing some election is = D y RC ) + (1 x D t t t t t D indicates whether a Democrat won the election. That is, where t only the winning candidate’s policy is observed. This expression can be transformed into regression equations: ⇤ = a # + p + P RC D + p t t t 0 0 1 t ⇤ p + + p = P b RC + D # 0 0 +1 t +1 t 1 t +1 +1 t and b where are constants. a 0 0 This equation can’t be directly estimated because we never observe ⇤ . But suppose we could randomize D P . Then D would be indepen- t t ⇤ and # . Then taking conditional expectations with respect P dent of t t to we get: D t D ⇤ R ⇤ D E [ E [ RC | D RC = 0] ] | P = p P [ = 1] t t 0 t +1 t +1 +1 t +1 t {z | } Observable D R [ P P + p ] 1 +1 +1 t t {z | } Observable g = ) ( 80 |{z} Total effect of initial win on future roll call votes E [ RC ) | D 81 = 1] E [ RC ( | D p = 0] = t t t t 1 {z | } Observable R D D ) 82 | D ( = 1] E [ D P P | D = 0] E [ = t t t +1 t +1 +1 t +1 t {z | } Observable D R and it’s estimated as the [ P The “elect” component is ] p P 1 +1 +1 t t difference in mean voting records between the parties at time t . The fraction of districts won by Democrats in t +1 is an estimate of

189 regression discontinuity 189 R D P [ P , of a Democrat ] . Because we can estimate the total effect, g +1 t +1 t victory in on , we cna net out the elect component to implicitly RC t +1 t get the “affect” component. D But random assignment of is crucial. For without it, this equa- t selection (i.e., Democratic districts have and p tion would reflect 1 D using more liberal bliss points). So the authors aim to randomize t a RDD, which I’ll now discuss in detail. There are two main datasets in this project. The first is a measure of how liberal an official voted. This is collected from the Americans for Democratic Action (ADA) linked with House of Representatives . Authors use the ADA score for all election results for - 1946 1995 as their voting record 1995 1946 US House Representatives from to index. For each Congress, the ADAD chose about 25 high-profile 100 to roll-call votes and created an index varying from 0 for each Representative. Higher scores correspond to a more “liberal” voting record. The running variable in this study is the voteshare. That is the share of all votes that went to a Democrat. ADA scores are then linked to election returns data during that period. . The authors have a Recall that we need randomization of D t clever solution. They will use arguably exogenous variation in Demo- cratic wins to check whether convergence or divergence is correct. Their exogenous shock comes from the discontinuity in the running variable. At a voteshare of just above 0 . 5 , the Democratic candidate wins. They argue that just around that cutoff, random chance de- termined the Democratic win - hence the random assignment of . D t QUARTERLY JOURNAL OF ECONOMICS 832 TABLE I 52 Figure : Lee, Moretti and Butler R LOSE C AMPLE CORES ADA S ASED ON B S ESULTS LECTIONS E — ( . Main results. 1 )’s Table 2004 Elect component Affect component Total effect R D R R D D * * !# P P $ [ P P [( ( # ) $ # ] )] $ P P " 1 1 t " " 1 " t t 1 " t 0 1 1 1 t t " 1 ADA Variable DEM ADA $ (col. (2)*(col. (3)) (col. (1)) (col. (4)) t 1 1 " t t " (1) (4) (5) (3) (2) Estimated gap 21.2 47.6 0.48 (1.9) (1.3) (0.02) $ 1.64 22.84 Downloaded from (2.0) (2.2) Standard errors are in parentheses. The unit of observation is a district-congressional session. The sample includes only observations where the Democrat vote share at time t is strictly between 48 percent and 52 percent. The estimated gap is the difference in the average of the relevant variable for observations for t is strictly between 50 percent and 52 percent and observations for which the Democrat vote share at time t t and t " 1 refer which the Democrat vote share at time is strictly between 48 percent and 50 percent. Time is the adjusted ADA voting score. Higher ADA scores correspond to more ADA to congressional sessions. t liberal roll-call voting records. Sample size is 915. primarily elect policies (full divergence) rather than affect poli- You should have the data in memory, but if not, recall that the cies (partial convergence). Here we quantify our estimates more precisely. In the analy- command is: “ close elections ”— sis that follows, we restrict our attention to at Baylor University on April 7, 2014 is strictly between 48 t where the Democrat vote share in time and 52 percent. As Figures I and II show, the difference between barely elected Democrat and Republican districts among these elections will provide a reasonable approximation to the discon- tinuity gaps. There are 915 observations, where each observation 20 is a district-year. Table I, column (1), reports the estimated total effect ! , the size of the jump in Figure I. Speci fi cally, column (1) shows the difference in the average for districts for which the ADA 1 " t t Democrat vote share at time is strictly between 50 percent and 52 percent and districts for which the Democrat vote share at time t is strictly between 48 percent and 50 percent. The esti- mated difference is 21.2. cient In column (2) we estimate the coef fi # , which is equal to 1 the size of the jump in Figure IIa. The estimate is the difference in the average ADA for districts for which the Democrat vote t " 20. In 68 percent of cases, the representative in period 1 is the same as t the representative in period t . The distribution of close elections is fairly uniform across the years. In a typical year there are about 40 close elections. The year with the smallest number is 1988, with twelve close elections. The year with the largest number is 1966, with 92 close elections.

190 190 causal inference : the mixtape . scuse lmb-data First we will replicate the first column of Figure 52 by typing (with output below each command): 48 . reg score lagdemocrat if lagdemvoteshare>. & lagdemvote- , cluster(id) share<. 52 Number of obs = 915 —————————————————————————— score | Coef. Std. Err. ————-+—————————————————————- 21 lagdemocrat | 28387 1 . 951234 . —————————————————————————— 48 . reg score democrat if lagdemvoteshare>. & lagdemvote- share<. 52 , cluster(id) Number of obs = 915 —————————————————————————— score | Coef. Std. Err. ————-+—————————————————————- democrat | 47 . 7056 1 . 356011 —————————————————————————— . reg democrat lagdemocrat if lagdemvoteshare>. & lagdemvote- 48 52 share<. , cluster(id) Number of obs = 915 —————————————————————————— democrat | Coef. Std. Err. ————-+—————————————————————- lagdemocrat | . 4843287 . 0289322 —————————————————————————— Okay, a few things. First, notice the similarity between each re- gression output and the regression output in Figure . So as you 52 can see, when we say we are estimating global regressions, it means we are simply regressing some outcome onto a treatment variable. Here what we did was simply run “local” linear regressions, though. Notice the bandwidth - we are only using observations between 0 . 48 and 0 . 52 voteshare. So this regression is estimating the coefficient on D right around the cutoff. What happens if we use all the data? t

191 regression discontinuity 191 . reg score democrat, cluster(id2) 13588 Number of obs = ————-+—————————————————————- score | Coef. Std. Err. ————-+—————————————————————- democrat | 76266 1 . 495659 . 40 ————-+—————————————————————- Notice when we use all the data, the effect on the democrat vari- able becomes smaller. It remains significant, but it no longer includes in its confidence interval the coefficient we found earlier. Recall we said that it is common to center the running variable. Centering simply means subtracting from the running variable the 0 value of the cutoff so that values of are where the voteshare equals 0 5 , negative values are Democratic voteshares less than 0 . 5 , and . positive values are Democratic voteshares above . 5 . To do this, type 0 in the following lines: _ . gen demvoteshare c = demvoteshare - 0.5 _ . reg score democrat demvoteshare c, cluster(id2) 13577 Number of obs = ——————————————————————————– score | Coef. Std. Err. —————+—————————————————————- 58 . 50236 1 . democrat | 555847 demvoteshare_c | - 48 . 93761 4 . 441693 —————+—————————————————————- Notice, now controlling for the running variable causes the coeffi- cient on democrat – using all the data – to get much larger than when we didn’t control for the running variable. It is common, though, to allow the running variable to vary on either side of the discontinuity, but how exactly do we implement that? Think of it - we need for a regression line to be on either side, which means necessarily that we have two lines left and right of the discontinuity. To do this, we need an interaction - specifically an interaction of the running variable with the treatment variable. So to do that in Stata, we simply type: _ . xi: reg score demvoteshare c, cluster(id2) * Number of obs = 13577 ——————————————————————————–

192 192 causal inference : the mixtape score | Coef. Std. Err. —————+—————————————————————- | 55 . 43136 1 . 448568 _Idemocrat_ 1 . . 939863 5 682785 5 demvoteshare_c | - |- 55 . 15188 8 _IdemXdemvo_ 236231 1 . ——————————————————————————– But notice, we are still estimating global regressions. And it is for that reason, as I’ll show now, that the coefficient is larger. This sug- gests that there exist strong outliers in the data which are causing the c distance at to spread more widely. So a natural solution, therefore, 0 is to again limit our analysis to a smaller window. What this does c , and therefore omit the is drop the observations far away from 0 influence of outliers from our estimation at the cutoff. Since we used / -. 02 last time, we’ll use + / -. 05 + this time just to mix things up. _ demvoteshare . xi: reg score c if demvoteshare>.45 * & demvoteshare<.55, cluster(id2) 2387 Number of obs = ——————————————————————————– score | Coef. Std. Err. —————+—————————————————————- _Idemocrat_ 1 | 46 . 77845 2 . 491464 demvoteshare_c | 54 82604 50 . 12314 . 1 91 . 1152 81 . 05893 _IdemXdemvo_ |- ——————————————————————————– + / 0 . As can be seen, when we limit our analysis to around the 05 cutoff, we are dropping observations from the analysis. That’s why 2 , 387 observations for analysis as opposed to the we only have , 000 13 we had before. This brings us to an important point. The ability to do this kind of local regression analysis necessarily requires a lot of data around the cutoff. If we don’t have a lot of data around the cutoff, then we simply cannot estimate local regression models, as the data simply becomes too noisy. This is why I said RDD is “greedy”. It needs a lot of data because it uses only a portion of it for analysis. But putting that aside, think about what this did. This fit a model where it controlled for a straight line below the cutoff (demvote- share_c) and above the cutoff (_IdemXdemvo_ 1 ). Controlling for those two things, the remainder is a potential gap at voteshare 0 . 5 , = which is captured by the democrat dummy. It does this through extrapolation. I encourage you to play around with the windows. Try / 0 . 1 + . Notice how the standard errors get larger the more + / 0 . 01 and

193 regression discontinuity 193 narrow you make the band. Why do you think that is? Think about that - narrowing the band decreases bias, but strangely increases variance. Do you know why? Recall what we said about nonlinearities and strong trends in the evolution of the potential outcomes. Without controlling for nonlinearities, we may be misattributing causal effects using only linear functions of the running variable. Therefore next we will show how to model polynomials in the running variable. Gelman and 2017 [ Imbens ] recommend polynomials up to a quadratic to avoid the problem of overfitting. So we will follow their advice now. First, we need to generate the polynomials. Then we need to interact them with the treatment variable which as we alluded to earlier will allow us to model polynomials to the left and right of the cutoff. _ . gen x c = demvoteshare - 0.5 _ _ . gen x c2 = x c^2 _ _ c c.x2 . reg score democrat##(c.x c) 13 577 Number of obs = , ——————————————————————————— score | Coef. Std. Err. —————-+—————————————————————- .democrat | 44 . 1 . 008569 40229 1 x_c | - 23 . 8496 8 . 209109 x_c 2 |- 41 . 72917 17 . 50259 democrat#c.x_c | 1 111 . 8963 10 . 57201 | 2 democrat#c.x_c | |- 229 . 9544 21 . 10866 1 ——————————————————————————— Notice now that using all the data gets us closer to the estimate. And finally, we can use the a narrow bandwidth. _ _ . reg score democrat##(c.x c c.x2 c) if demvoteshare>0.4 & demvoteshare<0.6 Number of obs = 4 , 632 ——————————————————————————— score | Coef. Std. Err. —————-+—————————————————————- 1 .democrat | 45 . 9283 1 . 892566 x_c | 38 . 63988 60 . 77525 x_c 2 | 295 . 1723 594 . 3159 democrat#c.x_c |

194 194 causal inference : the mixtape | 6 507415 88 . 51418 1 . | 2 democrat#c.x_c 744 . 0247 862 . 1 |- 0435 ——————————————————————————— Once we controlled for the quadratic polynomial, the advantage of limiting the bandwidth was a smaller than when we were either not controlling for the running variable at all or controlling only a linear running variable. Hahn et al. 2001 ] clarified assumptions about RDD – specifically, [ that continuity of the conditional expected potential outcomes. They also framed estimation as a non-parametric problem and emphasized using local polynomial regressions. Nonparametric methods mean a lot of different things to different people in statistics, but in RDD contexts, the idea is to estimate a model that doesn’t assume a functional form for the relationship Y ) and the running variable ( X ). The between the outcome variable ( model would be something like this: Y = f ( X )+ # A very basic method would be to calculate E [ Y ] for each bin on X , like a histogram. And Stata has an option to do this called cmogram created by Christopher Robert. The program has a lot of useful op- tions, and we can recreate Figures I, IIA and IIB from [ 2004 ]. Lee et al. Here is Figure I which is the relationship between the democratic win (as a function of the running variable, democratic vote share) and the 53 ). candidates second period ADA score (Figure cmogram from ssc, the Statistical First you will need to install Software Components archive. . ssc install cmogram Next we calculate the conditional mean values for the observations according to an automated binning algorithm generated by cmogram . . cmogram score lagdemvoteshare, cut(0.5) scatter line(0.5) qfitci Figure shows the output from this program. Notice the simi- 54 Lee et al. larities between what we produced here and what 2004 ] [ produced in their Figure I. The only difference is subtle differences in the binning used for the two figures. The key arguments used in this command are the listing of the outcome ( score ) and the running vari- able ( lagdemvoteshare ), the designation of where along the running variable the cutoff is (cut( 0 . 5 )), whether to produce the visualization

195 regression discontinuity 195 828 QUARTERLY JOURNAL OF ECONOMICS Lee et al. : 53 Figure 2004 ], Figure I [ Downloaded from I IGURE F " Total Effect of Initial Win on Future ADA Scores: This gure plots ADA scores after the election at time t ! 1 against the fi Democrat vote share, time . Each circle is the average ADA score within 0.01 t at Baylor University on April 7, 2014 tted values from fourth- intervals of the Democrat vote share. Solid lines are fi order polynomial regressions on either side of the discontinuity. Dotted lines are pointwise 95 percent con fi dence intervals. The discontinuity gap estimates D R R D * * * * . P $ P P $ " P " % % # # " ! # t 1 t t t 0 1 1 1 ! ! ! ! 1 ”“ Affect “ Elect ” be a continuous and smooth function of vote shares everywhere, except at the threshold that determines party membership. There of the scatter plots (scatter), whether to show a dashed vertical line at is a large discontinuous jump in ADA scores at the 50 percent )) and what kind of polynomial to fit left and right the cutoff (line( 0 . 5 threshold. Compare districts where the Democrat candidate of the cutoff (qfitci). t barely lost in period (for example, vote share is 49.5 percent), We have options other than a quadratic fit, though, and I think with districts where the Democrat candidate barely won (for example, vote share is 50.5 percent). If the regression disconti- it’s useful to compare this graph with one where we only fit a linear nuity design is valid, the two groups of districts should appear ex model. Now because there are strong trends in the running variable, — ante similar in every respect on average. The difference will be we probably just want to use the quadratic, but let’s see what we get that in one group, the Democrats will be the incumbent for the when we use simple lines. next election ( 1), and in the other it will be the Republicans. ! t Districts where the Democrats are the incumbent party for elec- t ! 1 elect representatives who have much higher ADA tion . cmogram score lagdemvoteshare, cut(0.5) scatter line(0.5) scores, compared with districts where the Republican candidate lfit Figure 55 shows what we get when we only use a linear fit of the data left and right of the cutoff. Notice the influence that outliers far from the actual cutoff play in the estimate of the causal effect at the cutoff. Now some of this would go away if we restricted the bandwidth to be shorter distances to and from the cutoff, but I leave it to you to do that yourself. Finally, we can use a lowess fit. A lowess fit more or less crawls through the data running small regressions on small cuts of data.

196 196 causal inference : the mixtape 54 Lee et al. Figure : Reproduction of 2004 ] Figure I using cmogram with [ quadratic fit and confidence intervals This can give the picture a zig zag appearance. We nonetheless show it here: . cmogram score lagdemvoteshare, cut(0.5) scatter line(0.5) lowess It is probably a good idea to at least run all of these, but your final selection of what to report as your main results should be that polynomial that best fits the data. Some papers only report a linear fit because there weren’t very strong trends to begin with. For instance, consider [ 2011 ]. The authors are interested Carrell et al. in the causal effect of drinking on academic test outcomes. Their running variable is the precise age of the student, which they have because they know the student’s date of birth and they know the date of every exam taken at the Air Force Academy. Because the Air Force Academy restricts the social lives of its students, there is a more stark increase in drinking at age 21 on its campus than might be on a typical university campus. They examined the causal effect of drinking age on normalized grades using RDD, but because there weren’t strong trends in the data, they only fit a linear model (Figure 57 ). It would no doubt have been useful for this graph to include confidence intervals, but the authors did not. Instead, they estimated

197 regression discontinuity 197 Figure : Reproduction of Lee et al. 55 2004 cmogram with [ ] Figure I using linear fit As can be seen from both the graphical data and the regression analysis, there appears to be a break in the outcome (normalized grade) at the point of age , suggesting that alcohol has a negative 21 104 104 causal effect on academic performance. Many, many papers have used RDD to look at alcohol using both ] have shown that the one-sided kernel estimation 2001 [ Hahn et al. age as the running variable or blood estimation such as lowess may suffer from poor properties because alcohol content as the running variable. Carpenter and Examples include the point of interest is at the boundary (i.e., the discontinuity). This is ] and Hansen 2009 [ 2015 ] just to Dobkin [ called the “boundary problem”. They propose instead to use “local name a couple. linear nonparametric regressions” instead. In these regressions, more weight is given to the observations at the center. You can implement this using Stata’s poly command which es- timates kernel-weighted local polynomial regressions. Think of it as a weighted regression restricted to a window like we’ve been doing (hence the word “local”) where the chosen kernel provides the weights. A rectangular kernel would give the same results as E [ Y ] at a given bin on X , but a triangular kernel would give more importance to observations closest to the center. This method will be sensitive to how large the bandwidth, or window, you choose. But in that sense, it’s similar to what we’ve been doing. . Note kernel-weighted local polynomial regression is a * smoothing method. . lpoly score demvoteshare if democrat == 0, nograph kernel(triangle) gen(x0 sdem0) bwidth(0.1) . lpoly score demvoteshare if democrat == 1, nograph

198 198 causal inference : the mixtape 56 Lee et al. Figure : Reproduction of 2004 ] Figure I using cmogram [ with lowess fit kernel(triangle) gen(x1 sdem1) bwidth(0.1) . scatter sdem1 x1, color(red) msize(small) || scatter sdem0 x0, msize(small) color(red) xline(0.5,lstyle(dot)) legend(off) xtitle("Democratic vote share") ytitle("ADA score") Figure 60 shows this visually. A couple of final things. First, I’m not showing this, but recall the continuity assumption. Because the continuity assumption specifi- cally involves continuous conditional expectation functions of the potential outcomes throughout the cutoff, it therefore is untestable . That’s right – it’s an untestable assumption. But, what we can do is check for whether there are changes in the conditional expectation functions for other exogenous covariates that cannot or should not be changing as a result of the cutoff. So it’s very common to look at things like race or gender around the cutoff. You can use these same methods to do that, but I do not do them here. Any RDD paper will always involve such placebos; even though they are not direct tests of the continuity assumption, they are indirect tests. Remember, your reader isn’t as familiar with this thing you’re studying, so your task is teach them. Anticipate their objections and the sources of their skepticism. Think like them. Try to put yourself in a stranger’s shoes. And then test those skepticisms to the best of your ability. Second, we saw the importance of bandwidth selection, or win- dow, for estimating the causal effect using this method, as well as

199 – 62 S.E. Carrell et al. / Journal of Public Economics 95 (2011) 54 59 ⁎ 0.079 0.085 20 days None 4080 14 (0.04) − (0.11) − tness score, and ⁎ fi 0.062 0.057 40 days None 8240 13 (0.04) − (0.05) − ⁎⁎⁎ ⁎⁎ 0.078 0.086 60 days None 12,670 12 − (0.03) (0.04) − ⁎⁎⁎ ⁎⁎ 0.067 0.087 None 17,121 11 − (0.03) − (0.04) ⁎ ⁎⁎⁎ Regression discontinuity estimates of the effect of drinking on achievement. Fig. 3. 0.081 0.122 80 days Linear 17,121 − (0.05) 10 − (0.04) fi ndings. We fi nd that In summary, our results yield two notable ⁎⁎ ⁎⁎⁎ there is a large and statistically signi fi cant discontinuous drop in college performance at age 21 that is robust across various 0.095 0.131 100 days Linear 21,410 9 − (0.04) − (0.04) bandwidths and functional forms. The drop in performance from regression discontinuity 199 e is an indicator for age 21. Standard errors clustered by age are in parentheses. Controls the increase in drinking in the weeks prior to the nal exams is fi ⁎⁎ ⁎⁎⁎ cant, as it is approximately the same as the effect fi economically signi 62 – S.E. Carrell et al. / Journal of Public Economics 95 (2011) 54 59 of being assigned to a professor whose quality is one standard 0.086 0.125 120 days Linear 25,930 8 (0.04) − − (0.03) 57 Figure Carrell et al. [ : 3 ] Figure 2011 deviation below the mean in quality for the entire semester. We also ⁎ fi nd that the largest negative effects of drinking are on the high-ability 0.079 0.085 students. ⁎⁎ ⁎⁎⁎ 20 days None 4080 14 − (0.04) (0.11) − xed effects, SAT math and verbal scores, academic composite score, leadership composite score, 0.088 0.119 fi tness score, and ⁎ 150 days Linear 32,321 7 − (0.04) (0.03) − fi 3.4. Robustness tests 0.062 0.057 ⁎⁎⁎ ⁎⁎⁎ 40 days None 8240 13 − (0.04) (0.05) − Table 6 fi To test the robustness of our , Column 2, we show ndings, in 0.117 0.131 that our results are virtually unchanged when including freshman Quadratic 45,092 6 (0.04) − − (0.03) ⁎⁎⁎ ⁎⁎ students. In Column 3, we show similar results when we include students who attended military preparatory schools prior to entering xed effects, birth year 0.078 0.086 fi ⁎⁎ ⁎⁎⁎ 60 days None 12,670 12 − (0.03) − (0.04) the USAFA. As a third robustness check, in Column 4 we restrict our observations to the required core courses taken by all students at the 0.062 0.073 210 days Linear 5 (0.03) − 45,092 − (0.02) ⁎⁎⁎ ⁎⁎ USAFA. These courses have the advantage of common examinations for all students taking the course in a given semester and eliminate any 0.067 0.087 ⁎⁎ ⁎⁎⁎ possible concerns of self-selection of courses during the semester in None 17,121 11 (0.03) − − (0.04) whichastudentturns21 yearsofage.Again,ourresultsremainvirtually 0.104 0.137 4 (0.04) − 51,431 Quadratic (0.03) − unchanged compared to our main speci fi cation. ⁎ ⁎⁎⁎ cations. Regression discontinuity estimates of the effect of drinking on achievement. Fig. 3. fi 0.081 0.122 ⁎⁎ ⁎⁎⁎ 80 days Linear 17,121 − (0.05) 10 (0.04) − [ Carrell et al. ] Table 2011 : 3 58 Figure 0.057 0.070 Table 3 240 days 3 − (0.03) 51,431 Linear − (0.02) nd that ndings. We fi In summary, our results yield two notable fi ⁎⁎ ⁎⁎⁎ Regression discontinuity estimates of the effect of drinking on academic performance. there is a large and statistically signi fi cant discontinuous drop in 2 cation Speci 3 1 fi ⁎⁎ ⁎⁎⁎ college performance at age 21 that is robust across various 0.095 0.131 100 days Linear 21,410 9 (0.04) − (0.04) − ⁎⁎⁎ ⁎⁎⁎ ⁎⁎⁎ bandwidths and functional forms. The drop in performance from Discontinuity at 0.092 − − 0.114 − 0.106 0.090 0.114 e is an indicator for age 21. Standard errors clustered by age are in parentheses. Controls age 21 (0.03) (0.02) (0.03) nal exams is fi the increase in drinking in the weeks prior to the − (0.04) 2 − (0.03) 58,032 Quadratic Observations 38,782 38,782 38,782 ⁎⁎ ⁎⁎⁎ fi cant, as it is approximately the same as the effect economically signi xed effects, graduating class by semester by year at USAFA Age polynomial Linear Quadratic Linear fi of being assigned to a professor whose quality is one standard ⁎⁎ ⁎⁎⁎ Control Yes Yes No 0.086 0.125 120 days Linear 25,930 8 (0.04) − − (0.03) deviation below the mean in quality for the entire semester. We also variables 0.057 0.066 nd that the largest negative effects of drinking are on the high-ability fi 270 days (0.03) 1 − (0.02) − 58,032 Linear Notes: Each cell contains results for separate regression where the dependent variable students. ⁎⁎ ⁎⁎⁎ is normalized course grade and the key independent variable is an indicator for age 21. Standard errors clustered by age are in parentheses. All speci fi cations control for a xed effects, SAT math and verbal scores, academic composite score, leadership composite score, 0.088 0.119 fl exible polynomial of age in which the slope is allowed to vary on either side of the fi 150 days Linear 32,321 7 (0.04) − − (0.03) cutoff. Data include all observations on student performance within 180 days of their the importance of selection of polynomial length. There’s always 3.4. Robustness tests 21st birthday. Controls include course by semester by section xed effects, graduating fi a tradeoff when choosing the bandwidth between bias and vari- xed effects, SAT math and xed effects, birth year fi class by semester by year at USAFA fi cant at the 0.10 level. cant at the 0.01 level. cant at the 0.05 level. ⁎⁎⁎ ⁎⁎⁎ fi fi fi ance - the shorter the window, the lower bias, but because you have verbal scores, academic composite score, leadership composite score, tness score, and fi fi Table 6 , Column 2, we show To test the robustness of our ndings, in indicator variables for Black, Hispanic, Asian, and recruited athlete. The bandwidth of Signi Signi 0.117 0.131 Signi less data, the variance in your estimate increases. Recent work has that our results are virtually unchanged when including freshman the data is 180 days on either side of Age 21. Quadratic 45,092 6 − (0.04) − (0.03) ⁎ Panel A: no controls Discontinuity at age 21 Bandwidth Panel B: with controls Discontinuity at age 21 Observations Age polynomial ⁎⁎ Imbens and been focused on optimal bandwidth selection, such as students. In Column 3, we show similar results when we include ⁎⁎⁎ ⁎⁎⁎ Table 4 Regression discontinuity estimates for different bandwidths and speci Notes: Each cell represents results for separate regression where the dependent variable is normalized course grade and the key independent variabl include course by semester by section indicator variables for Black, Hispanic, Asian, and recruited athlete. cant at the 0.01 level. fi Signi students who attended military preparatory schools prior to entering xed effects, birth year 2011 Kalyanaraman ]. The latter can be [ ] and 2014 Calonico et al. [ fi ⁎⁎ ⁎⁎⁎ the USAFA. As a third robustness check, in Column 4 we restrict our implemented with the user-created command. These meth- rdrobust observations to the required core courses taken by all students at the 0.062 0.073 ods ultimately choose optimal bandwidths which may differ left and 210 days Linear 5 − (0.03) 45,092 (0.02) − USAFA. These courses have the advantage of common examinations for right of the cutoff based on some bias-variance tradeoff. Here’s an all students taking the course in a given semester and eliminate any example: ⁎⁎ ⁎⁎⁎ possible concerns of self-selection of courses during the semester in whichastudentturns21 yearsofage.Again,ourresultsremainvirtually 0.104 0.137 . ssc install rdrobust 4 (0.04) − 51,431 Quadratic − (0.03) cation. fi unchanged compared to our main speci . rdrobust score demvoteshare, c(0.5) cations. fi ⁎⁎ ⁎⁎⁎ Sharp RD estimates using local polynomial regression. 0.057 0.070 Table 3 5 Cutoff c = . 13577 | Left of c Right of c Number of obs = 240 days 3 − (0.03) 51,431 Linear (0.02) − Regression discontinuity estimates of the effect of drinking on academic performance. 3 Speci fi 1 cation 2 ⁎⁎ ⁎⁎⁎ ⁎⁎⁎ ⁎⁎⁎ ⁎⁎⁎ 0.092 − Discontinuity at 0.114 − 0.106 − 0.090 0.114 age 21 (0.03) (0.02) (0.03) − (0.04) 2 − (0.03) 58,032 Quadratic 38,782 38,782 Observations 38,782 xed effects, graduating class by semester by year at USAFA Age polynomial Quadratic Linear Linear fi ⁎⁎ ⁎⁎⁎ Yes No Control Yes variables 0.057 0.066 270 days (0.03) 1 − (0.02) − 58,032 Linear Notes: Each cell contains results for separate regression where the dependent variable is normalized course grade and the key independent variable is an indicator for age 21. fi cations control for a Standard errors clustered by age are in parentheses. All speci fl exible polynomial of age in which the slope is allowed to vary on either side of the cutoff. Data include all observations on student performance within 180 days of their xed effects, graduating fi 21st birthday. Controls include course by semester by section fi xed effects, SAT math and class by semester by year at USAFA fi xed effects, birth year cant at the 0.10 level. cant at the 0.01 level. cant at the 0.05 level. fi fi fi tness score, and fi verbal scores, academic composite score, leadership composite score, indicator variables for Black, Hispanic, Asian, and recruited athlete. The bandwidth of Signi Signi Signi the data is 180 days on either side of Age 21. ⁎ Bandwidth Panel A: no controls Discontinuity at age 21 Panel B: with controls Discontinuity at age 21 Observations Age polynomial ⁎⁎ ⁎⁎⁎ ⁎⁎⁎ Regression discontinuity estimates for different bandwidths and speci Table 4 include course by semester by section Notes: Each cell represents results for separate regression where the dependent variable is normalized course grade and the key independent variabl indicator variables for Black, Hispanic, Asian, and recruited athlete. cant at the 0.01 level. fi Signi

200 200 causal inference : the mixtape 59 : Local linear nonparametric Figure regressions ——————-+———————- BW type = mserd Number of obs | 5480 8097 Kernel = Triangular 2096 1882 VCE method = NN Eff. Number of obs | Order est. (p) | 11 Order bias (q) | 22 BW est. (h) | . 085 0 . 085 0 0 . 140 0 . 140 BW bias (b) | rho (h/b) | 0 . 607 0 . 607 Outcome: score. Running variable: demvoteshare. ——————————————————————————– Method | Coef. Std. Err. ——————-+———————————————————— Conventional | 46 . 483 1 . 2445 Robust | - - 31 . 3500 ——————————————————————————– This method, as we’ve repeatedly said, is data greedy because it gobbles up data at the discontinuity. So ideally these kinds of methods will be used when you have large numbers of observations in the sample so that you have a sizable number of observations at the discontinuity. When that is the case, there should be some harmony in your findings across results. If there isn’t, then it calls into question whether you have sufficient power to pick up this effect. Finally, we look at the implementation of the McCrary density test. Justin McCrary has graciously made this available to us, though

201 regression discontinuity 201 rdrobust also has a density test built into it. But for now, we will use McCrary’s ado package. This cannot be downloaded from ssc, so you must download it directly from McCrary’s website and move it into your Stata subdirectory that we listed earlier. The website is . Note this will automatically download the file. Once the file is installed, you use the following command to check for whether there is any evidence for manipulation in the running variable at the cutoff. _ _ c if (demvoteshare c>-0.5 & . DCdensity demvoteshare _ demvoteshare c<0.5), breakpoint(0) generate(Xj Yj r0 fhat _ se fhat) Using default bin size calculation, bin size = . 003047982 Using default bandwidth calculation, bandwidth = . 104944836 Discontinuity estimate (log difference in height): . 011195629 (. 061618519 ) Performing LLR smoothing. 296 iterations will be performed Figure 60 : Local linear nonparametric regressions And visually inspecting the graph, we see no signs that there was manipulation in the running variable at the cutoff.

202 202 causal inference : the mixtape Regression Kink Design A couple of papers came out by David Card and coauthors. The Card et al. [ 2015 ]. This paper introduced us to a most notable is new method called regression kink design, or RKD. The intuition is rather simple. Rather than the discontinuity creating a discontinuous jump in the treatment variable at the cutoff, it created a change in the first derivative. They use essentially a “kink” in some policy rule to identify the causal effect of the policy using a jump in the first derivative. Their paper applies the design to answer the question whether the level of unemployment benefits affects the length of time spent unemployed in Austria. Here’s a brief description of the policy. Un- employment benefits are based on income in a base period. The benefit formula for unemployment exhibits two kinks. There is a min- imum benefit level that isn’t binding for people with low earnings. Then benefits are 55 % of the earnings in the base period. Then there is a maximum benefit level that is adjusted every year. People with dependents get small supplements, which is the reason there are five “solid” lines in the following graph. Not everyone receives benefits that correspond one to one to the formula because mistakes are made Base Year Earnings and Unemployment Benefits 61 ). in the administrative data (Figure : RKD kinks from Figure Card et al. 61 [ 2015 ] The graph shows unemployment benefits (vertical axis) as a function of pre-unemployment earnings (horizontal axis). The graph shows unemployment benefits on the vertical axis as Waldinger (Warwick) 46 / 48 a function of pre-unemployment earnings on the horizontal axis. Next we look at the relationship between average daily unemploy-

203 regression discontinuity 203 ment insurance benefits and base year earnings, where the running variable has re-centered. The bin-size is 100 euros. For single indi- viduals unemployment insurance benefits are flat below the cutoff. The relationship is still upward sloping, though, because of family Base Year Earnings and Benefits for Single Individuals benefits. : Base year earnings and 62 Figure benefits for single individuals from ] 2015 [ Card et al. Bin-Size: 100 Euros .The § For single individuals UI benefits are flat below the cuto Next we look at the main outcome of interest – time unemployed, which is the time the individual spent until they got another job. As relationship is still upward sloping because of family benefits. , people with higher base earnings have less can be seen in Figure 63 47 / 48 Waldinger (Warwick) trouble finding a job (which gives it the negative slope). But there is a king - the relationship becomes shallower once benefits increase more. This suggests that as unemployment benefits increased, the time spent unemployed was longer – even though it continued to rise, the slope shifted and got flatter. A very interesting and policy-relevant result.

204 204 causal inference the mixtape : Time to Next Job For Single Individuals : Log(duration unemployed) Figure 63 and benefits for single individuals from 2015 [ Card et al. ] People with higher base earnings have less trouble finding a job (negative slope). There is a kink: the relationship becomes shallower once benefits increase more. Waldinger (Warwick) 48 / 48

205 Instrumental variables “I made Sunday Candy, I’m never going to hell I met Kanye West, I’m never going to fail.” - Chance the Rapper Instrumental variables is maybe one of most important econo- metric strategies ever devised. Just as Archimedes said “Give me a fulcrum, and I shall move the world”, so it could be said that with a good enough instrument, we can identify any causal effect. But, while that is hyperbole for reasons we will soon see, it is nonetheless the case that instrumental variables is an important contribution to causal inference, and an important tool to have in your toolkit. It is also, interestingly, unique because it is one of those instances where the econometric estimator was not simply ripped off from statistics (e.g., Eicker-Huber-White standard errors) or some other field (e.g., regression discontinuity). Its history is, in my opinion, quite fascinating, and before we dive into the technical material, I’d like to tell you a story about its discovery. History of Instrumental Variables: Father and Son 1861 and died in 1934 . He received his Philip Wright was born in 1884 and a masters degree from bachelor’s degree from Tufts in 105 105 . 1887 1889 when Harvard in This biographical information is His son, Sewall Wright, was born in Stock and Trebbi 2003 drawn from ]. [ 28 Philip was . The family moved from Massachusetts to Illinois where Philip took a position as professor of mathematics and eco- nomics at Lombard College. Philip was so unbelievably busy with teaching and service that is astonishing he had any time for re- search, but he did. He published numerous articles and books over his career, including poetry. You can see his vita here at https: 106 106 _ Interesting side note: Philip had a . cv.pdf // passion for poetry, and even published Sewell attended Lombard College and took his college mathematics some in his life, and he used his courses from his father. school’s printing press to publish the first book of poems by the great In 1913 , Philip took a position at Harvard, and Sewell entered as American poet, Carl Sandburg. a graduate student. Eventually Philip would leave for the Brookings

206 206 causal inference : the mixtape Institute, and Sewell would take his first job in the Department of Zoology at the University of Chicago where he would eventually be . promoted to professor in 1930 Philip was prolific which given his teaching and service require- Quarterly ments is amazing. He published in top journals such as the , Journal of the American Statistical Association , Jour- Journal of Economics and nal of Political Economy . A common American Economic Review theme across many publications was the identification problem. He was acutely aware of it and was intent on solving it. In 1928 , Philip was writing a book about animal and vegetable oils of all the things. The reason? He believed that recent tariff increases were harming international relations. Thus he wrote passionately about the damage from the tariffs, which affected animal and veg- etable oils. We will return to this book again, as it’s an important contribution to our understanding of instrumental variables. While Philip was publishing like a fiend in economics, Sewall Wright was revolutionizing the field of genetics. He invented path analysis, a precursor to Pearl’s directed acyclical graphical models, as well as made important contributions to the theory of evolution and genetics. He was a genius. The decision to not follow in the family business (economics) cre- ated a bit of tension between the two men, but all evidence suggests that they found one another intellectually stimulating. In his book on vegetable and oil tariffs, there is an Appendix (entitled Appendix B) in which the calculus of the instrumental variables estimator was worked out. Elsewhere, Philip thanked his son for his valuable contributions to what he had written, referring to the path analysis primarily which Sewell taught him. This path analysis, it turned out, played a key role in Appendix B. The Appendix shows a solution to the identification problem. So long as the economist is willing to impose some restrictions on the problem, then the system of equations can be identified. Specifically, if there is one instrument for supply, and the supply and demand errors are uncorrelated, then the elasticity of demand can be identified. But who wrote this Appendix B? Either man could’ve done so. It is an economics article, which points to Philip. But it used the path analysis, which points to Sewell. Historians have debated this, even going so far as to accuse Philip of stealing the idea from his son. If Philip stole the idea, by which I mean when he published Appendix B, he failed to give proper attribution to his son, then it would at the very least have been a strange oversight which was possibly out of character for a man who by all evidence loved his son very much. In comes Stock and Trebbi [ 2003 ].

207 instrumental variables 207 Stock and Trebbi 2003 ] tried to determine the authorship of [ Appendix B using “stylometric analysis”. Stylometric analysis had been used in other applications, such as to identify the author of (Joseph Klein) and the unsigned the political novel Primary Colors Stock and Trebbi 2003 ] is the first application of . But [ Federalist Papers 107 107 it in economics to my knowledge. Maybe the only one? The method is akin to contemporary machine learning methods. The authors collected raw data containing the known original aca- demic writings of each man, plus the first chapter and Appendix B of the book in question. The writings were edited to exclude footnotes, 1 , 000 graphs and figures. Blocks of words were selected from the files. A total of blocks were selected: 20 written by Sewall with certainty, 54 by Philip, six from Appendix B, and three from chapter . Chapter 25 1 has always been attributed to Philip, but Stock and Trebbi [ 2003 1 ] treat the three blocks as unknown to “train” the data. That is, they use it to check if their model is correctly predicting authorship. The stylometric indicators that they used included the frequency of occurrence in each block of 70 function words. The list was taken from a separate study. These 70 function words produced 70 nu- merical variables, each of which is a count, per 1 000 words, of an , individual function word in the block. Some words were dropped because they occurred only once (“things”), leaving function word 69 counts. The second set of stylometric indicators, taken from another study, Stock and Trebbi [ concerned grammatical constructions. ] used 2003 18 grammatical constructions, which were frequency counts. They included things like noun followed by an adverb, total occurrences of prepositions, coordinating conjunction followed by noun, and so on. There was one dependent variable in their analysis, and that was authorship. The independent variables were 87 covariates ( 69 function word counts and 18 grammatical statistics). The results of this analysis are absolutely fascinating. For instance, many covariates have very large t -statistics, which would be unlikely if there really were no stylistic differences between the authors and indicators were independently distributed. So what do they find. The results that I find the most interesting is their regression analysis. They write: “We regressed authorship against an intercept, the first two principal components of the grammatical statistics and the first two principal components of the function word counts, and we attribute authorship depending on whether the predicted value is greater or less than 0 . 5 .” Note, they used principal component analysis because they had more covariates than observations, and needed the dimension reduction. A more contemporary method might be LASSO or ridge regression.

208 208 causal inference : the mixtape But, given this analysis, what did they find? They found that all of the Appendix B and chapter 1 blocks were assigned to Philip, not Sewell. They did other robustness checks, and all of them point to Philip as the author. I love this story for many reasons. First, I love the idea that an econometric estimator as important as instrumental variables was in fact created by an economist. I’m so accustomed to stories in which the actual econometric estimator was lifted from statistics (Huber-White standard errors) or educational psychology (regression discontinuity). It is nice to know economists have added their own to the seminal canon of econometrics. But the other part of the story that I love is the father/son component. I find it encouraging to know that a father and son can overcome differences through intellectual collaborations such as this. Such relationships are important, and tensions, when they arise, should be vigorously pursued until those tensions to dissipate if possible. And Philip and Sewell give a story of that, which I appreciate. Natural Experiments and the King of the North While natural experiments are not technically instrumental variables the estimator, they can be construed as such if we grant that they are component of the IV strategy. I will begin by describing reduced form one of the most famous, and my favorite, example of a natural experi- ment - John Snow’s discovery that cholera was a water borne disease transmitted through the London water supply. Natural experiments are technically, though, not an estimator or even an experiment. Rather they are usually nothing more than an event that occurs naturally which causes exogenous variation in some 108 108 Instruments don’t have to be simply treatment variable of interest. naturally occurring random variables. When thinking about these, effort is spent finding some rare Sometimes they are lotteries, such as circumstance such that a consequential treatment was handed to in the Oregon Medicaid Experiment. Other times, they are randomized peer some people or units but denied to others “haphazardly”. Note I designs to induce participation in an did not say randomly, though ideally it was random or conditionally experiment. Rosenbaum [ 2010 ] wrote: random. “The word ’natural’ has various connotations, but a ’natural experi- ment’ is a ’wild experiment’ not a ’wholesome experiment,’ natural in the way that a tiger is natural, not in the way that oatmeal is natural.” Before John Snow was the King of the North, he was a 19 th cen- tury physician in London during several cholera epidemics. He watched helplessly as patient after patient died from this mysterious illness. Cholera came in waves. Tens of thousands of people died hor- rible deaths from this disease, and doctors were helpless at stopping

209 instrumental variables 209 it, let alone understanding why it was happening. Snow tried his best to save his patients, but despite that best, they still died. Best I can tell, Snow was fueled by compassion, frustration and curiosity. He observed the progression of the disease and began . forming conjectures. The popular theory of the time was miasmis Miasmis was the majority view about disease transmission, and proponents of the theory claimed that minute, inanimate particles in the air were what caused cholera to spread from person to person. Snow tried everything he could to block the poisons from reaching the person’s body, a test of , but nothing seemed to save his miasmis patients. So he did what any good scientist does - he began forming a new hypothesis. It’s important to note something: cholera came in three waves in London, and Snow was there for all of them. He was on the front line, both as a doctor and an epidemiologist. And while his patients were dying, he was paying attention - making guesses, testing them, and updating his beliefs along the way. Snow observed the clinical course of the disease and made the following conjecture. He posited that the active agent was a living organism that entered the body, got into the alimentary canal with food or drink, multiplied in the body, and generated a poison that caused the body to expel water. The organism passed out of the body with these evacuations, then entered the water supply, re-infected new victims who unknowingly drank from the water supply. This process repeated causing a cholera epidemic. Snow had evidence for this based on years of observing the pro- gression of the disease. For instance, cholera transmission tended to follow human commerce. Or the fact that a sailor on a ship from a cholera-free country who arrived at a cholera-stricken port would only get sick after landing or taking on supplies. Finally, cholera hit the poorest communities the worst, who also lived in the most crowded housing with the worst hygiene. He even identified Patient Zero - a sailor named John Harnold who arrived to London by the Elbe ship from Hamburg where the disease was prevailing. It seems like you can see Snow over time moving towards cleaner and cleaner pieces of evidence in support of a waterborne hypothesis. For instance, we know that he thought it important to compare two apartment buildings - one which was heavily hit with cholera, but a second one that wasn’t. The first building was contaminated by runoff from privies but the water supply in the second was cleaner. The first building also seemed to be hit much harder by cholera than the second. These facts, while not entirely consistent with the miasma theory, were still only suggestive. How could Snow test a hypothesis that cholera was transmitted

210 210 causal inference : the mixtape via poisoned water supplies? Simple! Randomly assign half of Lon- don to drink from water contaminated by the runoff from cholera victims, and the other half from clean water. But it wasn’t merely that Snow predated the experimental statisticians, Jerzy Neyman and Roland Fisher, that kept him from running an experiment like that. An even bigger constraint was that even if he had known about randomization, there’s no way he could’ve run an experiment like that. Oftentimes, particularly in social sciences like epidemiology and economics, we are dealing with macro-level phenomena and randomized experiments are simply not realistic options. I present that kind of thought experiment, though, not to advocate for the randomized controlled trial, but rather to help us understand the constraints we face, as well as to help hone in on what sort of experiment we need in order to test a particular hypothesis. For one, Snow would need a way to trick the data such that the allocation of clean and dirty water to people was not associated with the other determinants of cholera mortality, such as hygiene and poverty. He just would need for someone or something to be making this treatment assignment for him. Fortunately for Snow, and the rest of London, that someone or something existed. In the London of the s, there were many 1800 different water companies serving different areas of the city. Some were served by more than one company. Several took their water from the Thames, which was heavily polluted by sewage. The service areas of such companies had much higher rates of cholera. The Chelsea water company was an exception, but it had an exceptionally good filtration system. That’s when Snow had a major insight. In 1849 , Lambeth water company moved the intake point upstream along the Thames, above the main sewage discharge point, giving its customers purer water. Southwark and Vauxhall water company, on the other hand, left their intake point downstream from where the sewage discharged. Insofar as the kinds of people that each company serviced were approximately the same, then comparing the cholera rates between the two houses could be the experiment that Snow so desperately needed to test his hypothesis. Snow’s Table IX Company name 10 , 000 houses Number of houses Cholera deaths Deaths per Southwark and Vauxhall 40 , 046 1 , 263 315 Lambeth , 107 98 37 26 Snow wrote up his results in a document with many tables and a map showing the distribution of cholera cases around the city - one of the first statistical maps, and one of the most famous. Table 9 , above, shows the main results. Southwark and Vauxhall, what I call the treatment case, had 1 , 263 cholera deaths, which is 315

211 instrumental variables 211 per , 000 houses. Lambeth, the control, had only 98 , which is 37 10 10 , houses. Snow spent the majority of his time in the write per 000 up tediously documenting the similarities between the groups of domiciles serviced by the two companies in order to rule out the possibility that some other variable could be both correlated with Southwark and Vauxhall and associated with miasmis explanations. He was convinced – cholera was spread through the water supply, not the air. Of this table, Freedman [ 1991 ] the statistician wrote: “As a piece of statistical technology, [Snow’s Table IX] is by no means remarkable. But the story it tells is very persuasive. The force of the argument results from the clarity of the prior reasoning, the bringing together of many different lines of evidence, and the amount of shoe leather Snow was willing to use to get the data. Snow did some brilliant detective work on nonexperimental data. What is impressive is not the statistical technique but the handling of the scientific issues. He made steady progress from shrewd observation through case studies to analyze ecological data. In the end, he found and analyzed a natural experiment.” The idea that the best instruments come from shoeleather is echoed Angrist and Krueger [ in ] when the authors note that the best 2001 instruments come from in-depth knowledge of the institutional details of some program or intervention. Instrumental variables DAG To understand the instrumental variables estimator, it is helpful to start with a DAG. This DAG shows a chain of causal effects that contains all the information needed to understand the instrumental variables strategy. First, notice the backdoor path between D and Y : D ! . Furthermore, note that u is unobserved by the econometri- u cian which causes the backdoor path to remain open. If we have this kind of , then there does not exist a condition- selection on unobservables ing strategy that will satisfy the backdoor criterion (in our data). But, Z before we throw up our arms, let’s look at how operates through these pathways. Z D Y U First, there is a mediated pathway from Z to Y via D . When Z varies, varies, which causes Y to change. But, even though Y D is varying when Z varies, notice that Y is only varying because D has varied. You sometimes hear people describe this as the “only through” assumption. That is, Z affects Y “only through” D .

212 212 causal inference : the mixtape D consists of people Imagine this for a moment though. Imagine , and sometimes Y making choices. Sometimes these choices affect U Y . But along these choices merely reflect changes in via changes in some comes some shock, Z , which induces but not all of the people in to make different decisions. What will happen? D will change Well, for one, when those people’s decisions change, Y too, because of the causal effect. But, notice, all of the correlation and Y in that situation will reflect the causal effect. The between D is a collider along the backdoor path between Z and reason being, D Y . But I’m not done with this metaphor. Let’s assume that in this variable, with all these people, only some of the people change D D . What then? Well, in that situation, Z their behavior because of is causing a change in Y for just a subset of the population. If the instrument only changes the behavior of women, for instance, then D on Y will only reflect the causal effect of female the causal effect of , not males. choices There’s two ideas inherent in the previous paragraph that I want to emphasize. First, if there are heterogeneous treatment effects Y differently than females), then our Z shock only (e.g., males affect D on Y . And that piece of the identified some of the causal effect of causal effect may only be valid for the female population whose behavior changed in response to Z ; it may not be reflective of how Y . And secondly, if is only inducing male behavior would affect Z Y D , then some of the change in via only a fraction of the change in it’s almost as though we have less data to identify that causal effect than we really have. Here we see two of the difficulties in both interpreting instrumen- tal variables, as well as identifying a parameter with it. Instrumental variables only identifies a causal effect for any group of units whose behaviors are changed as a result of the instrument. We call this the causal effect of the complier population; in our example, only females “complied” with the instrument, so we only know its effect for them. And secondly, instrumental variables are typically going to have larger standard errors, and as such, will fail to reject in many instances if for no other reason than because they are under-powered. Moving along, let’s return to the DAG. Notice that we drew the DAG such that Z has no connection to U . Z is independent of U . That is called the “exclusion restriction” which we will discuss in more detail later. But briefly, the IV estimator assumes that is Z Y except for D independent of the variables that determine . Secondly, Z is correlated with D and because of its correlation with D (and D ’s effect on Y ), Z is correlated with Y but only through its effect on D . This relationship between Z and D is called the “first

213 instrumental variables 213 stage”, named that because of the two stage least squares estimator, which is a kind of IV estimator. The reason it is only correlated with . D is because D is a collider along the path Z ! D u ! Y via Y How do you know when you have a good instrument? One, it will require a DAG - either an explicit one, or an informal one. You can only identify a causal effect using IV if you can theoretically and logically defend the exclusion restriction, since the exclusion restriction is an untestable assumption technically. That defense requires theory, and since some people aren’t comfortable with theoretical arguments like that, they tend to eschew the use of IV. More and more, applied microeconomists are skeptical of IV for this reason. But, let’s say you think you do have a good instrument. How might you defend it as such to someone else? A necessary but not a sufficient condition for having an instrument that can satisfy the exclusion restriction is if people are confused when you tell them about the instrument’s relationship to the outcome. Let me explain. No one is going to be confused when you tell them that you think family size will reduce female labor supply. They don’t need a Becker model to convince them that women who have more children prob- ably work less than those with fewer children. It’s common sense. But, what would they think if you told them that mothers whose first two children were the same gender worked less than those whose children had a balanced sex ratio? They would probably give you a confused look. What does the gender composition of your children have to do with whether a woman works? It doesn’t – it only matters, in fact, if people whose first two chil- dren are the same gender decide to have a third child. Which brings us back to the original point – people buy that family size can cause women to work less, but they’re confused when you say that women work less when their first two kids are the same gender. But if when you point out to them that the two children’s gender induces people to have larger families than they would have otherwise, the person “gets it”, then you might have an excellent instrument. Instruments are, in other words, jarring. They’re jarring precisely because of the exclusion restriction – these two things (gender com- position and work) don’t seem to go together. If they did go together, it would likely mean that the exclusion restriction was violated. But if they don’t, then the person is confused, and that is at minimum a possible candidate for a good instrument. This is the common sense explanation of the “only through” assumption. The following two sections differ from one another in the follow- ing sense: the next section makes the traditional assumption that all treatment effects are constant for all units. When this is assumed,

214 214 causal inference : the mixtape then the parameter estimated through an IV methodology equals the ATE equals the ATT equals the ATU. The variance will still be D , but the larger, because IV still only uses part of the variation in compliers are identical to the non-compliers so the causal effect for the compliers is the same as the causal effect for all units. The section after the next one is explicitly based on the potential outcomes model. It assumes the more general case where each unit has a unique treatment effect. If each unit can have a different effect Y , then the causal effect itself is a random variable. We’ve called on this heterogeneous treatment effects. It is in this situation that the complier qualification we mentioned earlier matters, because if we are only identifying a causal effect for just a subset of the column of causal effects, then we are only estimating the treatment effects associated with the compliers themselves. This estimand is called the (LATE), and it adds another wrinkle local average treatment effect to your analysis. If the compliers’ own average treatment effects are radically different from the rest of the population, then the LATE estimand may not be very informative. Heck, under heterogeneous treatment effects, there’s nothing stopping the sign of the LATE to be different than the sign of the ATE! For this reason, we want to think long and hard about what our IV estimate means under heterogenous treatment effects, because policy- makers will be implementing a policy, not based on assigning Z but rather based on assigning . And as such, both compliers and non- D compliers will matter for the policy-makers, yet IV only identifies the effect for one of these. Hopefully this will become clearer as we progress. Homogenous treatment effects and 2 SLS Instrumental variables methods are typically used to address omitted variable bias, measurement error, and simultaneity. For instance, quantity and price is determined by the intersection of supply and demand, so any observational correlation between price and quantity is uninformative about the unique elasticities associated with supply or demand curves. Wright understood this, which was why he investigated the problem so intensely. We begin by assuming homogenous treatment effects. Homoge- nous treatment effects assumes that the treatment effect is the same for every unit. This is the traditional econometric pedagogy and not based explicitly on the potential outcomes notation. Let’s start by illustrating the omitted variable bias problem again. Assume the classical labor problem where we’re interested in the causal effect of schooling on earnings, but schooling is endogenous

215 instrumental variables 215 because of unobserved ability. Let the true model of earnings be: g = d S + + A + # Y a i i i i is the log of earnings, Y A is S where is schooling measured in years, is an error term uncorrelated with school- individual “ability”, and # ing or ability. The reason is unobserved is simply because the A surveyor either forgot to collect it or couldn’t collect it and therefore 109 109 Unobserved ability doesn’t mean it’s it’s missing from your dataset. For instance, the CPS tells us noth- literally unobserved, in other words. It ing about respondents’ family background, intelligent, motivation or could be just missing from your dataset, non-cognitive ability. Therefore, since ability is unobserved, we have . and therefore is unobserved to you the following equation instead: = a Y d S + + h i i i g is a composite error term equalling h A where + # . We assume that i i i schooling is correlated with ability, so therefore it is correlated with , making it endogenous in the second, shorter regression. Only # is h i i uncorrelated with the regressors, and that is by definition. We know from the derivation of the least squares operator that the b d estimated value of is: E E ( Y , S ) ] S [ C [ YS ] E [ Y ] b d = = ) S V ( S ) ( V (from the longer model), we get the Y Plugging in the true value of following: 2 + E [ a S + S ] d + g SA + # S ] E ( S ) E [ a + d S + g A # b = d V ( S ) 2 2 E ( S S ) d E ( S ) ) + g E ( AS ) g E ( S ) E ( A )+ E ( # d ) E ( S ) E ( # = V S ) ( C ( AS ) d + = g S ( ) V b 0 and C ( A If S ) > 0 , then g d , the coefficient on schooling, is > , upward biased. And that is probably the case given that it’s likely that ability and schooling are positively correlated. Now, consistent with the IV DAG we discussed earlier, suppose Z there exists a variable, , that is correlated with schooling. We can i use this variable, as I’ll now show, to estimate . First, calculate the d covariance of Y and Z : ( Y , Z C C ( ad S + g A + # , Z ) )= = E [( a + d S + g A + # ), Z ] E ( S ) E ( Z ) E = a E ( Z ) a E ( Z ) } + d { E ( SZ ) E ( S ) E ( Z ) } + g { E ( AZ ) E ( A ) { ( Z ) } + { E ( # Z ) E ( # ) E ( Z ) } ( = C ( S , Z )+ g C d A , Z )+ C ( # , Z )

216 216 causal inference : the mixtape d is on the right hand side. So Notice that the parameter of interest, how do we isolate it? We can estimate it with the following: C ( Y , Z ) b d = S , Z ) C ( # ( , Z ) = 0 and C ( A , Z ) = 0. so long as C These zero covariances are the statistical truth contained in the Z IV DAG from earlier. If ability is independent of , then this second covariance is zero. And if is independent of the structural error Z # , then it too is zero. This, you see, is what is meant by the term, “exclusion restriction”: the instrument must be independent of both parts of the composite error term. But the exclusion restriction is only a necessary condition for IV to work; it is not a sufficient condition. After all, if all we needed was exclusion, then we could use a random number generator for an instrument. Exclusion is not enough. We also need the instrument to be highly correlated with the endogenous variable. And the higher C ( the better. We see that here because we are dividing by , Z ) , so it S necessarily requires that this covariance be non-zero. The numerator in this simple ratio is sometimes called the “re- duced form”, while the denominator is called the “first stage”. These terms are somewhat confusing, particularly the former as “reduced form” means different things to different people. But in the IV ter- minology, it is that relationship between the instrument and the outcome itself. The first stage is less confusing, as it gets its name from the two stage least squares estimator, which we’ll discuss next. When you take the probability limit of this expression, then assum- ing C ( A , Z )=0 and C ( # , Z )=0 due to the exclusion restriction, you get b d = d plim Z h (either because it’s correlated with A But if is not independent of b ), and if the correlation between S and Z or # d becomes is weak, then severely biased. Two stage least squares One of the more intuitive instrumental vari- ables estimators is the two-stage least squares ( 2 SLS). Let’s review an example to illustrate why I consider it helpful for explaining some of the IV intuition. Suppose you have a sample of data on Y , S and Z . For each observation , we assume the data are generated according i to: Y # = a + d S + i i i S e = g + b Z + i i i

217 instrumental variables 217 where ( Z , # )=0 and b 6 =0 . Now using our IV expression, and using C n ̄ the result that ( x ) = 0, we can write out the IV estimator as: x  i i =1 C ( Y , Z ) b d = S , Z ) C ( n 1 ( Y ) Z )( Y Z  i i i =1 n = n 1 S ( ) S Z Z )(  i i =1 i n n 1 ) ( Z Y Z  i i i =1 n = n 1 S ) ( Z Z  i i i =1 n When we substitute the true model for , we get the following: Y n 1 + S } # + a ( Z { Z ) d  i i =1 n b d = n 1 S ( Z Z )  i i =1 i n n 1 ( Z # Z )  i i =1 i n d = + n 1 ) ( Z S Z  i i i =1 n = n is large” d + “small if b d So, let’s return to our first description of as the ratio of two co- variances. With some simple algebraic manipulation, we get the following: Z C ( Y ) , b = d C ( S , Z ) C ( Z , Y ) V Z ( ) = C , S ) Z ( Z ) ( V 110 110 b b We can rewrite where the denominator is equal to b as: b That is, S . = g + b Z e + i i i ) S , C ( Z b = b ( Z V ) b Z Z )= b ( ( , S ) V C Then we rewrite the IV estimator and make a substitution: ) Y C ( Z , b = d IV C ( Z , S ) b , b C ( Z ) Y = b C Z , S ) ( b b C ( Z , Y ) b = 2 b b V ( Z ) b C ( b ) Z , Y = b V ( b Z )

218 218 causal inference the mixtape : b C ( b ZY ) b b b b = + b Z + e ; S SLS 2 . Then the Z b and let = S = Recall that g g d + b b ) ( V Z estimator is: b ) C ( b Y Z , b d = IV b ( Z ) V b b ( , S C Y ) = b S ) ( V b b ( Y , Z b C ( C S , Y ) , and leave it to you to show I will now show that )= b b ( b Z )= V that V S ). ( b b b [ Y )= E [ ( SY ] E S , S ] E [ Y ] C b b b b [ = g + E b Z ]) E ( Y ) E ( ( g + Y b Z ) b b b b ) )+ = b E ( YZ ) E g E ( Y ) ( b E ( Y g E ( Z ) Y b ( [ E ( YZ ) E b Y ) E ( Z )] = b b S C , Y )= ( b C ( Y , Z ) Now let’s return to something I said earlier – learning 2 SLS can help you better understand the intuition of instrumental variables more generally. What does this mean exactly? It means several 2 SLS estimator used only the fitted values of the things. First, the endogenous regressors for estimation. These fitted values were based on all variables used in the model, including the excludable instrument . And as all of these instruments are exogenous in the structural model, what this means is that the fitted values themselves have become exogenous too. Put differently, we are using only the variation in schooling that is exogenous . So that’s kind of interesting, as now we’re back in a world where we are identifying causal effects. B But, now the less exciting news. This exogenous variation in S driven by the instruments is only a subset of the total variation in the variable itself. Or put differently, IV reduces the variation in the data, so there is less information available for identification, and what little variation we have left comes from the complier population only. Hence the reason in large samples we are estimating the LATE – that is, the causal effect for the complier population, where a complier is someone whose behavior was altered by the instrument. Example 1 : Meth and Foster Care As before, I feel that an example will help make this strategy more concrete. To illustrate, I’m going to review one of my papers with Keith Finlay examining the effect of methamphetamine abuse on child abuse and foster care admissions [ Cunningham and Finlay ,

219 instrumental variables 219 2012 ]. It has been claimed that substance abuse, notably drug use, has a negative impact on parenting, such as neglect, but as these all occur in equilibrium, it’s possible that the correlation is simply reflective of selection bias. In other words, perhaps households with parents who abuse drugs would’ve had the same negative outcomes had the parents not used drugs. After all, it’s not like people are flipping coins when deciding to use meth. So let me briefly give you some background to the study so that you better understand the data generating process. First, d-methamphetamine is like poison to the mind and body when abused. Effects meth abuse increase energy and alertness, decreased appetite, intense euphoria, impaired judgment, and psy- chosis. Second, the meth epidemic, as it came to be called, was geo- graphically concentrated initially on the west coast before gradually making its way eastward over the 1990 s. What made this study possible, though, was meth’s production process. Meth is synthesized from a reduction of ephedrine or pseu- doephedrine, which is also the active ingredient in many cold med- ications, such as the behind-the-counter Sudafed. It is also worth noting that that key input (precursor) experienced a bottleneck in production. In 2004 , nine factories manufactured the bulk of the world supply of ephedrine and pseudoephedrine. The DEA cor- rectly noted that if they could regulate access to ephedrine and pseudoephedrine, then they could effectively interrupt the produc- tion of d-methamphetamine, and in turn, reduce meth abuse and its associated social harms. To understand this, it may be useful to see the two chemical molecules side by side. While the actual process of production is more complicated than this, the chemical reduction is nonethe- less straightforward: start with ephedrine or pseudoephdrine, re- move the hydroxyl group, add back the hydrogen. This gives you 64 d-methamphetamine (see Figure ). So, with input from the DEA, Congress passed the Domestic Chemical Diversion Control Act in August 1995 which provided safeguards by regulating the distribution of products that contained ephedrine as the only medicinal ingredient. But the new legislation’s regulations applied to ephedrine, not pseudoephedrine, and since the two precursors were nearly identical, traffickers quickly substituted from ephedrine to pseudoephedrine. By 1996 , pseudoephedrine was found to be the primary precursor in almost half of meth lab seizures. Therefore, the DEA went back to Congress, seeking greater control over pseudoephedrine products. And the Comprehensive Metham- phetamine Control Act of 1996 went into effect between October

220 220 causal inference : the mixtape 64 : Pseudoephedrine (top) vs Figure d-methamphetamine (bottom) 1997 . This Act required distributors of all forms of and December pseudoephedrine to be subject to chemical registration. Dobkin and Nicosia [ 2009 ] argued that these precursor shocks may very well have been the largest supply shocks in the history of drug enforcement. The effect of the two interventions were dramatic. The first supply intervention caused retail (street) prices (adjusted for purity, weight and inflation) to more than quadruple. The second more like 2 - 3 times its longrun trend. See Figure 65 . We are interested in the causal effect of meth abuse on child abuse, and so our first stage is necessarily a proxy for meth abuse – the number of people entering treatment who listed meth as one of the substances they used in their last substance abuse episode. As I said

221 instrumental variables 221 6ECONOMICINQUIRY Cunningham : Figure 3 from 65 Figure FIGURE 3 [ 2012 and Finlay ] showing changing Ratio of Median Monthly Expected Retail Prices of Meth, Heroin, and Cocaine Relative to Their Respective Values in January 1995, STRIDE, 1995–1999 street prices following both supply shocks. Notes: Authors’ calculations from STRIDE. Expected price estimates come from random coefficient models of both purity and price, following the methodology of Arkes et al. (2004). Estimates from these models are available from the authors. Prices are inflated to 2002 dollars by the All Urban CPI series. In the early 1990s, there was little use of Due to the concentration of meth precur- before, since pictures speak a thousand words, I’m going to show pseudoephedrine as a precursor. In 1994, ephed- sor markets, these two regulations may be the rine was identified as the source material in 79% largest supply shocks in the history of U.S. drug you pictures of both the first stage and the reduced form. Why do I of meth lab seizures, while pseudoephedrine enforcement (Dobkin and Nicosia 2009). To do this instead of going directly to the tables of coefficients? Because was only found in 2% (Suo 2004). Congress estimate the effect of the interdictions on meth sought to close the legal loophole in 1993 markets, we construct a monthly series for the quite frankly, you are more likely to find those estimates believable if by passing the Domestic Chemical Diversion expected retail price of a pure gram of -meth d Control Act, which became effective August from January 1995 to December 1999 using you can see evidence for the first stage and the reduced form in the 1995. This new regulation provided additional the DEA’s seizure database, System to Retrieve 111 4 , 5 111 safeguards by regulating the distribution of Information from Drug Evidence (STRIDE). While presenting figures of the first raw data itself. products that contained ephedrine as the only Figure 3 shows the median monthly expected stage and reduced form isn’t mandatory In Figure 66 , we show the first stage and you can see several active medicinal ingredient (Cunningham and retail prices of meth, heroin, and cocaine rela- in the way that it is for regression Liu 2003; U.S. DEA 1995). The new legislation tive to their respective medians in January 1995. things. All of these data come from the Treatment Episode Data Set discontinuity, it is nonetheless very ignored pseudoephedrine tablets, so traffickers The 1995 interdiction caused a dramatic spike soon took advantage of the omission by sub- in meth prices, but the effect was relatively commonly done. Ultimately, it is done (TEDS), which is all people going into treatment for substance abuse stituting toward pseudoephedrine as a precur- short lived. After 6 months, the prices returned because seeing is believing. sor. By 1996, pseudoephedrine was found to be to their pre-interdiction level. The 1997 regu- for federally funded clinics. Patients list the last three substances the primary precursor in almost half of meth lation had a smaller but more sustained effect used in the most recent “episode”. We mark anyone who listed meth, lab seizures (U.S. DEA 1997). From 1996 to on prices—lasting approximately 12 months. It 1997, pseudoephedrine imports grew by 27% is these rapid shocks to the supply and mar- cocaine or heroin as counts by month and state. Here we aggregate while sales of all cold medications grew only 4% ket price of meth that we exploit to understand (Suo 2004). As a consequence, the DEA sought 66 . You can see evidence for the effect to the national level in Figure greater controls over pseudoephedrine products. 4. See the Supporting Information for an explanation of The Comprehensive Methamphetamine Control the construction of the meth price series. the two interventions had on meth flows, particularly the ephedrine 5. There is a debate about the ability of researchers Act of 1996 went into effect between October to recover the distribution of market prices from STRIDE intervention. Self-admitted meth admissions dropped significantly, as and December 1997 and required distributors of because its sampling is determined by law enforcement almost all forms of pseudoephedrine to be sub- actions. See Horowitz (2001) for the critical argument and did total meth admissions, but there’s no effect on cocaine or heroin. Arkes et al. (2008) for a rebuttal. ject to chemical registration (U.S. DEA 1997). The effect of the pseudoephedrine is not as dramatic, but it appears to cause a break in trend as the growth in meth admissions slows during this period of time. 67 , we graphically show the reduced form. That is, the In Figure effect of the price shocks on foster care admissions. Consistent with what we found in our first stage graphic, the ephedrine intervention in particular had a profoundly negative effect on foster care admis- sions. They fell from around 8 , 000 children removed per month to around 6 , 000 , then began rising again. The second intervention also had an effect, though it appears to be milder. The reason we believe that the second intervention had a more modest effect than the first is because ( 1 ) the effect on price as we saw earlier was about half the

222 222 causal inference : the mixtape 10 ECONOMIC INQUIRY Cunningham from 5 : Figure 66 Figure FIGURE 5 ] showing first stage. 2012 [ and Finlay Total Admissions to Publicly Funded Treatment Facilities by Drug and Month, Selected States, Whites, TEDS, Seasonally Adjusted, 1995–1999 Notes: Authors’ calculations from TEDS. Arizona, the District of Columbia, Kentucky, Mississippi, West Virginia, and Wyoming are excluded because of poor data quality. Patients can report the use of more than one drug. the Current Population Survey. (The Bureau Figure 5 shows the seasonally adjusted trends of Labor Statistics does not disaggregate these for whites in treatment for meth (total cases 2 ) domestic meth production was size of the first intervention, and ( statistics by race, so we control for the over- and self-referred cases separately), juxtaposed all unemployment rate.) Finally, we include a with the trends for cocaine and heroin. Meth has 1990 s. being replaced by Mexican imports of d-meth over the late relatively exogenous measure of the price of a the largest percentage rise in treatment in-flows Thus, by the end of the s, domestic meth production played 1990 substitute drug. Orzechowski and Walker (2008) for the sample period due in part to its lower report the cigarette tax in each state. We also prevalence overall in 1995 relative to cocaine a smaller role in total output, hence why the effect on price and control for the state population of whites aged and heroin. There appears to have been a drop 0to19yearsandaged15to49years.Wesee in the level of meth admissions following the admissions was probably smaller. these as the appropriate denominators for foster 1995 intervention, followed by a rebound in the 8ECONOMICINQUIRY care and drug use rates, respectively. rate of growth afterwards, whereas the 1997 intervention appears to be mainly associated Figure 67 : Figure 4 from Cunningham FIGURE 4 with flat growth rates. Although suggestive that IV. MODEL AND IDENTIFICATION ] showing reduced 2012 and Finlay [ meth admissions may have fallen in response Number of Children Removed to and Discharged from Foster Care in a Set of Five States by form effect of interventions on children to rising meth prices, the fact that there are In this section, we develop an empirical Month, AFCARS, Seasonally Adjusted, 1995–1999 similar movements in the series outside the approach that examines the extent to which removed from families and placed into interventions suggests more rigorous statistical increases in meth use caused increases in foster foster care. analysis is necessary. care admissions from January 1995 to December We include a number of controls to address 1999. Further, we use data on the reasons for a potential confounds to identification. Meth use child’s removal to identify the precise mecha- may be correlated with other drug use, so nisms that translate growth in meth use to an we include the number of alcohol use treat- increase in foster care admissions. As we state ment cases for whites from TEDS. In some above, we proxy for meth use with the number robustness checks, we also include the num- of self-referred meth treatment admissions. ber of cocaine, heroin, and marijuana cases Steady-state treatment admissions are deter- for whites. Meth use may be a function of mined jointly by the population of meth users local economic conditions, so we control for in an area and the average effectiveness of the state unemployment rate estimated from local treatment options. First, it is reasonable Sources: Authors’ calculations from AFCARS. This figure contains AFCARS data only from California, Illinois, Massachusetts, New Jersey, and Vermont. These states form a balanced panel through the entire sample period. TABLE 1 In Figure , we reproduce Table from my article with Keith. 3 68 Foster Care Selected Descriptive Statistics, Adoption and Foster Care Analysis and Reporting System (AFCARS), 1995–1999 There are a few pieces of key information that all IV tables should All Regression Sample Whites Only have. First, there is the OLS regression. As the OLS regression suffers SD ( ) M Obs. M ( SD ) Obs. SD Obs. M ( ) Child Characteristics from endogeneity, we want the reader to see it so that they what 1,810,777 0.48 8,376,410 0.48 1,829,309 0.48 Female White 0.54 7,485,566 1.00 1,356,475 1.00 1,340,894 Black 0.41 7,485,566 — — — — — — — — Other race 7,485,566 0.05 Hispanic ethnicity 0.18 7,123,489 0.31 1,413,088 0.31 1,425,139 7.58 8,101,436 6.89 Age at first removal 1,691,607 7.57 1,706,948 (5.42) (5.44) (5.42) 1,825,189 8,355,884 7.18 Age at latest removal 7.79 1,806,628 7.79 (5.51) (5.45) (5.45) 1,793,777 Total number of removals 1.28 1,812,239 1.29 8,300,811 1.28 (0.72) (0.77) (0.78) Route of most recent removal 7,567,806 0.16 1,615,805 Parental drug use 0.12 0.11 1,541,297 Parental abuse 0.17 7,623,928 0.17 1,632,596 0.16 1,619,836 Parental neglect 0.52 7,645,084 0.45 1,636,756 0.45 1,623,995 0.04 0.05 Parental incarceration 7,496,838 0.04 1,563,020 1,575,780 Authors’ calculations from AFCARS. Children may have no reported route or more than one route of admission Notes: to foster care, so proportions may not add to one. See Supporting Information for the sample restrictions used to generate the sample in the final column. admission proportions can add up to more than Child welfare workers can report more than one reason for removal. For each category, we one. We report summary statistics for only the four most commonly cited reasons for removal. classify a child as following that route if it The most commonly cited reason for removal ever shows up in his file. Thus, the route of

223 instrumental variables 223 1 to compare the IV model with. Let’s focus on column where the CUNNINGHAM & FINLAY: SUBSTANCE USE AND FOSTER CARE 13 dependent variable is total entry into foster care. We find no effect, interestingly, of meth onto foster care. TABLE 3 OLS and 2SLS Regressions of Foster Care Admissions on Meth Treatment Admissions with State Linear Trends, Whites, 1995–1999 : Table 3 Cunningham and Figure 68 Log Latest Entry via Log Latest Entry into Log Latest Entry via Parental Incarceration Child Neglect Foster Care ] showing OLS and 2012 [ SLS 2 Finlay Covariates OLS (1) OLS (3) 2SLS (4) OLS (5) 2SLS (6) 2SLS (2) estimates of meth on foster care admis- ∗∗ ∗∗∗ ∗∗∗ − 0.38 1.03 0.03 1.54 0.01 Log self-referred meth treatment rate 0.23 sions. (0.05) (0.02) (0.32) (0.41) (0.59) (0.02) ∗∗ ∗∗∗ Unemployment rate − 0.06 0.04 0.03 − − 0.00 0.04 − − − 0.07 (0.06) (0.06) (0.02) (0.04) (0.02) (0.05) ∗∗∗ ∗∗∗ Cigarette tax per pack − 0.01 0.02 − 2.02 0.15 1.96 − 0.16 (0.42) (0.12) (0.16) (0.10) (0.17) (0.42) ∗∗∗ ∗∗∗ − 0.04 − 1.26 Log alcohol treatment rate − 0.13 − 0.05 − 0.85 0.37 (0.28) (0.03) (0.32) (0.46) (0.09) (0.03) ∗ ∗ 3.68 Log population 0–19 year old 2.25 − 42.61 1.28 − 2.12 40.43 (22.24) (3.60) (3.21) (2.59) (2.66) (22.74) ∗∗∗ ∗ ∗ − 15.48 Log population 15–49 year old − − 10.61 − 27.20 − 32.24 8.93 − 5.66 (6.19) (21.35) (5.11) (5.52) (5.44) (22.20) x x x x x x Month-of-year fixed effects x x x x State fixed effects x x x x x State linear time trends x x x First stage ∗∗∗ ∗∗∗ ∗∗∗ Price deviation instrument − 0.0005 − 0.0009 0.0005 − (0.0001) (0.0001) (0.0002) -statistic for IV in first stage 25.99 F 18.78 17.60 2 R 0.855 0.818 0.864 N 1,343 1,343 1,068 1,068 1,317 1,317 Log Number of Exits Log Latest Entry via Log Latest Entry via from Foster Care Parental Drug Use Physical Abuse OLS (7) 2SLS (8) OLS (9) 2SLS (10) OLS (11) 2SLS (12) ∗ ∗∗ ∗∗∗ 0.06 1.49 − 0.04 0.14 Log self-referred meth treatment rate 0.21 − 0.20 (0.03) (0.04) (0.34) (0.03) (0.28) (0.62) ∗∗∗ ∗∗∗ ∗∗∗ Unemployment − 0.17 − 0.18 0.05 − 0.02 − 0.03 − − 0.11 (0.05) (0.05) (0.04) (0.03) (0.03) (0.06) ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ − 2.80 Cigarette tax per pack 2.80 − 0.17 − 1.05 1.05 0.20 − (0.15) (0.37) (0.36) (0.14) (0.19) (0.15) ∗∗∗ ∗∗ Log alcohol treatment rate − 0.24 1.16 0.04 0.12 − 0.10 0.01 − − (0.28) (0.05) (0.49) (0.07) (0.04) (0.22) ∗∗∗ ∗∗∗ 10.59 13.30 − Log population 0–19 year old 0.81 − 0.44 9.50 − 9.69 (4.18) (3.60) (3.51) (17.74) (18.22) (3.73) ∗∗∗ ∗∗∗ − 8.74 − 4.01 − 20.22 Log population 15–49 year old − 0.71 − 6.01 − 20.90 (5.39) (33.63) (34.71) (6.83) (7.01) (5.33) x Month-of-year fixed effects x x x x x x x x State fixed effects x x x x State linear time trends x x x x x First stage ∗∗∗ ∗∗∗ ∗∗∗ Price deviation instrument 0.0007 − 0.0005 − 0.0005 − (0.0001) (0.0001) (0.0001) 17.70 F -statistic for IV in first stage 24.45 18.29 2 R 0.90 0.80 0.84 1,293 1,293 1,318 1,161 1,161 N 1,318 ‘Log latest entry into foster care” is the natural log of the sum of all new foster care admissions by state, race, Notes: and month. Models 3 to 10 denote the flow of children into foster care via a given route of admission denoted by the column heading. Models 11 and 12 use the natural log of the sum of all foster care exits by state, race and month. ∗∗ ∗∗∗ ∗ ,and , denote statistical significance at the 1%, 5%, and 10% levels, respectively. SLS 2 The second piece of information that one should report in a table is the first stage itself. We report the first stage at the bottom of each even numbered column. As you can see, for each one unit deviation in price from its longrun trend, meth admissions into treat- ment (our proxy) fell by - . 0005 log points. This is highly significant 0 1 % level, but we check for the strength of the instrument using at the 112 112 statistic [ Staiger and Stock , 1997 ]. the We have an F statistic F In a sense, I am probably getting ahead of myself as we technically of 17 . 6 , which suggests that our instrument is strong enough for haven’t introduced weak instrument identification. tests. But I wanted to walk you through an IV paper before getting too far into Finally, the 2 SLS estimate of the treatment effect itself. Notice the weeds. We will circle back around using only the exogenous variation in log meth admissions, and and discuss weak instruments later, but for now know that Staiger and Stock [ 1997 ] suggested that weak instruments test on the F were a problem when an excludability of the instrument from . That 10 the first stage was less than paper was not the last word. See Stock ] if you’re interested in 2005 [ and Yogo precise, quantitative definitions of weak instruments.

224 224 causal inference : the mixtape assuming the exclusion restriction holds in our model, we are able to isolate a causal effect of log meth admissions on log aggregate foster care admissions. As this is a log-log regression, we can interpret % increase in meth the coefficient as an elasticity. We find that a 10 admissions for treatment appears to cause around a 15 % increase in children removed from their homes and placed into foster care. This effect is both large and precise. And notice, it was not detectable otherwise (the coefficient was zero). Why are they being removed? Our data (AFCARS) lists several channels: parental incarceration, child neglect, parental drug use, and physical abuse. Interestingly, we do not find any effect of parental drug use or parental incarceration, which is perhaps somewhat counterintuitive. Their signs are negative and their standard errors are large. Rather, we find effects of meth admissions on removals for > 1). physical abuse and neglect. Both are elastic (i.e., What did we learn from this paper? Well, we learned two kinds of things. First, we learned how a contemporary piece of applied microeconomics goes about using instrumental variables to identify causal effects. We saw the kinds of graphical evidence mustered, the way in which knowledge about the natural experiment and the policies involved helped the authors argue for the exclusion restric- tion (since it cannot be tested), and the kind of evidence presented from 2 SLS, including the first stage tests for weak instruments. Hope- fully seeing a paper at this point was helpful. But the second thing we learned concerned the actual study itself. We learned that for the group of meth users whose behavior was changed as a result of rising real prices of a pure gram of d-methamphetamine (i.e., the complier subpopulation), their meth use was causing child abuse and neglect that was so severe that it merited removing their children and placing those children into foster care. If you were only familiar with Dobkin and Nicosia [ 2009 ], who found no effect of meth on crime using county level data from California and only the 1997 ephedrine shock, you might incorrectly conclude that there are no social costs associated with meth abuse. But, while meth does not appear to cause crime, it does appear to harm the children of meth users and place strains on the foster care system. Example 2 : Compulsory Schooling and Weak Instruments I am not trying to smother you with papers. But before we move back into the technical material itself, I’d like to discuss one more paper. This paper is interesting and important in and of itself, but even putting that aside, it will also help you better understand the weak instrument literature which followed.

225 instrumental variables 225 As we’ve said since the beginning, with example of example, there is a very long tradition in labor economics of building models that can credibly identify the returns to schooling. This goes back to ] and the Labor workshop at Columbia that Becker ran [ Becker 1994 for years with Jacob Mincer. This has been an important task given education’s growing importance in the distribution of income and 20 th century due to the increasing returns to skill wealth in the latter Juhn et al. , 1993 ]. in the marketplace [ One of the more seminal papers in instrumental variables for the [ modern period is ]. The idea is simple Angrist and Krueger 1991 and clever; a quirk in the United States educational system is that a child is chosen for one grade based on when their birthday is. For a long time, that cutoff was late December. If a child was born on or before December st, then they were assigned to the first grade. But 31 if their birthday was on or after January 1 st, they were assigned to 31 st kindergarten. Thus these two people – one born on December Instrument for Education using Compulsory Schooling Laws and one born on January 1 st – were exogenously assigned different grades. cult to find convincing instruments (in ¢ In practice it is often di Now there’s nothing necessarily relevant here because if they particular because many potential IVs do not satisfy the exclusion always stay in school for the duration of time necessary to get a high restriction). school degree, then that arbitrary assignment of start date won’t In the returns to education literature Angrist and Krueger (1991) had when they get that high affect high school completion. It’ll only affect a very influential study where they used quarter of birth as an school degree. But this is where the quirk gets interesting. For most instrumental variable for schooling. of the th century, the US had compulsory schooling laws which 20 In the US you could drop out of school once you turned 16. . 16 forced a person to remain in high school until they reached age erent Children have di § erent ages when they start school and thus di § After they hit age 16 , they could legally drop out. Figure 69 explains lengths of schooling at the time they turn 16 when they can visually their instrumental variable. potentially drop out. 1991 Figure 69 : Angrist and Krueger [ ] explanation of their instrumental variable. 9/45 Waldinger (Warwick) Angrist and Krueger had the insight that that small quirk was exogenously assigning more schooling to people born later in the year. The person born in December would reach age 16 with more education than the person born in January, in other words. Thus, the 113 113 Notice how similar their idea was to authors had exogenous variation in schooling. regression discontinuity. That’s because ] visually show the reader 1991 In Figure 70 , Angrist and Krueger [ IV and RDD are conceptually very the first stage, and it is really interesting. There’s a clear pattern - similar strategies.

226 First Stages 226 causal inference : the mixtape Men born earlier in the year have lower schooling. This indicates that there is a first stage. Figure 70 : Angrist and Krueger [ 1991 ] first stage relationship between quarter of birth and schooling. 10 / 45 Waldinger (Warwick) rd and 4 th quarter birth days have more schooling than 1 st and 3 nd 2 quarter births on average. That relationship gets weaker as we move Reduced Form into later cohorts, but that is probably because for later cohorts, the price on higher levels of schooling was rising so much that fewer and fewer people were dropping out before finishing their high school erent quarter of birth translate Do di § erences in schooling due to di § degree. into di erent earnings? § 1991 Angrist and Krueger 71 : ] [ Figure reduced form visualization of the relationship between quarter of birth and log weekly earnings. Waldinger (Warwick) 11 / 45 Figure 71 shows the reduced form visually. That is, here we see a simple graph showing the relationship between quarter of birth and 114 114 You have to squint your eye a little bit, but I know, I know. No one has ever log weekly earnings. accused me of being subtle. But it’s an 3 s you can see the pattern – all along the top of the jagged path are important point - a picture speaks a thousand words. If you can communi- cate your first stage and reduced form in pictures, you always should, as it will really captivate the reader’s attention and be far more compelling than a simple table of coefficients ever could.

227 instrumental variables 227 and s, and all along the bottom of the jagged path are 1 s and 2 s. Not 4 always, but it’s correlated. Let’s take a sidebar. Remember what I said about how instruments have a certain ridiculousness to them? That is, you know you have a good instrument if the instrument itself doesn’t seem relevant for explaining the outcome of interest because that’s what the exclusion . Why would quarter of birth affect earnings? It restriction implies doesn’t make any obvious, logical sense why it should. But, if I told you that people born later in the year got more schooling than those because of compulsory schooling , then the relationship between with less the instrument and the outcome snaps into place. The only reason we can think of as to why the instrument would affect earnings is if the instrument was operating through schooling. Instruments only explain the outcome, in other words, when you understand their 115 115 effect on the endogenous variable. This is why I chose those particular Chance the Rapper lyrics as this chap- Angrist and Krueger use three dummies as their instruments: a ter’s epigraph. There’s no reason why dummy for first quarter, a dummy for second quarter and a dummy making “Sunday Candy” would keep Chance from going to hell. Without for third quarter. Thus the omitted category is the fourth quarter, knowing the first stage, it makes no which is the group that gets the most schooling. Now ask yourself obvious sense! this: if we regressed years of schooling onto those three dummies, what should the signs and magnitudes be? That is, what would we expect the relationship between the first quarter (compared to the fourth quarter) and schooling? Let’s look at their first stage results First Stage Regressions in Angrist & Krueger (1991) 72 ). and see if it matched your intuition (Figure Figure 72 : Angrist and Krueger [ 1991 ] first stage for different outcomes. Waldinger (Warwick) 14 / 45

228 228 causal inference : the mixtape 72 shows the first stage from a regression of the following Figure form: + p S + Z h p X + Z = p p + Z 2 3 12 13 11 1 1 10 i Z is the is the dummy for the first three quarters, and p where i i coefficient on each dummy. Now we look at what they produced in Figure 72 . Consistent with our intuition, the coefficients are all and significant for the total years of education and the high negative school graduate dependent variables. Notice, too, that the relation- ship gets much weaker once we move beyond the groups bound by compulsory schooling: the number of years of schooling for high school students (no effect), and probability of being a college gradu- ate (no effect). Regarding those college non-results. Ask yourself this question: why should we expect quarter of birth to affect the probability of being a high school graduate, but not on being a college grad? What if we had found quarter of birth predicted high school completion, college completion, post-graduate completion, and total years of schooling beyond high school? Wouldn’t it start to seem like this compulsory schooling instrument was not what we thought it was? After all, this quarter of birth instrument really should only impact completion; since it doesn’t bind anyone beyond high high school school, it shouldn’t affect the number of years beyond high school or college completion probabilities. If it did, we might be skeptical of the whole design. But here it didn’t, which to me makes it even more convincing that they’re identifying a compulsory high school 116 116 schooling effect. These kinds of falsifications are extremely common in contemporary ap- SLS (which 2 Now we look at the second stage for both OLS and plied work. This is because many of the 73 shows these they label TSLS, but means the same thing). Figure identifying assumptions in any research design are simply untestable. And so results. The authors didn’t report the first stage in this table because the burden of proof is on researchers 117 they reported it in the earlier table we just reviewed. For small to convince the reader, oftentimes with intuitive and transparent falsification values, the log approximates a percentage change, so they are finding tests. SLS 2 % return for every additional year of schooling, but with 1 . 7 a 117 My personal preference is to report 8 it’s higher ( %). That’s interesting, because if it was merely ability 9 . everything in the same table, mainly for design reasons. I like fewer tables with , not too small. too large bias, then we’d expect the OLS estimate to be each table having more information. In So something other than mere ability bias must be going on here. other words, I want someone to look For whatever it’s worth, I am personally convinced at this point at an instrumental variables table and immediately see the OLS result, the that quarter of birth is a valid instrument, and that they’ve identified 2 SLS result, the first stage relationship, Angrist and Krueger a causal effect of schooling on earnings, but F statistic on that first stage. and the See Figure 68 for an example. [ ] want to go further, probably because they want more precision 1991 in their estimate. And to get more precision, they load up the first stage with even more instruments. Specifically, they use specifications with dummies (quarter dummies (quarter of birth ⇥ year) and 150 30

229 IV Results instrumental variables 229 IV Estimates Birth Cohorts 20-29, 1980 Census ] : Angrist and Krueger [ 1991 73 Figure OLS and 2 SLS results for the effect of education on log weekly earnings. Waldinger (Warwick) 17 / 45 of birth state) as instruments. The idea is that the quarter of birth ⇥ effect may differ by state and cohort. Because they have more vari- ation in the instrument, the predicted values of schooling also have more variation, which brings down the standard errors. But at what cost? Many of these instruments are only now weakly correlated with schooling - in some locations, they have almost no correlation, and for some cohorts as well. We got a flavor of that, 70 where the later cohorts show less variation in in fact, in Figure schooling by quarter of birth than the earlier cohorts. What is the effect, then, of reducing the variance in the estimator by loading up the first stage with a bunch of noise? Work on this starts with Bound et al. 1995 ] and is often called the [ “weak instrument” literature. It’s in this paper that we learn some ba- sic practices for determining if we have a weak instrument problem, as well as an understanding of the nature of the bias of IV in finite Bound samples and under different violations of the IV assumptions. et al. [ 1995 ] sought to understand what IV was identifying when the first stage was weak, as it was when Angrist and Krueger [ 1991 ] loaded up their first stage with 180 instruments, many of which were very weak. Let’s review Bound et al. [ 1995 ] now and consider their model with a single endogenous regressor and a simple constant treatment

230 230 causal inference the mixtape : effect. The causal model of interest here is as before: = + # b y s is some outcome and is some endogenous regressor, such y where s as schooling. The matrix of IVs is Z with the first stage equation 0 = p + h s Z # and h are correlated, then estimating the first equation by OLS If would lead to biased results, wherein the OLS bias is: ) s , # ( C b ]= b E [ b OLS s ) V ( s #h . It can be shown that the bias of 2 SLS We will rename this ratio as 2 s s is approximately: s 1 #h b [ b ] ⇡ b E 2 SLS 2 F +1 s h F is the population analogy of the F -statistic for the joint where significance of the instruments in the first stage regression. If the first s #h ! 0 2 SLS approaches F stage is weak, then , then the bias of . But 2 s h 0 ! , then the 2 if the first stage is very strong, • . F SLS bias goes to Returning to our rhetorical question from earlier, what was the cost of adding instruments without predictive power? Adding more F statistic to approach zero weak instruments causes the first stage 2 SLS. and increase the bias of What if the model is “just identified”, meaning there’s the same number of instruments as there are endogenous variables? Bound [ 1995 ] studied this empirically, replicating Angrist and Krueger et al. [ ], and using simulations. Figure 74 shows what happens once 1991 they start adding in controls. Notice that as they do, the F statistic on the excludability of the instruments falls from 13 . 5 to 4 . 7 to 1 . 6 .Soby the F statistic, they are already running into a weak instrument once 30 ⇥ year dummies, and I think they include the quarter of birth that’s because as we saw, the relationship between quarter of birth and schooling got smaller for the later cohorts. 180 Next, they added in the weak instruments – all of them – which is shown in Figure 75 . And here we see that the problem persists. The instruments are weak, and therefore the bias of the 2 SLS coefficient is close to that of the OLS bias. But the really damning part of the [ 1995 ] paper was Bound et al. their simulation. The authors write: “To illustrate that second-stage results do not give us any indication of the existence of quantitatively important finite-sample biases, we reestimated Table ) , columns ( 4 ) and ( 6 ) and Table 2 , columns ( 2 1

231 instrumental variables 231 448 Journal of the American Statistical Association, June 1995 74 : Bound et al. [ 1995 ] OLS and Figure Table 1. Estimated Effect of Completed Years of Education on Men's Log Weekly Earnings 2 SLS results for the effect of education (standard errors of coefficients in parentheses) on log weekly earnings. (1) (2) (3) (4) (5) (6) OLS IV OLS IV OLS IV Coefficient .063 .142 .063 .081 .063 .060 (.000) (.033) (.000) (.016) (.000) (.029) F (excluded instruments) 13.486 4.747 1.613 Partial R2 (excluded instruments, X100) .012 .043 .014 F (overidentification) .932 .775 .725 Age Control Variables Age, Age2 x x x x 9 Year of birth dummies x x x x Excluded Instruments Quarter of birth x x x Quarter of birth x year of birth x x Number of excluded instruments 3 30 28 NOTE: Calculated from the 5% Public-Use Sample of the 1980 U.S. Census for men born 1930-1939. Sample size is 329,509. All specifications include Race (1 = black), SMSA (1 = central city), Married (1 = married, living with spouse), and 8 Regional dummies as control variables. F (first stage) and partial R2 are for the instruments in the first stage of IV estimation. F (overidentification) is that suggested by Basmann (1960). and Table 2, columns (2) and (4), using randomly generated finite-sample bias. Because quarter of birth is related, by def- information in place of the actual quarter of birth, following inition, to age measured in quarters within a single year of a suggestion by Alan Krueger. The means of the estimated birth, and because age is an important determinant of earn- and ( ), using randomly generated information in place of the actual 4 standard errors reported in the last row are quite close to the ings, we find the specification using within-year age controls quarter of birth, following a suggestion by Alan Krueger. The means actual standard deviations of the 500 estimates for each [column (6) ] to be more sensible than the specification that model. Moreover, the distribution of the estimates appears does not [column (4) ]. The F statistic on the excluded in- of the estimated standard errors reporting in the last row are quite to be quite symmetric. In these cases, therefore, the asymp- struments in column (6) indicates that quantitatively im- close to the actual standard deviations of the estimates for each 500 totic standard errors give reasonably accurate information portant finite-sample biases may affect the estimate. Com- on the sampling variability of the IV estimator. This is specific paring the partial R2 in columns (2) and (6) shows that model. . . . It is striking that the second-stage results reported in Table to these cases, however. Nelson and Startz (I 990a) showed, adding 25 instruments does not change the explanatory 3 look quite reasonable even with no information about educational in the context of a different example, that asymptotic stan- power of the excluded instruments by very much, explaining dard errors can give very misleading information about the why the F statistic deteriorates so much between the two attainment in the simulated instruments. They give no indication that actual sampling distribution of the IV estimator when the specifications. the instruments were randomly generated. . . . On the other hand, the correlation between the instrument and the endogenous Compulsory attendance laws, and the degree to which variable is weak. these laws are enforced, vary by state. In AK-9 1 the authors F statistics on the excluded instruments in the first-stage regressions used this cross-state variation to help identify the coefficient are always near their expected value of essentially 1 and do give a clear on education by including state of birth X quarter of birth Table 2. Estimated Effect of Completed Years of Education on Men's Log Weekly Earnings, Controlling for State of Birth interactions as instruments in some of their specifications. indication that the estimates of the second-stage coefficients suffer from (standard errors of coefficients in parentheses) Besides improving the precision of the estimates, using vari- finite-sample biases.” ation across state of birth should mitigate problems of mul- (1) (2) (3) (4) OLS IV OLS IV ticollinearity between age and quarter of birth. In Table 2 we report replications of AK-9 I's Table VII, columns (5) So, what can you do if you have weak instruments. First, you can Coefficient .063 .083 .063 .081 through (8). These models use quarter of birth X state of (.000) (.009) (.000) (.01 1) birth interactions in addition to quarter of birth and quarter use a just identified model with your strongest IV. Second, you can F (excluded instruments) 2.428 1.869 of birth X year of birth interactions as instruments for ed- Partial R2 (excluded instruments, X100) .133 .101 use a limited information maximum likelihood estimator (LIML). F (overidentification) .919 .917 ucational attainment. Including the state of birth X quarter of birth interactions Age Control Variables This is approximately median unbiased for over identified constant reduces the standard errors on the IV results by more than Age, Age2 x x a factor of two and stabilizes the point estimates considerably. 9 Year of birth dummies x x x x 2 effects models. It provides the same asymptotic distribution as SLS The F statistics on the excluded instruments in the first stage Excluded Instruments under homogenous treatment effects, but provides a finite-sample of IV do not improve, however. These F statistics indicate Quarter of birth x x that although including state of birth X quarter of birth in- Quarter of birth x year of birth x x bias reduction. teractions improves the precision arn I reduces the instability Quarter of birth x state of birth x x Number of excluded instruments 180 178 of the estimates, the possibility that small-sample bias may But, let’s be real for a second. If you have a weak instrument prob- be a problem remains. NOTE: Calculated from the 5% Public-Use Sample of the 1980 U.S. Census for men bom 1930- 1939. Sample size is 329,509. All specifications include Race (1 = black), SMSA (1 = central To illustrate that second-stage results do not give us any lem, then you only get so far by using LIML or estimating a just city), Maried (1 = maried, living with spouse), 8 Regional dummies, and 50 State of Birth dummies indication of the existence of quantitatively important finite- as control variables. F (first stage) and partial R2 are for the instruments in the first stage of IV identified model. The real solution for a weak instrument problem is estimation. F (overidentification) is that suggested by Basmann (1960). sample biases, we reestimated Table 1, columns (4) and (6), get better instruments . Under homogenous treatment effects, you’re al- This content downloaded from on Sun, 11 Mar 2018 18:08:22 UTC ways identifying the same effect so there’s no worry about a complier All use subject to only parameter. So you should just continue searching for stronger 118 118 instruments that simultaneously satisfy the exclusion restriction. Good luck with that. Seriously, good luck. In conclusion, circling back to where we started, I think we’ve learned a lot about instrumental variables and why it is so powerful. The estimators based on this design are capable of identifying causal effects when your data suffer from selection on unobservables. Since selection on unobservables is believed to be very common, this is a

232 448 Journal of the American Statistical Association, June 1995 Table 1. Estimated Effect of Completed Years of Education on Men's Log Weekly Earnings (standard errors of coefficients in parentheses) (1) (2) (3) (4) (5) (6) OLS IV OLS IV OLS IV Coefficient .063 .142 .063 .081 .063 .060 (.000) (.033) (.000) (.016) (.000) (.029) F (excluded instruments) 13.486 4.747 1.613 Partial R2 (excluded instruments, X100) .012 .043 .014 F (overidentification) .932 .775 .725 Age Control Variables Age, Age2 x x x x 9 Year of birth dummies x x x x Excluded Instruments Quarter of birth x x x Quarter of birth x year of birth x x Number of excluded instruments 3 30 28 NOTE: Calculated from the 5% Public-Use Sample of the 1980 U.S. Census for men born 1930-1939. Sample size is 329,509. All specifications include Race (1 = black), SMSA (1 = central city), Married (1 = married, living with spouse), and 8 Regional dummies as control variables. F (first stage) and partial R2 are for the instruments in the first stage of IV estimation. F (overidentification) is that suggested by Basmann (1960). and Table 2, columns (2) and (4), using randomly generated finite-sample bias. Because quarter of birth is related, by def- information in place of the actual quarter of birth, following inition, to age measured in quarters within a single year of a suggestion by Alan Krueger. The means of the estimated birth, and because age is an important determinant of earn- standard errors reported in the last row are quite close to the ings, we find the specification using within-year age controls actual standard deviations of the 500 estimates for each [column (6) ] to be more sensible than the specification that model. Moreover, the distribution of the estimates appears does not [column (4) ]. The F statistic on the excluded in- to be quite symmetric. In these cases, therefore, the asymp- struments in column (6) indicates that quantitatively im- totic standard errors give reasonably accurate information portant finite-sample biases may affect the estimate. Com- on the sampling variability of the IV estimator. This is specific paring the partial R2 in columns (2) and (6) shows that to these cases, however. Nelson and Startz (I 990a) showed, adding 25 instruments does not change the explanatory in the context of a different example, that asymptotic stan- power of the excluded instruments by very much, explaining dard errors can give very misleading information about the why the F statistic deteriorates so much between the two : 232 causal inference the mixtape actual sampling distribution of the IV estimator when the specifications. correlation between the instrument and the endogenous Compulsory attendance laws, and the degree to which variable is weak. these laws are enforced, vary by state. In AK-9 1 the authors : ] OLS and 1995 Bound et al. Figure 75 [ used this cross-state variation to help identify the coefficient SLS results for the effect of education 2 on education by including state of birth X quarter of birth Table 2. Estimated Effect of Completed Years of Education on 100 on log weekly earnings with the + Men's Log Weekly Earnings, Controlling for State of Birth interactions as instruments in some of their specifications. weak instruments. (standard errors of coefficients in parentheses) Besides improving the precision of the estimates, using vari- ation across state of birth should mitigate problems of mul- (1) (2) (3) (4) OLS IV OLS IV ticollinearity between age and quarter of birth. In Table 2 we report replications of AK-9 I's Table VII, columns (5) Coefficient .063 .083 .063 .081 through (8). These models use quarter of birth X state of (.000) (.009) (.000) (.01 1) birth interactions in addition to quarter of birth and quarter F (excluded instruments) 2.428 1.869 of birth X year of birth interactions as instruments for ed- Partial R2 (excluded instruments, X100) .133 .101 F (overidentification) .919 .917 ucational attainment. Including the state of birth X quarter of birth interactions Age Control Variables reduces the standard errors on the IV results by more than Age, Age2 x x a factor of two and stabilizes the point estimates considerably. 9 Year of birth dummies x x x x The F statistics on the excluded instruments in the first stage Excluded Instruments of IV do not improve, however. These F statistics indicate Quarter of birth x x that although including state of birth X quarter of birth in- Quarter of birth x year of birth x x teractions improves the precision arn I reduces the instability Quarter of birth x state of birth x x Number of excluded instruments 180 178 of the estimates, the possibility that small-sample bias may be a problem remains. NOTE: Calculated from the 5% Public-Use Sample of the 1980 U.S. Census for men bom 1930- 1939. Sample size is 329,509. All specifications include Race (1 = black), SMSA (1 = central To illustrate that second-stage results do not give us any city), Maried (1 = maried, living with spouse), 8 Regional dummies, and 50 State of Birth dummies indication of the existence of quantitatively important finite- as control variables. F (first stage) and partial R2 are for the instruments in the first stage of IV estimation. F (overidentification) is that suggested by Basmann (1960). sample biases, we reestimated Table 1, columns (4) and (6), This content downloaded from on Sun, 11 Mar 2018 18:08:22 UTC All use subject to very useful methodology for addressing it. But, that said, we also have learned some of its weaknesses, and hence why some people eschew it. Let’s now move to heterogeneous treatment effects so that we can better understand some of its limitations a bit better. Heterogenous treatment effects Now we turn to the more contemporary pedagogy where we relax the assumption that treatment effects are the same for every unit. Now we will allow for each unit to have a unique response to the treatment, or 1 0 d Y Y = i i i i . Note that the treatment effect parameter now differs by individual The main questions we have now are: ( 1 ) what is IV estimating when we have heterogenous treatment effects, and ( 2 ) under what assumptions will IV identify a causal effect with heterogenous treat- ment effects? The reason why this matters is that once we introduce

233 instrumental variables 233 heterogenous treatment effects, we introduce a distinction between the internal validity of a study and its external validity. Internal va- for the population lidity means our strategy identified a causal effect we studied . But external validity means the study’s finding applied to different populations (not in the study). As we’ll see, under homoge- nous treatment effects, there is no such tension between external and internal validity because everyone has the same treatment effect. But under heterogenous treatment effects, there is a huge tension; the tension is so great, in fact, that it may even undermine an otherwise valid IV design. Heterogenous treatment effects are built on top of the potential outcomes notation, with a few modifications. Since now we have two and Z - we have to modify the notation slightly. We arguments - D is a function of D and Z as Y D ( say that Y =0, Z , which is = 1) i i i Y (0, 1). represented as i Potential as we have been using the term refers to the outcomes Y variable, but now we have a new potential variable – potential treatment status (as opposed to observed treatment status). Here’s the characteristics: 1 D = i ’s treatment status when Z =1 • i i 0 D = i ’s treatment status when Z =0 • i i • And observed treatment status is based on a treatment status switching equations: 0 1 0 D Z +( D ) = D D i i i i i + f p = Z p + 0 1 i i 0 0 1 ) = E [ D where p ] , p D =( D is the heterogenous causal effect i 0 i 1 i i i D D , and E [ p . of the IV on ] = the average causal effect of Z on i i 1 i i There are considerably more assumptions necessary for identifica- tion once we introduce heterogenous treatment effects – specifically five assumptions. We now review each of them. And to be concrete, I will use repeatedly as an example the effect of military service on Angrist , earnings using a draft lottery as the instrumental variable [ ]. 1990 First, as before, there is a stable unit treatment value assumption (SUTVA) which states that the potential outcomes for each person i are unrelated to the treatment status of other individuals. The 0 0 0 = Z assumption states that if Z , then D = ( Z Z D Z ( Z )= ) . And if i i i i i i 0 0 0 = D )= ) , then Y and ( D , Z D Y Z ( D . An violation of SUTVA would , i i i i be if the status of a person at risk of being drafted was affected by the draft status of others at risk of being drafted. Such spillovers violate 119 119 SUTVA. Probably no other identifying assumption is given shorter shrift than SUTVA. Rarely is it mentioned in applied studies, let alone taken seriously.

234 234 causal inference : the mixtape Second, there is the independence assumption. The independence assumption is also sometimes call the “as good as random assign- ment” assumption. It states that the IV is independent of the poten- tial outcomes and potential treatment assignments. Notationally, it is 0 0 1 1 , 1), D { ( Z D ( , 0), D ?? Y , D Y } i i i i i i i The independence assumption is sufficient for a causal interpretation of the reduced form: 1 0 Y | Z = 0] = 1] E [ Y Z | Z | = 0] = E [ E , 0) ( D [ Y , 1) | Z D = 1] E [ Y ( i i i i i i i i i i 1 0 Y ( D = , 0)] , 1)] E E [ Y [ ( D i i i i Independence means that the first stage measures the causal effect of on Z D : i i 0 1 D = 0] | Z Z = 1] E [ D | | Z E = 0] = E [ D [ D | Z [ = 1] E i i i i i i i i 1 0 = D [ E D ] i i An example of this is if Vietnam conscription for military service was based on randomly generated draft lottery numbers. The assignment of draft lottery number was independent of potential earnings or potential military service because it was “as good as random”. Third, there is the exclusion restriction. The exclusion restriction Z on Y states that any effect of Z on D . In must be via the effect of other words, Y only. Or formally: ( D D , Z ) is a function of i i i i , 1) for D D =0,1 , 0) = Y D ( Y ( i i i i Again, our Vietnam example. In the Vietnam draft lottery, an individ- ual’s earnings potential as a veteran or a non-veteran are assumed to be the same regardless of draft eligibility status. The exclusion restric- tion would be violated if low lottery numbers affected schooling by people avoiding the draft. If this was the case, then the lottery num- ber would be correlated with earnings for at least two cases. One, through the instrument’s effect on military service. And two, through the instrument’s effect on schooling. The implication of the exclusion restriction is that a random lottery number (independence) does not therefore imply that the exclusion restriction is satisfied. These are different assumptions. Fourth is the first stage. IV under heterogenous treatment effects requires that Z be correlated with the endogenous variable such that 0 1 D E E [ ] 6 =0 i i Z has to have some statistically significant effect on the average probability of treatment. An example would be having a low lottery

235 instrumental variables 235 number. Does it increase the average probability of military service? If so, then it satisfies the first stage requirement. Note, unlike inde- pendence and exclusion, the first stage is testable as it is based solely on , both of which you have data on. D Z and And finally, the monotonicity assumption. This is only strange at first glance, but is actually quite intuitive. Monotonicity requires that the instrumental variable (weakly) operate in the same direction on all individual units. In other words, while the instrument may have no effect on some people, all those who are affected are affected in the same direction (i.e., positively or negative, but not both). We write it out like this: Either p 0 for all i or p N =1,...,  0 for all i i 1 1 i What this means, as an example, using our military draft example, is that draft eligibility may have no effect on the probability of military service for some people, like patriots, but when it does have an effect, it shifts them all into service, or out of service, but not both. The reason that we have to make this assumption is that without mono- tonicity, IV estimators are not guaranteed to estimate a weighted average of the underlying causal effects of the affected group. If all five assumptions are satisfied, then we have a valid IV strat- egy. But that being said, while valid, it is not doing what it was doing when we had homogenous treatment effects. What, then, is the IV strategy estimating under heterogenous treatment effects? Answer: the local average treatment effect (LATE) of on Y : D Effect of Z on Y = d , IV LATE Effect of Z on D 1 0 Y D D , 0)] E , 1) Y [ ( ( i i i i = 0 1 D E [ D ] i i 1 0 0 1 = 1] = Y E [( ) | D Y D i i i i D Y for The LATE parameters is the average causal effect of on Z those whose treatment status was changed by the instrument, . For instance, IV estimates the average effect of military service on earnings for the subpopulations who enrolled in military service because of the draft but who would not have served otherwise. It doesn’t identify the causal effect on patriots who always serve, for instance, because those individuals did not have their military service pushed or pulled by the draft number. It also won’t tell us the effect of military service on those who were exempted from military service 120 120 for medical reasons. We have reviewed the properties of IV with heterogenous treatment The LATE framework has even more jargon, so let’s review it effects using a very simple dummy now. The LATE framework partitions the population of units with endogenous variable, dummy IV, and no additional controls example. The intuition of LATE generalizes to most cases where we have continuous endogenous variables and instruments, and additional control variables, as well.

236 236 causal inference : the mixtape an instrument into potentially four mutually exclusive groups. Those groups are: . Compliers: this is the subpopulation whose treatment status is 1 1 affected by the instrument in the correct direction. That is, D =1 i 0 = 0. D and i Defiers: this is the subpopulation whose treatment status is af- 2 . 1 D fected by the instrument in the wrong direction. That is, =0 i 0 121 121 D = 1. and So for instance, say that we have i some instrument for attending a private Never takers: this is the subpopulation of units that never take the . 3 school. Compliers go to the school if 1 0 they win the lottery, and don’t go to the D D = treatment regardless of the value of the instrument. So, = i i school if they don’t. Defiers attend the 122 0. They simply never take the treatment. school if they don’t win, but attend the school if they do win. Defiers sound 4 Always takers: this is the subpopulation of units that always . like jerks. take the treatment regardless of the value of the instrument. So, 122 Sticking with our private school 1 123 0 D = D = 1. They simply always take the instrument. lottery example. This is a group of i i people who believe in public education, and so even if they win the lottery, they As outlined above, with all five assumptions satisfied, IV estimates won’t go. They’re never-takers; they the average treatment effect for compliers. Contrast this with the never go to private school no matter traditional IV pedagogy with homogenous treatment effects. In that what. 123 This is a group of people who situation, compliers have the same treatment effects as non-compliers, always send their kids to private school, so the distinction is irrelevant. Without further assumptions, LATE regardless of the number on their voucher lottery. is not informative about effects on never-takers or always-takers because the instrument does not affect their treatment status. Does this matter? Yes, absolutely. It matters because in most applications, we would be mostly interested in estimating the average treatment effect on the whole population, but that’s not usually 124 124 possible with IV. This identification of the LATE under heterogenous treatment effects material Now that we have reviewed the basic idea and mechanics of was worked out in 1996 ]. [ Angrist et al. instrumental variables, including some of the more important tests See it for more details. associated with it, let’s get our hands dirty with some data. We’ll work with a couple of datasets now to help you better understand 2 SLS in real data. how to implement 1 : College in the county Stata exercise # We will once again look at the returns to schooling since it is such a historically popular topic for causal questions in labor. In this application, we will simply show how to use the Stata command ivregress 2 SLS, calculate the first stage F statistic, and compare with the 2 SLS results with the OLS results. I will be keeping it simple, because my goal is just to help the reader become familiarized with the procedure. The data comes from the NLS Young Men Cohort of the National Longitudinal Survey. This data began in 1966 with 5 , 525 men aged

237 instrumental variables 237 14 24 and continued to follow up with them through 1981 . These - 1966 , the baseline survey, and there’s a number of data come from questions related to local labor markets. One of them is whether the 4 2 -year) college. -year (and a respondent lives in the same county as a 1995 ] is interested in estimating the following regression Card [ equation: # = a + d S + Y g X + i i i i Y X S is years of schooling, where is a matrix of is log earnings, # is an error term that contains among exogenous covariates and other things unobserved ability. Under the assumption that # contains ability, and ability is correlated with schooling, then C ( S , # ) 6 =0 and therefore schooling is biased. Card 1995 ] proposes therefore [ an instrumental variables strategy whereby he will instrument for schooling with the college-in-the-county dummy variable. It is worth asking ourselves why the presence of a four year col- lege in one’s county would increase schooling. The main reason that 4 -year-college increases the I can think of is that the presence of the likelihood of going to college by lowering the costs, since the student can live at home. This therefore means, though, that we are selecting on a group of compliers whose behavior is affected by the variable. Some kids, in other words, will always go to college regardless of whether a college is in their county, and some will never go despite the presence of the nearby college. But there may exist a group of compliers who go to college only because their county has a col- lege, and if I’m right that this is primarily picking up people going because they can attend while living at home, then it’s necessarily people at some margin who attend only because college became slightly cheaper. This is, in other words, a group of people who are liquidity constrained. And if we believe the returns to schooling for this group is different that of the always-takers, then our estimates may not represent the ATE. Rather, they would represent the LATE. But in this case, that might actually be an interesting parameter since it gets at the issue of lowering costs of attendance for poorer families. Here we will do some simple analysis based on Card [ 1995 ]. . scuse card . reg lwage educ exper black south married smsa . ivregress 2sls lwage (educ=nearc4) exper black south married smsa, first . reg educ nearc4 exper black south married smsa . test nearc4 And our results from this analysis have been arranged into Table 28 . First, we report our OLS results. For every one year additional

238 238 causal inference the mixtape : . 7 1 of schooling, respondents’ earnings increase by approximately %. ivregress 2sls SLS using the command in Stata. Next we estimated 2 Here we find a much larger return to schooling than we had found using OLS - around 75 % larger in fact. But let’s look at the first stage 327 . 0 first. We find that the college in the county is associated with a < 0.001 more years of schooling. This is highly significant ( p ). The -statistic exceeds , suggesting we don’t have a weak instrument F 15 SLS estimate 2 problem. The return to schooling associated with this . 124 – that is, for every additional year of schooling, earnings 0 is . 4 %. Other covariates are listed if you’re interested in increases by 12 studying them as well. : OLS and 2 SLS regressions of Table 28 Dependent variable Log earnings Log Earnings on Schooling OLS 2 SLS educ 071 0 *** 0 . 124 ** . 050 ) ( 0 . 003 )( 0 . 056 0 *** 034 . *** 0 exper . ( 020 . 0 )( 002 . 0 ) . 116 ** black - 0 . 166 *** - 0 051 . ( 0 018 )( 0 . ) 0 113 *** *** - 132 . 0 - south . . 023 ) ( 0 . 015 )( 0 0 *** - *** 036 . 0 - married . 032 005 0 )( 003 . 0 ) ( . smsa 148 . 0 *** 176 . 0 *** 031 ) ( 0 . 015 )( 0 . First Stage Instrument College in the county 0 . *** 327 . 082 0 Robust standard error ) ( . 15 F statistic for IV in first stage 767 003 N 3 , 003 3 , 262 Mean Dependent Variable 6 . 262 6 . 0 444 . 0 444 . Std. Dev. Dependent Variable < 0 . 05 , *** p < 0 . 01 Standard errors in parenthesis. * p < 0 . 10 , ** p Why would the return to schooling be so much larger for the compliers than for the general population? After all, we showed earlier that if this was simply ability bias, then we’d expect the 2 SLS coefficient to be smaller than the OLS coefficient, because ability bias implies that the coefficient on schooling is too large . Yet we’re finding the opposite. So a couple of things it could be. First, it could be that schooling has measurement error. Measurement error would bias the coefficient towards zero, and 2 SLS would recover its true value. But

239 instrumental variables 239 I find this explanation to be unlikely, because I don’t foresee people really not knowing with accuracy how many years of schooling they currently have. Which leads us to the other explanation, and that is that compliers have larger returns to schooling. But why would this be the case? Assuming that the exclusion restriction holds, then why would compliers returns be so much larger? We’ve already established that these people are likely being shifted into more schooling because they live with their parents, which suggests that the college is lowering the marginal cost of going to college. All we are left saying is that for some reason, the higher marginal cost of attending college is causing these people to under invest in schooling; that in fact their returns are much higher. I welcome your thoughts, though, on why this number might be so different. 2 Stata exercise # : Fulton fish markets The second exercise that we’ll be doing is based on [ 2006 ]. Graddy My understanding is that Graddy hand collected these data herself by recording prices of fish at the actual Fulton fish market. I’m not sure if that is true, but I like to believe it’s true, because I like to believe in shoe leather research of that kind. Anyhow, the Fulton Fish Market operated in NYC on Fulton Street for 150 years. In November 2005 , they moved it from the lower Manhattan to a large facility building for the market in the South Bronx. At the time of the article’s writing, it was called the New Fulton Fish Market. It’s one of the world’s largest fish markets, second only to the Tsukiji in Tokyo. This is an interesting market because fish are heterogenous, highly 100 to 300 differentiated products. There are anywhere between 15 differ- different varieties of fish sold at the market. There are over ent varieties of shrimp alone. Within each variety, there’s small fish, large fish, medium fish, fish just caught, fish that have been around a while. There’s so much heterogeneity in fact that customers often want to examine fish personally. You get the picture. This fish market functions just like a two-sided platform matching buyers to sellers, which is made more efficient by the thickness the market produces. It’s not surprising, therefore, that Graddy found the market such an interesting thing to study. Let’s move to the data. I want us to estimate the price elasticity of demand for fish, which makes this problem much like the problem that Philip Wright faced in that price and quantity are determined simultaneously. The elasticity of demand is a sequence of quantity and price pairs, but with only one pair observed at a given point in time. In that sense, the demand curve is itself a sequence of potential outcomes (quantity) associated with different potential treatments

240 240 causal inference : the mixtape (price). This means the demand curve is itself a real object, but mostly unobserved. Therefore, to trace out the elasticity, we need an instrument that is correlated with supply only. Graddy proposes a few of them, all of which have to do with the weather at sea in the days before the fish arrived to market. The first instrument is the average max last days wave height. 2 The model we are interested in estimating is: a + Q P + g X + # = d Q is log quantity of whiting sold in pounds, P is log average where X daily price per pound, are day of the week dummies and a time # 29 presents the results trend, and is the structural error term. Table from estimating this equation with OLS (first column) and 2 SLS (second column). The OLS estimate of the elasticity of demand 0 is - 549 . It could’ve been anything given price is determined by . how many sellers and how many buyers there are at the Market on any given day. But when we use the average wave height as the instrument for price, we get a 0.96 price elasticity of demand. A 10 % increase in the price causes quantity to decrease by 9 . 6 %. The instrument is strong ( F 22 ). For every one unit increase in the wave > 10 height, price rose %. I suppose the question we have to ask ourselves, though, is what exactly is this instrument doing to supply. What are higher waves doing exactly? It’s making it more difficult to fish, but is it also changing the composition of the fish caught? If so, then it would seem that the exclusion restriction is violated because that would mean the wave height is directly causing fish composition to change which will directly determine quantities bought and sold. Now let’s look at a different instrument: windspeed. Specifically, it’s the 3 day lagged max windspeed. We present these results in Table 29 . Here we see something we did not see before, which is that this is a weak instrument. The F statistic is less than 10 (approxi- mately . 5 ). And correspondingly, the estimated elasticity is twice as 6 large as what we found with wave height. Thus we know from our earlier discussion of weak instruments that this estimate is likely bi- ased, and therefore less reliable than the previous one – even though the previous one itself ( 1 ) may not convincingly satisfy the exclusion restriction and ( 2 ) is at best a LATE relevant to compliers only. But as we’ve said, if we think that the compliers’ causal effects are sim- ilar to that of the broader population, then the LATE may itself be informative and useful. We’ve reviewed the use of IV in identifying causal effects when some regressor is endogenous in observational data. But increasingly, you’re seeing it used with randomized trials. In many randomized

241 instrumental variables 241 SLS regressions of Table 29 : OLS and 2 Log quantity Dependent variable Log Quantity on Log Price with wave SLS 2 OLS height instrument . *** - 549 . 960 0 - Log(Price) 0 ** ) . 0 )( 184 . 0 ( 406 322 - . 0 - Monday 318 . 0 0 . 227 ( 0 . 225 ) )( 687 *** Tuesday - 0 . 684 *** - 0 . ) 224 ( 0 . )( 0 . 221 520 Wednesday ** . 0 ** - 535 . 0 - ) 0 . 221 )( 0 . ( 219 Thursday 0 . 068 0 . 106 222 221 ( 0 . )( 0 . ) . - 001 . 003 0 - Time trend 0 ) 0 . 003 )( 003 ( . 0 First Stage Instrument Average wave height *** 103 . 0 . 022 Robust standard error ( 0 ) 638 . 22 F statistic for IV in first stage 97 97 N 086 Mean Dependent Variable 8 . 8 . 086 Std. Dev. Dependent Variable . 765 0 . 765 0 01 < 05 , *** p < 0 . . , ** p 10 . 0 < Standard errors in parenthesis. * p 0

242 242 causal inference : the mixtape SLS regressions Table 30 : OLS and 2 Log quantity Dependent variable of Log Quantity on Log Price with OLS SLS 2 windspeed instrument . *** - 549 . 960 0 - Log Price 1 ** ) . 0 )( 184 . 0 ( 873 332 - . 0 - Monday 318 . 0 0 . 227 ( 0 . 281 ) )( 696 ** Tuesday - 0 . 684 *** - 0 . ) 224 ( 0 . )( 0 . 277 482 Wednesday * . 0 ** - 535 . 0 - ) 0 . 221 )( 0 . ( 275 Thursday 0 . 068 0 . 196 285 221 ( 0 . )( 0 . ) . - 001 . 007 0 - Time trend 0 ) 0 . 005 )( 003 ( . 0 First Stage Instrument Wind Speed ** 017 . 0 . 007 Robust standard error ( 0 ) 581 . 6 F statistic for IV in first stage 97 97 N 086 Mean Dependent Variable 8 . 8 . 086 Std. Dev. Dependent Variable . 765 0 . 765 0 01 < 05 , *** p < 0 . . , ** p 10 . 0 < Standard errors in parenthesis. * p 0

243 instrumental variables 243 trials, participation is voluntary among those randomly chosen to be in the treatment group. On the other hand, persons in the control group usually don’t have access to the treatment. Only those who are particularly likely to benefit from treatment therefore will probably take up treatment which almost always leads to positive selection bias. If you just compare means between treated and untreated in- dividuals using OLS, you will obtain biased treatment effects even for the randomized trial due to non-compliance. So a solution is to instrument for treatment with whether you were offered treatment and estimate the LATE. Thus even when treatment itself is randomly assigned, it is common for people to use a randomized lottery as an instrument for participation. For a modern example of this, see Baicker et al. [ 2013 ] who used the randomized lottery to be on Ore- gon’s Medicaid as an instrument for being on Medicaid. In conclusion, instrumental variables is a powerful design for identifying causal effects when your data suffer from selection on un- observables. But even with that in mind, it has many limitations that has in the contemporary period caused many applied researchers to eschew it. First, it only identifies the LATE under heterogeneous treatment effects, and that may or may not be a policy relevant vari- able. It’s value ultimately depends on how closely the compliers’ average treatment effect resembles that of the other subpopulations’. Second, unlike RDD which has only 1 main identifying assumption (the continuity assumption), IV has up to 5 assumptions! Thus, you can immediately see why people find IV estimation less credible – not because it fails to identify a causal effect, but rather because it’s harder and harder to imagine a pure instrument that satisfies all five conditions. But all this is to say, IV is an important strategy and sometimes the opportunity to use it will come along, and you should be prepared for when that happens by understanding it and how to implement it in practice.


245 Panel data “That’s just the way it is Things will never be the same That’s just the way it is Some things will never change.” 2 -Pac – Introduction One of the most important tools in the causal inference toolkit are the panel data estimators. These are estimators designed explicitly for longitudinal data – the repeated observing of a unit over time. Under certain situations, repeatedly observing the same unit over time can overcome a particular kind of omitted variable bias, though not all kinds. While it is possible that observing the same unit over time will not resolve the bias, there are still many applications where it can, and that’s why this method is so important. We review first the DAG describing just such a situation, followed by discussion of a paper, and then present a dataset exercise in Stata. DAG Example Before I dig into the technical assumptions and estimation method- ology for panel data techniques, I wanted to review a simple DAG illustrating those assumptions. This DAG comes from Imai and [ 2017 ]. Let’s say that we have data on a column of outcomes, Kim , and Y , Y , which appear in three time periods. In other words, Y i i 2 1 i Y where i indexes a particular unit and t =1,2,3 index the time i 3 period where each unit is observed. Likewise, we have a matrix of i D , which also vary over time – D . And D , and D covariates, , i 2 i 1 i 3 i finally there exists a single unit-specific unobserved variable, , u i which varies across units, but which does not vary over time for that unit. Hence the reason that there is no t =1,2,3 subscript for our u variable. Key to this variable is (a) it is unobserved in the dataset, i (b) it is unit-specific, and (c) it does not change over time for a given

246 246 causal inference : the mixtape i . Finally there exists some unit-specific time-invariant variable, unit . Notice that it doesn’t change over time, just , but unlike u it is X u i i i observed. Y Y Y i 3 i 1 i 2 D D D i 3 i i 1 2 X i u i As this is the busiest DAG we’ve seen so far, it merits some dis- D causes both its own outcome Y is also cussion. First, note that 1 i 1 i D is correlated with correlated with the next period . Secondly, u 2 i i Y variables, which technically makes and D endogenous all the D it it it since u is unobserved and therefore gets absorbed into a composite i error term. Thirdly, there is no time-varying unobserved confounder correlated with D - the only confounder is u , which we call the un- i it observed heterogeneity. Fourth, past outcomes do not directly affect Y variables). Fifth, current outcomes (i.e., no direct edge between the it past outcomes do not directly affect current treatments (i.e., no direct edge from do not to D ). And finally, past treatments, D Y t it i i , t 1 , 1 directly affect current outcomes, Y D (i.e., no direct edge from it i t 1 , Y ). It is under these assumptions that we can a particular panel and it fixed effects to isolate the causal effect of method called on Y . D What might an example of this be? Let’s return to our story about the returns to education. Let’s say that we are interested in the effect of schooling on earnings, and schooling is partly determined by unchanging genetic factors which themselves determine unobserved ability, like intelligence, contentiousness and motivation [ Conley and Fletcher , 2017 ]. If we observe the same people’s time varying earnings and schoolings over time, then if the situation described by the above DAG describes both the directed edges and the missing edges , then we can use panel fixed effects models to identify the causal effect of schooling on earnings. Estimation When we use the term “panel data”, what do we mean? We mean a dataset where we observe the same units (individuals, firms, coun-

247 panel data 247 tries, schools, etc.) over more than one time period. Often our out- come variable depends on several factors, some of which are ob- served and some of which are unobserved in our data, and insofar as the unobserved variables are correlated with the treatment variable, then the treatment variable is endogenous and correlations are not estimates of a causal effect. This chapter focuses on the conditions D Y reflects a causal effect under which a correlation between and even with unobserved variables that are correlated with the treatment variable. Specifically, if these omitted variables are constant over time, then even if they are heterogeneous across units, we can use panel data estimators to consistently estimate the effect of our treatment variable on outcomes. There are several different kinds of estimators for panel data, but we will in this chapter only cover two: pooled ordinary least squares 125 125 (POLS) and fixed effects (FE). A common third type of panel esti- mator is the random effects estimator, First we need to set up our notation. With some exceptions, panel but in my experience, I have used it less methods are usually based on the traditional notation and not the often than fixed effects, so I decided to omit it. Again, this is not because it potential outcomes notation. One exception, though, includes the is unimportant. It is important. I just [ ], but at the moment, 2017 Athey et al. matrix completion methods by have chosen to do fewer things in more that material is not included in this version. So we will use, instead, detail based on whether I think they qualify as the most common methods the traditional notation for our motivation. used in the present period by applied , be observable random variables ) D ,..., D Let Y and D ⌘ ( D 2 1 empiricists. See ] for a Wooldridge [ 2010 k more comprehensive treatment, though, and u be an unobservable random variable. We are interested in the of all panel methods including random partial effects of variable D in the population regression function: j effects. E Y | D ] , D [ ,..., D , u 2 1 k i cross-sectional units for N We observe a sample of =1,2,..., t T time periods (a balanced panel). For each unit i ,we =1,2,..., denote the observable variables for all time periods as { ( Y = , D t ): it it 126 126 T D Let D For simplicity, I’m ignoring the time- ⌘ ( D . vector. We , D K ⇥ ,..., 1, 2, . . . , } ) is a 1 2 1 it it it itk X invariant observations, from our i typically assume that the actual cross-sectional units (e.g., individuals DAG for reasons that will hopefully in a panel) are identical and independent draws from the population soon be made clear. N { Y , D , or cross-sectional independence. , u . } in which case d . ⇠ i . i i i i i =1 0 Y and ⌘ ( Y Y ) , We describe the main observables, then, as Y ,..., 2 iT 1 i i i D ). ⌘ ( D D ,..., , D i 1 i iT i 2 It’s helpful now to illustrate the actual stacking of individual units across their time periods. A single unit i will have multiple time periods t

248 248 causal inference the mixtape : 0 1 0 1 ... Y D D D D 1 i j ,1, i i ,1,1 ,1, K i i ,1,2 C C B B . . . . . . . . . . B C C B . . . . . B B C C C C B B = D Y = D Y ... D D D B C B C i i i t K j , t , i , t ,1 t i , , ,2 , i it C C B B . . . . . C C B B . . . . . @ A A @ . . . . . ... D D D Y D i , T , j i , T , T , K ,1 i i , T ,2 iT T K ⇥ ⇥ T 1 And the entire panel itself with all units included will look like this: 0 1 1 0 D Y 1 1 B B C C . . . . B B C C . . B C C B C C B B Y = D = D Y C B C B i i C C B B . . B C C B . . @ A A @ . . D Y N N 1 NT ⇥ K NT ⇥ i , the model is given by For a randomly drawn cross-sectional unit # = D Y + u T + d =1,2,..., , t it i it it As always, we use our schooling-earnings example for motivation. Y be schooling be log earnings for a person i Let t . Let D in year it it i be the returns to schooling. Let t . Let d for person u in year be i the sum of all time-invariant person-specific characteristics, such as unobserved ability. This is often called the unobserved heterogeneity . And let # be the time-varying unobserved factors that determine a it person’s wage in a given period. This is often called the idiosyncratic Y D . error. We want to know what happens when we regress on it it The first estimator we will discuss is the pooled Ordi- Pooled OLS nary Least Squares or POLS estimator. When we ignore the panel Y on D we get structure and regress it it T = d Y =1,2,..., + h t ; D it it it h ⌘ with composite error + # . The main assumption necessary to c it i it obtain consistent estimates for d is: | E h T | D =1,2,..., t , D ] = 0 for D ,..., D [ ]= E [ h it 2 it i 1 i it iT While our DAG did not include # , this would be equivalent to it c , was uncorrelated assuming that the unobserved heterogeneity, i with D for all time periods. it But this is not an appropriate assumption in our case because our DAG explicitly links the unobserved heterogeneity to both the

249 panel data 249 outcome and the treatment in each period. Or using our schooling- earnings example, schooling is likely based on unobserved back- u , and therefore without controlling for it, we have ground factors, i b is biased. No correlation between and D d omitted variable bias and it no correlation between the unobserved and necessarily means h u i it t and that is just probably not a credible assumption. An D for all it h is serially correlated for unit i since u additional problem is that it i period. And thus pooled OLS standard errors are is present in each t also invalid. Let’s rewrite our unobserved effects Fixed Effects (Within Estimator) model so that this is still firmly in our minds: T d D Y + u = + # =1,2,..., ; t it it it i If we have data on multiple time periods, we can think of fixed u as i to be estimated. OLS estimation with fixed effects yields effects T N 2 b b b ,..., ( d )= argmin ) m b D Y u ( u , N 1 it i it   , m b ,..., m =1 t i =1 N 1 N individual dummies in regression of Y this amounts to including it on . D it The first-order conditions (FOC) for this minimization problem are: N T 0 b b d D u ( )=0 Y D i it it   it =1 t =1 i and T b b )=0 ( Y D d u it it i  t =1 for i N . =1,..., =1,..., N , Therefore, for i T 1 b b b = d u D ( Y D Y , d )= it i it i i  T =1 t where T T 1 1 ̄ ⌘ D Y D Y ; ⌘ it i i it   T T =1 =1 t t Plug this result into the first FOC to obtain: ✓ ◆ ◆ ✓ 1 T N N T 0 0 b D = D ) ) ) ( D D D ) Y ( Y ( D ( d i it i it i it it     =1 t t =1 i =1 i =1 ✓ ◆ ◆ ✓ 1 T N T N 0 0 b ̈ ̈ ̈ ̈ = d D D D D it it     it it t =1 =1 t i =1 i =1

250 250 causal inference : the mixtape ̈ ̈ ⌘ D D D , with time-demeaned variables Y . ⌘ Y Y it it it it i In case it isn’t clear, though, running a regression with the time- ̈ ̈ ⌘ Y numerically Y demeaned variables and Y D is ⌘ D D it it it it i to a regression of Y and unit specific dummy on D equivalent it it variables. Hence the reason this is sometimes called the “within” estimator, and sometimes called the “fixed effects” estimator. They 127 127 are the same thing. One of the things you’ll find over time is that things have different names, Even better, the regression with the time demeaned variables is depending on the author and tradition, consistent for d even when C because time-demeaning [ =0 , u 6 ] D i it and those names are often completely uninformative. eliminates the unobserved effects. Let’s see this now: Y d D + u + # = it it it i u Y d D # + = + i i i i ( Y ) Y # )=( d D d D )+( u # u )+( i i it i i it it ̈ ̈ ̈ d Y D + = # it it it Where’d the unobserved heterogeneity go?! It was deleted when we time demeaned the data. And as we said, including individual fixed effects does this time demeaning automatically so that you don’t have 128 128 to go to the actual trouble of doing it yourself manually. Though feel free to do it if you want to convince yourself that they are So how do we precisely do this form of estimation? There are numerically equivalent, probably just three ways to implement the fixed effects (within) estimator. They starting with a bivariate regression for simplicity. are: ̈ ̈ 1 Y . on Demean and regress D (need to correct degrees of free- it it dom) 2 . Regress Y and unit dummies (dummy variable regression) on D it it . Regress 3 on D Y with canned fixed effects routine in Stata it it . xtreg y d, fe i(PanelID) More on the Stata implementation later at the end of this chapter. We’ll review an example from my research and you’ll estimate a POLS, a FE and a demeaned OLS model on real data so that you can see how to do this. Identifying Assumptions We kind of reviewed the assumptions necessary to identify d with our fixed effects (within) estimator when we walked through that original DAG, but let’s supplement that DAG intuition with some formality. The main identification assumptions are: 1 T E [ # =1,2,..., | D t ] = 0; , D . u ,..., D , iT 2 i 1 i it i

251 panel data 251 This means that the regressors are strictly exogenous condi- • tional on the unobserved effect. This allows to be arbitrarily D it u related to , though. It only concerns the relationship between i D and D ’s relationship to u . # , not i it it it ◆ ✓ T 0 ̈ ̈ ] [ 2 D rank K E D = . Â it =1 t it • It shouldn’t be a surprise to you by this point that we have a rank condition, because even when we were working with the simpler linear models, the estimated coefficient was always a scaled covariance, where the scaling was by a variance term. i and not Thus regressors must vary over time for at least some b ⇡ d . d be collinear in order that b - 2 are that The properties of the estimator under assumptions d 1 is FE b b D is unbiased conditional on d consistent ( d plim ) and d = N , FE FE N • ! I only briefly mention inference. But the standard errors in this framework must be “clustered” by panel unit (e.g., individual) to i ’s for the same person allow for correlation in the over time. In # it Stata, this is implemented as follows: . xtreg y d , fe i(PanelID) cluster(PanelID) This yields valid inference so long as the number of clusters is 129 129 In my experience, when an econome- “large”. trician is asked how large is large, they say “the size of your data”. But that : Fixed Effects Cannot Address Reverse Causality But, there Caveat # 1 said, there is a small clusters literature and usually it’s thought that fewer are still things that fixed effects (within) estimators cannot solve. For clusters is too small (as a rule than 30 instance, let’s say we regressed crime rates onto police spending of thumb). So it may be that having 30 clusters is sufficient for 40 - around 1968 ] argues that increases in the probability of per capita. Becker [ the approaching of infinity. This will arrest, usually proxied by police per capita or police spending per usually hold in most panel applications capita, will reduce crime. But at the same time, police spending per such as US states or individuals in the NSLY, etc. capita is itself a function of crime rates. This kind of reverse causality problem shows up in most panel models when regressing crime rates Cornwell and Trumbull [ 1994 ], Table onto police. For instance, see 3 , column 2 (Figure 76 ). Focus on the coefficient on “POLICE”. The dependent variable is crime rates by county in North Carolina for a panel, and they find a correlation between police and crime positive causes rates. Does this mean the more police in an area higher crime rates? Or does it likely reflect the reverse causality problem? Traditionally, economists have solved this kind of reverse causality problem by using instrumental variables. Examples include Evans and Owens [ 2007 ] and Draca et al. [ 2011 ]. I produce one example from Draca et al. [ 2011 ]. In this study, the authors used as an instru- ment in which police were deployed in response to terrorist attacks

252 252 causal inference : the mixtape NOTES 365 : Table 3 from Cornwell and Figure 76 1994 [ Trumbull ] TABLE 3.-RESULTS FROM ESTIMATION (standard errors in parentheses) 2SLS 2SLS Between Within (fixed effects) (no fixed effects) CONSTANT -2.097 -3.719 (2.822) (8.189) PA -0.648 - 0.355 -0.455 - 0.507 (0.088) (0.032) (0.618) (0.251) PC - 0.528 - 0.282 - 0.336 - 0.530 (0.067) (0.021) (0.371) (0.110) Pp 0.297 -0.173 -0.196 0.200 (0.231) (0.032) (0.200) (0.343) S -0.236 - 0.00245 -0.0298 -0.218 (0.174) (0.02612) (0.0300) (0.185) POLICE 0.364 0.413 0.504 0.419 (0.060) (0.027) (0.617) (0.218) DENSITY 0.168 0.414 0.291 0.226 (0.077) (0.283) (0.785) (0.103) PERCENT -0.0951 0.627 0.888 -0.145 YOUNG MALE (0.1576) (0.364) (0.139) (0.336) WCON 0.195 -0.0378 -0.0358 0.329 (0.210) (0.0391) (0.0467) (0.279) WTUC -0.196 0.0455 0.0398 -0.197 (0.170) (0.0190) (0.0282) (0.197) WTRD 0.129 -0.0205 -0.0196 0.0293 (0.278) (0.0405) (0.0426) (0.3240) WFIR 0.113 -0.00390 -0.00700 0.0506 (0.220) (0.02806) (0.03270) (0.3224) WSER -0.106 0.00888 0.00600 -0.127 (0.163) (0.01913) (0.02536) (0.176) WMFG - 0.0249 - 0.360 - 0.406 - 0.0493 (0.1339) (0.112) (0.217) (0.1672) WFED 0.156 -0.309 -0.273 0.170 (0.287) (0.176) (0.296) (0.327) WSTA -0.284 0.0529 -0.0129 -0.181 (0.256) (0.114) (0.2599) (0.300) WLOC 0.0103 0.182 0.136 0.0237 (0.4635) (0.118) (0.165) (0.5187) WEST -0.229 -0.198 (0.108) (0.117) CENTRAL -0.164 -0.173 (0.064) (0.067) URBAN -0.0346 -0.0874 (0.1324) (0.1508) PERCENT 0.148 0.174 MINORITY (0.049) (0.057) s.e. 0.216 0.137 0.141 0.224 the problem of crime exceeds that of traditional crimi- crime. In both fixed effects 2SLS and within regres- nal justice strategies (along the lines of Myers (1983)). sions, the estimated coefficient of WMFG is statisti- However, a Wu-Hausman test of the contrast between cally significant and at least as large in absolute value the within and fixed effects 2SLS estimates cannot as any of the deterrent variables' coefficient estimates. in London in a program called Operation Theseus. The authors reject the null hypothesis that PA and POLICE are The other variable revealed to influence the crime rate present both a pooled OLS estimate (which is positive on the effect uncorrelated with E.7 Therefore, on efficiency grounds statistically significantly is PERCENT YOUNG MALE, whose estimated coefficient is 0.888. The large, positive we prefer the within estimates, and conclude that both 2 that police have on crime) and the SLS estimate (which is negative, labor market and law enforcement incentives matter effect of PERCENT YOUNG MALE is consistent with (consistent with Grogger (1991)). the fact that young males commit most of the crime. 77 consistent with Becker’s hypothesis). See Figure . Although estimators that ignore unobserved hetero- Interestingly, the effects of WMFG and PERCENT So, one situation in which you wouldn’t want to use panel fixed ef- YOUNG MALE are not statistically significant in re- geneity are inconsistent, it is instructive to contrast our fixed effects 2SLS estimates with those obtained from gressions that do not account for unobserved hetero- fects is if you have reverse causality or simultaneity bias. And specifi- geneity. cally when that reverse causality is very strong in observational data. One interpretation of our fixed effects 2SLS esti- 7 The value of the test-statistic, which is asymptotically dis- mates is that the efficacy of labor market solutions to tributed as X2, is 0.031. This would technically violate the DAG, though, that we presented at the start of the chapter. Notice that if we had reverse causality, D then Y ! , which is explicitly ruled out by this theoretical model This content downloaded from on Mon, 12 Mar 2018 17:32:25 UTC All use subject to contained in the DAG. But obviously, in the police - crime example, that DAG would be inappropriate, and any amount of reflection on the problem should tell you that that DAG is inappropriate. Thus it requires, as I’ve said repeatedly, some careful reflection, and writing

253 VOL. 101 NO. 5 DRACA ETAL.: PANIC ON THE STREETS OF LONDON 2169 Table 2—Difference-In-Differences Regression Estimates, Police Deployment and Total Crimes, 2004-2005. Full Split +Controls +Trends 0) (2) (3) (4) Panel A. Police deployment (Hours worked per 1,000 population) T x Post-Attack 0.081*** (0.010) T x Post-Attackl 0.341*** 0.342*** 0.356*** (0.029) (0.027) (0.028) 0.001 0.014 -0.001 T x Post-Attackl (0.016) (0.010) (0.011) No No Controls Yes Yes No Yes No No Trends 32 32 32 32 Number of boroughs Observations 1,664 1,664 1,664 1,664 Full Split -fControls +Trends (1) (2) (3) (4) Panel B. Total crimes (Crimes per 1,000 population) T x Post-Attack —0.052** (0.021) -0.056* T x Post-Attackl -0.111*** -0.109*** (0.030) (0.027) (0.027) 0.024 -0.031 T x Post-Attack2 -0.033 panel data 253 (0.054) (0.028) (0.027) Yes No No Yes Controls No Yes No No Trends 32 32 32 32 Number of boroughs Observations 1,664 1,664 1,664 1,664 : Table 2 from Draca et al. Figure 77 IV Estimates OLS Estimates [ ] 2011 Differences Full Levels +Trends Split (5) (4) (3) (1) (2) Panel C. Structural form 0.785*** ln(police hours) (0.053) -0.183*** -0.318*** -0.641** -0.031 Aln(police hours) (0.066) (0.093) (0.051) (0.301) Yes Yes Yes Yes Yes Controls Yes No No No No Trends 32 32 32 32 32 Number of boroughs 1.664 Observations 1,664 1,664 1,664 3,328 Notes: All specifications include week fixed effects. Standard errors clustered by borough in parentheses. Weighted by borough population. "Full" post-period for baseline models in column 1 of panels A and B defined as all weeks after 7/7/2005 until 31/12/2005 attack inclusive. Weeks defined in a Thursday-Wednesday inter val throughout to ensure a clean pre- and post-split in the attack weeks. T x Post-Attack is then defined as interaction of treatment group with a dummy variable for the post-period. T x Post-Attackl is defined as inter out exactly what the relationship is between the treatment variables action of treatment group with a deployment "policy" dummy for weeks 1-6 following the July 7, 2005, attack. T x Post-Attack2 is defined as treatment group interaction for all weeks subsequent to the main Operation and the outcome variables in a DAG can help you develop a credible Theseus deployment. Treatment group defined as boroughs of Westminster, Camden, Islington, Tower Hamlets, and Kensington-Chelsea. Police deployment defined as total weekly hours worked by all police staff at borough identification strategy. level. Controls based on Quarterly Labour Force Survey (QLFS) data and include: borough unemployment rate, employment rate, males under 25 as proportion of population, and whites as proportion of population (follow ing QLFS ethnic definitions). : Fixed Effects Cannot Address Time-variant Unobserved Hetero- 2 Caveat # ***Significant at the 1 percent level. ** Significant at the 5 percent level. The second situation in which panel fixed effects don’t buy geneity *Significant at the 10 percent level. you anything is if the unobserved heterogeneity is time varying. In this situation, the demeaning has simply demeaned an unobserved time-variant variable, which is then moved into the composite er- ̈ ̈ , ror term, and which since time demeaned u correlated with D it it ̈ D remains endogenous. Again, look carefully at the DAG - panel it This content downloaded from on Mon, 12 Mar 2018 17:39:32 UTC fixed effects is only appropriate if u is unchanging. Otherwise it’s i All use subject to just another form of omitted variable bias. So, that said, don’t just blindly use fixed effects and think that it solves your omitted variable bias problem – in the same way that you shouldn’t use matching just because it’s convenient to do. You need a DAG, based on an ac- tual economic model, which will allow you to build the appropriate research design. Nothing substitutes for careful reasoning and eco- nomic theory, as they are the necessary conditions for good research design. Example: Returns to Marriage and Unobserved Heterogeneity When might this be true? Let’s use an example from Cornwell and [ 1997 ] in which the authors attempt to estimate the causal Rupert effect of marriage on earnings. It’s a well known stylized fact that married men earn more than unobserved men, even controlling for observables. But the question is whether that correlation is causal, or whether it reflects unobserved heterogeneity, or selection bias. So let’s say that we had panel data on individuals. These indi- viduals i t . We are interested in the are observed for four periods 130 130 following equation: We use the same notation as used ̈ in their paper, as opposed to the Y notation presented earlier. Y # = a + d M + + b X g + A + i it i it it it

254 254 causal inference : the mixtape Y Let the outcome be their wage observed in each period, and it which changes each period. Let wages be a function of marriage M , other covariates the change over time which changes over time it X , race and gender which do not change over the panel period it A , and an unobserved variable we call unobserved ability g . This i i could be intelligence, non-cognitive ability, motivation, or some other unobserved confounder. The key here is that it is unit-specific, unob- served, and time-invariant. The # is the unobserved determinants it of wages which are assumed to be uncorrelated with marriage and other covariates. [ 1997 ] estimate both a feasible generalized Cornwell and Rupert least squares model and three fixed effects models (each of which includes different time-varying controls). The authors call the fixed effects regression a “within” estimator, because it uses the within unit variation for eliminating the confounding Their estimates are presented in Figure 78 . Notice that the FGLS (column 1 ) finds a strong marriage premium of around 8 3 %. But, once we begin estimating fixed effects models, . the effect gets smaller and less precise. The inclusion of marriage characteristics, such as years married and job tenure, causes the 60 % from the FGLS estimate, coefficient on marriage to fall by around 5 % level. and is no longer statistically significant at the One of the interesting features of this analysis is the effect of dependents on wages. Even under the fixed effects estimation, the relationship between dependents and wages is positive, robust and statistically significant. The authors explore this in more detail by including interactions of marriage variables with dependents (Figure 79 ). Here we see that the coefficient on marriage falls and is no longer statistically significant, but there still exists a positive effect of dependents on earnings. Stata example: Survey of Adult Service Providers Next I’d like to introduce a Stata exercise based on data collection for my own research: a survey of sex workers. You may or may not know this, but the Internet has had a profound effect on sex markets. It has moved women indoor from the streets while simultaneously breaking the link with pimps. It has increased safety and anonymity, too, which has had the effect of causing new entrants. The marginal sex worker has more education and better outside options than tra- ditional US sex workers [ Cunningham and Kendall , 2011 , Cornwell and Cunningham , 2016 ]. The Internet, in sum, caused the marginal sex worker to shift towards women more sensitive to detection, harm and arrest.

255 panel data 255 ECONOMIC INQUIRY 290 I1 TABLE Figure : Table 2 from Cornwell and 78 Regressions Wage Estimated Rupert [ 1997 ] (Standard Errors in Parentheses) (3 1 (4) (2) (1) Within Variable Within Within FGLS ______ 0.033 1 0.05 0.056 0.083 Married (0.028) (0.026) (0.026) (0.022) 0.040 0.057 0.062 0.064 Divorced (0.036) (0.036) (0.038) (0.033) -0.005 Years Married (0.006) -0.0003 Years Married’ (0.0003) -0.014 Divorced Years (0.008) 0.027 0.021 0.024 0.027 Experience (0.004) (0.005) (0.004) (0.004) -0.001 -0.001 -0.001 -0.001 Experience’ (0.0002) (0.0002) (0.0002) (0.0001) 0.011 0.013 Tenure (0.004) (0.004) -0.0006 -0.0005 Tenure2 (0.0002) (0.0002) -0.121 -0.117 -0.118 4.091 South (0.034) (0.034) (0.034) (0.019) 0.057 0.059 0.059 0.137 Urban (0.024) (0.024) (0.024) (0.017) 0.106 0.102 0.103 Union 0.109 (0.018) (0.018) (0.018) (0.015) 0.052 0.048 0.047 Dependents 0.052 (0.019) (0.020) (0.017) (0.019) -0.325 High School No (0.057) -0.148 High School Some (0.032) 0.091 Some College (0.028) 0.278 Grad College (0.034) 0.322 Post-College (0.041) 11 0.2 0.212 0.212 0.215 Standard error 2 1.9 11 7 x2 eral, despite the smaller cross-section dimen- much is effect dents becoming robust, less our of sion sample, and the basic qualitative their fixed effects smaller and insignificant in from quantitative results obtained the larger, regressions. regard With coefficient esti- the other to but shorter preserved. been have panel KN We those generally typical of mates, they are emphasize that the significant difference only 2008 and In , I surveyed (with Todd Kendall) approximately 2009 the found in very similar to and literature those and between our results lies KN of in US Internet-mediated sex workers. The survey was a basic labor 700 by those obtained gen- in In KN their sample. marriage. returns the to market survey; I asked them about their illicit and legal labor market experiences, and demographics. The survey had two parts: a “static” provider-specific section and a “panel” section. The panel section asked respondents to share information about each of the last 4 131 131 session with clients. Technically, I asked them to share about the last five sessions, but for this I have created a shortened version of the dataset and uploaded it exercise, I have dropped the fifth due to to my website. It includes a few time-invariant provider character- low response rates on the fifth session. istics, such as race, age, marital status, years of schooling and body mass index, as well as several time-variant session-specific charac- teristics including the log of the hourly price, the log of the session length (in hours), characteristics of the client himself, whether a con- dom was used in any capacity during the session, whether the client

256 : 256 causal inference the mixtape MARITAL CORNWELL AND EARNINGS RUPERT STATUS & 293 from 3 : Table 79 Figure Cornwell and TABLEIII . ] 1997 [ Rupert Dependents and the Returns to Marriage Parentheses) (Standard Errors in (1) (2) Within Variable Within 0.027 0.071 Married (0.026) (0.038) Divorced - 0.033 0.008 (0.058) (0.049) Years Married 0.011 12) (0.0 Married’ -0.001 Years (0.0007) -0.013 Divorced Years (0.01 1) 0.024 0.021 Experience (0.004) (0.005) Experience’ -0.001 -0.001 (0.002) (0.0002) Tenure 0.013 0.011 (0.004) (0.004) -0.0005 -0.0006 Tenure’ (0.0002) (0.0002) -0.113 south -0.109 (0.034) (0.034) Urban 0.061 0.061 (0.024) (0.023) 0.110 0.101 Union (0,018) (0.018) 0.281 0.292 Dependents (0.076) (0.076) x -0.232 Dependents -0.266 Married (0.077) (0.087) Dependents -0.156 -0.124 x Divorced (0.103) (0.095) -0.013 x Dependents Married Years (0.012) Years Mamed’ x Dependents 0.001 (0.0008) Dependents x Divorced Years -0.001 12) (0.0 Standard error 0.213 0.211 Table (as in (3)), more Or, by an but includes promptly. column 11, as indicated Reed and 19891, young men with children also [ status Harford and of marital and divorced interaction was a “regular”, etc. (2) years mar- the adds and may accept jobs dependents. Column fewer amenities offer that In this exercise, you will estimate three types of models: a pooled column and (4)) 11, ried variables (as in Table benefits, but greater wage compensation. de- 111, the dependents test to with interactions thereof Finally, as in Table indicated OLS model, a fixed effects (FE) and a demeaned OLS model. The (marriage) whether the effect of children var- pendents (marital status) effect varies signifi- ies in the house- with years married (children model will be of the following form: (dependents) ef- with the marital status cantly fect. Column hold). results from reports (1) regres- a sion which omits years married and divorced Y # = b + X u + g + Z is i i i is is is ̈ ̈ ̈ = g + Y Z h is is is is where u . is both unobserved and correlated with Z is i The first regression model will be estimated with pooled OLS and the second model will be estimated using both fixed effects and OLS. In other words, I’m going to have you estimate the model using the xtreg function with individual fixed effects, as well as demean the

257 panel data 257 data manually and estimate the demeaned regression using . reg Notice that the second regression has a different notation on the dependent and independent variable; it represents the fact that the ̈ variables. Thus Y demeaned . variables are columns of Y Y = is i is Secondly, notice that the time-invariant X variables are missing i from the second equation. Do you understand why that is the case? These variables have also been demeaned, but since the demeaning is across time, and since these time-invariant variables do not change over time, the demeaning deletes them from the expression. Notice, also, that the unobserved individual specific heterogeneity, u , has i disappeared. It has disappeared for the same reason that the X i terms are gone – because the mean of u over time is itself, and thus i the demeaning deletes it. To estimate these models, type the following lines into Stata:

258 258 causal inference the mixtape : _ scuse sasp panel, clear . . tsset id session . foreach x of varlist lnw age asq bmi hispanic black other asian schooling cohab married divorced // _ _ _ _ _ _ _ second asian cl black cl appearance cl hispanic cl unsafe llength reg asq cl // separated age cl provider _ _ cl hot massage cl { othrace drop if ‘x’==. } _ bysort id: gen s= . N . keep if s==4 . foreach x of varlist lnw age asq bmi hispanic black other asian schooling cohab married divorced // _ _ _ _ _ _ _ second asian separated age cl appearance cl hispanic cl unsafe llength reg asq cl // cl provider cl black _ _ cl hot massage cl { othrace _ egen mean ‘x’=mean(‘x’), by(id) _ _ ‘x’ gen demean ‘x’=‘x’ - mean drop mean * } xi: reg lnw age asq bmi hispanic black other asian schooling cohab married divorced separated // . _ _ _ _ _ _ _ second asian cl appearance cl black age cl hispanic cl provider cl // cl unsafe llength reg asq _ _ cl, robust cl hot massage othrace . xi: xtreg lnw age asq bmi hispanic black other asian schooling cohab married divorced separated // _ _ _ _ _ _ _ age second asian cl appearance cl black cl unsafe llength reg asq cl hispanic cl provider cl // _ _ othrace cl hot massage cl, fe i(id) robust _ _ _ _ _ _ _ bmi demean lnw demean hispanic demean reg demean black demean age demean other // . asq demean _ _ _ _ _ _ cohab demean schooling demean divorced demean asian demean separated // married demean demean _ _ _ _ _ _ _ _ _ demean cl demean reg demean asq unsafe demean cl demean age appearance llength demean cl // _ _ _ _ _ _ _ _ _ _ othrace provider cl demean asian hispanic demean cl demean cl demean second demean black cl // _ _ _ demean hot demean massage cl, robust cluster(id) Notice the first five commands created a balanced panel. Some of the respondents would leave certain questions blank, probably due to concerns about anonymity and privacy. So we have dropped anyone who had missing values for the sake of this exercise. This leaves us with a balanced panel. You can see this yourself if after running those five lines you type xtdescribe .

259 panel data 259 31 I have organized the output into Table . There’s a lot of interest- ing information in these three columns, some of which may surprise you if only for the novelty of the regressions. So let’s talk about the statistically significant ones. The pooled OLS regressions, recall, do not control for unobserved heterogeneity, because by definition those are unobservable. So these are potentially biased by the unobserved heterogeneity, which is a kind of selection bias, but we will discuss them anyhow. : POLS, FE and Demeaned OLS Table 31 FE POLS Depvar: Demeaned OLS Estimates of the Determinants of Log 051 * 0 . 051 * Unprotected sex with client of any kind 013 0 0 . . Hourly Price for a Panel of Sex Workers . 0 . 026 ) 028 . 0 ( )( 028 0 )( Ln(Length) 435 0 . 435 *** . - *** - 0 . 308 *** - 0 )( 0 . 019 ) ( 0 . 028 )( 0 . 024 . 037 - 0 . 037 ** ** 0 *- 047 . 0 Client was a Regular - 019 )( ( . 017 ) 0 . 028 )( 0 . 0 . 0 . 002 Age of Client - 0 . 001 0 002 ( 007 . 006 ) . 0 )( 0 . 009 )( 0 000 - 0 . 000 Age of Client Squared 0 . 000 - 0 . . 000 0 . 000 ) )( 0 )( 000 . ( 0 . 020 *** 0 . 006 0 . Client Attractiveness (Scale of 1 to 10 ) 0 006 0 006 )( 0 . 005 ) ( 0 . 007 )( . 113 0 . 113 * 055 . 0 Second Provider Involved * . 0 ( 060 . 048 ) . 0 0 . 067 )( )( 0 . 010 - 0 . 010 Asian Client - 0 . 014 - 0 034 0 0 . 030 ) ( 0 . 049 )( )( . 0 . 027 027 0 092 . 0 Black Client . . ( )( 0 . 037 ) 0 . 073 )( 0 042 0 062 - 0 . 062 Hispanic Client 0 . 052 - . 0 . 0 . 045 ) )( 052 ( 0 . 080 )( . 142 *** 0 . 142 Other Ethnicity Client 0 . 156 ** 0 *** 0 049 )( 0 . 045 . )( 068 . 0 ( ) 0 . 052 * 0 . 052 * Met Client in Hotel 0 . 133 *** 0 )( )( 0 . 024 ) 029 . 0 ( . 027 - 134 001 - 0 . 001 . 0 - Gave Client a Massage 0 *** . 028 0 . 024 ) )( 029 . 0 )( ( 0 . . 000 0 . 0 003 . 0 Age of provider 000 ( 0 . 012 ) (.) (.) 000 - . 000 Age of provider squared 0 . 000 0 0 . (.) ) 000 . 0 ( (.) 0 . 000 0 . Body Mass Index - 0 . 022 *** 000 (.) ( 0 . 002 ) (.) 0 226 0 . 000 . 0 - Hispanic . *** 000 (.) (.) ) 082 . 0 ( . 0 000 Black 028 0 . 000 0 . (.) 0 . 064 ( (.) ) 0 000 0 . 000 Other - 0 . . 112 (.) (.) ) . 0 ( 077 . 000 0 . 000 Asian 0 . 086 0 (.) ) 158 . 0 ( (.) . 0 . 000 ** 020 000 . 0 Imputed Years of Schooling 0 (.) (.) ) 010 . 0 ( . 0 - 000 0 Cohabitating (living with a partner) but unmarried 054 0 . 000 . (.) ( 0 . 036 ) (.) 000 . . 000 Currently married and living with your spouse 0 005 0 . 0 (.) (.) ) 043 . 0 ( . 000 Divorced and not remarried . 000 - 0 . 021 0 0 (.) (.) ( 0 . 038 ) 0 056 0 . 000 . 0 - Married but not currently living with your spouse . 000 (.) (.) ) 059 . 0 ( 028 N , 028 1 , 028 1 , 1 . 57 0 . 00 Mean of dependent variable 5 . 57 5 . , ** p 0 . 05 , *** p < 0 < 01 10 . 0 < Heteroskedastic robust standard errors in parenthesis clustered at the provider level. * p First, a simple scan of the second and third column will show that the fixed effects regression which included (not shown) dummies for

260 260 causal inference : the mixtape the individual herself is equivalent to a regression on the demeaned data. This should help persuade you that the fixed effects and the demeaned (within) estimators are yielding the same coefficients. But second, let’s dig into the results. One of the first things we observe is that in the pooled POLS model, there is not a compensat- ing wage differential detectable on having unprotected sex with a 132 132 client. But, notice that in the fixed effects model, unprotected sex There were three kinds of sexual encounter - vaginal receptive sex, anal has a premium. This is consistent with Rosen [ 1986 ] who posited the receptive sex, and fellatio. Unprotected ] who found existence of risk premia, as well as Gertler et al. [ 2005 sex is coded as any sex act without a condom. Gertler et al. 2005 ], risk premia for sex workers using panel data. [ though, find a much larger premia of over 20 % for unprotected sex, 5 whereas I am finding only a mere %. This could be because a large number of the unprotected instances are fellatio, which carry a much lower risk of infection than unprotected receptive intercourse. Never- theless, it is interesting that unprotected sex, under the assumption of strict exogeneity, appears to cause wages to rise by approximately 5 %, which is statistically significant at the 10 % level. Given an hourly wage of $ 262 , this amounts to a mere $ 13 additional dollars per hour. The lack of a finding in the pooled OLS model seems to suggest that the unobserved heterogeneity was masking the effect. Next we look at the session length. Note that I have already ad- justed the price the client paid for the length of the session so that the outcome is a log wage, as opposed to a log price. As this is a log-log regression, we can interpret the coefficient on log length as an elas- ticity. When we use fixed effects, the elasticity increases from - 0 . 308 to - . 435 . The significance of this result, in economic terms, though, 0 is that there appears to be “volume discounts” in sex work. That is, longer sessions are more expensive, but at a decreasing rate. Another interesting result is whether the client was a “regular” which meant that she had seen him before in another session. In our pooled OLS model, regulars paid 4 . 7 % less, but this shrinks slightly in our fixed effects model to 3 7 % reductions. Economically, this could be lower . because new clients pose risks that repeat customers do not pose. Thus, if we expect prices to move closer to marginal cost, the disap- pearance of some of the risk from the repeated session should lower price, which it appears to do. Another factor related to price is the attractiveness of the client. Interestingly, this does not go in the direction we may have expected. One might expect that the more attractive the client, the less he pays. But in fact it is the opposite. Given other research that finds beautiful people earn more money [ Hamermesh and Biddle , 1994 ], it’s possible that sex workers are price discriminating. That is, when they see a handsome client, they deduce he earns more, and therefore charges him more. This result does not hold up when including fixed effects,

261 panel data 261 though, suggesting that it is due to unobserved heterogeneity, at least in part. Similar to unprotected sex, a second provider present has a posi- tive effect on price which is only detectable in the fixed effects model. Controlling for unobserved heterogeneity, the presence of a second provider increases prices by 11 . 3 %. We also see that she discriminates 14 . against clients of “other” ethnicity who pay % more than White 2 clients. There’s a premium associated with meeting in a hotel which is considerably smaller when controlling for provider fixed effects by almost a third. This positive effect, even in the fixed effects model, may simply represent the higher costs associated with meeting in a hotel room. The other coefficients are not statistically significant. Many of the time-invariant results are also interesting, though. For instance, perhaps not surprisingly, women with higher BMI earn less. Hispanics earn less than White sex workers. And women with more schooling earn more, something which is explored in greater detail in Cunningham and Kendall [ 2016 ]. Conclusion In conclusion, we have been exploring the usefulness of panel data for estimating causal effects. We noted that the fixed effects (within) estimator is a very useful method for addressing a very specific form of endogeneity, with some caveats. First, it will eliminate any and all unobserved and observed time-invariant covariates correlated with the treatment variable. So long as the treatment and the outcome varies over time, and strict exogeneity, then the fixed effects (within) estimator will identify the causal effect of the treatment on some outcome. But this came with certain qualifications. For one, the method couldn’t handle time variant unobserved heterogeneity. It’s thus the burden of the researcher to determine which type of unobserved heterogeneity problem they face, but if they face the latter, then the panel methods reviewed here are not unbiased and consistent. Second, when there exists strong reverse causality pathways, then panel methods are biased. Thus, we cannot solve the problem of simultaneity, such as what Wright faced when estimating the price elasticity of demand, using the fixed effects (within) estimator. Most likely, we are going to have to move into a different framework when facing that kind of problem. Still, many problems in the social sciences may credibly be caused by a time-invariant unobserved heterogeneity problem, in which case the fixed effects (within) panel estimator is useful and appropriate.


263 Differences-in-differences “What’s the difference between me and you? About five bank accounts, three ounces, and two vehicles.” – Dr. Dre Introduction In 2002 , Craigslist opened a new section on its front page called “erotic services” in San Francisco, California. The section would end up being used by sex workers exclusively to advertise to and solicit clients. Sex workers claimed it made them safer, because instead of working on street corners and for pimps, they could solicit indoors from their computers, which as a bonus, also gave them the chance to learn more about the men contacting them. But activists and law enforcement worried that it was facilitating sex trafficking and increasing violence against women. Which was it? Was erotic services (ERS) making women safer, or was it placing them in harm’s way? This is ultimately an empirical question. We want to know the effect of ERS on female safety, but the fundamental problem of causal inference says that we can’t know what effect it had because we are missing the data necessary to make the calculation. That is, 1 0 d ]= E [ M E M [ ] 1 where is women murdered in a world where San Francisco has M 0 M ERS, and is women murdered in a world where San Francisco does not have ERS at the exact same moment in time . In 2002 , only the first occurred, as the second was a counterfactual. So how do we proceed? The standard way to evaluate interventions such as this is the 133 133 DD is basically standard differences-in-differences strategy, or DD. You’ll sometimes see the acronyms DiD, Diff-in-diff, or even DnD. a version of panel fixed effects, but can also be used with repeated cross-sections. Let’s look at this example using some tables, which hopefully will help give you an idea of the intuition behind DD, as well as some of its identifying assumptions.

264 264 causal inference : the mixtape E Let’s say that the intervention is erotic services, or , and we want M . Couldn’t we E to know the causal effect of on female murders just compare San Francisco murders in, say, with some other 2003 city, like Waco, Texas, where the author lives? Let’s look at that. is an unobserved San Francisco fixed effect and W is a where SF : Compared to what? Different 32 Table Outcome Cities cities M E + SF = San Francisco = Waco, Texas M W Waco fixed effect. When we make a simple comparison between E SF W . Waco and San Francisco, we get a causal effect equalling + W and SF Thus the simple difference is biased because of . Notice that the SF W term is akin to our selection bias term in the decom- position of the simpler difference in outcomes. It’s the underlying differences in murder rates between the two cities in a world where neither gets treated. So if our goal is to get an unbiased estimate of E , then that simple difference won’t work unless W and SF are the same. But what if we compared San Francisco to itself? Say compared it 2003 to two years earlier in in 2001 ? Let’s look at that simple before and after difference. Again, this doesn’t lead to an unbiased 33 Table : Compared to what? Before Outcome Time Cities and After SF = San Francisco Before M T + E M After SF + = estimate of E , even if it does eliminate the fixed effect. That’s because such differences can’t control for or net out natural changes in the murder rate over time. I can’t compare San Francisco before and after ( E ) because of T which is a kind of omitted variable bias. If we + T T , then it’d be fine, though. could control for The intuition of the DD strategy is simple: all you do is combine these two simpler approaches so that you can eliminate both the selection bias and the effect of time. Let’s look at it in the following table. The first difference, D , does the simple before and after 1 difference. This ultimately eliminates the unit specific fixed effects. Then, once those differences are made, we difference the differences (hence the name) to get the unbiased estimate of E . But there’s a couple of key assumptions with a standard DD model. First, we are assuming that there is no time-variant city specific unobservables. Nothing unobserved in San Francisco that is changing over time that also determines murders. And secondly, we are assuming that T is the same for all units. This second assumption is called the parallel

265 differences - - differences 265 in Table 34 : Compared to what? Subtract D Time Outcome Cities D 2 1 each city’s differences SF = M San Francisco Before + E After + E M = SF + T T E Waco Before M = W T M + W = After T trends assumption, which I’ll discuss in more detail later. DD is a powerful, yet amazingly simple, strategy. It is a kind of panel estimator in the sense that it utilizes repeated observations on the same unit to eliminate the unobserved heterogeneity con- founding the estimate of the treatment effect. But here we treat it separately because of the amount of focus it has gotten separately in the literature. Background You see traces of this kind of strategy in Snow’s cholera study, though technically Snow only did a simple difference. He just had every reason to believe that absent the treatment, the two parts of London would’ve had similar underlying cholera rates since the two groups were so similar ex ante. The first time I ever saw DD in its Card and Krueger [ 1994 ], a famous minimum wage current form was study. This was a famous study primarily because of its use of an ex- plicit counterfactual for estimation. Suppose you are interested in the effect of minimum wages on employment. Theoretically, you might expect that in competitive labor markets, an increase in the minimum wage would move us up a downward sloping demand curve causing [ 1994 ] was interested in Card and Krueger employment to fall. But quantifying this, and approached it furthermore as though it was purely an empirical question. Their strategy was to do a simple DD between two neighboring states - a strategy we would see again in minimum wage research Dube et al. with 2010 ]. New Jersey was set to experience an increase [ in the state minimum wage from $ 4 . 25 to $ 5 . 05 , but neighboring Pennsylvania’s minimum wage was staying at $ 4 . 25 (see Figure 80 ). They surveyed about fast food stores both in New Jersey and 400 Pennsylvania before and after the minimum wage increase. This was used to measure the outcomes they cared about (i.e., employment). 1 Y t with be employment at restaurant i , in state s , at time Let ist 0 Y be employment at restaurant a high minimum wage, and let ist i , state s , time t with a low minimum wage. As we’ve said repeat- edly through this book, we only see one or the other because the

266 266 causal inference the mixtape : Locations of Restaurants (Card and Krueger 2000) Figure 80 : NJ and PA J. Hainmueller (MIT) 5/50 switching equation selects one or the other based on the treatment assignment. But, we can assume then that 0 E [ | s , t ]= g t + Y s t ist . In the absence of a minimum wage change, in other words, employ- ment in a state will be determined by the sum of a time-invariant state fixed effect, g , that is idiosyncratic to the state, and a time effect s t that is common across all states. t D Let be a dummy for high-minimum wage states and periods. st Under the conditional independence assumption, we can write out the average treatment effect as 1 0 Y E d Y [ ]= | s , t ist ist and observed employment can be written as # = g + t + d D Y + t st s ist ist Figure 81 shows the distribution of wages in November 1992 after the minimum wage hike. As can be seen, the minimum wage hike was binding evidenced by the mass of wages at the minimum wage in New Jersey. Now how do we take all this information and precisely calculate the treatment effect? One way is to do what we did earlier in our San

267 differences - - differences 267 in Wages After Rise in Minimum Wage 81 : Distribution of wages for NJ Figure and PA in November 1992 J. Hainmueller (MIT) 7/50 Francisco and Waco example: compute before and after differences for each state, and then difference those differences. In New Jersey: • Employment in February is ( Y l | s = NJ , t = Feb )= g E + NJ ist Feb Employment in November is: • ( Y d | s = NJ , t E Nov )= g + + l = Nov NJ ist • Difference between November and February E ( Y d | s = NJ , t = Nov ) E ( Y + | s = NJ , t = Feb )= l l N F ist ist And in Pennsylvania: • Employment in February is E ( Y l | s = PA , t = Feb )= g + PA ist Feb • Employment in November is: E ( Y l | s = PA , t = Nov )= g + Nov PA ist

268 268 causal inference the mixtape : Difference between November and February • | s = PA , t = Nov ) E ( Y E | s = PA ( t = Feb )= l Y l , F N ist ist Once we have those two before and after differences, we simply difference them each to net out the time effects. The DD strategy amounts to comparing the change in employment in NJ to the change in employment in PA. The population DD are: ✓ ◆ b Y = | s = NJ , t = Nov ) E ( Y d | s = NJ , t = Feb ) E ( ist ist ✓ ◆ Y | s = PA , t = Nov E E ( Y ( | s = PA , t = Feb ) ) ist ist l =( d ) ( + l l ) l F N F N d = SEPTEMBER REVlEW ECONOMIC AMERICAN THE 1994 erences: Estimation erence-in-Di Di This is estimated using the sample analog of the population means Sample Means: Minimum wage laws and employment 82 ). (see Figure THE RISE EMPLOYMENT STORE BEFORE AND I~ER 3-AVERAGE TABLE PER JERSEY MINIMUM IN NEW WAGE : Simple DD using sample Figure 82 averages Jersey Stores by state Stores in New NJ~ Differences a within Difference, Midrange- Low- Wage = r Wage Wage = high NJ-PA high $4.26-$4.99 PA $4.25 $5.00 NJ (viii) (iii) (vii) (ii) (v) (vi) Variable (iv) (i) 1. before, employment FTE all available observations 2. after, employment FTE all available observations 3. Change in mean FTE employment in mean Change FTE 4. balanced employment, storesC sample of J. Hainmueller (MIT) 17 / 50 FTE mean in Change 5. What made this study so controversial was less its method and employment, setting FTE at temporarily more its failure to find the negative effect on employment predicted Od to stores closed by a neoclassical perfect competition model. In fact, not only did employment not fall; their DD showed it rose relative to the counter- are data on employment. FTE parentheses. errors Standard Notes: in The sample consists of all stores shown available with (full-time-equivalent) employment counts each part-time worker as half a full-time worker. Employment at six closed stores factual. This paper started a new wave of studies on the minimum 134 is stores is set to zero. Employment at four temporarily missing. as treated closed 134 A review of that literature is beyond wage, which continues to this day. 1 is between $4.25 equals hour in New per = (N 101), wave in wage starting whether by classified were Jersey astares the scope of this chapter, but you Simple differencing is one way to do it, but it’s not the only way to higher (N = = 140), or is $5.00 per hour or 73). and $4.99 per hour (N $4.26 can find a relatively recent review by do it. We can also directly estimate this using a regression framework. and stores; hour) per $5.00 (2 difference and hour) per ($4.25 low-wage between employment in high-wage b~ifference [ 2014 ]. Neumark et al. between in midrange per hour) and high-wage stores. employment ($4.26-$4.99 The advantages of that is that we can control for other variables in 'Subset 2. wave and 1 wave of data employment available with stores which may reduce the residual variance (leading to smaller standard wave-2 employment to only, row this are based on the 0. changes set is stores closed temporarily four at Employment errors), it’s easy to include multiple time periods, and we can study wave subset of stores with available employment data in wave 2. 1 and IN EMPLOYMENT 4-REDUCED-FORM MODELS FOR CHANGE TABLE Model variable Independent (v) (iv) (iii) (ii) (i) 1. New dummy 2.33 2.30 Jersey - - - (1.19) (1.20) 15.65 14.92 11.91 - - gapa Initial 2. wage (6.08) (6.21) (7.39) 3. Controls for chain and yes yes yes no no ownershipb Controls 4. regionC for regression of error Standard 5. value controlsd for Probability 6. errors 357 Notes: stores of consists sample The parentheses. in given are Standard in waves with available data on employment and starting wages and 2. The 1 all dependent variable in FTE models is change in employment. The mean and All respectively. standard deviation of the dependent variable are -0.237 and 8.825, unrestricted constant (not reported). include an models necessary starting increase in starting wage aProportional to raise new to wage For stores in Pennsylvania the wage gap is 0. rate. minimum and type chain or for variables dummy b~hree whether not company- is store the owned included. are for two regions of New Jersey and two regions of eastern 'Dummy variables included. are Pennsylvania joint of value d~robability control variables. all of exclusion test F for

269 differences - - differences 269 in treatments with different treatment intensity (e.g., varying increases in the minimum wage for different states. The typical regression model we estimate is: # = a + b + D s + b Y Post t + d ( D ⇥ Post ) + + t s st t 2 1 it it i D where is a dummy whether the unit is in the treatment group or not, Post is a post-treatment dummy, and the interaction is the DD coefficient of interest. One way to build this is to have as a separate variable the date in which a unit (e.g., state) received the treatment, and then generate a new variable equalling the difference between the current date and 2001 the date of treatment. So for instance, say that the current date is 2004 . Then 2001 - 2004 equals - and the treatment occurred in . This 3 new variable would be a re-centering of the time period such that each unit was given a date from the point it received the treatment. Then one could define the post-treatment period as all periods where the recentered variable exceeded zero for those treatment units. In the Card and Krueger case, the equivalent regression would be: Y # = a + g NJ + + l d ) + d ( NJ ⇥ d t st s its its if the observation is from NJ, and d is a 1 NJ is a dummy equal to if the observation is from November (the post 1 dummy equal to period). This equation takes the following values PA Pre: • a • a + l PA Post: NJ Pre: a + g • • NJ Post: a + g + l + d The DD estimate: (NJ Post - NJ Pre) - (PA Post - PA Pre) d . We can = see this visually in Figure 83 . Notice that the regression identifies a vertical bar in the post- treatment period marked by the d . What’s important to notice is that algebraically this is only the actual treatment effect if the declining line for NJ is exactly equal to the declining line for PA. In other words, it’s because of these parallel trends that the object identified by the regression equals in expectation the true parameter. This gets to our key identifying assumption in DD strategies – the parallel trends assumption. This is simply an untestable assumption because as we can see, we don’t know what would’ve happened to employment in New Jersey had they not passed the minimum wage because that is a counterfactual state of the world. Maybe it would’ve

270 Graph - DD 270 causal inference : the mixtape d = + g NJ Y + l d # + a ( NJ )+  d t s t s ist ist 83 Figure : DD regression diagram Waldinger (Warwick) 23 / 55 evolved the same as Pennsylvania, but maybe it wouldn’t have too. We have no way of knowing. Empiricists faced with this untestable assumption have chosen, therefore, to use deduction as a second best for checking the as- sumption. By which I mean, empiricists will reason that if the pre-treatment trends were parallel between the two groups, then wouldn’t it stand to reason that the post-treatment trends would have too ? Notice, this is not a test of the assumption; rather, this is a test of a possible corollary of the assumption: checking the pre-treatment trends. I emphasize this because I want you to understand that check- not equivalent to ing the parallelism of the pre-treatment trends is proving that the post-treatment trends would’ve evolved the same. But given we see that the pre-treatment trends evolved similarly, it does give some confidence that the post-treatment would’ve too (absent some unobserved group specific time shock). That would look like this (see Figure 84 ): Including leads into the DD model is an easy way to check for the pre-treatment trends. Lags can be included to analyze whether the treatment effect changes over time after treatment assignment, too. If you did this, then the estimating regression equation would be: 1 m + d = D Y + g g D + + l x + # t t s t t s s t ist its ist   q = t =0 t Treatment occurs in year 0 . You include q leads or anticipatory effects and m leads or post treatment effects. Boom goes the dynamite. Autor [ 2003 ] included both leads and lags in his DD model when he studied the effect of increased employment protection on the

271 Key Assumption of Any DD Strategy: Common Trends The key assumption for any DD strategy is that the outcome in treatment and control group would follow the same time trend in the absence of the treatment. This does not mean that they have to have the same mean of the outcome! Common trend assumption is di ¢ cult to verify but one often uses pre-treatment data to show that the trends are the same. Even if pre-trends are the same one still has to worry about other policies changing at the same time. differences 271 in differences - - Figure 84 : Checking the pre-treatment trends for parallelism 24 / 55 Waldinger (Warwick) firms’ use of temporary help workers. In the US, employers can usu- ally hire and fire at will, but some state courts have made exceptions to this “employment at will” rule and have thus increased employ- ment protection. The standard thing in this kind of analysis is to do what I said earlier and re-center the adoption year to 0 . Autor [ 2003 ] Results then analyzed the effects of these exemptions on the use of tempo- . Notice 85 rary health workers. These results are shown in Figure : 85 Figure 2003 [ ] leads and lags Autor in dynamic DD model ects § no evidence for anticipatory e ! The leads are very close to 0. (good news for the common trends assumption). . Thus, there is no evidence for antic- 0 that the leads are very close to ect increases during the first years of the § The lags show that the e treatment and then remains relatively constant. 27 / 55 Waldinger (Warwick)

272 272 causal inference : the mixtape ipatory effects (good news for the parallel trends assumption). The lags show that the treatment effect is dynamic: it increases during the first few years, and then plateaus. Inference Many papers using DD strategies use data from many years – not just pre and 1 post treatment period like Card and 1 1994 Krueger [ ]. The variables of interest in many of these setups only vary at a group level, such as the state, and outcome variables are often serially correlated. In Card and Krueger [ 1994 ], it is very likely for instance that employment in each state is not only correlated Bertrand et al. within the state but also serially correlated. 2004 ] [ point out that the conventional standard errors often severely under- state the standard deviation of the estimators, and so standard errors Bertrand et al. [ 2004 ] are biased downward (i.e., incorrectly small). propose therefore the following solutions. 1 Block bootstrapping standard errors (if you analyze states the . block should be the states and you would sample whole states with replacement for bootstrapping) 2 . Clustering standard errors at the group level (in Stata one would simply add , cluster(state) to the regression equation if one analyzes state level variation) 3 Aggregating the data into one pre and one post period. Liter- . ally works if there is only one treatment data. With staggered treatment dates one should adopt the following procedure: Regress Y • onto state FE, year FE and relevant covariates st • Obtain residuals from the treatment states only and divide them into 2 groups: pre and post treatment • Then regress the two groups of residuals onto a post dummy Correct treatment of standard errors sometimes makes the number of groups very small: in Card and Krueger [ 1994 ], the number of groups is only . More common than not, researchers will use the second 2 option (clustering the standard errors by group), though sometimes you’ll see people do all three for robustness. Threats to validity There are four threats to validity in a DD strategy. They are: ( 1 ) non-parallel trends; ( 2 ) compositional differences; ( 3 ) long-term effects vs. reliability; ( 4 ) functional form dependence. We discuss those now in order. Regarding the violation of parallel trends, one way in which that happens is through endogenous treatments. Often policymakers

273 differences - differences 273 in - will select the treatment and controls based on pre-existing differ- ences in outcomes – practically guaranteeing the parallel trends assumption will be violated. One example of this is the“Ashenfelter dip”, named after Orley Ashenfelter, labor economist at Princeton. Participants in job trainings program often experience a “dip” in earnings just prior to entering the program. Since wages have a nat- ural tendency to mean reversion, comparing wages of participants and non-participants using DD leads to an upward biased estimate of the program effect. Another example is regional targeting, like when NGOs target villages that appear most promising, or worse off. This is a form of selection bias and violates parallel trends. What can you do if you think the parallel trends assumption is violated? There’s a variety of robustness checks that have become very common. They all come down to various forms of placebo analysis. For instance, you can look at the leads like we said. Or you can use a falsification test using data for an alternative control group, which I’ll discuss in a moment. Or you can use a falsification test using alternative outcomes that shouldn’t be affected by the treatment. For instance, if Craigslist’s erotic services only helps female sex workers, then we might check that by estimating the same model against manslaughters and male murders – neither of which are predicted to be affected by ERS, but which would be affected by secular violence trends. DDD The use of the alternative control group, though, is usually 135 135 called the differences-in-differences-in-differences model, or DDD. Also called triple difference, DnDnD, or DiDiD. [ 1994 ] in his study of maternity Gruber This was first introduced by benefits. Before we dig into this paper, let’s go back to our original DD table from the start of the chapter. What if we introduced city- specific time-variant heterogeneity? Then DD is biased. Let’s see. Table 35 : Differences-in-differences-in- D Cities D Period D Outcomes Category 3 2 1 differences + f After T d SF + SF + + t t + SF Female murders + d T f + t t Before SF San Francisco m d + f t t + After + m SF T SF + t t SF + m + Male murders T t t SF Before d + T + f W + W After t t f + Female murders T + W t t W Before Waco m f t t T W + m + W After + t t m + W + T Male murders t t Before W The way that you read this table is as follows. Female murders

274 274 causal inference : the mixtape in San Francisco are determined by some San Francisco fixed effect in the before period, and that same San Francisco fixed effect in the , a San Francisco specific time trend, after period plus a time trend T a trend in female murders separate from the national trend shaping d . When we difference this all crimes, and the erotic services platform we get SF + + + d T f t t Now in the normal DD, we would do the same before and after differencing for Waco female murders, which would be T + W + f t t And if we differenced these two, we’d get d SF + W t t This is the familiar selection bias term – the DD estimator would iso- late the treatment effect plus the selection bias, and thus we couldn’t know the effect itself. The logic of the DDD strategy is to use a within-city comparison group that experiences the same city-specific trends, as well as its own crime-specific trend, and use these within-city controls to net them out. Go through each difference to confirm that at the third difference, you have isolated the treatment effect, . Note that while d this seems to have solved the problem, it came at a cost, which is more parallel trends assumptions. That is, now we require that female murders have a common trend, the entire country have a common trend, and each city have a common trend. We also require that these crime outcomes be additive, otherwise the differencing would not eliminate the components from the analysis. [ ] does this exact kind of triple differencing in his Gruber 1994 original maternity mandate paper. Here are his main results in Figure : 86 These kinds of simple triple differencing are useful because they explain the intuition behind triple differencing, but in practice you will usually run regressions of the following form: Y ) = a + b t X ⇥ + b d t ( + b b d + + b D t 2 3 5 4 1 ijt j ijt jt i + b # ( t ⇥ D ) + + b ) ( d ⇥ D ) D + b ⇥ ( d ⇥ t 8 7 6 ijt ti ijt ij where in this representation, the parameter of interest is b . There’s 8 a few things I want to bring to your attention. First, notice the addi- tional subscript, j . This j indexes whether it’s the main category of interest (e.g., female murders) or the within-city comparison group (e.g., male murders).

275 differences in - differences 275 - Di erence-in-Di erences: Threats to Validity Triple DDD: Mandated Maternity Benefits (Gruber, 1994) Figure 86 : Gruber [ 1994 ] Table 3 J. Hainmueller (MIT) 37 / 50

276 276 causal inference : the mixtape But sometimes, it is sufficient just to use your DD model and use it to examine the effect of the treatment on a placebo as a falsification. [ 2013 ] examined the effect of For instance, Cheng and Hoekstra castle doctrine gun laws on homicides as their main results, but they performed placebo analysis by also looking at the law’s effect on Auld and Grootendorst [ 2004 ] estimated standard grand theft auto. Becker and Murphy [ “rational addiction” models from ] on 1988 outcomes that could not possibly be considered addictive, such as eggs and milk. Since they find evidence for addiction with these models, they argued that the identification strategy that authors had been using previously to evaluate the rational addiction model were flawed. And then there is the networks literature. Several studies found significant network effects on outcomes like obesity, smoking, alcohol use and happiness, leading many researchers to conclude that these kinds of risk behaviors were “contagious” through peer Cohen-Cole and Fletcher [ 2008 ] used similar models and effects. data to study network effects for things that couldn’t be transmitted between peers – acne, height, and headaches – in order to show that the models’ research designs were flawed. DD can be applied to repeated cross-sections, as well as panel data. But one of the risks of working with the repeated cross-section is that unlike panel data (e.g., individual level panel data), repeated cross- Hong [ 2013 ] used sections run the risk of compositional changes. repeated cross-sectional data from the Consumer Expenditure Survey (CEX) containing music expenditure and internet use for a random sample of households. The authors’ study exploited the emergency of Napster, the first file sharing software widely used by Internet users, in June 1999 as a natural experiment. The study compared Internet users and Internet non-users before and after emergence of Napster. Figure shows the main results. Notice that as the Internet 87 diffusion increased, music expenditure for the Internet user group declined – as did for the non-user group – suggesting that Napster was causing people to substitute away from music purchases towards file sharing. But when we look at Figure 88 , we see evidence of compositional changes in the unit itself. While music expenditure fell over the treatment period, the age of the sample grew while income fell. If older people are less likely to buy music in the first place, then this could independently explain some of the decline. This kind of compositional change is a kind of omitted variable bias caused by time-variant unobservables. Diffusion of the Internet appears to be changing the samples as younger music fans are early adopters.

277 a Table 8: 2SIV Estimates for Age and Family Groups HHs w/children Age 15-34 Aged 6-17 (1) (2) A. DD Estimates -3.432 (1.284) -3.258 (1.203) B. 2SIV Estimates -0.120 (1.092) -2.427 (0.949) 0 -22.510 (6.889) -2.719 (2.079) 1 the mean of imputed downloading probability 0.346 0.140 a Standard errors in parentheses. The dependent variable is music expenditure differences differences 277 in - - in 1998 dollar. All regressions are estimated by weighted least squares using the CEX weights. Panel A reports the coe cient estimates for in the DD regression (16) that includes controls such as age, education, income, appliance, occupation, cient estimates for family composition, and region. Panel B reports the coe 0 and in the regression (17) that includes various covariates. Bootstrap is used 1 to estimated standard errors. erence-in-Di erences: Threats to Validity Di Compositional di erences? : Internet diffusion and music 87 Figure usion and Average Quarterly Music Expenditure in the CEX Figure 1: Internet Di spending 100 40 90 Internet User Group fusion Internet Dif Non-user Group 35 80 30 70 25 60 20 50 40 15 (in 1998 dollars) 30 10 verage Music Expenditure A 20 % of HHs w/Internet connection 5 10 0 0 1999 1998 1997 1996 2001 2000 Y ear 41 / 50 J. Hainmueller (MIT) 44 Di erence-in-Di erences: Threats to Validity Compositional di erences? : Comparison of Internet user 88 Figure and non-user groups a Table 1: Descriptive Statistics for Internet User and Non-user Groups 1998 1999 Year 1997 2000 Internet User Non-user Internet User Non-user Internet User Non-user Internet User Non-user Average Expenditure $9.37 $17.42 $10.90 Recorded Music $24.18 $9.97 $20.92 $25.73 $8.22 $182.42 $80.19 $164.88 $71.44 Entertainment $195.03 $96.71 $193.38 $84.92 Zero Expenditure .80 .81 .68 .83 .79 Recorded Music .56 .60 .64 .14 .39 .17 .44 .09 .32 .08 Entertainment .35 Demographics 49.0 44.1 44.3 49.9 49.0 40.2 Age 42.3 49.4 $49,970 $26,649 $47,510 $26,336 Income $52,887 $30,459 $51,995 $28,169 .32 .22 .31 High School Grad. .17 .32 .21 .18 .33 .34 .27 .36 .27 Some College .37 .28 .35 .27 40 .42 .37 .20 .45 College Grad. .43 .21 .21 .20 .08 .14 .07 .08 .16 .08 .16 Manager .14 .21 .10 .19 .10 Professional .23 .11 .22 .10 .05 0 0 .08 0 .12 Living in a Dorm 0 .05 usion of the internet changes samples (e.g. younger music fans are Di .93 .93 .87 .91 .87 .86 .86 Urban .89 .78 Inside a MSA .84 .78 .83 .78 .83 .78 .81 early adopters) .25 .26 .30 .26 .34 .28 4 million > Pop. Size .31 .25 Appliance Ownership .81 .28 .80 .28 .81 .27 .79 Computer .32 Sound System .81 .57 .79 .58 .78 .56 .76 .56 42 / 50 J. Hainmueller (MIT) .86 .72 .85 .83 VCR .72 .74 .86 .72 Total Households 91 22 86 28 80 34 76 (in million) 15 21,550 22,810 Observations 20,919 9,606 8,191 3,163 5,624 19,052 a All the statistics are weighted using the weights provided by the CEX. Years refer to the period from June of the year to May of the next year. Total households are computed by summing the CEX weights.

278 278 causal inference : the mixtape Stata exercise: Abortion legalization and longrun gonorrhea inci- dence [ 2013 ] As we have shown, Exposition of Cunningham and Cornwell estimating the DD model is straightforward, but running through an example would probably still be beneficial. But since the DDD requires reshaping the data, it would definitely be useful to run through an example that did both. The study we will be replicating [ 2013 ]. But first let’s learn about the is Cunningham and Cornwell [ 1999 ] started a contro- Gruber et al. project and the background. versial literature. What was the effect that abortion legalization in 1970 s had on the marginal child who would’ve been born 15 the 20 - years later? The authors showed that the child who would have been 60 born had abortion remained illegal was % more likely to live in a single-parent household. The most famous paper to pick up on that basic stylized fact was Donohue and Levitt [ 2001 ]. The authors link abortion legalization in 1970 s with the decline in crime in the the early s. Their argument 1990 was similar to Gruber et al. [ 1999 ] - the marginal child was unwanted and would’ve grown up in poverty, both of which they argued could predict higher criminal propensity to commit crime as the cohort aged throughout the age-crime profile. But, abortion legalization (in Gruber et al. [ 1999 ] and Donohue and Levitt [ 2001 ]’s argument) re- moved these individuals and as such the treated cohort had positive selection. [ 2004 ] attributes as much as 10 % of the decline in Levitt 1991 to abortion legalization in the 2001 crime between 1970 s. and This literature was, not surprisingly, incredibly controversial, some of it unwarranted. When asked whether abortion was correct to be legalized, Levitt hedged and said those sorts of ethical questions were beyond the scope of his study. Rather, his was a positive study interested only in cause and effect. But some of the ensuing criticism was more legitimate. Joyce [ 2004 ], Joyce [ 2009 ], and Foote and Goetz [ ] all disputed the findings – some through replication exercises 2008 using different data and different identification strategies, and some through the discovery of key coding errors. Furthermore, why look at only crime? If the effect was as large as the authors claim, then wouldn’t we find effects everywhere ? Cunningham and Cornwell [ 2013 ] sought to build on Joyce [ 2009 ]’s challenge - if the abortion-selection hypothesis has merit, then shouldn’t we find it elsewhere? Because of my research agenda in risky sexual behavior, I chose to investigate the effect on gonorrhea incidence. Why STIs? For one, the characteristics of the marginal child could explain risky sexual behavior that leads to disease trans- mission. Being raised a single parent is a strong predictor of earlier

279 differences - - differences 279 in Levine et al. [ ] found that sexual activity and unprotected sex. 1999 %. 12 abortion legalization caused teen childbearing to fall by Charles and Luoh [ 2006 ] reported that children exposed in utero to a legalized abortion regime were less likely to use illegal substances which is correlated with risky sexual behavior. The estimating strategy that I used was conventional at the time. Five states repealed abortion laws three years before Roe v. Wade . My data from the CDC comes in five-year age categories (e.g., 15 - 19 , - 24 year olds). This created some challenges. First, the early repeal 20 of some states should show declines in gonorrhea for the treated cohort three years before states (i.e., the rest of the country). Roe 15 19 year olds Specifically, we should see lower incidence among - in the repeal states during the 1986 - 1992 period relative to their counterparts. Second, the treatment effect should be nonlinear Roe because treated cohorts in the repeal states do not fully come of age until , just when the 15 -year-olds born under Roe 1988 enter the sample. Thus we should find negative effects on gonorrhea incidence briefly lasting only for the duration of time until Roe cohorts catch up and erase the effect. I present a diagram of this dynamic in Figure 89 . The top horizontal axis shows the year of the panel; the vertical axis shows the age in calendar years. The cells show the cohort for those individuals who are of a certain age in that given year. So for -year-old in 1985 was born in 15 .A 15 -year-old in instance, a 1970 was born in 1971 , and so forth. The highlighted blue means that 1986 person was exposed to repeal, and the highlighted yellow means that Roe catches up. This creates a very specific age pattern in the treatment effect, represented by the colored bottom row. We should see no effect in ; a slightly negative effect in 1986 as the first cohort reaches 1985 , 15 an even more negative effect through 1987 and an even more negative effect from 1988 - 1990 . But then following 1990 through 1992 , the treatment effect should gradually disappear. All subsequent DD coefficients should be zero thereafter since there is no difference at that point in the Roe and repeal states beyond 1992 . A simple graphic for Black 15 - 19 year old female incidence can hep illustrate our findings. Remember, a picture speaks a thousand words, and whether it’s RDD or DD, it’s helpful to show pictures like these to prepare the reader for the table after table of regression coef- ficients. I present two pictures; one showing the raw data, and one showing the DD coefficients. The first is Figure 91 . This picture cap- tures the dynamics that we will be picking up in our DD plots. The shaded areas represent the period of time where differences between the treatment and control units should be different, and beyond they should be the same, conditional on a state and year fixed effect. And

280 280 causal inference : the mixtape 89 : Theoretical predictions of Figure abortion legalization on age profiles of gonorrhea incidence : Differences in black female 90 Figure Black 15-19 year old female gonorrhea incidence gonorrhea incidence between repeal cohorts. Roe and Roe vs Repeal states 6000 5000 4000 3000 2000 Black female gonorrhea incidence per 100 000 1000 1999 1995 1996 1997 1998 1987 2000 1986 1988 1989 1990 1985 1992 1993 1994 1991 Ye a r Repeal states Roe states

281 differences - - differences 281 in Roe states experienced a large increase in gonorrhea as you can see, during the window where repeal states were falling. Our estimating equation is as follows: # = b + Repeals + b t DTt + b ⇥ g Repeal + ⇥ DT t + X + y + a g DS Y s s t s st st st 3 2 2 t s 1 1 Y 19 15 - where year is the log number of new gonorrhea cases for 100 000 of the population); Repeal olds (per equals one if the state , s Roe legalized abortion prior to DT is a year dummy; DS is a state ; t s dummy; t is a time trend; X is a matrix of covariates; DS are ⇥ t s # is an error term assumed to be state specific linear trends; and st conditionally independent of the regressors. We present plotted coefficients from this regression for simplicity (and because pictures . As can be seen, there is a negative 91 can be so powerful) in Figure has not fully caught up: effect during the window where Roe : Coefficients and standard 91 Figure Estimated effect of abortion legalization on gonorrhea errors from DD regression equation Black females 15-19 year-olds 0.50 1995 1997 2000 1994 1999 1996 0.00 1998 1986 1993 1987 1992 1991 1990 1988 1989 -0.50 Repeal x year estimated coefficient -1.00 1995 2000 1985 1990 Ye a r Whisker plots are estimated coefficients of DD estimator from Column b of Table 2. The regression equation for a DDD is more complicated as you recall from the Gruber [ 1994 ] paper. Specifically, it requires stacking new comparison within-state units who capture state-specific trends but who were technically untreated. We chose the 25 - 29 year olds in the same states as within-state comparison groups. We also chose the - 24 year olds as a within-state comparison group but our reasoning 20 was that that age group, while not treated, was more likely to have sex with the 15 - 19 year olds, who were treated, and thus SUTVA was violated. So we chose a group that was reasonably close to capture trends, but not so close that they violate SUTVA. The estimating

282 282 causal inference : the mixtape equation for this regression is Y b DA Repeal · + b Repeal DT d + b + DA Repeal = · DT d + t s ast s t s 2 3 t 2 1 1 d DA · DA · DT DS + d a + Repeal DS · DA · DT + + X + x a st t s s s t s t 3 2 4 t s 1 g , t + g e + DS t · t + g · DA · t + g DA · DS + s ast s 3 2 s 4 1 s g where the DDD parameter we are estimating is - the full inter- 4 action. In case this wasn’t obvious, the reason there are 8 separate dummies is because our DDD parameter has all three interactions. Thus since there are 9 combinations, we had to drop one as the omit- ted group, and control separately for the other 7 . Here we present the table of coefficients. Note that the effect should be concentrated only among the treatment years as before. This is presented here in 136 136 : Note, because the original table Figure 92 Column (b) controls for an age-state interaction with spans multiple pages, I didn’t want to age-state specific linear time trends. As can be seen, we see nearly clutter up the page with awkwardly the same pattern using DDD as we found with our DD, though the linked tables. But you can see the full Cunningham 401 table on pages 402 of - precision is smaller. I interpreted these patterns as evidence for the and Cornwell ]. [ 2013 Gruber et al. 1999 ] and Donohue and Levitt [ 2001 ] abortion- original [ selection hypothesis. Now what I’d like to do is replicate some of these Stata replication results, as I want you to have handy a file that will estimate a DD model, but also the slightly more cumbersome DDD model. Before from Doug Miller’s cgmreg.ado we begin, you will need to download website, as referees asked us to implement the multi-way clustering correction for the standard errors to allow for correlation both across states and within states. That can be found at the top of http:// , and as , must simply be saved into the /c subdirectory of scuse.ado with your Stata folders. Let’s begin: . scuse abortion, clear xi: cgmreg lnr i.repeal . i.year i.fip acc ir pi alcohol crack poverty income ur if bf15==1 // * [aweight=totpop], cluster(fip year) _ _ _ _ _ _ _ _ _ _ _ _ 1 1987 . IrepXyea 1986 1 IrepXyea 1988 IrepXyea IrepXyea test 1 1 1989 // _ _ _ _ _ _ _ _ _ IrepXyea 1 1990 IrepXyea 1 1 1991 1992 IrepXyea The last line tests for the joint significance of the treatment (repeal ⇥ year interactions). Note, for simplicity, I only estimated this for the black females ( bf15==1 ) but you could estimate for the black males ( bm15==1 ), white females ( wf15==1 ) or white males ( wm15==1 ). We do all four in the paper, but I am just trying to give you a basic understanding of the syntax.

283 differences - - differences 283 in Diff-in-Diff-in-Diff: Panel fixed effects regressions of early repeal of abortion on in utero cohort log of 15–19 year-old Table3. gonorrhea incidence rates by race/gender, 25–29 comparison, state and age linear trends, 1985–2000, state clustering Figure 92 : Subset of coefficients (year- repeal interactions) for the DDD model, Black male White female White male Black female Table 3 Cunningham and Cornwell of 2013 ]. [ (a) (b) (a) (b) (a) (b) (a) (b) Covariates ∗∗∗ ∗ ∗∗∗ ∗∗ ∗ ∗∗ ∗∗∗ 244 − − 0 . 274 0 − 0 . 123 − 0 . 146 . − 0 . 219 337 − 0 . 210 × − 0 . 389 1986 − 0 . 15-year-old Repeal × 110 0 . 137 )( 0 . 084 )( 0 . 061 )( 0 . 080 )( 0 . 125 ) )( 126 . 0 )( 115 )( . 0 ( . 0 ∗ ∗∗ ∗ ∗∗ ∗∗ × 15-year-old Repeal 1987 − 0 . 389 × . 152 − 0 . 197 0 0 . 037 0 . 057 0 451 . 215 . − 0 . 259 − − 0 − )( 0 . 131 )( 0 . 180 )( 0 . 116 )( 0 . 277 )( 0 ( 195 ) 0 . 155 )( 0 . 189 )( 0 . 092 . ∗∗ ∗∗ ∗∗∗ ∗∗ ∗∗ ∗∗ 382 Repeal 0 − 1988 × 15-year-old × . . 344 − − 0 . . 0 − 0 . 160 − 0 . 140 − . 232 472 − 0 . 308 0 − 0 415 098 )( 0 . 143 )( 0 . 077 )( 0 . 182 )( 0 . 209 )( 0 . 100 ) ( 0 . 143 )( 0 . 182 )( 0 . ∗ ∗ Repeal × 15-year-old × − 0 . 277 1989 0 . 135 0 . 041 − 0 . 080 − 0 . 049 380 0 0 . 048 − 0 . 043 − . )( 212 . 142 )( 0 . 375 )( 0 . 238 )( 0 . 327 0 0 . 155 ) . ( 0 . 138 )( 0 . 191 )( 0 )( . 163 0 . 202 0 . 090 0 . 223 0 . 108 − 0 . 126 − 0 . 083 Repeal × 15-year-old × 1990 − 0 . 046 − 0 247 . 0 . 150 )( 0 . 486 ( 0 . 146 )( 0 . 169 )( 0 )( . 233 ) 0 )( 450 . 0 )( 320 . 0 )( . 039 0 . 308 0 . 196 0 Repeal 341 0 . 210 0 . 045 0 . 097 × 15-y ear-old × 1991 0 . 079 − 0 . . 214 )( 0 . 118 )( 0 . 326 )( 0 . 157 ( 0 . 366 )( 0 . 120 ) 0 . 148 )( 0 . 216 )( 0 )( Effect of Abortion on STIs 401 ∗ . 183 0 . 071 0 . 272 0 . 129 0 . 173 0 . 236 . 0 Repeal × 15-year-old × 1992 0 . 122 0 005 301 0 . 163 )( 0 . 384 )( 0 . 180 )( 0 . 417 )( 0 . 135 ) )( 144 . 0 )( 140 . 0 ( )( . 0 034 261 . 123 − 0 . 213 0 . 095 − 0 . 054 0 . 0 0 . 119 . Repeal × 15-year-old × 1993 − 0 . 168 − − 0 )( 0 . 311 )( 0 . 474 )( 0 . 250 )( 0 . 525 )( 0 . 256 ) ( 0 . 360 )( 0 . 328 )( 0 . 414 ∗ Repeal × 15-year-old × 0 0 . 120 − 0 . 055 0 . 188 0 . 261 . 231 0 . 1994 0 . 239 112 0 112 . )( )( . 240 )( 0 . 505 )( 0 . 250 )( 0 . 628 421 0 . 232 ) . 0 )( 104 . 0 )( 124 . 0 ( 0 0 . 295 0 . 207 0 . 094 − 0 . 082 − 0 . 019 0 Repeal 084 × 15-year-old × 1995 0 . 151 0 . 060 . . )( 0 . 255 )( 0 . 515 )( 0 . 225 )( 0 ( 0 . 142 )( 0 . 096 )( 0 459 . 685 )( 0 . 248 ) − . 311 0 . 222 0 . 311 0 . 125 0 0 . 130 − 0 . 013 ear-old × . 183 0 . 095 0 Repeal × 15-y 1996 415 )( 0 . 195 )( 0 . 338 )( 0 . 103 )( 0 . 695 )( 0 . 239 ) ( 0 . 114 )( 0 . 115 )( 0 . ∗∗∗ ∗∗ ∗∗∗ × 15-year-old × 1997 0 . 357 Repeal . 0 . 030 − 0 . 113 0 . 025 0 . 435 0 . 346 269 0 . 0 231 . 379 0 . 146 )( 0 . 411 )( 0 . 107 )( 0 . 711 )( 0 )( 172 ) ( 0 . 114 )( 0 0 )( 098 . . ) continued ( Downloaded from at Baylor University on September 13, 2013

284 284 causal inference : the mixtape Next, we show how to use this sample so that we can estimate a DDD model. A considerable amount of reshaping had to be done earlier in the code, but it would take too long to post that here, so in 0 of this book, I will provide the do file that was used to make v. . 2 the tables for this paper. For now, though, I will simply produce the commands that produce the black female result. gen yr=(repeal==1) & (younger==1) . . gen wm=(wht==1) & (male==1) . gen wf=(wht==1) & (male==0) gen bm=(wht==0) & (male==1) . . gen bf=(wht==0) & (male==0) char year[omit] 1985 . . char repeal[omit] 0 . char younger[omit] 0 . char fip[omit] 1 . char fa[omit] 0 char yr[omit] 0 . i.year // cap n xi: cgmreg lnr i.repeal i.year i.younger i.repeal i.younger i.year i.yr . * * * * i.fip t acc pi ir alcohol crack poverty income ur if ‘x’==1 & (age==15 | age==25) // * [aweight=totpop], cluster(fip year) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ IyrXyea 1986 1 1 1988 IyrXyea IyrXyea IyrXyea 1 . 1989 1 IyrXyea 1987 1 test 1990 // _ _ _ _ _ _ IyrXyea IyrXyea 1 1991 1 1992 Notice that some of these already are intereactions (e.g., yr ) which was my way to compactly include all of the interactions since at the time my workflow used the asterisk to create interactions as opposed to the hashtag (e.g., ##). But I encourage you to study the data structure itself. Notice how I used if-statements to limit the regression analysis which forced the data structure to shrink into either the DD matrix or the DDD matrix depending on how I did it. Conclusion I have a bumper sticker on my car that says “I love Fed- 137 137 eralism (for the natural experiments)” (Figure 93 ). The reason Although this one has a misspelling, the real one has the plural version of I made this was half tongue-in-cheek, half legitimate gratitude. Be- experiments. I just couldn’t find the cause of state federalism, each American state is allowed considerable image on my computer. :( legislative flexibility to decide its own governance and laws. Yet, because of the federal government, many of our datasets are harmo- nized across states, making it even more useful for causal inference than European countries which do not always have harmonized datasets for many interesting questions outside of macroeconomics.

285 differences - - differences 285 in 93 Federalism bumper- ~ Figure :I sticker (for the natural experiments) The reason to be grateful for federalism is that it provides a con- stantly evolving laboratory for applied researchers seeking to evalu- ate the causal effects of laws and other interventions. It has therefore for this reason probably become one of the most popular forms of identification among American researchers, if not the most common. A google search of the phrase “differences in differences” brought up 12 million hits. It is arguably the most common methodology you will use – moreso than IV or matching or even RDD, despite RDD’s greater credibility. There is simply a never ending flow of quasi-experiments being created by our decentralized data generating process in the United States made even more advantageous by so many federal agencies being responsible for data collection, thus ensuring improved data quality and consistency. Study the ideas in this chapter. Review them. Review the dataset I provided and the Stata syntax. Walk yourself through the table and figures I presented. Think carefully about why the regression analy- sis reproduces the exact same differencing that we presented in our DD and DDD tables. Study the DAG at the start of the chapter and the formal technical assumptions necessary for identification. Under- standing what you’re doing in DD and DDD is key to your career because of its popularity if nothing else. You need to understand how it works, and under what conditions it can identify causal effects, if only to interact with colleagues and peers’ research.


287 Synthetic control “I’m representin’ for them gangstas all across the world” – Dr Dre “The synthetic control approach developed by 2010 , Abadie et al. [ ] and [ 2003 ] is arguably the most impor- 2015 Abadie and Gardeazabal 15 years.” - tant innovation in the policy evaluation literature in the last [ Athey and Imbens ] 2017 Democracy in America , In qualitative case studies, such as de Toqueville’s classic the goal is to reason inductively about the causal effect of events or characteristics of a single unit on some outcome using logic and historical analysis. But it may not give a very satisfactory answer to these causal questions because oftentimes it lacks a counterfactual. As such, we are usually left with description and speculation about the causal pathways connecting various events to outcomes. Quantitative comparative case studies are more explicitly causal designs. They usually are natural experiments and they usually are applied to only a single unit, such as a single school, firm, state or country. These kinds of quantitative comparative case studies compare the evolution of an aggregate outcome with either some single other outcome, or as is more oftentimes the case, a chosen set of similar units which serve as a control group. Athey and Imbens [ As ] point out, one of the most important 2017 contributions to quantitative comparative case studies is the synthetic control model. The synthetic control model was developed in Abadie and Gardeazabal [ 2003 ] in a study of terrorism’s effect on aggregate income which was then elaborated on in a more exhaustive treatment Abadie et al. , 2010 ]. Synthetic controls models optimally choose [ a set of weights which when applied to a group of corresponding units produce an optimally estimated counterfactual to the unit that received the treatment. This counterfactual, called the “synthetic unit”, serves to outline what would have happened to the aggregate treated unit had the treatment never occurred. It is a powerful, yet surprisingly simple, generalization of the differences-in-differences

288 288 causal inference : the mixtape strategy. We will discuss it now with a motivating example - the famous Mariel boatlift paper by 1990 ]. Card [ Cuba, Miami and the Mariel Boatlift “Born in Miami, right on time Scarface, El Mariel, Cuban crime” – Pitbull Labor economists have debated the effect of immigration on Card and Peri , 2016 local labor market conditions for many years [ ]. Do inflows of immigrants depress wages and the employment of Card 1990 ], this was an empirical natives in local labor markets? For [ question, and he used a natural experiment to evaluate it. In , Fidel Castro announced that anyone wishing to leave 1980 Cuba could do so if they exited from Mariel by a certain date, called the Mariel Boatlift. The Mariel Boatlift was a mass exodus from Cuba’s Mariel Harbor to the United States (primarily Miami Florida) between April and October 1980 . Approximately 125 , 000 Cubans emigrated to Florida over this six month period of time. The emi- gration stopped only because Cuba and the US mutually agreed to 7 end it. The event increased the Miami labor force by %, largely by depositing a record number of low skill workers into a relatively small area. Card saw this as an ideal natural experiment. It was arguably an exogenous shift in the labor supply curve, which would allow him to determine if wages fell and employment increased, consistent with a simple competitive labor market model. He used individual- level data on unemployment from the CPS for Miami and chose four comparison cities (Atlanta, Los Angeles, Houston and Tampa-St. Petersburg). The choice of these four cities is delegated to a footnote in the paper wherein Card argues that they were similar based on demographics and economic conditions. Card estimated a simple DD model and found, surprisingly, no effect on wages or native unemployment. He argued that Miami’s labor market was capable of absorbing the surge in labor supply because of similar surges two decades earlier. The paper was very controversial, probably not so much because he attempted to answer empirically an important question in labor economics using a natural experiment, but rather because the result violated conventional wisdom. It would not be the last word on the subject, and I don’t take a stand on this question; rather, I introduce it to highlight a few characteristics of the study. It was a comparative case study which had certain strengths

289 synthetic control 289 and weaknesses. The policy intervention occurred at an aggregate level, for which aggregate data was available. But the problems with the study were that the selection of the control group is ad hoc and ambiguous, and secondly, the standard errors reflect sampling variance as opposed to uncertainty about the ability of the control 138 138 group to reproduce the counterfactual of interest. Interestingly, a recent study repli- cated Card’s paper using synthetic 2010 [ Abadie et al. ] and [ Abadie and Gardeazabal ] intro- 2003 Peri control and found similar results. [ duced the synthetic control estimator as a way of addressing both ]. , and Yasenov 2018 simultaneously. This method uses a weighted average of units in the donor pool to model the counterfactual. The method is based on the observation that, when the units of analysis are a few aggregate units, a combination of comparison units (the “synthetic control”) often does a better job of reproducing characteristics of a treated unit than using a single comparison unit alone. The comparison unit, therefore, in this method is selected to be the weighted average of all compari- son units that best resemble the characteristics of the treated unit(s) in the pre-treatment period. Abadie et al. [ 2010 ] argue that this method has many distinct advantages over regression based methods. For one, the method precludes extrapolation. It uses instead interpolation, because the estimated causal effect is always based on a comparison between some outcome in a given year and a counterfactual in the same year. That is, its uses as its counterfactual a convex hull of control group units, and thus the counterfactual is based on where data actually is, as opposed to extrapolating beyond the support of the data which King and Zeng , can occur in extreme situations with regression [ ]. 2006 A second advantage has to do with processing of the data. The construction of the counterfactual does not require access to the post-treatment outcomes during the design phase of the study, unlike regression. The advantage here is that it helps the researcher avoid “peaking” at the results while specifying the model. Care and hon- esty must still be used, as it’s just as easy to also look at the outcomes during the design phase as it is to not, but the point is that it is hypo- thetically possible to focus just on design, and not estimation, with this method. Another advantage, which is oftentimes a reason that people will object to the study ironically, is that the weights which are chosen make explicit what each unit is contributing the counterfactual. Now this is in many ways a strict advantage, except when it comes to defending those weights in a seminar. Because someone can see that Idaho is contributing 0 . 3 to your modeling of Florida, they are now able to argue that it’s absurd to think Idaho is anything like Florida. But contrast this with regression, which also weights the data, but

290 290 causal inference : the mixtape does so blindly. The only reason no one objects to what regression cannot see the weights . They are produces as a weight is that they implicit, rather than explicit. So I see this explicit production of weights as a distinct advantage because it makes synthetic control more transparent than regression based designs. A fourth advantage, which I think is often unappreciated, is that it bridges a gap between qualitative and quantitative types. Qualitative researchers are often the very ones focused on describing 2010 ], in great Perkinson a single unit, such as a country or a prison [ , detail. They are usually the experts on the histories surrounding those institutions. They are usually the ones doing comparative case studies in the first place. Synthetic control places a valuable tool into their hands which enables them to choose counterfactuals - a process that in principle can improve their work insofar as they are interested in evaluating some particular intervention. Finally, Abadie et al. [ ] argue that it removes subjective re- 2010 searcher bias, but I actually believe this is the most overstated benefit of the method. Through repeated iterations and changes to the matching formula, a person can just as easily introduce subjective choices into the process. Sure, the weights are optimally chosen to minimize some distance function, but through the choice of the covariates themselves, the researcher can in principle select differ- ent weights. She just doesn’t have a lot of control over it, because ultimately the weights are optimal for a given set of covariates. Formalization “I’m the real Slim Shady all you other Slim Shadys are just imitating” – Eminem Y +1 be the outcome of interest for unit j of J Let aggregate jt units at time t, and treatment group be =1. The synthetic control j estimator models the effect of the intervention at time T on the 0 treatment group using a linear combination of optimally chosen units as a synthetic control. For the post-intervention period, the synthetic +1 J ⇤ Y w control estimator measures the causal effect as where Y Â t 1 jt j =2 j ⇤ w is a vector of optimally chosen weights. j X and X , are chosen as predictors of post- Matching variables, 0 1 intervention outcomes and must be unaffected by the intervention. || X The weights are chosen so as to minimize the norm, X || W 0 1 subject to weight constraints. There are two weight constraints. First, 0 =( w =2,..., ,..., w . Second, let let ) W with w J 0 for j +1 2 J +1 j w . In words, no unit receives a negative weight, but + ··· + w =1 2 +1 J 139 139 And the sum of all weights must equal can receive a zero weight. See Doudchenko and Imbens [ 2016 ] for recent work relaxing the non- one. negativity constraint.

291 synthetic control 291 As I said, [ 2010 ] consider Abadie et al. p 0 ( || W || = X ( X W X ) W ) X V X X 0 0 0 1 1 1 is some V k ⇥ k ) symmetric and positive semidefinite matrix. where ( X is be the value of the m -th covariates for unit Let . Typically, V j jm v v . Then the synthetic control ,..., diagonal with main diagonal 1 k weights minimize: ◆ ✓ 2 J +1 k X X v w m 1 m j jm   m =1 =2 j where is a weight that reflects the relative importance that we as- v m sign to the m -th variable when we measure the discrepancy between the treated unit and the synthetic control. The choice of V , as should be seen by now, is important because ⇤ ⇤ . The synthetic control W W ( V ) is depends on one’s choice of V meant to reproduce the behavior of the outcome variable for the treated unit in the absence of the treatment. Therefore, the weights v ,..., v should reflect the predictive value of the covariates. 1 k Abadie et al. 2010 ] suggests different choices of V , but ultimately [ V that minimizes it appears from practice that most people choose the mean squared prediction error: ◆ ✓ 2 T J +1 0 ⇤ Y ) Y V ( w t 1 jt   j t =1 =2 j What about unobserved factors? Comparative case studies are complicated by unmeasured factors affecting the outcome of interest as well as heterogeneity in the effect of observed and unobserved fac- Abadie et al. [ 2010 tors. ] note that if the number of pre-intervention periods in the data is “large”, then matching on pre-intervention outcomes can allow us to control for the heterogenous responses to multiple unobserved factors. The intuition here is that only units that are alike on unobservables and unobservables would follow a similar trajectory pre-treatment. California’s Proposition 99 Abadie and Gardeazabal [ 2003 ] developed the synthetic control estimator so as to evaluate the impact that terror- Abadie et al. [ 2010 ] expounds on ism had on the Basque region. But the method by using a cigarette tax in California called Proposition 99 . The cigarette tax example uses a placebo-based method for infer- ence, which I’m wanting to explain, so let’s look more closely at their paper. In 1988 , California passed comprehensive tobacco control legisla- tion called Proposition 99 . Proposition 99 increased cigarette taxes by

292 292 causal inference : the mixtape cents a pack, spurred clean-air ordinances throughout the state, 25 funded anti-smoking media campaigns, earmarked tax revenues to health and anti-smoking budgets, and produced more than $ mil- 100 lion a year in anti-tobacco projects. Other states had similar control programs, and they were dropped from their analysis. Cigarette Consumption: CA and the Rest of the U.S. : California cigarette sales vs 94 Figure the rest of the country California 140 rest of the U.S. 120 100 80 60 Passage of Proposition 99 capita cigarette sales (in packs) 40 − per 20 0 2000 1995 1985 1980 1975 1970 1990 year Figure 94 shows changes in cigarette sales for California and the rest of the United States annually from 1970 to 2000 . As can be seen, cigarette sales fell after Proposition 99 , but as they were already falling, it’s not clear if there was any effect – particularly since they were falling in the rest of the country at the same time. Using their method, though, they select an optimal set of weights that when applied to the rest of the country produces the figure shown in Figure 95 . Notice that pre-treatment, this set of weights pro- duces a nearly identical time path for California as the real California itself, but post-treatment the two series diverge. There appears at first glance to have been an effect of the program on cigarette sales. The variables they used for their distance minimization are listed in Figure 96 . Notice that this analysis produces values for the treat- ment group and control group that facilitate a simple investigation of balance. This is not a technical test, as there are only one value per variable per treatment category, but it’s the best we can do with this method. And it appears that the variables used for matching are similar across the two groups, particularly for the lagged values.

293 synthetic control 293 Cigarette Consumption: CA and synthetic CA : California cigarette sales vs 95 Figure California synthetic California 140 synthetic California 120 100 80 60 Passage of Proposition 99 capita cigarette sales (in packs) 40 − per 20 0 1970 1980 1975 2000 1995 1990 1985 year Predictor Means: Actual vs. Synthetic California 96 Figure : Balance table Average of California Real Synthetic 38 control states Variables 9 . 86 9 . 86 Ln(GDP per capita) . 10 08 17 . 40 17 . 29 40 17 Percent aged 15-24 . 42 89 . 41 87 . 27 Retail price 89 . 28 20 . . 23 . 75 Beer consumption per capita 24 24 10 . 62 114 . 20 . 91 Cigarette sales per capita 1988 90 20 120 . 43 136 . Cigarette sales per capita 1980 120 . 58 . . 99 132 . 10 126 Cigarette sales per capita 1975 127 81 All variables except lagged cigarette sales are averaged for the 1980- Note: 1988 period (beer consumption is averaged 1984-1988).

294 294 causal inference : the mixtape Like RDD, synthetic control is a picture-intensive estimator. Your estimator is basically a picture of two series which, if there is a causal effect, diverge from another post-treatment, but resemble each other pre-treatment. It is common to therefore see a picture just showing the difference between the two series (Figure 97 . But so far, we Smoking Gap Between CA and synthetic CA 97 Figure : California cigarette sales vs synthetic California 30 20 10 0 10 − capita cigarette sales (in packs) − 20 − Passage of Proposition 99 gap in per 30 − 1995 2000 1985 1980 1975 1970 1990 year have only covered estimation. How do we determine whether the observed difference between the two series is a statistically signifi- difference? After all, we only have two observations per year. cant Maybe the divergence between the two series is nothing more than prediction error, and any model chosen would’ve done that, even if Abadie et al. [ 2010 ] suggest that we there was no treatment effect. use an old fashioned method to construct exact p-values based on Fisher [ 1935 ]. This is done through “randomization” of the treat- ment to each unit, re-estimating the model, and calculating a set of root mean squared prediction error (RMSPE) values for the pre- and 140 140 post-treatment period. What we will do is simply reassign We proceed as follows: the treatment to each unit, putting California back into the donor pool 1 . Iteratively apply the synthetic control method to each coun- each time, estimate the model for that try/state in the donor pool and obtain a distribution of placebo “placebo”, and recording information effects from each iteration. 2 Calculate the RMSPE for each placebo for the pre-treatment . period: 1 ◆ ◆ ✓ ✓ 2 +1 J T 2 1 ⇤ = Y RMSPE Y w 1 t jt   j T T 0 t T + = t j =2 0

295 synthetic control 295 3 . Calculate the RMSPE for each placebo for the post-treatment period (similar equation but for the post-treatment period) . Compute the ratio of the post-to-pre-treatment RMSPE 4 5 Sort this ratio in descending order from greatest to highest. . 6 . p = Calculate the treatment unit’s ratio in the distribution as RANK TOTAL In other words, what we want to know is whether California’s treat- ment effect is extreme, which is a relative concept compared to the donor pool’s own placebo ratios. There’s several different ways to represent this. The first is to overlay California with all the placebos using Stata twoway command, which I’ll show later. Figure 98 shows what this looks like. And I think you’ll agree, it tells a nice story. Clearly, California is in the Abadie et al. 2010 ] tails of some distribution of treatment effects. [ Smoking Gap for CA and 38 control states 98 : Placebo distribution Figure (All States in Donor Pool) California 30 control states 20 10 0 10 − capita cigarette sales (in packs) − 20 − Passage of Proposition 99 gap in per 30 − 1995 2000 1970 1975 1980 1985 1990 year recommend iteratively dropping the states whose pre-treatment RMSPE is considerably different than California’s because as you can see, they’re kind of blowing up the scale and making it hard to see what’s going on. They do this in several steps, but I’ll just skip to the last step (Figure 99 ). In this, they’ve dropped any state unit from the graph whose pre-treatment RMSPE is more than two times that of California’s. This therefore limits the picture to just units whose

296 296 causal inference : the mixtape Smoking Gap for CA and 19 control states : Placebo distribution 99 Figure (Pre-Prop. 99 MSPE 2TimesPre-Prop. 99MSPEforCA) California 30 control states 20 10 0 10 − capita cigarette sales (in packs) − 20 − Passage of Proposition 99 gap in per 30 − 1970 1990 1985 2000 1980 1975 1995 year model fit, pre-treatment, was pretty good, like California’s. But, ultimately, inference is based on those exact p-values. So the way we do this is we simply create a histogram of the ratios, and more or less mark the treatment group in the distribution so that the reader can see the exact p-value associated with the model. I produce that here in Figure 100 . As can be seen, California is ranked 1 st out of 141 141 Recall, they dropped several states 0 . 026 , which is less 38 This gives an exact p-value of state units. who had similar legislation passed over than the conventional 5 % most journals want to (arbitrarily) see for this time period. statistical significance. Falsifications In Abadie et al. [ 2015 ], the authors studied the effect of the reunification of Germany on GDP. One of the contributions this paper makes, though, is a recommendation for how to test the validity of the estimator through a falsification exercise. To illustrate this, let’s walk through their basic findings. In Figure 101 , the authors illustrate their main question by showing the changing trend lines for West Germany and the rest of their OECD sample. As we saw with cigarette smoking, it’s difficult to make a state- ment about the effect of reunification given West Germany is dissimi- lar from the other countries on average before reunification. In Figure 101 and Figure 103 , we see their main results. The au- thors then implement the placebo-based inference to calculate exact p -values and find that the estimated treatment effect from reunifica-

297 synthetic control 297 Ratio Post-Prop. 99 MSPE to Pre-Prop. 99 MSPE : Placebo distribution 100 Figure (All 38 States in Donor Pool) 5 4 3 California frequency 2 1 0 120 80 100 60 40 20 0 post/pre − Proposition 99 mean squared prediction error Figures : West Germany GDP vs. 101 Figure Figure 1: Trends in Per-Capita GDP: West Germany vs. Rest of OECD Sample Other Countries 30000 25000 reunification 20000 15000 capita GDP (PPP, 2002 USD) − 10000 per 5000 West Germany rest of the OECD sample 0 2000 1960 1970 1980 1990 year 25

298 298 causal inference the mixtape : : Synthetic control graph: 102 Figure Figure 2: Trends in Per-Capita GDP: West Germany vs. Synthetic West Germany West Germany vs Synthetic West Germany 30000 25000 reunification 20000 15000 capita GDP (PPP, 2002 USD) − 10000 per 5000 West Germany synthetic West Germany 0 1980 1970 1960 1990 2000 year : Synthetic control graph: 103 Figure Figure 3: Per-Capita GDP Gap Between West Germany and Synthetic West Germany Differences between West Germany and Synthetic West Germany 4000 26 2000 reunification 0 capita GDP (PPP, 2002 USD) − 2000 − gap in per 4000 − 2000 1980 1970 1960 1990 year 27

299 synthetic control 299 tion is statistically significant. The placebo-based inference suggests even further robustness checks, though. The authors specifically recommend rewinding time from the date of the treatment itself and estimating their model on an earlier (placebo) date. There should be no effect when they do this; if there is, then it calls into question the research design. The authors . Notice that when they run their model on do this in Figure 104 : Synthetic control graph: 104 Figure Figure 4: Placebo Reunification 1975 - Trends in Per-Capita GDP: West Germany vs. Synthetic West Germany Placebo Date 30000 25000 placebo reunification 20000 15000 capita GDP (PPP, 2002 USD) − 10000 per 5000 West Germany synthetic West Germany 0 1985 1980 1990 1975 1970 1965 1960 year , they ultimately find no effect. This suggests the placebo date of 1975 that their model has good in and out of sample predictive properties. Hence since the model does such a good job of predicting GDP per capita, the fact that it fails to anticipate the change in the year of 28 reunification suggests that the model was picking up a causal effect. We include this second paper primarily to illustrate that synthetic control methods are increasingly expected to pursue numerous falsification exercises in addition to simply estimating the causal effect itself. In this sense, researchers have pushed others to hold it to the same level of scrutiny and skepticism as they have with other methodologies such as RDD and IV. Authors using synthetic control must do more than merely run the synth command when doing comparative case studies. They must find the exact p -values through placebo-based inference, check for the quality of the pre-treatment fit,

300 300 causal inference : the mixtape investigate the balance of the covariates used for matching, and check for the validity of the model through placebo estimation (e.g., rolling back the treatment date). Stata exercise: Prison construction and Black male incarceration The project that you’ll be replicating here is a project I have been 142 142 working on with several coauthors over the last few years. You can find one example Here’s of an unpublished manuscript the backdrop. here coauthored with Sam Kang: 1980 In , Texas Department of Corrections lost a major civil action 20 .pdf Ruiz v. Estelle lawsuit. The lawsuit was called ; Ruiz was the prisoner who brought the case, and Estelle was the warden. The case argued that TDC was engaging in unconstitutional practices related to overcrowding and other prison conditions. Surprisingly, Texas lost the case, and as a result, Texas was forced to enter into a series of settlements. To amend the issue of overcrowding, the courts placed constraints on the number of housing inmates that could be placed in cells. To ensure compliance, TDC was put under court supervision until 2003 . Given these constraints, the construction of new prisons was the only way that Texas could adequately meet demand without letting prisoners go, and since the building of new prisons was erratic, the only other option was increasing the state’s parole rate. That is precisely what happened; following , Texas used paroles Ruiz v. Estelle more intensively to handle the increased arrest and imprisonment flows since they did not have the operational capacity to handle that flow otherwise. But, then the state began building prisons which started somewhat 1980 s under Governor Bill Clements. However, the prison in the late construction under Clements was relatively modest. Not so in 1993 when Governor Ann Richards embarked on a major prison construc- tion drive. Under Richards, state legislators approved a billion dollar prison construction project which doubled the state’s operational capacity within years. This can be seen in Figure 105 . As can be 3 seen, Clements build out was relatively modest both as a percentage change and in levels. But Richards’ investments in operational capac- ity was gigantic – the number of beds grew over 30 % for three years causing the number of beds to more than double in a short period of time. What was the effect of building so many prisons? Just because prison capacity expands doesn’t mean incarceration rates will grow. But because the state was intensively using paroles to handle the flow, that’s precisely what did happen. Because our analysis in a moment will focus on African-American male imprisonment, I will show the

301 synthetic control 301 : Prison capacity (operational 105 Figure Texas prison growth capacity) expansion Operational capacity 35 160000 30 25 140000 20 120000 15 100000 10 5 80000 Prison capacity operational 0 60000 Percent change in capacity operational -5 40000 1996 1982 2000 2002 2004 1984 1986 1988 1990 1992 1994 1998 Percent change Operational capacity effect of the prison boom on African-American male incarceration. As you can see from Figure 106 , the Black male incarceration rate went from 150 350 in only two years. Texas basically went from to being a typical, modal state when it came to incarceration rates to one of the most severe in only a few short periods of time. What we will now do is analyze the effect that the prison con- struction under Richards had on Black male incarceration rates using synthetic control. The do file to do this can be downloaded directly from my website at , and I probably would recommend downloading it now instead of using the code I’m going to post here. But, let’s start now. You’ll first want to look at the readme document to learn how to organize a set of subdirectories, as I use subdirectories extensively in this do file. That readme can be found at . The subdirectories you’ll need are the following: • Do • Data – synth • Inference • Figures And I recommend having a designated main directory for all this, perhaps /Texas. In other words the Do directory would be located in

302 302 causal inference the mixtape : : African-American male 106 Figure Back male incarceration per 100 000 incarceration rates Texas vs US 300 250 200 150 100 Black male incarceration rates 50 1990 2005 2000 1995 1980 1985 USA (excluding TX) TX 1993 starts the prison expansion /Texas/Do. Now let’s begin. The first step is to create the figure showing the effect of the 1993 prison construction on Black male incarceration rates. I’ve chosen a set of covariates and pre-treatment outcome variables for the matching; I encourage you, though, to play around with different models. We can already see, though, from Figure 106 that prior to 1993 , Texas Black male incarceration rates were pretty similar to the rest of the country. What this is going to mean for our analysis is that we have every reason to believe that the convex hull likely exists in

303 synthetic control 303 this application. _ cd "/users/scott . cunningham/downloads/texas/do" Estimation 1: Texas model of black male prisoners (per capita) . * scuse texas.dta, replace . . #delimit; . ssc install synth synth bmprison . . bmprison(1990) bmprison(1992) bmprison(1991) bmprison(1988) . alcohol(1990) aidscapita(1990) aidscapita(1991) income ur poverty black(1990) black(1991) black(1992) . perc1519(1990) . , . trunit(48) trperiod(1993) unitnames(state) . . mspeperiod(1985(1)1993) resultsperiod(1985(1)2000) _ keep(../data/synth/synth . bmprate.dta) replace fig; _ . mat list e(V matrix); . #delimit cr _ graph save Graph ../Figures/synth tx.gph, replace . Note that on the first line, you will need to change the path directory, but otherwise, it should run because I’m using standard Unix/DOS notation that allows you to back up and redirect to a different subdi- rectory using the “../” command. Now in this example, there’s a lot of syntax, so let me walk you through it. First, you need to install the data from my website using scuse . Second, I personally prefer to make the delimiter a semicolon be- synth on the same screen. I’m cause I want to have all syntax for synth syntax. The more of a visual person, so that helps me. Next the syntax goes like this: call synth , then call the outcome variable (bm- prison), then the variables you want to match on. Notice that you can choose either to match on the entire pre-treatment average, or you can choose particular years. I choose both. Also recall that Abadie et al. 2010 ] notes the importance of controlling for pre-treatment [ outcomes to soak up the heterogeneity; I do that here as well. Once you’ve listed your covariates, you use a comma to move to Stata options. You first have to specify the treatment unit. The FIPS code for Texas is a 48 , hence the 48 . You then specify the treatment pe- riod, which is 1993 . You list the period of time which will be used to minimize the mean squared prediction error, as well as what years to display. Stata will produce both a figure as well as a dataset with information used to create the figure. It will also list the V matrix. Finally, I change the delimiter back to carriage return, and save the

304 304 causal inference : the mixtape figure in the /Figures subdirectory. Let’s look at what these lines made (Figure 107 ). This is the kind of outcome that you ideally : African-American male 107 Figure incarceration 60000 50000 40000 bmprison 30000 20000 10000 2000 1985 1995 1990 year Texas synthetic Texas want to say – specifically, a very similar pre-treatment trend in the synthetic Texas group compared to the actual Texas group, and a divergence in the post-treatment period. We will now plot the gap between these two lines using the following commands: . Plot the gap in predicted error * _ . use ../data/synth/synth bmprate.dta, clear _ _ _ _ _ treated Y keep synthetic . time Y _ drop if time==. . _ . time year rename _ _ rename Y treated treat . _ _ rename Y synthetic counterfact . gen gap48=treat-counterfact . . sort year . #delimit ; . twoway (line gap48 year,lp(solid)lw(vthin)lcolor(black)), yline(0, lpattern(shortdash) lcolor(black)) . xline(1993, lpattern(shortdash) lcolor(black)) xtitle("",si(medsmall)) xlabel(#10) . ytitle("Gap in black male prisoner prediction error", size(medsmall)) legend(off); . #delimit cr _ _ . save ../data/synth/synth bmprate 48.dta, replace

305 synthetic control 305 The figure that this makes is basically nothing more than the gap 107 . between the actual Texas and the synthetic Texas from Figure : Gap between actual Texas 108 Figure and synthetic Texas 25000 20000 15000 10000 5000 Gap in black male prisoner prediction error 0 1994 1996 1998 2000 1986 1988 1984 1992 1990 And finally, we will show the weights used to construct the syn- thetic Texas. Table 36 : Synthetic control weights State name Weight 0 California 408 . Florida 0 109 . 36 Illinois 0 . 122 . 0 Louisiana Now that we have our estimates of the causal effect, we move into the calculation of the exact p -value which will be based on assigning the treatment to every state and re-estimating our model. Texas will

306 306 causal inference : the mixtape always be thrown back into the donor pool each time. . Inference 1 placebo test * #delimit; . . set more off; use ../data/texas.dta, replace; . . local statelist 1 2 4 5 6 8 9 10 11 12 13 15 16 17 18 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 45 46 47 48 49 51 53 55; foreach i of local statelist {; . synth bmprison . bmprison(1990) bmprison(1992) bmprison(1991) bmprison(1988) . alcohol(1990) aidscapita(1990) aidscapita(1991) . . income ur poverty black(1990) black(1991) black(1992) . perc1519(1990) . , . trunit(‘i’) trperiod(1993) unitnames(state) . mspeperiod(1985(1)1993) resultsperiod(1985(1)2000) _ _ . keep(../data/synth/synth ‘i’.dta) replace; bmprate matrix state‘i’ = e(RMSPE); / . check the V matrix / * * . }; . foreach i of local statelist {; . matrix rownames state‘i’=‘i’; . matlist state‘i’, names(rows); . }; . #delimit cr This is a loop in which it will cycle through every state and estimate the model. It will then save data associated with each model into the ../data/synth/synth_bmcrate_‘i’.dta data file where ‘i’ is one of the state FIPS code listed after local statelist . Now that we have each

307 synthetic control 307 of these files, we can calculate the post-to-pre RMSPE. local statelist 1 2 4 5 6 8 9 10 11 12 13 15 16 17 18 20 21 22 23 24 25 26 27 28 29 30 31 . 32 33 34 35 36 37 38 39 40 41 42 45 46 47 48 49 51 53 55 foreach i of local statelist { . _ _ . use ../data/synth/synth bmprate ‘i’ ,clear _ _ _ _ _ treated Y keep synthetic . time Y _ drop if time==. . _ . time year rename _ _ rename Y treated treat‘i’ . _ _ rename synthetic counterfact‘i’ Y . . gen gap‘i’=treat‘i’-counterfact‘i’ . sort year _ _ gap bmprate‘i’, replace save ../data/synth/synth . _ _ . use ../data/synth/synth bmprate48.dta, clear gap sort year . _ save ../data/synth/placebo . bmprate48.dta, replace . foreach i of local statelist { _ _ . merge year using ../data/synth/synth gap bmprate‘i’ _ drop merge . . sort year _ save ../data/synth/placebo . bmprate.dta, replace Notice that this is going to first create the gap between the treatment state and the counterfactual state before merging each of them into

308 308 causal inference : the mixtape single data file. . Inference 2: Estimate the pre- and post-RMSPE and calculate the ratio of the ** post-pre RMSPE . * . set more off . local statelist 1 2 4 5 6 8 9 10 11 12 13 15 16 17 18 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 45 46 47 48 49 51 53 55 foreach i of local statelist { . _ _ use ../data/synth/synth . bmprate‘i’, clear gap . gap‘i’ gen gap3=gap‘i’ * . egen postmean=mean(gap3) if year>1993 . egen premean=mean(gap3) if year<=1993 . gen rmspe=sqrt(premean) if year<=1993 replace rmspe=sqrt(postmean) if year>1993 . _ . gen ratio=rmspe/rmspe[ n-1] if year==1994 _ . gen rmspe post=sqrt(postmean) if year>1993 _ _ gen rmspe . n-1] if year==1994 pre=rmspe[ _ _ pre rmspe . mkmat rmspe post ratio if year==1994, matrix (state‘i’) . } In this part, we are calculating the post-RMSPE, the pre-RMSPE and the ratio of the two. Once we have this information, we can compute a histogram. The following commands do that.

309 synthetic control 309 . show post/pre-expansion RMSPE ratio for all states, generate histogram * . foreach i of local statelist { . matrix rownames state‘i’=‘i’ . matlist state‘i’, names(rows) . } . #delimit ; . mat state=state1\state2\state4\state5\state6\state8\state9\state10\state11\state12\state13\state15 \state16\state17\state18\state20\state21\state22\state23\state24\state25\state26 . . \state27\state28\state29\state30\state31\state32\state33\state34\state35\state36\ . state37\state38\state39\state40\state41\state42\state45\state46\state47\state48; \state49\state51\state53\state55 . ; . #delimit cr ssc install mat2txt . _ mat2txt, matrix(state) saving(../inference/rmspe bmprate.txt) replace . _ . insheet using ../inference/rmspe bmprate.txt, clear . ren v1 state . drop v5 . gsort -ratio _ gen rank= . n gen p=rank/46 . _ export excel using ../inference/rmspe bmprate, firstrow(variables) replace . _ bmprate.xls, sheet("Sheet1") firstrow clear . import excel ../inference/rmspe histogram ratio, bin(20) frequency fcolor(gs13) lcolor(black) ylabel(0(2)6) xtitle(Post/pre RMSPE ratio) xlabel(0(1)5) . . Show the post/pre RMSPE ratio for all states, generate the histogram. * . list rank p if state==48 All the looping will take a few moments to run, but once it is done, it will produce a histogram of the distribution of ratios of post-RMSPE to pre-RMSPE. As you can see from the p -value, Texas has the second highest ratio out of state units, giving it a p -value 46 of 0 . 04 . We can see that in Figure 110 . Notice that in addition to the figure, this created an excel spreadsheet containing information on the pre-RMSPE, the post-RMSPE, the ratio, and the rank. We will want to use that again when we limit our display next to states whose pre-RMSPE are similar to that of Texas. All the looping will take a few moments to run, but once it is done, it will produce a histogram of the distribution of ratios of post-RMSPE to pre-RMSPE. As you can see from the p -value, Texas

310 310 causal inference the mixtape : : Histogram of the distri- 109 Figure bution of ratios of post-RMSPE to pre-RMSPE. Texas is one of the ones in the far right tail. 6 Frequency 4 2 0 2 3 4 5 0 1 Post/pre RMSPE ratio has the second highest ratio out of 46 state units, giving it a p -value . Notice that in addition to of . 04 . We can see that in Figure 110 0 the figure, this created an excel spreadsheet containing information on the pre-RMSPE, the post-RMSPE, the ratio, and the rank. We will want to use that again when we limit our display next to states whose pre-RMSPE are similar to that of Texas. Now we want to create the characteristic placebo graph where all the state placebos are laid on top of Texas. To do that we use the

311 synthetic control 311 Figure 110 : Histogram of the distri- bution of ratios of post-RMSPE to pre-RMSPE. Texas is one of the ones in the far right tail. 6 Frequency 4 2 0 4 3 2 1 0 5 Post/pre RMSPE ratio following syntax: Inference 3: all the placeboes on the same picture . * _ use ../data/synth/placebo bmprate.dta, replace . Picture of the full sample, including outlier RSMPE . * . #delimit; . twoway . (line gap1 year ,lp(solid)lw(vthin)) . (line gap2 year ,lp(solid)lw(vthin)) . (line gap4 year ,lp(solid)lw(vthin)) (line gap5 year ,lp(solid)lw(vthin)) . (line gap6 year ,lp(solid)lw(vthin)) . . (line gap8 year ,lp(solid)lw(vthin)) . (line gap9 year ,lp(solid)lw(vthin)) . (line gap10 year ,lp(solid)lw(vthin)) . (line gap11 year ,lp(solid)lw(vthin)) . (line gap12 year ,lp(solid)lw(vthin)) . (line gap13 year ,lp(solid)lw(vthin)) . (line gap15 year ,lp(solid)lw(vthin)) . (line gap16 year ,lp(solid)lw(vthin)) . (line gap17 year ,lp(solid)lw(vthin)) . (line gap18 year ,lp(solid)lw(vthin)) . (line gap20 year ,lp(solid)lw(vthin)) . (line gap21 year ,lp(solid)lw(vthin)) . (line gap22 year ,lp(solid)lw(vthin)) . (line gap23 year ,lp(solid)lw(vthin)) . (line gap24 year ,lp(solid)lw(vthin))

312 312 causal inference : the mixtape (line gap25 year ,lp(solid)lw(vthin)) . . (line gap26 year ,lp(solid)lw(vthin)) (line gap27 year ,lp(solid)lw(vthin)) . . (line gap28 year ,lp(solid)lw(vthin)) . (line gap29 year ,lp(solid)lw(vthin)) . (line gap30 year ,lp(solid)lw(vthin)) (line gap31 year ,lp(solid)lw(vthin)) . . (line gap32 year ,lp(solid)lw(vthin)) (line gap33 year ,lp(solid)lw(vthin)) . (line gap34 year ,lp(solid)lw(vthin)) . (line gap35 year ,lp(solid)lw(vthin)) . (line gap36 year ,lp(solid)lw(vthin)) . . (line gap37 year ,lp(solid)lw(vthin)) (line gap38 year ,lp(solid)lw(vthin)) . (line gap39 year ,lp(solid)lw(vthin)) . . (line gap40 year ,lp(solid)lw(vthin)) . (line gap41 year ,lp(solid)lw(vthin)) . (line gap42 year ,lp(solid)lw(vthin)) . (line gap45 year ,lp(solid)lw(vthin)) (line gap46 year ,lp(solid)lw(vthin)) . (line gap47 year ,lp(solid)lw(vthin)) . . (line gap49 year ,lp(solid)lw(vthin)) . (line gap51 year ,lp(solid)lw(vthin)) . (line gap53 year ,lp(solid)lw(vthin)) . (line gap55 year ,lp(solid)lw(vthin)) . (line gap48 year ,lp(solid)lw(thick)lcolor(black)), / / treatment unit, Texas * * . yline(0, lpattern(shortdash) lcolor(black)) xline(1993, lpattern(shortdash) lcolor(black)) . xtitle("",si(small)) xlabel(#10) ytitle("Gap in black male prisoners prediction error", size(small)) . legend(off); . #delimit cr Here we will only display the main picture with the placebos, though one could show several cuts of the data in which you drop states whose pre-treatment fit compared to Texas is rather poor. Now that you have seen how to use this do file to estimate a syn- thetic control model, you are ready to play around with the data yourself. All of this analysis so far has used black male (total counts) incarceration as the dependent variable, but perhaps the results

313 synthetic control 313 : Placebo distribution. Texas 111 Figure is the black line. 30000 20000 10000 0 Gap in black male prisoners per capita prediction error -10000 1984 1994 1992 2000 1990 1988 1986 1996 1998 would be different if we used black male incarceration rates. That information is contained in the dataset. I would like for you to do your own analysis using the black male incarceration rate variable as the dependent variable. You will need to find a new model to fit this pattern, as it’s unlikely that the one we used for levels will do as good a job describing rates as it did levels. In addition, you should implement the placebo-date falsification exercise that we mentioned from [ 2015 ]. Choose an 1989 as your treatment date and Abadie et al. 1992 as the end of the sample and check whether the same model shows the same treatment effect as you found when you used the correct year, 1993 , as the treatment date. I encourage you to use these data and this file to learn the ins and outs of the procedure itself, as well as to think more deeply about what synthetic control is doing and how to best use it in research. Conclusion In conclusion, we have seen how to estimate synthetic control models in Stata. This model is currently an active area of research (e.g., Powell [ 2017 ]), but this is a good foundation for under- standing the model. I hope that you find this useful.


315 Conclusion Causal inference is a fun area. It’s fun because the Rubin causal model is such a philosophically stimulating and intuitive way to think about causal effects, and Pearl’s directed acylical graphical models are so helpful for moving between a theoretical model and/or an understanding of some phenomena, and an identifica- tion strategy to identify the causal effect you care about. From those DAGs, you will learn whether it’s even possible to design such an identification strategy with the dataset you have, and while that can be disappointing, it is nonetheless a disciplined and truthful approach to identification. These DAGs are, in my experience, em- powering and extremely useful for the design phase of a project. The methods I’ve outlined are merely some of the most common research designs currently employed in applied microeconomics. They are not all methods, and each method is not exhaustively plumbed either. Version 1 . 0 omits a lot of things, like I said in the opening chapter, such as machine learning, imperfect controls, matrix completion, and structural estimation. I do not omit these because they are unimportant; I omit them because I am still learning them myself! Version 2 . 0 will differ from version 1 . 0 primarily in that it will add in some of these additional estimators and strategies. Version . 0 2 will also contain more Stata exercises, and most likely I will produce a set of do files for you that will exactly reproduce the examples I go through in the book. It may be helpful for you to have handy a file, as well as see the programming on the page. I also would like to have more simulation, as I find that simulations are a great way to communicate the identifying assumptions for some estimator, as well as explain basic ideas like the variance in some estimator. I hope you find this book valuable. Please check out the many papers I’ve cited, as well as the textbooks I listed at the beginning, as they are all excellent, and you will learn more from them than you have learned from my introductory book. Good luck in your research.


317 Bibliography Alberto Abadie and Javier Gardeazabal. The economic costs of American Economic conflict: A case study of the basque country. 93 ( 1 ): 113 – 132 , March 2003 . Review , Alberto Abadie and Guido Imbens. Large sample properties of matching estimators for average treatment effects. 74 Econometrica , ): 235 1 267 , 2006 . ( – Alberto Abadie and Guido Imbens. Bias-corrected matching estima- Journal of Business and Economic tors for average treatment effects. , Statistics : 1 – 11 , 2011 . 29 Alberto Abadie, Alexis Diamond, and Jens Hainmueller. Synthetic control methods for comparative case studies: Estimating the effect Journal of the American of california’s tobacco control program. , 105 ( 490 ): Statistical Association – 505 , June 2010 . 493 Alberto Abadie, Alexis Diamond, and Jens Hainmueller. Com- parative politics and the synthetic control method. American Jour- nal of Political Science , 59 ( 2 ): 495 – 510 , October 2015 . Unpublished Manuscript. Eric Allen, Patricia Dechow, Devin Pope, and George Wu. Reference- dependent preferences: Evidence from marathon runners. Unpub- lished Manuscript, . 2013 Douglas Almond, Joseph J. Doyle, Amanda Kowalski, and Heidi Williams. Estimating returns to medical care: Evidence from at-risk The Quarterly Journal of Economics , 125 ( 2 newborns. 591 – 634 , 20010 . ): Joshua D. Angrist. Lifetime earnings and the vietnam era draft lot- tery: Evidence from social security administrative records. American Economic Review , 80 ( 3 ): 313 – 336 , June 1990 . Joshua D. Angrist and Alan B. Krueger. Does compulsory school Quarterly Journal of attendance affect schooling and earnings? Economics , 106 ( . ): 979 – 1014 , November 1991 4

318 318 causal inference : the mixtape Joshua D. Angrist and Alan B. Krueger. Instrumental variables and the search for identification: From supply and demand to natural , 15 ( 4 ): 69 – 85 , 2001 . experiments. Journal of Economic Perspectives Joshua D. Angrist and Victor Lavy. Using maimonides’ rule to Quarterly estimate the effect of class size on scholastic achievement. 114 ( 2 ): 533 Journal of Economics 575 , 1999 . , – Mostly Harmless Econo- Joshua D. Angrist and Jorn-Steffen Pischke. . Princeton University Press, 1 st edition, metrics . 2009 Joshua D. Angrist, Guido W. Imbens, and Donald B. Rubin. Identi- Journal of the fication of causal effects using instrumental variables. 87 : 328 – 336 , , . American Statistical Association 1996 Susan Athey and Guide W. Imbens. The state of applied economet- rics: Causality and policy evaluation. Journal of Economic Perspectives , 31 2 ): 3 – 32 , Spring 2017 . ( Susan Athey, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, and Khashayar Khosravi. Matrix completion methods for causal panel data models. Unpublished Manuscript, October 2017 . M. Christopher Auld and Paul Grootendorst. An empirical analysis of milk addiction. Journal of Health Economics , 23 ( 6 ): 1117 – 1133 , November 2004 . David H. Autor. Outsourcing at will: The contribution of unjust dismissal doctrine to the growth of employment outsourcing. Journal 42 21 1 ): of Labor Economics – ( , 2003 . , 1 Katherine Baicker, Sarah L. Taubman, Heidi L. Allen, Mira Bern- stein, Jonathan Gruber, Joseph Newhouse, Eric Schneider, Bill Wright, Alam Zaslavsky, and Amy Finkelstein. The oregon experi- New England Journal ment – effects of medicaid on clinical outcomes. , 368 : 1713 of Medicine 1722 , May 2013 . – Burt S. Barnow, Glen G. Cain, and Arthur Goldberger. Selection on observables. Evaluation Studies Review Annual , 5 : 43 – 59 , 1981 . Gary Becker. Crime and punishment: An economic approach. The , 76 : 169 Journal of Political Economy 217 , 1968 . – Gary Becker. Human Capital: A Theoretical and Empirical Analysis with Special Reference to Education . University of Chicago Press, 3 rd edition, 1994 . Gary S. Becker. The economic way of looking at life. University of Chicago Coase-Sandor Working Paper Series in Law and Economics, 1993 .

319 bibliography 319 Gary S. Becker and Kevin M. Murphy. A theory of rational addiction. , 96 4 ), August 1988 . Journal of Political Economy ( Gary S. Becker, Michael Grossman, and Kevin M. Murphy. The Journal of Political Economy market for illegal gods: The case of drugs. , 1 ): 38 – 114 , 2006 . ( 60 Marianne Bertrand, Esther Duflo, and Sendhil Mullainathan. How much should we trust differences-in-differences estimates? Quarterly , 119 ( 1 ): 249 – Journal of Economics , February 2004 . 275 Sandra E. Black. Do better schools matter? parental valuation of – Quarterly Journal of Economics ( 2 ): 577 114 599 , elementary education. , . 1999 John Bound, David A. Jaeger, and Regina M. Baker. Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. 1995 , ( 430 Journal of the American Statistical Association 90 . ), John M. Brooks and Robert L. Ohsfeldt. Squeezing the balloon: Propensity scores and unmeasured covariate balance. Health Services Research , 48 ( 4 ): 1487 – 1507 , August 2013 . Sebastian Calonico, Matis D. Cattaneo, and Rocio Titiunik. Robust nonparametric confidence intervals for regression-discontinuity 2326 Econometrica ( 6 ): 2295 – 82 , November 2014 . designs. , David Card. The impact of the mariel boatlift on the miami labor Industrial and Labor Relations Review , market. ( 2 ): 245 – 257 , January 43 1990 . David Card. Aspects of Labour Economics: Essays in Honour of John Van- derkamp , chapter Using Geographic Variation in College Proximity to Estimate the Return to Schooling. University of Toronto Press, 1995 . David Card and Alan Krueger. Minimum wages and employment: A case study of the fast-food industry in new jersey and pennsylvania. 1994 , : 772 – 793 American Economic Review 84 . , David Card and Giovanni Peri. Immigration economics: A review. Unpublished Manuscript, April 2016 . David Card, Carlos Dobkin, and Nicole Maestas. The impact of nearly universal insurance coverage on health care utilization: ): American Economic Review , 98 ( 5 Evidence from medicare. 2242 – 2258 , December 2008 .

320 320 causal inference : the mixtape David Card, Carlos Dobkin, and Nicole Maestas. Does medicare save lives? 124 ( 2 ): 597 – 636 , 2009 . The Quarterly Journal of Economics , David Card, David S. Lee, Zhuan Pei, and Andrea Weber. Inference Economet- on causal effects in a generalized regression kink design. 84 ( 6 ): 2453 – rica , November 2015 . , 2483 Christopher Carpenter and Carlos Dobkin. The effect of alcohol consumption on mortality: Regression discontinuity evidence from the minimum drinking age. American Economic Journal: Applied , 1 ( 1 Economics 164 – 182 , January 2009 . ): Scott E. Carrell, Mark Hoekstra, and James E. West. Does drinking impair college performance? evidence from a regression discontinu- 62 95 : 54 – , , 2011 . ity approach. Journal of Public Economics Eduardo Cavallo, Sebastian Galiani, Ilan Noy, and Juan Pantano. Review of Catastrophic natural disasters and economic growth. 1561 , ( 5 ): 1549 – 95 , 2013 . Economics and Statistics Kerwin Charles and Ming Ching Luoh. Male incarceration, the marriage market and female outcomes. Unpublished Manuscript, 2006 . Cheng Cheng and Mark Hoekstra. Does strengthening self-defense law deter crime or escalate violence? evidence from expansions to 854 Journal of Human Resources ( 3 ): 821 – 48 , 2013 . castle doctrine. , W. G. Cochran. The effectiveness of adjustment by subclassification in removing bias in observational studies. , 24 ( 2 ): 295 – 313 , Biometrics 1968 . Ethan Cohen-Cole and Jason Fletcher. Deteching implausible social network effects in acne, height, and headaches: Longtiudinal analysis. British Medical Journal , 337 (a 2533 ), 2008 . Dalton Conley and Jason Fletcher. The Genome Factor: What the Social Genomics Revolution Reveals about Ourselves, Our History, and the Future . Princeton University Press, 2017 . Thomas D. Cook. “waiting for life to arrive”: A history of the regression-discontinuity design in psychology, statistics and eco- nomics. Journal of Econometrics , 142 : 636 – 654 , 2008 . Christopher Cornwell and Scott Cunningham. Mass incarcration’s 2016 . effect on risky sex. Unpublished Manuscript, Christopher Cornwell and Peter Rupert. Unobservable individual effects, marriage and the earnings of young men. Economic Inquiry , 35 . 2 ): 1 – 8 , April 1997 (

321 bibliography 321 Christopher Cornwell and William N. Trumbull. Estimating the Review of Economics and economic model of crime with panel data. 76 ( 2 ): 360 – 366 , 1994 , Statistics . The Professor, the Banker and the Suicide King: Inside the Michael Craig. Richest Poker Game of All Time . Grand Central Publishing, 2006 . Richard K. Crump, V. Joseph Hotz, Guido W. Imbens, and Oscar A. Mitnik. Dealing with limited overlap in estimation of average Biometrika , 96 ( 1 ): 187 – 1999 , 2009 . treatment effects. Scott Cunningham and Christopher Cornwell. The long-run effect of abortion on sexually transmitted infections. American Law and , Spring , 1 ): 381 – 407 ( 2013 . Economics Review 15 Scott Cunningham and Keith Finlay. Parental substance abuse and foster care: Evidence from two methamphetamine supply shocks? , Economic Inquiry ( 1 ): 764 – 782 , 2012 . 51 Scott Cunningham and Todd D. Kendall. Prostitution 2 . 0 : The changing face of sex work. Journal of Urban Economics , 69 : 273 – 287 , 2011 . Scott Cunningham and Todd D. Kendall. Prostitution labor supply and education. Review of Economics of the Household , Forthcoming, . 2016 Stacy Berg Dale and Alan B. Krueger. Estimating the payoff to attending a more selective college: An application of selection on , observables and unobservables. ( 4 ): Quarterly Journal of Economics 117 1491 1527 , November 2002 . – Angus Deaton and Nancy Cartwright. Understanding and misun- Social Science and Medicine , derstanding randomized controlled trials. forthcoming, . 2018 Rajeev H. Dehejia and Sadek Wahba. Causal effects in nonexperimen- tal studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association , 94 ( 448 ): 1053 – 1062 , December 1999 . Rajeev H. Dehejia and Sadek Wahba. Propensity score-matching methods for nonexperimental causal studies. Review of Economics and Statistics , 84 ( 1 ): 151 – 161 , February 2002 . Carlos Dobkin and Nancy Nicosia. The war on drugs: Metham- ): American Economic Review , 99 ( 1 phetamine, public health and crime. 324 – 349 , 2009 .

322 322 causal inference : the mixtape John J. Donohue and Steven D. Levitt. The impact of legalized abortion on crime. 116 ( 2 ): 379 – 420 , The Quarterly Journal of Economics , . 2001 May Nikolay Doudchenko and Guido Imbens. Balancing, regression, difference-in-differences and synthetic control methods: A synthesis. NBER Working Papers 22791 , 2016 . Mirko Draca, Stephen Machin, and Robert Witt. Panic on the streets terror attacks. 2005 American of london: Police, crime, and the july 101 ( 5 ): Economic Review – 81 , August 2011 . , 2157 Arindrajit Dube, T. William Lester, and Michael Reich. Minimum wage effects across state borders: Estimates using contiguous coun- Review of Economics and Statistics ties. 92 ( 4 ): 945 – 964 , November , . 2010 Journal of William N. Evans and Emily G. Owens. Cops and crime. , 91 ( 1 - 2 ): 181 – 201 Public Economics 2007 . , February R. A. Fisher. . Edinburgh: Oliver and Boyd, The Design of Experiments . 1935 Roland A. Fisher. . Oliver and Statistical Methods for Research Workers Boyd, Edinburg, 1925 . Christopher L. Foote and Christopher F. Goetz. The impact of legalized abortion on crime: Comment. Quarterly Journal of Economics , 2008 ( 407 – 423 , February ): . 123 1 Sociological David A. Freedman. Statistical models and shoe leather. , 21 : 291 – 313 Methodology 1991 . , Ragnar Frisch and Frederick V. Waugh. Partial time regressions as 387 Econometrica 1 ( 4 compared with individuals trends. , – 401 , 1933 . ): Carl Friedrich Gauss. Theoria Motus Corporum Coelestium . Perthes et Besser, Hamburg, . 1809 Andrew Gelman and Guido Imbens. Why higher-order polynomials should not be used in regression disconti- nuity designs. Journal of Business and Economic Statistics , 10 . 1080 / 07350015 . 2017 . 1366909 , 2017 . Andrew Gelman and Guido W. Imbens. Why high-order polynomials should not be used in regression discontinuity design. Unpublished Manuscript, September 2016 . Paul Gertler, Manisha Shah, and Stefano M. Bertozzi. Risky business: The market for unprotected commercial sex. Journal of Political Economy , . ( 3 ): 518 – 550 , 2005 113

323 bibliography 323 A. S. Goldberger. Selection bias in evaluating treatment effects: some 1972 . formal illustrations. Madison, WI unpublished Manuscript, Journal of Economic Kathryn Graddy. The fulton fish market. ( 2 ): 207 – 20 , Spring 2006 . , Perspectives 220 Jonathan Gruber. The incidence of mandated maternity benefits. 84 ( 3 ): 622 – 641 , June 1994 . American Economic Review , Jonathan Gruber, Phillip B. Levine, and Douglas Staiger. Abortion legalization and child living circumstances: Who is the “marginal The Quarterly Journal of Economics , 114 ( 1 child”? 263 – 291 , February ): 1999 . Trygve Haavelmo. The statistical implications of a system of simul- – 11 ( 1 ): 1 , 12 , 1943 . taneous equations. Econometrica Jinyong Hahn, Petra Todd, and Wilbert van der Klaauw. Iden- tification and estimation of treatment effects with a regression- discontinuity design. , 69 ( 1 ): Econometrica – 209 , January 2001 . 201 Daniel S. Hamermesh and Jeff E. Biddle. Beauty and the labor market. American Economic Review , 84 ( 5 ): 1174 – 1194 , 1994 . Ben Hansen. Punishment and deterrence: Evidence from drunk driving. , 105 ( 4 ): 1581 – 1617 , 2015 . American Economic Review James Heckman and Rodrigo Pinto. Causal analysis after haavelmo. Econometric Theory 31 ( 1 ): 115 – 151 , February 2015 . , Econometric Evaluation of So- James J. Heckman and Edward J. Vytlacil. cial Programs, Part I: Causal Models, Structural Models and Econometric – , volume B, chapter 70 Policy Evaluation 4779 6 4874 . Elsevier, , pages 2007 . Wayne H. Holtzman. The unbiased estimate of the population variance and standard deviation. The American Journal of Psychology , , ( ): 615 63 617 4 1950 . – Seung-Hyun Hong. Measuring the effect of napster on recorded music sales: Difference-in-differences estimates under compositional changes. Journal of Applied Econometrics , 28 ( 2 ): 297 – 324 , March 2013 . Robert Hooke. . CRC Press, How to Tell the Liars from the Statisticians 1983 . David Hume. An Enquiry Concerning Human Understanding: with Hume’s Abstract of A Treatise of Human Nature and A Letter from a Gentleman to His Friend in Edinburgh . Hackett Publishing Company, 2 nd edition, 1993 .

324 324 causal inference : the mixtape Stefano M. Iacus, Gary King, and Giuseppe Porro. Causal inference without balance checking: Coarsened exact matching. Political . 20 ): 1 – 24 , 2012 1 ( Analysis , Kosuke Imai and In Song Kim. When should we use fixed effects regression models for causal inference with longitudinal data. Unpublished Manuscript, December 2017 . Guide W. Imbens and Donald B. Rubin. Causal Inference for Statistics, Social and Biomedical Sciences: An Introduction . Cambridge University st edition, 2015 . Press, 1 Guideo W. Imbens and Joshua D. Angrist. Identification and estimation of local average treatment effects. , 62 ( 2 ): Econometrica – 475 , 1994 467 . Guido Imbens. Better late than nothing: Some comments on deaton 2009 ) and heckman and urzua ( ). Unpublished Manuscript, ( 2009 2009 April . Guido Imbens and Karthik Kalyanaraman. Optimal bandwidth choice for the regression discontinuity estimator. The Review of , 79 ( 3 ): 933 – 959 Economic Studies 2011 . , July Guido W. Imbens and Thomas Lemieux. Regression discontinuity , Journal of Econometrics 142 : 615 – designs: A guide to practice. , 635 2008 . Brian A. Jacob and Lars Lefgen. Remedial education and student achivement: A regression-discontinuity analysis. The Review of Economics and Statistics , 86 ( 1 ): 226 – 244 , February 2004 . Ted Joyce. Did legalized abortion lower crime? The Journal of Human , Winter , 1 ): 1 – 28 ( 2004 . Resources 39 Review of Economics Ted Joyce. A simple test of abortion and crime. , 91 ( 1 ): 112 and Statistics 123 , 2009 . – Chinhui Juhn, Kevin M. Murphy, and Brooks Pierce. Wage inequality Journal of Political Economy ): 101 ( 3 and the rise in returns to skill. , – 442 , June 1993 . 410 Gary King and Langche Zeng. The dangers of extreme counterfactu- – als. 14 ( 2 ): 131 , 159 , 2006 . Political Analysis Alan Krueger. Experimental estimates of education production functions. Quarterly Journal of Economics , 114 ( 2 ): 497 – 532 , May 1999 . Robert Lalonde. Evaluating the econometric evaluations of training ): American Economic Review , 76 ( 4 programs with experimental data. 604 – 620 , 1986 .

325 bibliography 325 David S. Lee. Randomized experiments from non-random selection Journal of Econometrics , : 675 – 697 , 2008 . in u.s. house elections. 142 David S. Lee and Thomas Lemieux. Regresion discontinuity designs Journal of Economic Literature 48 : 281 – 355 , June in economics. . 2010 , David S. Lee, Enrico Moretti, and Matthew J. Butler. Do voters affect Quarterly Journal of or elect policies: Evidence from the u.s. house. 119 ( 3 ): Economics – 859 , August 2004 . , 807 Sex and Consequences: Abortion, Public Policy, and the Phillip B. Levine. . Princeton University Press, Economics of Fertility st edition, 2004 . 1 Phillip B. Levine, Douglas Staiger, Thomas J. Kane, and David J. Zimmerman. Roe v. wade and american fertility. American Journal of , 89 ( 2 ): 199 – 203 , February 1999 . Public Health 1990 s: Four Steven D. Levitt. Understanding why crime fell in the Journal of factors that explain the decline and six that do not. , 18 ( 1 ): Economic Perspectives – 190 , Winter 2004 . 163 David Lewis. Causation. The Journal of Philosophy , 70 ( 17 ): 556 – 567 , October 1973 . John R. Lott and David B. Mustard. Crime, deterrence and the 68 26 : 1 – , , right-to-carry concealed handguns. Journal of Legal Studies . 1997 Michael C. Lovell. Seasonal adjustment of economic time series Journal of the American Statistical and multiple regression analysis. 1010 , ( 304 ): 991 – 58 , 1963 . Association Michael C. Lovell. A simple proof of the fwl theorem. Journal of Economic Education , 39 ( 1 ): 88 – 91 , 2008 . Ross L. Matsueda. Handbook of Structural Equation Modeling , chapter “Key Advances in the History of Structural Equation Modeling”. Guilford Press, 2012 . Justin McCrary. Manipulation of the running variable in the regres- sion discontinuity design: A design test. Journal of Econometrics , 142 : 698 – 714 , 2008 . John Stuart Mill. . FQ A System of Logic, Ratiocinative and Inductive Books, July 2010 . Mary S. Morgan. The History of Econometric Ideas . Cambridge University Press, 1991 .

326 326 causal inference : the mixtape Counterfactuals and Stephen L. Morgan and Christopher Winship. Causal Inference: Methods and Principles for Social Research . Cambridge nd edition, . 2 2014 University Press, Martin Needleman and Carolyn Needleman. Marx and the problem , 33 ( 3 of causation. 322 – 339 , Summer - Fall 1969 . Science and Society ): David Neumark, J.M. Ian Salas, and William Wascher. Revisting the minimum wage-employment debate: Throwing out the baby with Industrial and Labor Relations Review , 67 ( 2 . 5 ): the bathwater? – 648 , 608 2014 . Free for All? Lessons from the RAND Health Joseph P. Newhouse. 1993 . Experiment . Harvard University Press, Causality 2 nd edition, Judea Pearl. . Cambridge University Press, 2009 . Charles Sanders Peirce and Joseph Jastrow. On small differences in 83 Memoirs of the National Academy of Sciences 3 : 73 – , , 1885 . sensation. Giovanni Peri and Vasil Yasenov. The labor market effects of a refugee wave: Synthetic control method meets the mariel boatlift. Journal of Human Resources , doi: 10 . 3368 /jhr. 54 . 2 . 0217 . 8561 R 1 , 2018 2018 . Texas Tough: The Rise of America’s Prison Empire . Robert Perkinson. . Picador, first edition, 2010 David Powell. Imperfect synthetic controls: Did the massachusetts health care reform save lives? Unpublished Manuscript, October 2017 . . Merchant Books, 2012 . Ranier Maria Rilke. Letters to a Young Poet Handbook of Labor Economics Sherwin Rosen. 1 , chapter The , volume Theory of Equalizing Differences. Amsterdam: North-Holland, . 1986 Paul R. Rosenbaum. Two simple models for observational studies. Design of Observational Studies , pages 65 94 , 2010 . – Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika , 1983 70 ): 41 – 55 , April 1 . ( Donald Rubin. Estimating causal effects of treatments in random- inzed and nonrandomized studies. Journal of Educational Psychology , 66 ( 5 ): 688 – 701 , 1974 . Donald B. Rubin. Assignment to treatment group on the basis of a . Journal of Educational Statistics , 2 : covariate. – 26 , 1977 1

327 bibliography 327 Donald B. Rubin. Direct and indirect causal effects via potential Scandinavian Journal of Statistics , : 161 – 170 , 2004 . outcomes. 31 Donald B. Rubin. Causal inference using potential outcomes: Design, Journal of the American Statistical Association modeling, decisions. 100 , ): 322 469 331 , March 2005 . ( – John Rust. Optimal replacement of gmc bus engines: An empirical Econometrica , 55 ( 5 ): 999 – 1033 model of harold zurcher. 1987 . , Adam Smith. . Bantam Classics, 2003 . The Wealth of Nations Jerzy Splawa-Neyman. On the application of probability theory to agricultural experiments. essay on principles. Annals of Agricultural 1 – Sciences , 1923 . , pages 51 Douglas Staiger and James H. Stock. Instrumental variables regres- – Econometrica 65 ( 3 ): 557 sion with weak instruments. 586 , 1997 . , James H. Stock and Francesco Trebbi. Who invented instrumental variable regression? The Journal of Economic Perspectives , 17 ( 3 ): 177 – 194 , Summer 2003 . James H. Stock and Motohiro Yogo. Testing for weak instruments in linear iv regression. In Donald W. K. Andrews and James H. Stock, editors, Identification and Inference for Econometrics Models: Essays in 2005 . Cambridge University Press, . Honor of Thomas Rothenberg Jack Stuster and Marcelline Burns. Validation of the standardized field sobriety test battery at bacs below 10 percent. Technical report, 0 . US Department of Transportation, National Highway Traffic Safety 1998 Administration, August . Donald Thistlehwaite and Donald Campbell. Regression- discontinuity analysis: an alternative to the ex-post facto experi- – , 51 : ment. Journal of Educational Psychology 317 , 1960 . 309 Wilbert van der Klaauw. Estimating the effect of financial aid offers on college enrollment: A regression-discontinuity approach. 1287 International Economic Review ( 4 ): 1249 – 43 , November 2002 . , Jeffrey Wooldridge. Econometric Analysis of Cross Section and Panel Data . MIT Press, 2 nd edition, 2010 . Jeffrey Wooldridge. . Introductory Econometrics: A Modern Approach South-Western College Pub, 6 th edition, 2015 . Phillip G. Wright. The Tariff on Animal and Vegetable Oils . The Macmillan Company, 1928 .

328 328 causal inference the mixtape : G. Udny Yule. An investigation into the causes of changes in pau- perism in england, chiefly during the last two interensal decades. Journal of Royal Statistical Society , . : 249 – 295 , 1899 62

Related documents