1 JUDEA PEARL BAYESIANISM AND CAUSALITY, OR, WHY I AM ONLY A HALF-BAYESIAN 1 INTRODUCTION The I turned Bayesian in 1971, as soon as I began reading Savage’s monograph Foundations of Statistical Inference [Savage, 1962]. The arguments were unas- sailable: (i) It is plain silly to ignore what we know, (ii) It is natural and useful to cast what we know in the language of probabilities, and (iii) If our subjective probabilities are erroneous, their impact will get washed out in due time, as the number of observations increases. Thirty years later, I am still a devout Bayesian in the sense of (i), but I now doubt the wisdom of (ii) and I know that, in general, (iii) is false. Like most Bayesians, I believe that the knowledge we carry in our skulls, be its origin experience, school- ing or hearsay, is an invaluable resource in all human activity, and that combining this knowledge with empirical data is the key to scientific enquiry and intelligent behavior. Thus, in this broad sense, I am a still Bayesian. However, in order to be combined with data, our knowledge must first be cast in some formal language, and what I have come to realize in the past ten years is that the language of proba- bility is not suitable for the task; the bulk of human knowledge is organized around causal, not probabilistic relationships, and the grammar of probability calculus is insufficient for capturing those relationships. Specifically, the building blocks of our scientific and everyday knowledge are elementary facts such as “mud does not cause rain” and “symptoms do not cause disease” and those facts, strangely enough, cannot be expressed in the vocabulary of probability calculus. It is for this reason that I consider myself only a half-Bayesian. In the rest of the paper, I plan to review the dichotomy between causal and sta- tistical knowledge, to show the limitation of probability calculus in handling the former, to explain the impact that this limitation has had on various scientific dis- ciplines and, finally, I will express my vision for future development in Bayesian philosophy: the enrichment of personal probabilities with causal vocabulary and causal calculus, so as to bring mathematical analysis closer to where knowledge resides. 2 STATISTICS AND CAUSALITY: A BRIEF SUMMARY The aim of standard statistical analysis, typified by regression and other estimation techniques, is to infer parameters of a distribution from samples drawn of that population. With the help of such parameters, one can infer associations among variables, estimate the likelihood of past and future events, as well as update the
2 28 JUDEA PEARL likelihood of events in light of new evidence or new measurements. These tasks are managed well by statistical analysis so long as experimental conditions remain the same. Causal analysis goes one step further; its aim is to infer aspects of the data generation process. With the help of such aspects, one can deduce not only the likelihood of events under static conditions, but also the dynamics of events under changing conditions . This capability includes predicting the effect of actions (e.g., treatments or policy decisions), identifying causes of reported events, x was necessary and assessing responsibility and attribution (e.g., whether event (or sufficient) for the occurrence of event ). y Almost by definition, causal and statistical concepts do not mix. Statistics deals with behavior under uncertain, yet static conditions, while causal analysis deals with changing conditions. There is nothing in the joint distribution of symptoms and diseases to tell us that curing the former would not cure the latter. In general, there is nothing in a distribution function that would tell us how that distribu- tion would differ if external conditions were to change—say from observational to experimental setup—every conceivable difference in the distribution would be perfectly compatible with the laws of probability theory, no matter how slight the 1 change in conditions. Drawing analogy to visual perception, the information contained in a probabil- ity function is analogous to a precise description of a three-dimensional object; it is sufficient for predicting how that object will be viewed from any angle outside the object, but it is insufficient for predicting how the object will be viewed if ma- nipulated and squeezed by external forces. The additional properties needed for making such predictions (e.g., the object’s resilience or elasticity) is analogous to the information that causal models provide using the vocabulary of directed graphs and/or structural equations. The role of this information is to identify those aspects of the world that remain invariant when external conditions change, say due to an action. These considerations imply that the slogan “correlation does not imply cau- sation” can be translated into a useful principle: one cannot substantiate causal claims from associations alone, even at the population level—behind every causal conclusion there must lie some causal assumption that is not testable in observa- tional studies. Nancy Cartwright  expressed this principle as “no causes in, no causes out”, meaning we cannot convert statistical knowledge into causal knowledge. The demarcation line between causal and statistical concepts is thus clear and crisp. A statistical concept is any concept that can be defined in terms of a distri- bution (be it personal or frequency-based) of observed variables, and a causal con- 1 Even the theory of stochastic processes, which provides probabilistic characterization of certain dynamic phenomena, assumes a fixed density function over time-indexed variables. There is nothing in such a function to tell us how it would be altered if external conditions were to change. If a parametric family of distributions is used, we can represent some changes by selecting a different set of parameters. But we are still unable to represent changes that do not correspond to parameter selection; for example, restricting a variable to a certain value, or forcing one variable to equal another.
3 BAYESIANISM AND CAUSALITY, OR, WHY I AM ONLY A HALF-BAYESIAN 29 cept is any concept concerning changes in variables that cannot be defined from the distribution alone. Examples of statistical concepts are: correlation, regression, dependence, conditional independence, association, likelihood, collapsibility, risk 2 Examples of causal concepts are: randomization, in- ratio, odd ratio, and so on. fluence, effect, confounding, disturbance, spurious correlation, instrumental vari- ables, intervention, explanation, attribution, and so on. The purpose of this de- marcation line is not to exclude causal concepts from the province of statistical analysis but, rather, to make it easy for investigators and philosophers to trace the assumptions that are needed for substantiating various types of scientific claims. Every claim invoking causal concepts must be traced to some premises that invoke such concepts; it cannot be derived or inferred from statistical claims alone. This principle may sound obvious, almost tautological, yet it has some far reaching consequences. It implies, for example, that any systematic approach to causal analysis must acquire new mathematical notation for expressing causal as- sumptions and causal claims. The vocabulary of probability calculus, with its powerful operators of conditionalization and marginalization, is simply insuffi- cient for expressing causal information. To illustrate, the syntax of probability calculus does not permit us to express the simple fact that “symptoms do not cause diseases”, let alone draw mathematical conclusions from such facts. All we can say is that two events are dependent—meaning that if we find one, we can expect to encounter the other, but we cannot distinguish statistical dependence, quantified from causal dependence, for P ( dise ase j symptom ) by the conditional probability 3 which we have no expression in standard probability calculus. Scientists seeking to express causal relationships must therefore supplement the language of proba- bility with a vocabulary for causality, one in which the symbolic representation for the relation “symptoms cause disease” is distinct from the symbolic representation of “symptoms are associated with disease.” Only after achieving such a distinction can we label the former sentence “false,” and the latter “true.” The preceding two requirements: (1) to commence causal analysis with 4 judgmentally based assumptions, and (2) to extend the syntax of proba- untested, bility calculus, constitute, in my experience, the two main obstacles to the accep- tance of causal analysis among statisticians, philosophers and professionals with traditional training in statistics. We shall now explore in more detail the nature of these two barriers, and why they have been so tough to cross. 2 The term ‘risk ratio’ and ‘risk factors’ have been used ambivalently in the literature; some authors insist on a risk factor having causal influence on the outcome, and some embrace factors that are merely associated with the outcome. 3 Attempts to define causal dependence by conditioning on the entire past (e.g., Suppes, 1970) vi- olate the statistical requirement of limiting the analysis to “observed variables”, and encounter other insurmountable difficulties (see Eells , Pearl [2000a], pp. 249-257). 4 By “untested” I mean untested using frequency data in nonexperimental studies.
4 30 JUDEA PEARL 2.1 The Barrier of Untested Assumptions All statistical studies are based on some untested assumptions. For examples, we often assume that variables are multivariate normal, that the density function has certain smoothness properties, or that a certain parameter falls in a given range. The question thus arises why innocent causal assumptions, say, that symptoms do not cause disease or that mud does not cause rain, invite mistrust and resistance among statisticians, especially of the Bayesian school. There are three fundamental differences between statistical and causal assump- tions. First, statistical assumptions, even untested, are testable in principle, given sufficiently large sample and sufficiently fine measurements. Causal assumptions, in contrast, cannot be verified even in principle, unless one resorts to experimental control. This difference is especially accentuated in Bayesian analysis. Though the priors that Bayesians commonly assign to statistical parameters are untested quan- tities, the sensitivity to these priors tends to diminish with increasing sample size. In contrast, sensitivity to priors of causal parameters, say those measuring the ef- fect of smoking on lung cancer, remains non-zero regardless of (nonexperimental) sample size. Second, statistical assumptions can be expressed in the familiar language of probability calculus, and thus assume an aura of scholarship and scientific re- spectability. Causal assumptions, as we have seen before, are deprived of that honor, and thus become immediate suspect of informal, anecdotal or metaphysical thinking. Again, this difference becomes illuminated among Bayesians, who are accustomed to accepting untested, judgmental assumptions, and should therefore invite causal assumptions with open arms—they don’t. A Bayesian is prepared to accept an expert’s judgment, however esoteric and untestable, so long as the judgment is wrapped in the safety blanket of a probability expression. Bayesians turn extremely suspicious when that same judgment is cast in plain English, as in “mud does not cause rain.” A typical example can be seen in Lindley and Novick’s  treatment of Simpson’s paradox. Lindley and Novick showed that decisions on whether to use conditional or marginal contingency tables should depend on the story behind the tables, that is, on one’s assumption about how the tables were generated. For example, to X = x is beneficial ( Y = y ) in a population, one decide whether a treatment 0 Z P ( y j x; z ) to if P ( y j x stands for the gender of pa- ;z ) should compare z z tients. In contrast, if Z stands for a factor that is affected by the treatment (say blood pressure), one should compare the marginal probabilities, ( y j x ) vis- a -vis P 0 ( y j x P ) , and refrain from conditioning on Z (see [Pearl, 2000a; pp. 174-182] for details). Remarkably, instead of attributing this difference to the causal relation- ships in the story, Lindley and Novick wrote: “We have not chosen to do this; nor to discuss causation, because the concept, although widely used, does not seem to be well-defined” (p. 51). Thus, instead of discussing causation, they attribute the change in strategy to another untestable relationship in the story—exchangeability [ ] DeFinetti, 1974 which is cognitively formidable yet, at least formally, can be
5 BAYESIANISM AND CAUSALITY, OR, WHY I AM ONLY A HALF-BAYESIAN 31 cast in a probability expression. In Section 4.2, we will return to discuss this trend among Bayesians of equating “definability” with expressibility in probabilistic lan- guage. a -vis statistical) assumptions stems from their The third resistance to causal (vis- intimidating clarity. Assumptions about abstract properties of density functions or about conditional independencies among variables are, cognitively speaking, rather opaque, hence they tend to be forgiven, rather than debated. In contrast, as- sumptions about how variables cause one another are shockingly transparent, and tend therefore to invite counter-arguments and counter-hypotheses. A co-reviewer on a paper I have read recently offered the following objection to the causal model postulated by the author: “A thoughtful and knowledgeable epidemiologist could write down two or more equally plausible models that leads to different conclu- sions regarding confounding.” Indeed, since the bulk of scientific knowledge is organized in causal schema, sci- entists are incredibly creative in constructing competing alternatives to any causal hypothesis, however plausible. Statistical hypotheses in contrast, having been sev- eral levels removed from our store of knowledge, are relatively protected from such challenges. I conclude this subsection with a suggestion that statisticians’ suspicion of a -vis probabilistic assumptions, is unjustified. Consid- causal assumptions, vis- ering the organization of scientific knowledge, it makes prefect sense that we per- mit scientists to articulate what they know in plain causal expressions, and not force them to compromise reliability by converting to the “higher level” language of prior probabilities, conditional independence and other cognitively unfriendly 5 terminology. 2.2 The Barrier of New Notation If reluctance to making causal assumptions has been a hindrance to causal anal- ysis, finding a mathematical way of expressing such assumptions encountered a formidable mental block. The need to adopt a new notation, foreign to the province of probability theory, has been traumatic to most persons trained in statistics; partly because the adaptation of a new language is difficult in general, and partly because statisticians have been accustomed to assuming that all phenomena, processes, thoughts, and modes of inference can be captured in the powerful language of 6 probability theory. 5 Similar observations were expressed by J. Heckman . 6 Commenting on my set ( x ) notation [Pearl, 1995a, b], a leading statistician wrote: “Is this a concept in some new theory of probability or expectation? If so, please provide it. Otherwise, ‘meta- physics’ may remain the leading explanation.” Another statistician, commenting on the do ( x ) notation [ ] used in Causality Pearl, 2000a , insisted: “...the calculus of probability is the calculus of causality.”
6 32 JUDEA PEARL Not surprisingly, in the bulk of the statistical literature, causal claims never appear in the mathematics. They surface only in the verbal interpretation that in- vestigators occasionally attach to certain associations, and in the verbal description with which investigators justify assumptions. For example, the assumption that a covariate is not affected by a treatment, a necessary assumption for the control ] [ of confounding Cox, 1958 , is expressed in plain English, not in a mathematical equation. In some applications (e.g., epidemiology), the absence of notational distinction between causal and statistical dependencies seemed unnecessary, because investi- gators were able to keep such distinctions implicitly in their heads, and managed to confine the mathematics to conventional probability expressions. In others, as in economics and the social sciences, investigators rebelled against this notational tyranny by leaving mainstream statistics and constructing their own mathematical machinery (called Structural Equations Models). Unfortunately, this machinery has remained a mystery to outsiders, and eventually became a mystery to insiders 7 as well. But such tensions could not remain dormant forever. “Every science is only so far exact as it knows how to express one thing by one sign,” wrote Augustus de Morgan in 1858 — the harsh consequences of not having the signs for expressing causality surfaced in the 1980-90’s. Problems such as the control of confound- ing, the estimation of treatment effects, the distinction between direct and indirect effects, the estimation of probability of causation, and the combination of experi- mental and nonexperimental data became a source of endless disputes among the users of statistics, and statisticians could not come to the rescue. [Pearl, 2000a] de- scribes several such disputes, and why they could not be resolved by conventional statistical methodology. 3 LANGUAGES FOR CAUSAL ANALYSIS 3.1 The language of diagrams and structural equations How can one express mathematically the common understanding that symptoms do not cause diseases? The earliest attempt to formulate such relationship mathe- matically was made in the 1920’s by the geneticist Sewall Wright . Wright used a combination of equations and graphs to communicate causal relationships. X stands for a disease variable and Y stands for a certain symptom For example, if of the disease, Wright would write a linear equation: y = ax + u (1) supplemented with the diagram X ! Y , where x stands for the level (or sever- y stands for the level (or severity) of the symptom, and u stands ity) of the disease, 7 Most econometric texts in the last decade have refrained from defining what an economic model is, and those that attempted a definition, erroneously view structural equations models as compact representations of probability density functions (see [Pearl, 2000a, pp. 135-138]).
7 BAYESIANISM AND CAUSALITY, OR, WHY I AM ONLY A HALF-BAYESIAN 33 ( for all factors, other than the disease in question, that could possibly affect U Y is called “exogenous”, “background”, or “disturbance”.) The diagram encodes on , and the absence of Y X the possible existence of (direct) causal influence of causal influence of , while the equation encodes the quantitative relation- on Y X ships among the variables involved, to be determined from the data. The parameter a in the equation is called a “path coefficient” and it quantifies the (direct) causal effect of on ; given the numerical value of a , the equation claims that, ceteras X Y X a -unit increase of Y . If correlation would result in an paribus, a unit increase in between and U is presumed possible, it is customary to add a double arrow X between X . and Y The asymmetry induced by the diagram renders the equality sign in Eq. (1) dif- ferent from algebraic equality, resembling instead the assignment symbol ( := ) in programming languages. Indeed, the distinctive characteristic of structural equa- tions, setting them apart from algebraic equations, is that they stand for a value- (not Y assignment process — an autonomous mechanism by which the value of X Y is committed to track changes in ) is determined. In this assignment process, 8 X is not subject to such commitment. X , while Wright’s major contribution to causal analysis, aside from introducing the lan- guage of path diagrams, has been the development of graphical rules for writing down (by inspection) the covariance of any pair of observed variables in terms of path coefficients and of covariances among disturbances. Under certain causal Cov ( U; X )=0 ), the resulting equations may allow one to assumptions, (e.g. if solve for the path coefficients in terms of observed covariance terms only, and this amounts to inferring the magnitude of (direct) causal effects from observed, non- experimental associations, assuming of course that one is prepared to defend the causal assumptions encoded in the diagram. The causal assumptions embodied in the diagram (e.g, the absence of arrow to X ,or Cov Y U; X )=0 ) are not generally testable from nonexperimental ( from data. However, the fact that each causal assumption in isolation cannot be tested does not mean that the sum total of all causal assumptions in a model does not Z ! Y ! X for exam- have testable implications. The chain model ple, encodes seven causal assumptions, each corresponding to a missing arrow or a missing double-arrow between a pair of variables. None of those assumptions Z is is testable in isolation, yet the totality of all those assumptions implies that unassociated with , conditioned on Y . Such testable implications can be read off X the diagrams (see [Pearl 2000a, pp. 16–19]), and these constitute the only open- ing through which the assumption embodies in structural equation models can be tested in observational studies. Every conceivable statistical test that can be ap- plied to the model is entailed by those implications. 8 Clearly, if we intervene on X , Y would continue to track changes in X . Not so when we intervene on Y , X will reman unchanged. Such intervention (on Y ) would alter the assignment mechanism for Y and, naturally, would cause the equality in Eq. (1) to be violated.
8 34 JUDEA PEARL do -calculus 3.2 From path-diagrams to Structural equation modeling (SEM) has been the main vehicle for causal analysis in economics, and the behavioral and social sciences [Goldberger 1972; Duncan 1975]. However, the bulk of SEM methodology was developed for linear anal- ysis and, until recently, no comparable methodology has been devised to extend its capabilities to models involving discrete variables, nonlinear dependencies, or situations in which the functional form of the equations is unknown. A central requirement for any such extension is to detach the notion of “effect” from its al- gebraic representation as a coefficient in an equation, and redefine “effect” as a general capacity to transmit among variables. One such extension, based changes on simulating hypothetical interventions in the model, is presented in Pearl [1995a, 2000a] The central idea is to exploit the invariant characteristics of structural equations without committing to a specific functional form. For example, the non-parametric Z ! X Y corresponds to a set of three ! interpretation of the chain model functions, each corresponding to one of the variables: z = f ( ) w Z = f x ( z; v ) (2) X y f = ( x; u ) Y (not shown together with the assumption that the background variables V , U W , in the chain) are jointly independent but, otherwise, arbitrarily distributed. Each of these functions represents a causal process (or mechanism) that determines the value of the left variable (output) from those on the right variables (input). The ab- sence of a variable from the right hand side of an equation encodes the assumption that it has no direct effect on the left variable. For example, the absence of variable Z from the arguments of f unchanged, indicates that variations in Z will leave Y Y as long as variables and X U remain constant. A system of such functions are said (or modular structural to be ) if they are assumed to be autonomous, that is, each function is invariant to possible changes in the form of the other functions [Simon 1953; Koopmans 1953]. This feature of invariance permits us to use structural equations as a basis for modeling actions and counterfactuals. This is done through a mathematical oper- do ( x ) which simulates physical interventions by deleting certain func- ator called tions from the model, replacing them by constants, while keeping the rest of the model unchanged. For example, to represent an intervention that sets the value of X to x the model for Eq. (2) would become 0 ) z = f w ( Z = x x (3) 0 u f ) y x; = ( Y The distribution of Y and Z calculated from this modified model characterizes the effect of the action .Itis ( X = x )) ) and is denoted as P ( y; z j do ( x do 0 0
9 BAYESIANISM AND CAUSALITY, OR, WHY I AM ONLY A HALF-BAYESIAN 35 ( y j do P x not hard to show that, as expected, the model of Eq. (2) yields )) = ( 0 ( y j x , ) and P ( z j do ( x . )) = P ( z ) regardless of the functions f f P f and Y 0 0 X Z ( P y ; z )= x; The general rule is simply to remove from the factorized distribution ( z ) P ( x j z ) P ( y j x ) the factor that corresponds to the manipulated variable ( X in P x in our exam- our example) and to substitute the new value of that variable ( 0 ple) into the truncated expression — the resulting expression then gives the post- intervention distribution of the remaining variables [Pearl, 2000a; section 3.2]. Ad- ditional features of this transformation are discussed in the Appendix; see [Pearl, 2000a; chapter 7] for full details. The main task of causal analysis is to infer causal quantities from two sources of information: (i) the assumptions embodied in the model, and (ii) the observed P ( , or from samples of that distribution. Such analysis requires y ; z ) x; distribution mathematical means of transforming causal quantities, represented by expressions , since only P ( y j do ( x )) , into do -free expressions derivable from P ( z; x; y ) such as do -free expressions are estimable from non-experimental data. When such a trans- identifiable . A calculus formation is feasible, we say that the causal quantity is [ -calculus, was developed in do Pearl, for performing such transformations, called ] . Remarkably, the rules governing this calculus depend merely on the topol- 1995a ogy of the diagram; it takes no notice of the functional form of the equations, nor of the distribution of the disturbance terms. This calculus permits the investigator to inspect the causal diagram and 1. Decide whether the assumptions embodied in the model are sufficient to obtain consistent estimates of the target quantity; 2. Derive (if the answer to item 1 is affirmative) a closed-form expression for the target quantity in terms of distributions of observed quantities; and 3. Suggest (if the answer to item 1 is negative) a set of observations and ex- periments that, if performed, would render a consistent estimate feasible. 4 ON THE DEFINITION OF CAUSALITY In this section, I return to discuss concerns expressed by some Bayesians that -calculus can be an ef- do causality is an undefined concept and that, although the fective mathematical tool in certain tasks, it does not bring us closer to the deep and ultimate understanding of causality, one that is based solely on classical prob- ability theory. 4.1 Is causality reducible to probabilities? Unfortunately, aspirations for reducing causality to probability are both untenable and unwarranted. Philosophers have given up such aspirations twenty years ago,
10 36 JUDEA PEARL and were forced to admit extra-probabilistic primitives (such as “counterfactuals” or “causal relevance”) into the analysis of causation (see Eells  and Pearl [2000a, Section 7.5]). The basic reason was alluded to in Section 2: probability theory deals with beliefs about an uncertain, yet static world, while causality deals with changes that occur in the world itself, (or in one’s theory of such changes). More specifically, causality deals with how probability functions change in re- sponse to influences (e.g., new conditions or interventions) that originate from outside the probability space, while probability theory, even when given a fully specified joint density function on all (temporally-indexed) variables in the space, cannot tell us how that function would change under such external influences. Thus, “doing” is not reducible to “seeing”, and there is no point trying to fuse the two together. Many philosophers have aspired to show that the calculus of probabilities, en- dowed with a time dynamic, would be sufficient for causation [Suppes, 1970]. A well known demonstration of the impossibility of such reduction (following Otte X that turns on two lights, Y and Z , ) goes as follows. Consider a switch Z turns on a split second before and assume that, due to differences in location, . Consider now a variant of this example where the switch X Z , and Y activates Y . This case is probabilistically identical to the previous one, Z , in turns, activates because all functional and temporal relationships are identical. Yet few people would perceive the causal relationships to be the same in the two situations; the X ! Z ! Y , while the former represents latter represents cascaded process, a branching process, Y X ! Z . The difference shows, of course, when we Z Y in the cascaded case, but would affect consider interventions; intervening on not in the branching case. The preceding example illustrates the essential role of in defining mechanisms causation. In the branching case, although all three variables are symmetrically constrained by the functional relationships: = Y , X = Z , Z = Y , these X relationships in themselves do not reveal the information that the three equalities , and that the first Y = X and Z = X are sustained by only two mechanisms, equality would still be sustained when the second is violated. A set of mechanisms, each represented by an equation, is not equivalent to the set of algebraic equations that are implied by those mechanisms. Mathematically, the latter is defined as one n n separate sets, each containing equations, whereas the former is defined as set of one equation. These are two distinct mathematical objects that admit two distinct types of solution-preserving operations. The calculus of causality deals with the dynamics of such modular systems of equations, where the addition and deletion of equations represent interventions (see Appendix). 4.2 Is causality well-defined? From a mathematical perspective, it is a mistake to say that causality is unde- fined. The do -calculus, for example, is based on two well-defined mathemati- P and a directed acyclic graph (DAG) D ; the cal objects: a probability function
11 BAYESIANISM AND CAUSALITY, OR, WHY I AM ONLY A HALF-BAYESIAN 37 first is standard in statistical analysis while the second is a newcomer that tells us (in a qualitative, yet formal language) which mechanisms would remain invari- ant to a given intervention. Given these two mathematical objects, the definition X is a probabilistic-cause of variable Y if of “cause” is clear and crisp; variable ( y j do ( x )) 6 = P ( P ) for some values x and y . Since each of P ( y j do ( x )) and P ( y ) y is well-defined in terms of the pair ( P; D ) , the relation “probabilistic cause” is, likewise, well-defined. Similar definitions can be constructed for other nuances of causal discourse, for example, “causal effect”, “direct cause”, “indirect cause”, “event-to-event cause”, “scenario-specific cause”, “necessary cause”, “sufficient cause”, “likely cause” and “actual cause” (see [Pearl, 2000a, pp. 222–3, 286–7, 319]; some of these definitions invoke functional models). Not all statisticians/philosophers are satisfied with these mathematical defini- tions. Some suspect definitions that are based on unfamiliar non-algebraic objects (i.e., the DAG) and some mistrust abstract definitions that are based on unverifiable models. Indeed, no mathematical machinery can ever verify whether a given DAG really represents the causal mechanisms that generate the data — such verification is left either to human judgment or to experimental studies that invoke interven- tions. I submit, however, that neither suspicion nor mistrust are justified in the case at hand; DAGs are no less formal than mathematical equations, and questions of model verification need be kept apart from those of conceptual definition. mean . Even non-Bayesians Consider, for example, the concept of a distribution perceive this notion to be well-defined, for it can be computed from any given (non- pathological) distribution function, even before ensuring that we can estimate that distribution from the data. We would certainly not declare the mean “ill-defined” if, for any reason, we find it hard to estimate the distribution from the available data. Quite the contrary; by defining the mean in the abstract, as a functional of any hypothetical distribution, we can often prove that the defining distribution need not be estimated at all, and that the mean can be estimated (consistently) directly from the data. Remarkably, by taking seriously the abstract (and untestable) notion of a distribution, we obtain a license to ignore it. An analogous logic applies to ( P; D ) , causation. Causal quantities are first defined in the abstract, using the pair and this abstract definition then provides a theoretical framework for deciding, given the type of data available, which of the assumptions embodied in the DAG are ignorable, and which are absolutely necessary for establishing the target causal 9 quantity from the data. The separation between concept definition and model verification is even more pronounced in the Bayesian framework, where purely judgmental concepts, such as the prior distribution of the mean, are perfectly acceptable, as long as they can be assessed reliably from one’s experience or knowledge. Dennis Lindley has re- marked recently (personal communication) that “causal mechanisms may be easier 9 I have used a similar logic in defense of counterfactuals [Pearl, 2000a], which Dawid  deemed dangerous on account of being untestable. (See, also Dawid , this volume.) Had Bernoulli been constrained by Dawid’s precautions, the notion of a “distribution” would have had to wait for another “dangerous” scientist, of Bernoulli’s equal, to be created.
12 38 JUDEA PEARL to come by than one might initially think”. Indeed, from a Bayesian perspective, the newcomer concept of a DAG is not an alien at all — it is at least as legitimate as the probability assessments that a Bayesian decision-maker pronounces in con- structing a decision tree. In such construction, the probabilities that are assigned X correspond to assessments of to branches emanating from a decision variable y j do ( x )) and those assigned to branches emanating from a chance variable P ( X P ( y j x ) . If a Bayesian decision-maker is free to as- correspond to assessments of P ( y j x ) and P ( in any way, as separate evaluations, the Bayesian j do ( x )) y sess should also be permitted to express his/her conception of the mechanisms that en- tail those evaluations. It is only by envisioning these mechanisms that a decision ( y j do P x )) type as- ( maker can generate a coherent list of such a vast number of 10 sessments. The structure of the DAG can certainly be recovered from judgments of the form ( y j P ( x )) and, conversely, the DAG combined with a probability do function P dictates all judgments of the form P ( y j do ( x )) . Accordingly the struc- ture of the DAG can be viewed as a qualitative parsimonious scheme of encoding and maintaining coherence among those assessments. And there is no need to translate the DAG into the language of probabilities to render the analysis legiti- mate. Adding probabilistic veneer to the mechanisms portrayed in the DAG may calculus appear more traditional, but would not change the fact that do make the the objects of assessment are still causal mechanisms, and that these objects have their own special grammar of generating predictions about the effect of actions. In summary, recalling the ultimate Bayesian mission of fusing judgment with data, it is not the language in which we cast judgments that legitimizes the analysis, but whether those judgments can reliably be assessed from our store of knowledge and from the peculiar form in which this knowledge is organized. If it were not for this concern to maintain reliability (of judgment), one could easily translate the information conveyed in a DAG into purely probabilistic formu- lae, using hypothetical variables. (Translation rules are provided in [Pearl, 2000a, p. 232]). Indeed, this is how the potential-outcome approach of Neyman  and Rubin  has achieved statistical legitimacy: judgments about causal re- lationships among observables are expressed as statements about probability func- tions that involve mixtures of observable and counterfactual variables. The diffi- culty with this approach, and the main reason for its slow acceptance in statistics, is that judgments about counterfactuals are much harder to assess than judgments about causal mechanisms. For instance, to communicate the simple assumption that symptoms do not cause diseases, we would have to use a rather roundabout expression and say that the probability of the counterfactual event “disease had symptoms been absent” is equal to the probability of “disease had symptoms been present”. Judgments of conditional independencies among such counterfactual events are even harder for researchers to comprehend or to evaluate. 10 Coherence requires, for example, that for any x , y , and z , the inequality P ( y j do ( x ) ;do ( z )) be satisfied. This follows from the property of composition (see Appendix, Eq. (6), or P ( y;x j do ( z )) [Pearl, 2000a; pp. 229]
13 BAYESIANISM AND CAUSALITY, OR, WHY I AM ONLY A HALF-BAYESIAN 39 5 SUMMARY This paper calls attention to a basic conflict between mission and practice in Bayesian methodology. The mission is to express prior knowledge mathemati- cally and reliably so as to assist the interpretation of data, hence the acquisition of new knowledge. The practice has been to express prior knowledge as prior proba- bilities — too crude a vocabulary, given the grand mission. Considerations of re- liability (of judgment) call for enriching the language of probabilities with causal vocabulary and for admitting causal judgments into the Bayesian repertoire. The mathematics for interpreting causal judgments has matured, and tools for using such judgments in the acquisition of new knowledge have been developed. The grounds are now ready for mission-oriented Bayesianism. APPENDIX CAUSAL MODELS, ACTIONS AND COUNTERFACTUALS This appendix presents a brief summary of the structural-equation semantics of causation and counterfactuals as defined in Balke and Pearl , Galles and Pearl [1997, 1998], and Halpern . For detailed exposition of the structural [ ] . account and its applications see Pearl, 2000a Causal models are generalizations of the structural equations used in engineer- 11 World knowledge is represented as a ing, biology, economics and social science. modular collection of stable and autonomous relationships called “mechanisms”, each represented as a function, and changes due to interventions or unmodelled eventualities are treated as local modifications of these functions. A causal model is a mathematical object that assigns truth values to sentences involving causal relationships, actions, and counterfactuals. We will first define causal models, then discuss how causal sentences are evaluated in such models. We will restrict our discussion to recursive (or feedback-free) models; extensions to non-recursive models can be found in Galles and Pearl [1997, 1998] and Halpern . DEFINITION 1 (Causal model). A causal model is a triple M = h U; V ; F i where (i) U is a set of variables, called exogenous . (These variables will represent back- ground conditions, that is, variables whose values are determined outside the model.) 11 Similar models, called “neuron diagrams” [Lewis, 1986, p. 200; Hall, 1998] are used informally by philosophers to illustrate chains of causal processes.
14 40 JUDEA PEARL V is an ordered set V (ii) ;V f ;:::;V . (These g of variables, called endogenous n 1 2 represent variables that are determined in the model, namely, by variables in .) [ U V f f ;f F is a set of functions is a mapping from f g where each ;:::;f (iii) n i 2 1 ( V U V tells us the value of f . In other words, each ) to V ::: i i i 1 1 given the values of and all predecessors of V V U . Symbolically, the set of i i 12 equations F can be represented by writing ;:::;n = v =1 ( pa i ;u ) f i i i i pa PA in where is any realization of the unique minimal set of variables i i 13 connoting ) sufficient for representing f V . ( Likewise, U parents U i i stands for the unique minimal set of variables in U that is sufficient for representing f . i M G ( M ) , in which can be associated with a directed graph, Every causal model V and the directed edges point from mem- each node corresponds to a variable in bers of V toward PA (by convention, the exogenous variables are usually not i i causal graph associated shown explicitly in the graph). We call such a graph the M PA that have . This graph merely identifies the endogenous variables with i direct influence on each V . but it does not specify the functional form of f i i action For any causal model, we can define an operator, x ( ) do , which, from a conceptual viewpoint, simulates the effect of external action that sets the value of x to submodel , that X and, from a formal viewpoint, transforms the model into a is, a causal model containing fewer functions. DEFINITION 2 (Submodel). Let M be a causal model, X be a set of variables in V , and x be a particular X . A submodel M M of is the causal assignment of values to the variables in x model F h U; V ; = M i x x where (4) f = F f g : V x 6 2 X g[f X = i i x In words, F corresponding to mem- is formed by deleting from F all functions f i x X and replacing them with the set of constant functions = x . X bers of set If we interpret each function f in F as an independent physical mechanism i and define the action ( do = x ) as the minimal change in M required to make X 12 We use capital letters (e.g., X , Y ) as names of variables and sets of variables, and lower-case letters (e.g., x , y ) for specific values (called realizations) of the corresponding variables. 13 ( is sufficient for representing a given function y = f X x;z ) if f is trivial in A set of variables 0 0 x;z;z . we have f ( x;z )= f ( x;z —that is, if for every ) Z
15 BAYESIANISM AND CAUSALITY, OR, WHY I AM ONLY A HALF-BAYESIAN 41 = x u , then M X represents the model that results from hold true under any x such a minimal change, since it differs from M by only those mechanisms that M to modifies . The transformation from X M directly determine the variables in x the algebraic content of modifiable structural F , which is the reason for the name 14 [ ] . Galles and Pearl, 1998 equations used in DEFINITION 3 (Effect of action). , and M be a set of variables in V be a causal model, x be a particular X Let realization of . The effect of action do ( X = x ) on M X is given by the submodel M . x DEFINITION 4 (Potential response). Let , and let V X be a subset of Y , let u be a particular value be a variable in V of . The potential response of Y to action do ( X = x U in situation u , denoted ) Y ( u ) , is the (unique) solution for Y of the set of equations F . x x We will confine our attention to actions in the form of do ( X x ) . Conditional = actions, of the form “ do ( X = x ) if Z = z ” can be formalized using the replace- ] [ ment of equations by functions of Z .We , rather than by constants Pearl, 1994 0 ( = x do X = x X ) ”, since or will not consider disjunctive actions, of the form “ these complicate the probabilistic treatment of counterfactuals. DEFINITION 5 (Counterfactual). Let be a variable in V , and let X be a subset of V . The counterfactual expression Y Y been x ” is interpreted as denoting would have obtained, had X “The value that the potential response Y ( u ) . x Definition 5 thus interprets the counterfactual phrase “had been ” in terms X x of a hypothetical external action that modifies the actual course of history and im- X = x ” with minimal change of mechanisms. This is a cru- poses the condition “ [ ] Balke and Pearl, 1994 cial step in the semantics of counterfactuals , as it permits x to differ from the actual value ( u ) of X without creating logical contradiction; X it also suppresses abductive inferences (or backtracking) from the counterfactual 15 X = x . antecedent [ ] Galles and Pearl, 1997 that the counterfactual relationship It can be shown , satisfies the following two properties: Y ) ( u just defined, x Effectiveness : For any two disjoint sets of variables, Y and W ,wehave y: Y )= ( u (5) yw 14 Structural modifications date back to Marschak  and Simon . An explicit translation of interventions into “wiping out” equations from the model was first proposed by Strotz and Wold  and later used in Fisher , Sobel , Spirtes et al. , and Pearl . A similar notion of sub-model is introduced in Fine , though not specifically for representing actions and counterfactuals. 15 Simon and Rescher [1966, p. 339] did not include this step in their account of counterfactuals and noted that backward inferences triggered by the antecedents can lead to ambiguous interpretations.
16 42 JUDEA PEARL W to has no effect on Y , once we set the value In words, setting the variables in w of to Y y . : Composition X and , and any set of variables Y , W For any two disjoint sets of variables : ( u )= w = ) Y ) ( W )= Y u ( u (6) x x xw to to the same values, , setting the variables in W X w , that x In words, once we set they would attain (under ) should have no effect on Y . Furthermore, effectiveness x compl ete whenever M is recursive (i.e., G ( is acyclic) ) M and composition are [ ] Galles and Pearl, 1998; Halpern, 1998 , that is, every property of counterfactu- als that follows from the structural model semantics can be derived by repeated application of effectiveness and composition. [ ] consistency Robins, 1987 : by A corollary of composition is a property called ( X ( u )= x )= ) ( Y ( ( u )= Y )) u (7) x u we find variable at value x , and Consistency states that, if in a certain context X we intervene and set to that same value, x , we should not expect any change in X the response variable Y . Composition and consistency are used in several deriva- tions of Section 3. The structural formulation generalizes naturally to probabilistic systems, as is seen below. DEFINITION 6 (Probabilistic causal model). A probabilistic causal model is a pair M; h ( u ) i P where M is a causal model and P ( u ) is a probability function defined over the domain of U . u ( , ) P , together with the fact that each endogenous variable is a function of U defines a probability distribution over the endogenous variables. That is, for every ,wehave Y V set of variables X (8) P ( u ) Y = y )= y P ( = P ( ) u )= y g f u j Y ( The probability of counterfactual statements is defined in the same manner, through the function . For example, the ( u ) induced by the submodel M causal effect Y x x X on Y is defined as: of X (9) ( Y ) = y )= P u ( P x u )= y g j u f Y ( x Likewise, a probabilistic causal model defines a joint distribution on counter- factual statements, i.e., ( Y P = y; Z is defined for any sets of variables = z ) w x 0 X; Z; W , not necessarily disjoint. In particular, P ( Y and = y; X = x Y; ) x 0 0 0 , and are given by = y; Y x ( P y Y ) are well defined for x 6 = = x x
17 BAYESIANISM AND CAUSALITY, OR, WHY I AM ONLY A HALF-BAYESIAN 43 X 0 (10) = y; X = x ( )= ( Y ) u P P x 0 g & ( u )= x y X )= Y j u f u ( x and X 0 0 u y; Y ) ( : P ( Y P = )= = y (11) x x 0 ( u y g )= j Y ( f u )= y & Y u 0 x x 0 0 x are incompatible, Y x and When and Y cannot be measured simultane- x x ously, and it may seem meaningless to attribute probability to the joint statement 0 0 “ if X = x and Y would be y Y if X = x would be .” Such concerns have y been a source of recent objections to treating counterfactuals as jointly distributed [ ] 0 in terms of two Y Y and . The definition of random variables Dawid, 2000 x x , demonstrates U distinct submodels, driven by a standard probability space over that joint probabilities of counterfactuals have solid mathematical and conceptual underpinning and, moreover, these probabilities can be encoded rather parsimo- ( u ) and F . P niously using Computer Science Department, University of California, USA. BIBLIOGRAPHY ] [ A. Balke and J. Pearl. Probabilistic evaluation of counterfactual queries. In Balke and Pearl, 1994 Proceedings of the Twelfth National Conference on Artificial Intelligence , volume I, pages 230–237. MIT Press, Menlo Park, CA, 1994. [ ] Balke and Pearl, 1995 A. Balke and J. Pearl. Counterfactuals and policy analysis in structural mod- Uncertainty in Artificial Intelligence 11 , pages 11–18. els. In P. Besnard and S. Hanks, editors, Morgan Kaufmann, San Francisco, 1995. [ ] Nature’s Capacities and Their Measurement Cartwright, 1989 N. Cartwright. . Clarendon Press, Oxford, 1989. [ ] The Planning of Experiments . John Wiley and Sons, NY, 1958. Cox, 1958 D.R. Cox. [ ] A.P. Dawid. Causal inference without counterfactuals (with comments and rejoinder). Dawid, 2000 , 95(450):407–448, June 2000. Journal of the American Statistical Association ] [ B. DeFinetti. DeFinetti, 1974 2 volumes Theory of Probability: A Critical Introductory Treatment, (Translated by A. Machi and A. Smith). Wiley, London, 1974. [ ] Duncan, 1975 O.D. Duncan. Introduction to Structural Equation Models . Academic Press, New York, 1975. [ ] E. Eells. Probabilistic Causality Eells, 1991 . Cambridge University Press, Cambridge, MA, 1991. ] [ K. Fine. Fine, 1985 . B. Blackwell, New York, 1985. Reasoning with Arbitrary Objects [ ] Fisher, 1970 Econo- F.M. Fisher. A correspondence principle for simultaneous equations models. metrica , 38(1):73–92, January 1970. [ ] Galles and Pearl, 1997 D. Galles and J. Pearl. Axioms of causal relevance. Artificial Intelligence , 97(1-2):9–43, 1997. [ ] Galles and Pearl, 1998 D. Galles and J. Pearl. An axiomatic characterization of causal counterfac- Foundation of Science , 3(1):151–182, 1998. tuals. [ ] Goldberger, 1972 A.S. Goldberger. Structural equation models in the social sciences. Econometrica: Journal of the Econometric Society , 40:979–1001, 1972. [ ] Hall, 1998 N. Hall. Two concepts of causation, 1998. In press. [ ] Halpern, 1998 J.Y. Halpern. Axiomatizing causal reasoning. In G.F. Cooper and S. Moral, editors, Uncertainty in Artificial Intelligence , pages 202–210. Morgan Kaufmann, San Francisco, CA, 1998. [ ] Heckman, 2001 J.J. Heckman. Econometrics and empirical economics. Journal of Econometrics , 100(1):1–5, 2001.
18 44 JUDEA PEARL [ ] Koopmans, 1953 T.C. Koopmans. Identification problems in econometric model construction. In , pages 27–48. Wiley, New W.C. Hood and T.C. Koopmans, editors, Studies in Econometric Method York, 1953. [ ] D. Lewis. Philosophical Papers . Oxford University Press, New York, 1986. Lewis, 1986 ] [ Lindley and Novick, 1981 D.V. Lindley and M.R. Novick. The role of exchangeability in inference. The Annals of Statistics , 9(1):45–58, 1981. ] [ Statistical Marschak, 1950 J. Marschak. Statistical inference in economics. In T. Koopmans, editor, , pages 1–50. Wiley, New York, 1950. Cowles Commission Inference in Dynamic Economic Models for Research in Economics, Monograph 10. [ ] J. Neyman. On the application of probability theory to agricultural experiments. Neyman, 1923 Essay on principles. Section 9. Statistical Science , 5(4):465–480, 1990. [Translation] ] [ Otte, 1981 R. Otte. A critique of suppes’ theory of probabilistic causality. , 48:167–189, Synthese 1981. [ ] Pearl, 1994 J. Pearl. A probabilistic calculus of actions. In R. Lopez de Mantaras and D. Poole, , pages 454–462. Morgan Kaufmann, San Mateo, Uncertainty in Artificial Intelligence 10 editors, CA, 1994. ] [ Pearl, 1995a J. Pearl. Causal diagrams for empirical research. Biometrika , 82(4):669–710, Decem- ber 1995. [ ] J. Pearl. Causal inference from indirect experiments. Artificial Intelligence in Pearl, 1995b , 7(6):561–582, 1995. Medicine ] [ J. Pearl. Pearl, 2000a . Cambridge University Press, Causality: Models, Reasoning, and Inference New York, 2000. [ ] J. Pearl. Comment on A.P. Dawid’s, Causal inference without counterfactuals. Pearl, 2000b Journal of the American Statistical Association , 95(450):428–431, June 2000. ] [ Robins, 1987 J.M. Robins. A graphical approach to the identification and estimation of causal , Journal of Chronic Diseases parameters in mortality studies with sustained exposure periods. 40(Suppl 2):139S–161S, 1987. [ ] Rubin, 1974 D.B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized Journal of Educational Psychology , 66:688–701, 1974. studies. ] [ L. J. Savage. Savage, 1962 . Methuen and Co. Ltd., London, The Foundations of Statistical Inference 1962. [ ] H.A. Simon and N. Rescher. Cause and counterfactual. Philosophy and Simon and Rescher, 1966 Science , 33:323–340, 1966. [ ] Simon, 1953 H.A. Simon. Causal ordering and identifiability. In Wm. C. Hood and T.C. Koopmans, editors, Studies in Econometric Method , pages 49–74. Wiley and Sons, Inc., 1953. [ ] M.E. Sobel. Effect analysis and causation in linear structural equation models. Psy- Sobel, 1990 , 55(3):495–515, 1990. chometrika [ ] , 1993 Spirtes P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search . et al. Springer-Verlag, New York, 1993. [ ] Strotz and Wold, 1960 R.H. Strotz and H.O.A. Wold. Recursive versus nonrecursive systems: An attempt at synthesis. Econometrica , 28:417–427, 1960. [ ] A Probabilistic Theory of Causality P. Suppes. Suppes, 1970 . North-Holland Publishing Co., Ams- terdam, 1970. [ ] Wright, 1921 S. Wright. Correlation and causation. Journal of Agricultural Research , 20:557–585, 1921.
D E A R T M E N T O F D E F E N S E P N A L O F W A R M A W U A L J U N E 2 0 1 5 O F F I C E O F G E N ER A L C O U N S E L D P A R T M E N T E O F D E F E N S EMore info »
Legal and Operational Guide for Free Medical ClinicsMore info »
Univ ersit hic ago L aw Sc hool y of C ago U un d Chic nbo Journal Articles Faculty Scholarship 1989 r i g ins a nd H i s t or ic The O nde rs t a ndin g of F r ee al U E xe r c i se of R eli g ion M ...More info »
PREPUBLICATION COPY Urban Stormwater Management in the United States ADVANCE COPY OR N EFORE OT ELEASE R UBLIC P B F Wednesday, October 15, 2008 EDT 11:00 a.m. ________________________________________...More info »
Ge org niv ersit y L aw C enter etown U hol hi p @ GE ORGE TO WN L AW Sc ars 1993 g Go od F aith: H as the U nit ed S tates V iol ated Parsin ucle ar N on-P roli feration T reaty? Article VI of the N ...More info »
1 Conservation Biology for All EDITED BY : Navjot S. Sodhi AND * Department of Department of Biological Sciences, National University of Singapore Organismic and Evolutionary Biology, Harvard Universi...More info »
The Health Consequences of Smoking—50 Years of Progress A Report of the Surgeon General U.S. Department of Health and Human ServicesMore info »
North western J nal of I nternatio nal L aw & B usin ess our Volume 5 Issue 3 a ll F Fall 1983 ax T reaty S hoppin g: An O ve Income T rview of Preve echniques ntion T ady Kenneth A. Gr http://s chola...More info »
Innocent Infringement in U.S. Copyright Law: A History * R. Anthony Reese INTRODUCTION Innocent or unknowing copyright infringement occurs when someone engages in infringing activity not knowing that ...More info »