1 Journal of Machine Learning Research 6 (2005) 1961–1998 Sub mitted 8/04; Revised 3/05; Published 12/05 What’s Strange About Recent Events (WSARE): An Algorithm for the Early Detection of Disease Outbreaks @ EECS . OREGONSTATE . EDU Weng-Keen Wong WONG School of Electrical Engineering and Computer Science Oregon State University Corvallis, OR 97330, USA @ CS Andrew Moore CMU . EDU AWM . School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA GFC CBMI . PITT . EDU Gregory Cooper @ MMW Michael Wagner CBMI . PITT . EDU @ Center For Biomedical Informatics University of Pittsburgh Pittsburgh, PA 15213, USA Editor: Dale Schuurmans Abstract Traditional biosurveillance algorithms detect disease ou tbreaks by looking for peaks in a univariate illance data, however, are no longer sim- time series of health-care data. Current health-care surve emporal, demographic and symptomatic ply univariate data streams. Instead, a wealth of spatial, t information is available. We present an early disease outbr eak detection algorithm called What’s Strange About Recent Events (WSARE), which uses a multivaria te approach to improve its time- at compares recent health-care data liness of detection. WSARE employs a rule-based technique th s of the recent data whose proportions against data from a baseline distribution and finds subgroup have changed the most from the baseline data. In addition, he alth-care data also pose difficulties for surveillance algorithms because of inherent temporal t rends such as seasonal effects and day of week variations. WSARE approaches this problem using a Bayes ian network to produce a baseline distribution that accounts for these temporal trends. The a lgorithm itself incorporates a wide range of ideas, including association rules, Bayesian networks, hypothesis testing and permutation tests to produce a detection algorithm that is careful to evaluate the significance of the alarms that it raises. Keywords: anomaly detection, syndromic surveillance, biosurveilla nce, Bayesian networks, ap- plications 1. Introduction Detection systems inspect routinely collected data for anomalies and raise an a lert upon discovery of any significant deviations from the norm. For example, Fawcett and Pro vost (1997) detect cellu- lar phone fraud by monitoring changes to a cell phone user’s typical callin g behavior. In intrusion detection systems, anomalies in system events might indicate a possible breach o f security (Warren- c © 2005 Weng-Keen Wong, Andrew Moore, Gregory Cooper and Micha el Wagner.

2 W ONG OORE , C OOPER AND W AGNER , M ly disease outbreak der et al., 1999). In a similar manner, we would like to tackle the problem of ear detection, in which the disease outbreak can be due to either natural cause s or a bioterrorist attack. ily available data that One of the challenges for early disease outbreak detection is finding read contains a useful signal (Tsui et al., 2001). Data sources that requir e definitive diagnosis of the disease, such as lab reports, can often be obtained several days to we eks after the samples are sub- mitted. By that point, the outbreak may have already escalated into a large scale epidemic. Instead ch as the symptoms of waiting for definite diagnostic data, we can monitor pre-diagnosis data, su exhibited by patients at an Emergency Department (ED). In doing so, we ris k increasing the false spiratory problems positive rate, such as mistakenly attributing an increase in patients exhibiting re ntial gain in timeliness to an anthrax attack rather than to influenza. Nevertheless, we have a pote erred to as syndromic of detection. This type of surveillance of pre-diagnosis data is commonly ref (Mostashari and Hartman, 2003; Sosin, 2003). surveillance a database of emer- In our syndromic surveillance infrastructure, we have real-time access to gency department (ED) cases from several hospitals in a city. Each rec ord in this multivariate is information database contains information about the individual who is admitted to the ED. Th rk zip code, and time includes fields such as age, gender, symptoms exhibited, home zip code, wo FR Parts 160 through 164, of arrival at the ED. In accordance with the HIPAA Privacy Rule (45 C 2003), personal identifying information, such as patient names, addres ses, and identification num- bers are removed from the data set used in this research. When a sever e epidemic sweeps through a region, there will obviously be extreme perturbations in the number of ED vis its. While these dramatic upswings are easily noticed during the late stages of an epidemic, the c hallenge is to detect the outbreak during its early stages and mitigate its effects. We would also like to d etect outbreaks that are more subtle than a large scale epidemic as early as possible. Although we have posed our problem in an anomaly detection framework, tra ditional anomaly detection algorithms are inappropriate for this domain. In the traditional appro ach, a probabilistic model of the baseline data is built using techniques such as neural nets (Bis hop, 1994) or a mixture tified as individual data of naive Bayes submodels (Hamerly and Elkan, 2001). Anomalies are iden points with a rare attribute or rare combination of attributes. If we apply tradition al anomaly detec- tion to our ED data, we would find, for example, a patient that is over a hundr ed years old living in a sparsely populated region of the city. These isolated outliers in attribute sp ace are not at all indicative of a disease outbreak. Instead of finding such unusual isola ted cases, we are interested in finding anomalous patterns , which are specific groups whose profile is anomalous relative to their typical profile. Thus, in our example of using ED records, if there is a dra matic upswing in the number of children from a particular neighborhood appearing in the ED with diarrhea, then an early detection system should raise an alarm. Another common approach to early outbreak detection is to convert the multiva riate ED database combination of into a univariate time series by aggregating daily counts of a certain attribute or attributes. For instance, a simple detector would monitor the daily number of people appearing in the ED. Many different algorithms can then be used to monitor this univariate s urveillance data, including methods from Statistical Quality Control (Montgomery, 2001), time ser ies models (Box and Jenkins, 1976), and regression techniques (Serfling, 1963). T his technique works well if we know beforehand which disease to monitor, since we can improve the timeliness of detection by monitoring specific attributes of the disease. For example, if we are vigilant ag ainst an anthrax attack, we can concentrate our efforts on ED cases involving respirator y problems. In our situation, we need to perform non-specific disease monitoring because we do not k now what disease to expect, 1962

3 W HAT S S TRANGE A BOUT R ECENT E VENTS ’ care data for pre-defined particularly in the case of a bioterrorist attack. Instead of monitoring health- patterns, we detect any significant anomalous patterns in the multivariate ED d ata. Furthermore, , particularly the by taking a multivariate approach that inspects all available attributes in the data temporal, spatial, demographic, and symptomatic attributes, we will show that suc h an approach can improve on the detection time of a univariate detection algorithm if the outbreak initia lly manifests itself as a localized cluster in attribute space. omaly pattern detector Our approach to early disease outbreak detection uses a rule-based an , 2003). WSARE operates called What’s Strange About Recent Events (WSARE) (Wong et al., 2002 ompares recent on discrete, multidimensional data sets with a temporal component. This algorithm c nificant patterns data against a baseline distribution with the aim of finding rules that summarize sig j X = V of anomalies. Each rule is made up of components of the form th attribute , where i is the X i i i j V is the th value of that attribute. Multiple components are joined together by a logical and j i AND. For example, a two component rule would be = Male AND Home Location = NW . Gender m in which the rules have These rules should not be interpreted as rules from a logic-based syste f as SQL SELECT queries an antecedent and a consequent. Rather, these rules can be thought o t match the components because they identify a subset of the data having records with attributes tha of the rule. WSARE finds these subsets whose proportions have change d the most between recent data and the baseline. We will present versions 2.0 and 3.0 of the WSARE algorithm. We will also briefl y describe WSARE 2.5 in order to illustrate the strengths of WSARE 3.0. These three algor ithms only differ in how they create the baseline distribution; all other steps in the WSARE framewo rk remain identical. WSARE 2.0 and 2.5 use raw historical data from selected days as the baselin e while WSARE 3.0 models the baseline distribution using a Bayesian network. 2. What’s Strange About Recent Events November 2003 Su Mo Tu We Th Fr Sa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 December 2003 Su Mo Tu We Th Fr Sa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Figure 1: The baseline for WSARE 2.0 if the current day is December 30, 2 003 1963

4 W ONG , C OOPER AND W AGNER , M OORE e has occurred in The basic question asked by all detection systems is whether anything strang recent events. This question requires defining what it means to be recen t and what it means to be strange. Our algorithm considers all patient records falling on the curre nt day under evaluation to be recent events. Note that this definition of recent is not restrictive – our a pproach is fully general and recent can be defined to include all events within some other time period such a s over the last six something being normal. hours. In order to define an anomaly, we need to establish the concept of historical data from In WSARE version 2.0, baseline behavior is assumed to be captured by raw the same day of the week in order to avoid environmental effects such as we ekend versus weekday differences in the number of ED cases. This baseline period must be chos en from a time period similar to the current day. This can be achieved by being close enough to the current day to capture must also be sufficiently distant any seasonal or recent trends. On the other hand, the baseline period pens on the current day but from the current day. This distance is required in case an outbreak hap the baseline period it remains undetected. If the baseline period is too close to the current day, will quickly incorporate the outbreak cases as time progresses. In the des cription of WSARE 2.0 below, we assume that baseline behavior is captured by records that are baseline in the set days . baseline days contains the days that are 35, 42, 49, and 56 days prior to the day under Typically, as an example; it consideration. We would like to emphasize that this baseline period is only used can be easily modified to another time period without major changes to our algorith m. In Section 3 we will illustrate how version 3.0 of WSARE automatically generates the baseline u sing a Bayesian network. We will refer to the events that fit a certain rule for the current day as C . Similarly, the recent number of cases matching the same rule from the baseline period will be called C . As an baseline example, suppose the current day is Tuesday December 30, 2003. The baseline used for WSARE 2.0 will then be November 4, 11, 18 and 25 of 2003 as seen in Figure 1. The se dates are all from Tuesdays in order to avoid day of week variations. 2.1 Overview of WSARE Parameter Name Description Default value max components Maximum number of compo- 2 rule nents to a rule num randomizations Number of iterations to the ran- 1000 domization test α 0.05 The significance level of the F DR False Discovery Rate baseline days (WSARE 2.0 Days to be used for the baseline 35, 42, 49, and 56 days prior only) to current date environmental attributes Not applicable Attributes that account for tem- poral trends (WSARE 2.5 and 3.0) num baseline samples 10000 The number of sampled records from the baseline Bayesian net- (WSARE 3.0 only) work Table 1: The main parameters in WSARE 1964

5 R W S S TRANGE A BOUT ’ ECENT E VENTS HAT WSARE 3.0 Learn Bayesian network from all historical data WSARE 2.5 WSARE 2.0 Create baseline using Create baseline from all historical data that selected days from match environmental Sample baseline from historical data attributes learned Bayesian network Find the best scoring rule using baseline and recent datasets Calculate p−value for best scoring rule using randomization test Running WSARE for one day Running WSARE for a history of days Report p−value and rule Use FDR to find significant days Figure 2: A schematic overview of the steps involved in the WSARE algorithms We will begin this section with an overview of the general WSARE algorithm follo wed by a more detailed example. Figure 2 gives a pictorial overview of the three WSA RE algorithms discussed in this paper. Note that the three algorithms differ only in how they c reate the baseline while all of the other steps remain identical. Table 1 describes the main parameter s used by the WSARE algorithms. WSARE first finds the best scoring rule over events occurring on the cu rrent day using a greedy search. The limit to the number of components in a rule is set to the parameter max components , rule which is typically set to be 2 for computational reasons although in Section 2.5 w e describe a greedy procedure for n component rules. The score of a rule is determined by comparing the events on the current day against events in the past. More specifically, we are compar ing if the ratio between certain events on the current day and the total number of events on the cur rent day differ dramati- cally between the recent period and the past. Following the score calculation , the best rule for that day has its p-value estimated by a randomization test. The p-value for a rule is th e likelihood of 1965

6 W ONG OORE , C OOPER AND W AGNER , M ther attributes are finding a rule with as good a score under the hypothesis that the date and the o independent. The randomization-based p-value takes into account the ef fect of the multiple testing mization test is determined that occurs during the rule search. The number of iterations of the rando by the parameter num randomizations . If we are running the algorithm on a day-by-day basis we would end at this step. However, if we are looking at a history of days and w e want to control for some level of false discoveries over this group of days, we would need th e additional step of using the False Discovery Rate (FDR) method (Benjamini and Hochberg, 1995) to determine which of the p-values are significant. The days with significant p-values are retur ned as the anomalies. 2.2 One Component Rules In order to illustrate this algorithm, suppose we have a large database of 1,00 0,000 ED records over a two-year span. This database contains roughly 1370 records a day. Suppose we treat all records line data set out of all within the last 24 hours as “recent” events. In addition, we can build a base e then combine the recent cases from exactly 35, 42, 49, and 56 days prior to the current day. W DB and baseline data to form a record subset called , which will have approximately 5000 records. i The algorithm proceeds as follows. For each day in the surveillance period, retrieve the records i belonging to DB . We first consider all possible one-component rules. For every poss ible attribute- i C . As an example, value combination, obtain the counts C DB from the data set and i recent baseline suppose the attribute under consideration is Age Decile for the ED case. There are 9 possible values for Age Decile , ranging from 0 to 8. We start with the rule Age Decile = 3 and count the number of cases for the current day i Age Decile = 3 and those that have Age Decile 6 = 3. The cases that have for the cases matching from five to eight weeks ago are subsequently examined to obtain the counts contingency table such the rule and those not matching the rule. The four values form a two-by-two as the one shown in Table 2. 2.3 Scoring Each One Component Rule which the null hypothesis The next step is to evaluate the “score” of the rule using a hypothesis test in ncy table. In effect, is the independence of the row and column attributes of the two-by-two continge C . the hypothesis test measures how different the distribution for C is compared to that of recent baseline This test will generate a p-value that determines the significance of the anoma lies found by the rule. We will refer to this p-value as the score in order to distinguish this p-value from the p-value that is obtained later on from the randomization test. We use the Chi Square test fo r independence of ity of the Chi Square variables whenever the counts in the contingency table do not violate the valid test. However, since we are searching for anomalies, the counts in the con tingency table frequently involve small numbers. In this case, we use Fisher’s exact test (Good, 2 000) to find the score for each rule since the Chi Square test is an approximation to Fisher’s exact te st when counts are large. h indicates that the count Running Fisher’s exact test on Table 2 yields a score of 0.025939, whic C for cases matching the rule Home Location = NW are very different from the count C . recent baseline In biosurveillance, we are usually only interested in an increase in the numbe r of certain records. As a result, we commonly use a one-sided Fisher’s exact test. 1966

7 W ’ S TRANGE A BOUT R ECENT E VENTS HAT S C C recent baseline 6 Home Location 496 = NW NW 40 9504 Home Location 6 = Table 2: A Sample 2x2 Contingency Table 2.4 Two Component Rules und. We will refer to At this point, the best one component rule for a particular day has been fo 1 BR . The algorithm then attempts to find the best two the best one component rule for day i as i 1 BR through a greedy search. component rule for the day by adding on one extra component to i 1 BR with all possible attribute-value pairs, This extra component is determined by supplementing i 1 BR , and selecting the resulting two component rule with the except for the one already present in i C best score. Scoring is performed in the exact same manner as before, e xcept the counts recent C and e best are calculated by counting the records that match the two component rule. Th baseline 2 i is subsequently found and we will refer to it as two-component rule for day BR i 1 2 Suppose has as its first component the attribute-value pair C ’s = V BR . Furthermore, let BR 1 1 i i 1 C and C = V . Adding the component C V = = to BR V components be may not result in a 1 1 2 2 2 2 i better scoring rule. During our search for the best scoring two compone nt rule, we only consider two component rules in which adding either component has a significant effec t. Determining if either component has a significant effect can be done through two hypothesis tests. In the first hypothesis C to the one component rule = V test, we use Fisher’s exact test to determine the score of adding 2 2 = . Similarly, in the second hypothesis test, we use Fisher’s exact test to scor e the addition of V C 1 1 = the component to C = V C . The 2-by-2 contingency tables used by the two hypothesis tests V 2 1 2 1 are shown in Table 3. V = V and C Records from Today with = C Records from Other with C = V and C = V 2 1 1 2 2 1 1 2 Records from Today with C 6 = V V and C = = V C Records from Other with C and 6 = V 1 2 2 2 1 2 1 1 C V = V = and C C Records from Today with V and Records from Other with C V = = 1 1 2 2 2 1 1 2 Records from Today with C V = V = and C 6 6 = V C Records from Other with C and = V 2 1 2 1 1 2 2 1 Table 3: 2x2 Contingency Tables for a Two Component Rule Once we have the scores for both tables, we need to determine if they are sig nificant or not. A ant at the α = 0 . 05 level. score is considered significant if the result of a hypothesis test is signific oth components has an If the scores for the two tables are both significant, then the presence of b 2 is BR . On the other hand, if any one of the scores i effect. As a result, the best rule overall for day i 1 . is not significant, then the best rule overall for day i is BR i 2.5 Component Rules n k − 1 Let BR n be the best k − 1 component rule found for day i . In the general case of finding the best i k − 1 , we produce n. Given BR component rule, the procedure is analogous to that of the previous sectio i k BR by greedily adding on the best component, which is found by evaluating all p ossible attribute- i 1967

8 W ONG , C OOPER AND W AGNER , M OORE 1 k − BR . Starting mponents of value pairs as the next component, excluding those already present in co i n 1 . BR with BR , we repeat this procedure until we reach i i eory test all In order to determine if the addition of a component is significant, we should in th n b c ) ( 2 n possible combinations of the n components. In general, we need 2 such tests. Having this ∑ i i 1 = increases. As an approximation, we resort to many tests is clearly computationally intensive as n component is significant with respect to the − 1 other components. The testing if adding the nth n V = two significance tests are as shown in Table 4, where C refers to the last component added and n n = V C = C 1 components. As before, if − n refers to the conjunction of the previous V , . . . , n − 1 1 1 − n 1 α = 0 . 05, then we consider the addition of the both of the Fisher’s exact tests return a score less than many components rule component significant. Due to this step, the probability of having a rule with e 95% level for both of the is low because for each component added, it needs to be significant at th Fisher’s exact tests. C = = V = , . . . , C C and V = V Records from Today with C , . . . , and C V = = C Records from Other with n n n 1 1 − n − 1 n 1 1 − 1 − n 1 1 V V n n C = V , . . . , C = V and C C 6 = = 6 Records from Other with C = V , . . . , C Records from Today with = V and n n 1 n 1 − 1 n − 1 1 n n − 1 − 1 1 V V n n C C = V = , . . . , C C and V = V = C , . . . , and C V = = Records from Today with Records from Other with n n 1 1 1 1 − n − 1 n 1 n − 1 − n 1 V V n n Records from Today with ¬ ( C and = V ) , . . . , C V = C = V , . . . , V = ) and C ( ¬ Records from Other with − 1 1 − 1 n 1 n 1 1 − n n − 1 1 V V = C C = n n n n Table 4: 2x2 Contingency Tables for an N Component Rule 2.6 Finding the p-value for a Rule The algorithm above for determining scores is prone to overfitting due to multiple hypothesis test- ignificant p-values but ing. Even if data were generated randomly, most single rules would have ins the best rule would be significant if we had searched over 1000 possible rules. In order to illustrate this point, suppose we follow the standard practice of rejecting the null hypo thesis when the p-value is < α , where α = 0 . 05. In the case of a single hypothesis test, the probability of a false positive under the null hypothesis would be α , which equals 0.05. On the other hand, if we perform 1000 e probability of a false posi- hypothesis tests, one for each possible rule under consideration, then th 1000 ( 1 − 0 . tive could be as bad as 1 ) − ≈ 1, which is much greater than 0.05 (Miller et al., 2001). 05 Thus, if our algorithm returns a significant p-value, we cannot accept it at face value without adding an adjustment for the multiple hypothesis tests we performed. This problem can be addressed using a Bonferroni correction (Bonferroni, 1936) but this approach wou ld be unnecessarily conservative. Instead, we use a randomization test. Under the null hypothesis of this rand omization test, the date and the other ED case attributes are assumed to be independent. Conseque ntly, the case attributes in the data set DB rds from remain the same for each record but the date field is shuffled between reco i the current day and records from five to eight weeks ago. The full meth od for the randomization test is shown below. Let UCP = Uncompensated p-value i.e. the score as defined above. i 1968

9 W S A BOUT R ECENT E VENTS S ’ TRANGE HAT For j = 1 to 1000 j ) ( = newly randomized data set Let DB i j ) ( ) j ( BR DB Let = Best rule on i i j ( ( j ) ) ( j ) BR on DB Let UCP = Uncompensated p-value of i i i BR be CPV Let the compensated p-value of i.e. i i ( j ) UCP UCP < # of Randomized Tests in which i i CPV = i # of Randomized Tests CPV lue as good is an estimate of the chance that we would have seen an uncompensated p-va i if in fact there was no relationship between date and case attributes. Note tha UCP as t we do not i use the uncompensated p-value UCP after the randomization test. Instead, the compensated p-value i CPV is used to decide if an alarm should be raised. i The bottleneck in the entire WSARE procedure is the randomization test. If impleme nted naively, it can be extremely computationally intense. In order to illustrate its comp lexity, suppose there are K possible values. In addition, let there be N attributes and each attribute can take on M T N records for the baseline period. Note that typically, N is 4 to 20 times records for today and B T N smaller than j of the randomization test, we need to search for the best scoring rule . At iteration B j ) ( DB over . Assuming we limit the number of components in a rule to be two, searching for the i KM + K ( M − 1 ) rules. Scoring a rule requires us best rule using a greedy search requires scoring N to obtain the entries for the two by two contingency table by counting over + N records. Thus, B T each iteration of the randomization test has a complexity of ( + K ( M − 1 )) ∗ ( N KM + N Q ) . With B T O ( QKM ( N iterations, the overall complexity of the randomization test is + N . )) B T One of the key optimizations to speeding up the randomization test is the technique of “racing” BR (Maron and Moore, 1997). If is highly significant, we run the full 1000 iterations but we stop i early if we can show with very high confidence that CPV is going to be greater than 0.1. As an i j j CPV example, suppose we have gone through iterations and let be the value of CPV on the i i j ( CPV j is calculated as the number of times so far that the best scoring rule on the current iteration i randomized data set has a lower p-value than the best scoring rule over th e original unrandomized data set). Using a normality assumption on the distribution of CPV , we can estimate the standard i deviation and form a 95% confidence interval on the true value of CPV σ . This is achieved using CPV i i σ 1 . 96 σ 96 . 1 j j CPV CPV i i √ √ ± CPV the interval − . If the lower half of this interval, namely CPV , is greater i i n n than, say 0.1, we are 95% sure that this score will be insignificant at the 0.1 level. On a typical data set where an outbreak is unlikely, the majority of days will result in insignifica nt p-values. As a result, we expect the racing optimization to allow us to stop early on many days. 2.7 Using FDR to Determine Which p-values are Significant This algorithm can be used on a day-to-day basis or it can operate over a history of several days to report all significantly anomalous patterns. When using our algorithm on a day-to-day basis, the compensated p-value CPV obtained for the current day through the randomization tests can i be interpreted at face value. However, when analyzing historical data, we need to characterize the false discovery rate over the group of days in the history, which requ ires comparing the CPV i values for each day. Comparison of multiple CPV values in the historical window results in a i 1969

10 W ONG OORE , C OOPER AND W AGNER , M le hypothesis tests second overfitting opportunity analogous to that caused by performing multip to determine the best rule for a particular day. As an illustration, suppose we took 500 days of CPV randomly generated data. Then, approximately 5 days would have a value less than 0.01 and i these days would naively be interpreted as being significant. Two approa ches can be used to correct this problem. The Bonferroni method (Bonferroni, 1936) aims to reduce th e probability of making one or more false positives to be no greater than α . However, this tight control over the number of native is Benjamini and false positives causes many real discoveries to be missed. The other alter 95), which we will refer to Hochberg’s False Discovery Rate method, (Benjamini and Hochberg, 19 e expected fraction of the as BH-FDR. BH-FDR guarantees that the false discovery rate, which is th is rejected, will be no number of false positives over the number of tests in which the null hypothesis α greater than . The FDR method is more desirable as it has a higher power than the Bonferr oni F DR correction while keeping a reasonable control over the false discovery rate. We incorporate the BH-FDR method into our rule-learning algorithm by first providing an α value and then using F DR BH-FDR to find the cutoff threshold for determining which p-values are sign ificant. 3. WSARE 3.0 Many detection algorithms (Goldenberg et al., 2002; Zhang et al., 2003; Fa wcett and Provost, 1997) vity, which we will refer to as assume that the observed data consist of cases from background acti tion, detection algorithms the baseline, plus any cases from irregular behavior. Under this assump operate by subtracting away the baseline from recent data and raising an alarm if the deviations from the baseline are significant. The challenge facing all such systems is to estimate the baseline distribution using data from historical data. In general, determining this distrib ution is extremely difficult due to the different trends present in surveillance data. Seaso nal variations in weather and temperature can dramatically alter the distribution of surveillance data. For exa mple, flu season typically occurs during mid-winter, resulting in an increase in ED cases involv ing respiratory prob- lems. Disease outbreak detectors intended to detect epidemics such as SARS , West Nile Virus and anthrax are not interested in detecting the onset of flu season and would b e thrown off by it. Day of week variations make up another periodic trend. Figure 3, which is taken from Goldenberg et al. (2002), clearly shows the periodic elements in cough syrup and liquid deco ngestant sales. Choosing the wrong baseline distribution can have dire consequences fo r an early detection system. Consider once again a database of ED records. Suppose we ar e presently in the middle of flu season and our goal is to detect anthrax, not an influenza outbre ak. Anthrax initially causes symptoms similar to those of influenza. If we choose the baseline distribution to be outside of the current flu season, then a comparison with recent data will trigger many fa lse anthrax alerts due to the flu cases. Conversely, suppose we are not in the middle of flu season and that we obtain the baseline distribution from the previous year’s influenza outbreak. The s ystem would now consider high counts of flu-like symptoms to be normal. If an anthrax attack occurs, it w ould be detected late, if at all. 1970

11 W HAT S S TRANGE A BOUT R ECENT E VENTS ’ 2000 Cough Syrup and Liquid Decongestant Sales 1500 1000 Sales 500 0 01/01/00 07/01/00 10/01/00 01/01/01 04/01/01 10/01/99 04/01/00 07/01/99 Dates Figure 3: Cough syrup and liquid decongestant sales from (Goldenber g et al., 2003) e extreme, we would There are clearly tradeoffs when defining this baseline distribution. At on ly the most recent data, like to capture any current trends in the data. One solution would be to use on too much weight on outliers such as data from the previous day. This approach, however, places that may only occur in a short but recent time period. On the other hand, we would like the baseline to be accurate and robust against outliers. We could use data from all pr evious years to establish the baseline. This choice would smooth out trends in the data and likely raise alarms for events that are due to periodic trends. In WSARE 2.0, we made the baseline distribution to be raw data obtained from se lected his- prior to the current day torical days. For example, we chose data from 35, 42, 49, and 56 days so that seasonal trends under examination. These dates were chosen to incorporate enough data could be captured and they were also chosen to avoid weekend versus w eekday effects by making ally in order to tune all comparisons from the same day of week. This baseline was chosen manu hould determine the the performance of WSARE 2.0 on the data set. Ideally, the detection system s baseline automatically. In this section, we describe how we use a Bayesian network to represent the joint probability distribution of the baseline. From this joint distribution, we represent the bas eline distributions from the conditional distributions formed by conditioning on what we term environmental attributes . These attributes are precisely those attributes that account for trends in th e data, such as the season, the current flu level and the day of week. 3.1 Creating the Baseline Distribution Learning the baseline distribution involves taking all records prior to the pas t 24 hours and build- ing a Bayesian network from this subset. During the structure learning, we differentiate between environmental attributes, which are attributes that cause trends in the data, a nd response attributes , which are the remaining attributes. The environmental attributes are specified by the user based on the user’s knowledge of the problem domain. If there are any latent en vironmental attributes 1971

12 W ONG OORE , C OOPER AND W AGNER , M ifficulties. How- that are not accounted for in this model, the detection algorithm may have some d ever, as will be described later on in Section 4, WSARE 3.0 was able to overc ome some hidden environmental attributes in our simulator. s are prevented While learning the structure of the Bayesian network, environmental attribute from having parents because we are not interested in predicting their distr ibutions, but rather, we neral, any structure want to use them to predict the distributions of the response attributes. In ge learning algorithm can be used in this step as long as it follows this restriction. I n fact, the structure search can even exploit this constraint by avoiding search paths that as sign parents to the environ- mental attributes. We experimented with using hillclimbing to learn the Bayesian network structure an d found it to be both slow and prone to being trapped in local optima. As a result, we deve loped an efficient structure search algorithm called Optimal Reinsertion based on ADTrees (M oore and Lee, 1998). Unlike hillclimbing, which performs a single modification to a directed acyclic graph (DAG) on each step, Optimal Reinsertion is a larger scale search operator that is much less p rone to local optima. T from the DAG, disconnects T from the graph, and Optimal Reinsertion first picks a target node efficiently finds the optimal way to reinsert T back into the graph according to the scoring function. The details of this algorithm can be found in (Moore and Wong, 2003). We have often referred to environmental attributes as attributes that cause periodic trends. En- vironmental attributes, however, can also include any source of informatio n that accounts for recent changes in the data. For example, suppose we detect that a botulism outbre ak has occurred and we would still like to be on alert for any anthrax releases. We can add “Botu lism Outbreak” as an environmental attribute to the network and supplement the current data with information about twork allows WSARE to the botulism outbreak. Incorporating such knowledge into the Bayesian ne treat events due to the botulism outbreak as part of the baseline. Once the Bayesian network is learned, we have a joint probability distribution for the data. We would like to produce a conditional probability distribution, which is formed b y condition- uary 21, 2003. If ing on the values of the environmental attributes. Suppose that today is Febr Season and Day o f Week the environmental attributes were Season = Winter and , we would set Day o f Week = Weekday . Let the response attributes in this example be X . We can then , ..., X n 1 from P X ( , ..., X obtain the probability distribution | Season = Winter , Day o f Week = Weekday ) n 1 n as a data set formed the Bayesian network. For simplicity, we represent the conditional distributio ed on the environmental by sampling a large number of records from the Bayesian network condition attributes. The number of samples is specified by the parameter num samples baseline , which has to be large enough to ensure that samples with rare combinations of attributes w ill be present. In general, this number will depend on the learned Bayesian network’s struc ture and the parameters of the network. We chose to sample 10000 records because we determined empirically that this number is a reasonable compromise between running time and accuracy on ou r data. We will refer to this sampled data set as . The data set corresponding to the records from the past 24 DB baseline hours of the current day will be named DB . recent We used a sampled data set instead of using inference mainly for simplicity. Infe rence might be faster than sampling to obtain the conditional probability P ( X , , . . . , X ) | Environmental Attributes n 1 especially when the learned Bayesian networks are simple. However, if inf erence is used, it is somewhat unclear how to perform the randomization test. With sampling, on the o ther hand, we only need to generate DB once and then we can use it for the randomization test to obtain baseline the p-values for all the rules. In addition, sampling is easily done in an efficie nt manner since 1972

13 W HAT S S TRANGE A BOUT R ECENT E VENTS ’ the simplest way environmental attributes have no parents. While a sampled data set provides of obtaining the conditional distribution, we have not completely ignored the po ssibility of using urther in our future inference to speed up this process. We would like to investigate this direction f work. 3.2 Dealing with New Hospitals Coming Online WSARE 3.0 assumes that the baseline distribution remains relatively stable, with th e environmental attributes accounting for the only sources of variation. However, in a rea l life situation where data online and become a new are pooled from various EDs around a city, new hospitals frequently come source of data to be monitored. These new data sources cause a shift fr om the baseline distribution s hospital begins sending that is not accounted for in WSARE 3.0. For example, suppose a children’ nomalous data to the surveillance system. In this case, WSARE 3.0 would initially detect an a part of the city where pattern due to an increase in the number of cases involving children from the porate the newly the children’s hospital is located. Over time, WSARE 3.0 would eventually incor added hospital’s data into its baseline. In general, this problem of a shifted distribution is difficult to address. We a pproach this issue by ignoring the new data sources until we have enough data from them to inc orporate them into the baseline. Our solution relies on the data containing an attribute such as Hos pital ID that can identify event ED data from the hospital that the case originated from. HIPAA regulations can sometimes pr containing such identifying attributes. In this case, we recommend using WSAR E 2.0 with a recent . Whenever the data enough baseline period in order to avoid instabilities due to new data sources includes a Hos pital ID attribute, we first build a list of hospitals that provide data for the current day. For each hospital in this list, we keep track of the first date a case came from that particular hospital. If the current day is less than a year after the first case date, w e consider that hospital to have insufficient historical data for the baseline and we ignore all reco rds from that hospital. For each hospital with sufficient historical records, we then build a Baye sian network using only historical data originating from that particular hospital. ds from all the hospital In order to produce the baseline data set, we sample a total of 10000 recor have n Bayesian networks. Let hospital records on the current day and suppose there are H h h H hospitals with sufficient historical data for the current date. Then let = N n . Each hospital ∑ h h h = 1 n h number of samples to the baseline data set. As an example, ∗ Bayesian network contributes 10000 N h me that we can ignore the suppose we have 5 hospitals with 100 records each. Furthermore, assu current date. We are then fourth hospital’s records since its first case is less than a year prior to the rk for each hospital, left with 4 hospitals with 100 records each. After we build the Bayesian netwo we sample 2500 records from the Bayesian network belonging to each of th e four hospitals. 4. Evaluation Validation of early outbreak detection algorithms is generally a difficult task du e to the type of data required. Health-care data during a known disease outbreak, either na tural or induced by a bioa- gent release, are extremely limited. Even if such data were plentiful, evaluatio n of biosurveillance algorithms would require the outbreak periods in the data to be clearly labelled. This task requires an expert to inspect the data manually, making this process extremely slow. Co nsequently, such labelled data would still be scarce and making statistically significant conclusion s with the results of detection algorithms would be difficult. Furthermore, even if a group of ep idemiologists were to 1973

14 W ONG OORE , C OOPER AND W AGNER , M outbreak begins be assembled to label the data, there would still be disagreements as to when an and ends. As a result of these limitations, we validate the WSARE algorithms on data from a simu lator which we will refer to as the city Bayesian network (CityBN) simulator. The CityB N simulator is based on a large Bayesian network that introduces temporal fluctuations based on a variety of factors. The structure and the parameters for this Bayesian network are created by hand. This is designed to simulator is not intended to be a realistic epidemiological model. Instead, the model produce extremely noisy data sets that are a challenge for any detection alg orithm. In addition to . Due to the fact simulated data, we also include WSARE output from ED data from an actual city utbreaks, we are only that epidemiologists have not analyzed this real world data set for known o able to provide annotated results from the runs of WSARE. 4.1 The CityBN Simulator The city in the CityBN simulator consists of nine regions, each of which contains a different sized population, ranging from 100 people in the smallest area to 600 people in the la rgest section, as shown in Table 5. We run the simulation for a two year period starting from Jan uary 1, 2002 to u levels and food December 31, 2003. The environment of the city is not static, with weather, fl pring and conditions in the city changing from day to day. Flu levels are typically low in the s summer but start to climb during the fall. We make flu season strike in winter, resu lting in the highest flu levels during the year. Weather, which only takes on the values of hot or cold, is as expected for the four seasons, with the additional feature that it has a go od chance of remaining the same as it was yesterday. Each region has a food condition of good or ba d. A bad food condition facilitates the outbreak of food poisoning in the area. NE (500) N (400) NW (100) W (100) C (200) E (300) SW (200) S (200) SE (600) arentheses Table 5: The geographic regions in the CityBN simulator with their populations in p Previous Region Date Previous Previous Region Previous Season Anthrax Concentration Weather Food Condition Flu Level Day of Week Region Anthrax Region Food Weather Flu Level Concentration Condition Figure 4: City Status Bayesian Network We implement this city simulation using a single large Bayesian network. For simplicity, we will describe this large Bayesian network in two parts, as shown in Figures 4 and 5. The subnetwork shown in Figure 4 is used to create the state of the city for a given day. Give n the state of the city, the network in Figure 5 is used to generate records for individual patients . We use the convention that any nodes shaded black in the subnetwork are set by the system and do not have their values generated probabilistically. Due to space limitations, in stead of showing 1974

15 W HAT S TRANGE A BOUT R ECENT E VENTS ’ S of each region in Figure 4, we eighteen separate nodes for the current and previous food conditions summarize them using the generic nodes and Previous Region Food Condition Region Food Condition respectively. This same space saving technique is used for the current and previous region an- thrax concentrations. Most of the nodes in this subnetwork take on two to thr ee values. For hite nodes are sampled each day, after the black nodes have their values set, the values for the w from the subnetwork. These records are stored in the City Status (CS) da ta set. The simulated anthrax release is selected for a random date during a specified time period . One of the nine re- f the release, the gions is chosen randomly for the location of the simulated release. On the date o Region Anthrax Concentration node is set to have the value of . The anthrax concentration High remains high for the affected region for each subsequent day with an 80 % probability. This prob- ability is chosen in order to ensure that enough individuals in the simulation are being infected by anthrax over an extended period of time after the attack. SEASON FLU LEVEL DAY OF WEEK WEATHER Region AGE Anthrax Has Anthrax Concentration Outside Immune Activity System Has Sunburn GENDER Region Heart Has Flu Grassiness Health DATE Has Allergy Has Cold Has Heart Region Problems Food Condition REGION Has Food Disease Poisoning Actual Symptom REPORTED DRUG ACTION SYMPTOM Figure 5: Patient Status Bayesian Network Table 6: Examples of two records in the PS data set Location N NW Age Child Senior Female Gender Male Flu Level High None Day of Week Weekday Weekday Weather Cold Hot Season Summer Winter Action Absent ED visit Reported Symptom Nausea Rash Drug None None Date Jan-01-2002 Jun-21-2002 The second subnetwork used in our simulation produces individual health care cases. Figure 5 depicts the Patient Status (PS) network. On each day, for each person in each region, we sample 1975

16 W ONG OORE , C OOPER AND W AGNER , M e their values assigned from the individual’s values from this subnetwork. The black nodes first hav the CS data set record for the current day. For the very first day, the black nodes are assigned a Each individual’s set of initial values. The white nodes are then sampled from the PS network. Weather Day o f Week , health profile for the day is thus generated. The nodes , F lu Level , , Season , and Region Food Condition are intended to represent environmental variables Region Grassiness Region Grassiness nodes indicate the that affect the upswings and downswings of a disease. The these environmental amount of pollen in the air and thus affect the allergies of a patient. We choose a population. Two variables because they are the most common factors influencing the health of Region Grassiness Region Food Condition , are hidden of the environmental variables, namely and from the detection algorithm while the remaining environmental attributes are obs erved. We choose to hide these two attributes because the remaining four attributes that are obse rved are typically data. considered when trying to account for temporal trends in biosurveillance As for the other nodes, the Disease node indicates the status of each person in the simulation. cedence, allergies, the We assume that a person is either healthy or they can have, in order of pre cold, sunburn, the flu, food poisoning, heart problems or anthrax. If the values of the parents of Disease node indicate that the individual has more than one disease, the Disease node picks the the disease with the highest precedence. This simplification prevents individua ls from having multiple diseases. A sick individual then exhibits one of the following symptoms: none , respiratory prob- t diseases can exhibit the lems, nausea, or a rash. Note that in our simulation, as in real life, differen same symptoms, such as a person with the flu can exhibit respiratory problems as could a person rily be the same as with anthrax. The actual symptom associated with a person may not necessa the symptom that is reported to health officials. Actions available to a sick perso n included doing chool. As with the CS nothing, buying medication, going to the ED, or being absent from work or s network, the arities for each node in the PS network are small, ranging from two to four values. If the patient performs any action other than doing nothing, the patient’s health c are case is added to the PS data set. Only the attributes in Figure 5 labelled with uppercase letters are recorded, result- luding some latent ing in a great deal of information being hidden from the detection algorithm, inc environmental attributes. The number of cases the PS network generates d aily is typically in the range of 30 to 50 records. Table 6 contains two examples of records in the PS data set. We run six detection algorithms on 100 different PS data sets. Each data set is generated for a two year period, beginning on January 1, 2002 and ending December 3 1, 2003. The detection algorithms train on data from the first year until the day being monitored while the second year is between January 1, 2003 used for evaluation. The anthrax release is randomly chosen in the period and December 31, 2003. We try to simulate anthrax attacks that are not trivially detectable. Figure 6 plots the total count of health-care cases on each day during the evaluation period while Figur e 7 plots the total count of health-care cases involving respiratory symptoms for the same simulated data s et. A naive detection algorithm would assume that the highest peak in this graph would be the date of the anthrax release. However, the anthrax release occurs on day index 74,409, which is clea rly not the highest peak in either graph. Occasionally the anthrax releases affects such a limited numbe r of people that it is undetected by all the algorithms. Consequently, we only use data sets with more than eight reported anthrax cases on any day during the attack period. The following paragraphs describe the six detection algorithms that we run o n the data sets. Three of these methods, namely the control chart, moving average, and AN OVA regression algo- rithms, operate on univariate data. We apply these three algorithms to two diffe rent univariate data 1976

17 ECENT W TRANGE A BOUT R S E VENTS HAT ’ S 60 All Health Care Cases 55 50 45 40 35 30 Counts of health care cases 25 20 15 10 74250 74300 74350 74400 74450 74500 74550 74100 74150 74200 Day number Figure 6: Daily counts of health-care data 25 All Health Care Cases 20 15 10 5 Counts of health care cases involving respiratory symptoms 0 74200 74250 74300 74350 74400 74450 74500 74550 74100 74150 Day number Figure 7: Daily counts of health-care data involving respiratory symptoms nts of cases involving sets – one data set is composed of total daily counts and the other of daily cou respiratory symptoms. The remaining three algorithms are variations on WSARE . The Control Chart Algorithm The first algorithm used is a common anomaly detection algo- rithm called a control chart. This detector determines the mean and variance o f the total number of ld is calculated based on records on each day in the PS data set during the training period. A thresho − 1 Φ is the inverse to the cumulative distribution function of a standard the formula below, in which normal while the p-value is supplied by the user. p-value − 1 ) σ ∗ Φ threshold = ( 1 − μ + 2 ring the evaluation If the aggregate daily counts of health care data exceeds this threshold du period, the control chart raises an alarm. We use a training period of Jan uary 1, 2002 to December 31, 2002. Moving Average Algorithm The second algorithm that we use is a moving average algorithm that predicts the count for the current day as the average of counts fr om the previous 7 days. The window of 7 days is intended to capture any recent trends that might appea r in the data. An alarm level is generated by fitting a Gaussian to data prior to the current day and ob taining a p-value for 1977

18 W ONG OORE , C OOPER AND W AGNER , M ian is calculated using data the current day’s count. The mean and standard deviation for the Gauss from 7 days before the current day. ANOVA Regression A simple detector that accounts for environmental factors is ANOVA re- gression, which is simply linear regression supplemented with covariates for the environmental variables. We include 6 covariates for the days of the week, 3 for the sea sons and one for the daily aggregate count from the previous day. ANOVA regression is a fa irly powerful detector when temporal trends are present in the data, as was shown in (Buckeridge et al., 2005). 5, WSARE 2.0 WSARE 2.0 is also evaluated, using a baseline distribution of records from 3 RE 3.0 as environmental 42, 49 and 56 days before the current day. The attributes used by WSA attributes are ignored by WSARE 2.0. If these attributes are not ignored, W SARE 2.0 would report many trivial anomalies. For instance, suppose that the current day is the fi rst day of fall, making Season = Fall the environmental attribute . Furthermore, suppose that the baseline is taken from the summer season. If the environmental attributes are not ignored, WSARE 2.0 would notice that Season Fall while 0% of the records in the baseline 100% of the records for the current day have = data set match this rule. WSARE 2.5 Instead of building a Bayesian network over the past data, WSARE 2.5 simply builds ntal attributes equal to the a baseline from all records prior to the current period with their environme F lu Level , Season , Day o f Week current day’s. In our simulator, we use the environmental attributes and Weather . To clarify this algorithm, suppose for the current day we have the followin g values of these environmental attributes: F lu Level = High , Season = Winter , Day o f Week = Weekday and Weather Cold . Then DB = would contain only records before the current period with baseline environmental attributes having exactly these values. It is possible that no s uch records exist in the past with exactly this combination of environmental attributes. If there are few er than five records in mparing the current the past that matched, WSARE 2.5 can not make an informed decision when co day to the baseline and simply reports nothing for the current day. WSARE 3.0 WSARE 3.0 uses the same environmental attributes as WSARE 2.5 but builds a Bayesian network for all data from January 1, 2002 to the day being monito red. We hypothesize that WSARE 3.0 would detect the simulated anthrax outbreak sooner than WSA RE 2.5 because 3.0 can handle the cases where there are no records corresponding to the current day’s combination of environmental attributes. The Bayesian network is able to generalize fro m days that do not match today precisely, producing an estimate of the desired conditional distrib ution. For efficiency nce every 30 days on reasons, we allow WSARE 3.0 to learn the network structure from scratch o all data since January 1, 2002. On intermediate days, WSARE 3.0 simply upda tes the parameters , we expect WSARE 3.0 of the previously learned network without altering its structure. In practice to be used in this way since learning the network structure on every day may b e very expensive computationally. 4.1.1 R ESULTS In order to evaluate the performance of the algorithms, we plot an Activity Mo nitoring Operating Characteristic (AMOC) curve (Fawcett and Provost, 1999), which is simila r to an ROC curve. On the AMOC curves to follow, the x-axis indicates the number of false positives per month while the y-axis measures the detection time in days. For a given alarm threshold, we p lot the performance of the algorithm at a particular false positive level and detection time on the gra ph. As an example, 1978

19 W HAT S TRANGE A BOUT R ECENT E VENTS ’ S larms generated by suppose we are dealing with an alarm threshold of 0.05. We then take all the a an algorithm, say WSARE 3.0, that have a p-value less than or equal to 0.05. Suppose there are two such alarms, with one alarm appearing 5 days before the simulated anthrax r elease, which would be lease, making the detection considered a false positive, and the other appearing 3 days after the re time 3 days. If we run the detection algorithms for 1 month, then we would plot a po int at 1 , 3 ) . ( h threshold value. We then vary the alarm threshold in the range of 0 to 0.2 and plot points at eac mber of false positives but For a very sensitive alarm threshold such as 0.2, we expect a higher nu ld would be on the a lower detection time. Hence the points corresponding to a sensitive thresho lower right hand side of the graph. Conversely, an insensitive alarm thr eshold like 0.01 would result in a lower number of false positives and a higher detection time. The correspo nding points would appear on the upper left corner of the graph. AMOC Curve for non-WSARE Algorithms Operating on Total Daily Counts 14 Control Chart ANOVA Regression Moving Average WSARE 3.0 12 Optimal 10 8 6 Detection Time in Days 4 2 0 7 8 0 1 2 3 4 5 6 False Positives per Month Figure 8: AMOC curves comparing WSARE 3.0 to univariate algorithms opera ting on total daily counts from the CityBN simulator Figures 8 to 10 plot the AMOC curve, averaged over the 100 data sets, with an alarm threshold increment of 0.001. On these curves, the optimal detection time is one day, as s hown by the dotted te reality line at the bottom of the graph. We add a one day delay to all detection times to simula urring before the start of the where current data is only available after a 24 hour delay. Any alert occ simulated anthrax attack is treated as a false positive. Detection time is calculated a s the first alert raised after the release date. If no alerts are raised after the release, th e detection time is set to 14 days. Figures 8 and 9 show that WSARE 3.0 clearly outperform the univariate alg orithms when the univariate algorithms operate on the total daily counts and also when the univ ariate algorithms operate on the daily counts of cases involving respiratory symptoms. In Figu re 10, WSARE 2.5 and WSARE 3.0 outperform the other algorithms in terms of the detection time and fa lse positive tradeoff. For a false positive rate of one per month, WSARE 2.5 and WSAR E 3.0 are able to detect the anthrax release within a period of one to two days. The Control Chart, mo ving average, ANOVA regression and WSARE 2.0 algorithms are thrown off by the periodic trends present in the PS data. 1979

20 W ONG , C OOPER AND W AGNER , M OORE AMOC Curve for non-WSARE Algorithms Operating on Counts of Respiratory Cases 14 Control Chart (Respiratory Cases Only) Moving Average (Respiratory Cases Only) ANOVA Regression (Respiratory Cases Only) WSARE 3.0 12 Optimal 10 8 6 Detection Time in Days 4 2 0 1 2 3 4 5 0 7 8 6 False Positives per Month ting on cases in- Figure 9: AMOC curves comparing WSARE 3.0 to univariate algorithms opera volving respiratory symptoms from the CityBN simulator AMOC Curve for WSARE Variants 8 WSARE 2.0 WSARE 2.5 WSARE 3.0 7 Optimal 6 5 4 3 Detection Time in Days 2 1 0 6 5 4 3 2 1 0 7 False Positives per Month Figure 10: AMOC curves for WSARE variants operating on CityBN data We previously proposed that WSARE 3.0 would have a better detection time than WSARE 2.5 due to the Bayesian network’s ability to produce a conditional distribution for a combination of environmental attributes that may not exist in the past data. After checking th e simulation results for which WSARE 3.0 outperformed WSARE 2.5, we conclude that in some cas es, our proposition is true. In others, the p-values estimated by WSARE 2.5 are not as low as thos e of version 3.0. The 1980

21 W HAT S S TRANGE A BOUT R ECENT E VENTS ’ SARE 3.0 due baseline distribution of WSARE 2.5 is likely not as accurate as the baseline of W to smoothing performed by the Bayesian network. The false positives foun d by WSARE 2.5 and WSARE 3.0 are likely due to other non-anthrax illnesses that are not accou nted for in the Bayesian network. Had we explicitly added a Region Food Condition environmental attrib ute to the Bayesian network, this additional information would likely have reduced the false positi ve count. g the number Figures 11 to 14 illustrate the various outbreak sizes in the simulated data by plottin of anthrax cases per day during the outbreak period. Since the outbrea k sizes and durations are randomly generated for each of the 100 data sets, we do not have room to show plots for each data set. Instead, we include representative plots of the outbreaks that a ppeared in our simulated ted on the first day by most data. Figure 11 represents a large scale outbreak which was easily detec algorithms. Large scale outbreaks were rare in our simulated data. Figure 1 2 is a representative tbreak shown plot of a medium scale outbreak that is most common in the data. The particular ou in Figure 12 is also detected by WSARE 3.0 on the first day for an alarm thres hold of 0.005. Small ARE 3.0 detects the scale outbreaks, as shown in Figure 13, are the most difficult to detect. WS outbreak in Figure 13 on the third day with a very insensitive alarm threshold of 0.005. Figure 14 hold of 0.03. contains an outbreak that WSARE 3.0 is unable to detect using an alarm thres certain parameters We also conduct four other experiments to determine the effect of varying correct for multiple of WSARE 3.0. In the first experiment, we use a Bonferroni correction to sults, as shown hypothesis testing instead of a randomization test. The AMOC curve for the re tical to those of the in Figure 15 indicate that the Bonferroni correction results are almost iden randomization test. This similarity was expected because on each day, there a re approximately e hypothesis tests are only 50 hypothesis tests being performed to find the best scoring rule and th weakly dependent on each other. However, as the number of hypothes is tests increases and as the ndomization test should be dependence between the hypothesis tests increases, the results of the ra better than those of the Bonferroni correction. In order to illustrate the advantages of the randomization test, we produce dep endent hypothesis generate a data set tests in WSARE by creating attributes that are dependent on each other. We X , . . . , X in which the states of each random variable in the chain become using a Markov chain 0 n X the attributes in the data set. Each random variable in the Markov chain can be in state A , B , C , t or D , except for X retains which always starts at A . At each time step t , the random variable X t 0 X in the Markov chain with a 90% chance. With a 10% chance, X takes on the next the state of − 1 t t A , B , C and state in the ordered sequence . As an example, if X or A can remain as = A , X D t 1 − t it can become B . If X or transition back to the state X can retain the same state as = D , X t 1 t − 1 t − erate 150 days worth A, which is the first state of the ordered sequence. We use this model to gen ins 100 attributes. We then of data in which each day contains 1000 records and each record conta is altered slightly so sample 14 days of data with the same characteristics except the Markov chain X that each random variable remains in the same state as X with an 89% probability. Thirty data t − 1 t sets, each containing a total of 164 days are produced. Two variations o f WSARE 2.0, one with a randomization test and the other with a Bonferroni correction, are applied to these thirty data sets in order to detect the change. Figure 16 plots the average AMOC curve of this experiment. As the graph illus trates, at a false positive rate of less than 0.4 per month, the randomization test has a much better d etection time. Upon further analysis, we find that the reduced performance of the Bon ferroni correction are due to a much higher number of false positives. As an example, we find that WSARE often notices that a rule such as X produces a very good score. The Bonferroni correction deals = C AND X B = 96 27 1981

22 W OORE , C OOPER AND W AGNER ONG , M Outbreak Size over Time Outbreak Size over Time 70 40 60 35 50 30 25 40 20 30 15 20 Number of Simulated Anthrax Cases Number of Simulated Anthrax Cases 10 10 5 0 0 74147 74338 74333 74143 74353 74151 74155 74343 74348 Day Index Day Index Figure 11: An example of a large scale outbreak Figure 12: An example of a medium scale out- in the CityBN data break in the CityBN data Outbreak Size over Time Outbreak Size over Time 11 12 10 10 9 8 8 7 6 6 5 4 4 Number of Simulated Anthrax Cases Number of Simulated Anthrax Cases 3 2 2 1 0 0 74472 74473 74474 74261 74262 74263 74264 74470 74471 Day Index Day Index Figure 13: An example of a small scale outbreak Figure 14: An example of an outbreak that was in the CityBN data not detected in the CityBN data by WSARE 3.0 with an alarm threshold of 0.03 1982

23 VENTS W TRANGE A BOUT R ECENT E S ’ HAT S Bonferroni Correction Versus Randomization Test for Greedy Rule Search with the Maximum Number of Rule Components as 2 7 Bonferroni Randomization Test Optimal 6 5 4 3 Detection Time in Days 2 1 0 6 7 0 1 2 3 4 5 False Positives per Month Figure 15: The Bonferroni correction version of WSARE versus the r andomization test version on the CityBN data Effect of Dependence among Hypothesis Tests on the Randomization Test and the Bonferroni Correction 12 Bonferroni Randomization 10 8 6 Detection Time in Days 4 2 0 2.5 3 0.5 1 0 2 1.5 False Positives per Month Figure 16: A comparison between the Bonferroni correction version of WSARE and the random- ization test version on data generated from a Markov chain with the multiple hypothesis problem by simply multiplying the score with the number of hy pothesis tests. Although there are a high number of hypothesis tests in this experiment, mu ltiplying by the number of hypothesis tests still results in a low compensated p-value. The ran domization test, on the other hand, notices that although the score is very good, the probability of finding an equal or better score for another rule, such as X is quite high because of the dependence = A AND X B = 94 46 1983

24 W ONG OORE , C OOPER AND W AGNER , M between attributes. Thus, the resulting compensated p-value from the rand omization test is quite high, signifying that the pattern defined by the rule is not so unusual after all. The Effect of Varying the Maximum Number of Components in a Rule Max Components=3 Max Components=2 Max Components=1 2.5 Optimal 2 1.5 1 Detection Time in Days 0.5 0 0 1 2 3 4 5 6 7 False Positives per Month Figure 17: The effect of varying the maximum number of components for a r ule on the AMOC curve for CityBN data The second experiment involves varying the maximum components allowed per rule from one to three. As seen on the AMOC curve in Figure 17, the variations do not see m significantly different to the left of the one false positive per month mark. However, after this point, a version of WSARE with a three component limit outperforms the other two variations. By setting the max imum number sive in its description of components per rule to be three, WSARE is capable of being more expres erfitting by requiring of anomalous patterns. On the other hand, WSARE also guards against ov each component added to be 95% significant for the two hypothesis tests pe rformed in Section 2.5. This criterion makes the addition of a large number of rule components unlikely a nd we expect the optimal number of components to be about two or three. ather than greedy. The third experiment involves changing the rule search to be exhaustive r Note that if we compare the score of the best rule found by the exhaustive method against that found by the greedy method, the exhaustive method would unquestionably fi nd a rule with an equal or greater score than the greedy method. In Figure 18, however, we co mpare the performance of the two algorithms using AMOC curves. Each coordinate on the AMOC curve is a result of a compensated p-value produced by the randomization test and not the rule s core. Thus, even though an exhaustive rule search will always equal or outperform a greedy rule search in terms of the best rule score, it is not guaranteed to be superior to the greedy rule search on an AMOC curve due to the fact that the randomization test adjusts the rule score for multiple hypothesis tes ting. In Figure 18, we plot the AMOC curves comparing the average performance for both the exhaustive and greedy algorithms over 100 experiments; we do not show the confidence intervals in order to avoid clutter. The confidence intervals for both the greedy and the exhaustive curve s do overlap substantially. Therefore, there appears to be no significant difference between the two algorithms for the data 1984

25 E W TRANGE A BOUT R ECENT S VENTS HAT S ’ The Effect of Greedy Versus Exhaustive Rule Search Greedy Rule Search with Max Components=2 Exhaustive Rule Search with Max Components=2 Optimal 2.5 2 1.5 1 Detection Time in Days 0.5 0 6 7 0 1 2 3 4 5 False Positives per Month Figure 18: AMOC curves for greedy versus exhaustive rule search for CityBN data n the greedy search. from this simulator. We measure the exhaustive search to be 30 times slower tha er the greedy search. Since the AMOC curves are nearly identical for our simulated data, we pref Effect of Noise on AMOC Curves 4 Regular Noisy Noisier 3.5 3 2.5 Detection Time in Days 2 1.5 1 2 7 8 9 10 5 3 1 0 6 4 False Positives per Month Figure 19: The effect of increased noise levels in the data on WSARE 3.0 Finally, we experiment with adding noise to the data by increasing the number of ED cases due to allergies, food poisoning, sunburns and colds. We increase the n oise levels by increasing the probabilities of Region Food Condition = bad , Has Allergy = true , Has Cold = true , and 1985

26 W ONG OORE , C OOPER AND W AGNER , M = true all Has Sunburn in their respective conditional probability tables. Note that these nodes are s to many entries of not visible in the output data. Increasing these probabilities involves change the conditional probability tables and we do not have space to list all of the ch anges. In general, we increase the probabilities of the corresponding entries in the conditional probability tables by approximately 0.004-0.005. We cannot say specifically how many noisy cas es are generated since this amount fluctuates over time. We produce 100 data sets with increased noise levels which we will refer to a s “Noisy” and we to as “Noisier”. The also produce another 100 data sets with even more noise which we will refer We then apply WSARE “Regular” data sets are the 100 data sets used in all previous experiments. 100 data sets is plotted in 3.0 to these three groups. The average AMOC curve for each group of f , Season , Figure 19. As in previous experiments, we use the environmental attributes o F lu Level and Weather . As shown in Figure 19, both the detection time and the false positive Day o f Week rate degrade with increased noise levels. 4.2 Annotated Output of WSARE 3.0 on Actual ED Data for 2001 We also test the performance of WSARE 3.0 on actual ED data from a major US city. This database contains almost seven years worth of data, with personal identifying infor mation excluded in order to protect patient confidentiality. The attributes in this database include date of admission, coded hospital ID, age decile, gender, syndrome information, discretized home la titude, discretized home longitude, discretized work latitude, discretized work longitude and both home location and work line and begin location on a coarse latitude-longitude grid. In this data, new hospitals come on scribed in submitting data during the time period that the data is collected. We use the solution de Section 3.2 to address this problem. WSARE operates on data from the year 2 001 and is allowed to use over five full years worth of training data from the start of 1996 to th e current day. The environmental attributes used are month, day of week and the number of cas es from the previous day with respiratory problems. The last environmental attribute is intended to b e an approximation to the flu levels in the city. We use a one-sided Fisher’s exact test to score th e rules such that only rules corresponding to an upswing in recent data are considered. In addition, we apply the Benjamini-Hochberg FDR procedure with α 1. = 0 . F DR l ED data for the The following list contains the significant anomalous patterns found in the rea year 2001. 1. 2001-02-20: SCORE = -2.15432e-07 PVALUE = 0 15.9774% (85/532) of today’s cases have Viral Syndrome = Tru e and Respiratory Syndrome = False 8.84% (884/10000) of baseline cases have Viral Syndrome = Tr ue and Respiratory Syndrome = False 2. 2001-06-02: SCORE = -3.19604e-08 PVALUE = 0 1.27971% (7/547) of today’s cases have Age Decile = 10 and Hom e Latitude = Missing 0.02% (2/10000) of baseline cases have Age Decile = 10 and Hom e Latitude = Missing 3. 2001-06-30: SCORE = -2.39821e-07 PVALUE = 0 1.44% (9/625) of today’s cases have Age Decile = 10 0.09% (9/10000) of baseline cases have Age Decile = 10 4. 2001-08-08: SCORE = -1.21558e-08 PVALUE = 0 83.7979% (481/574) of today’s cases have Unknown Syndrome = False 73.6926% (7370/10001) of baseline cases have Unknown Syndr ome = False 1986

27 W HAT S S TRANGE A BOUT R ECENT E VENTS ’ 5. 2001-10-10: SCORE = -1.42315e-06 PVALUE = 0 0.994036% (5/503) of today’s cases have Age Decile = 10 and Ho me Latitude = Missing 0.009998% (1/10002) of baseline cases have Age Decile = 10 an d Home Latitude = Missing 6. 2001-12-02: SCORE = -4.31806e-07 PVALUE = 0 14.7059% (70/476) of today’s cases have Viral Syndrome = Tru e and Encephalitic Syndrome = False True and Encephalitic Syndrome = False 7.73077% (773/9999) of baseline cases have Viral Syndrome = 7. 2001-12-09: SCORE = -3.31973e-10 PVALUE = 0 8.57788% (38/443) of today’s cases have Hospital ID = 1 and Vi ral Syndrome = True 2.49% (249/10000) of baseline cases have Hospital ID = 1 and V iral Syndrome = True s an increase in Rules 2, 3 and 5 are likely due to clerical errors in the data since the rule find ome zip code for these the number of people between the ages of 100 and 110. Furthermore, the h icates that the patients appears to be missing in rules 2 and 5. Rule 4 is uninteresting since it ind , has experienced number of cases without an unknown symptom, which is typically around 73.7% a slight increase. For rules 1, 6 and 7 we went back to the original ED data to inspect the text descriptions of the chief complaints for the cases related to these three rules . The symptoms related to Rules 1, 6 and 7 involve dizziness, fever and sore throat. Given that R ules 1, 6 and 7 have dates in winter, along with the symptoms mentioned, we speculate that this anomalous pattern is likely caused by an influenza strain. We also include results from WSARE 2.0 running on the same data set. Unlike WS ARE 3.0, deal with new WSARE 2.0 does not have a similar solution to the approach taken in Section 3.2 to d, such as the standard hospitals coming online. However, by using a short enough baseline perio baseline of 35, 42, 49, and 56 days prior to the current date, we can ca pture fairly recent trends and shown below. Note deal with a changing distribution as new hospitals submit data. The results are that we group together identical rules from consecutive days in order to save space. 1. 2001-01-31: SCORE = -8.0763e-07 PVALUE = 0 21.2766% (110/517) of today’s cases have Unknown Syndrome = True 12.5884% (267/2121) of baseline cases have Unknown Syndrom e = True 2. 2001-05-01: SCORE = -1.0124e-06 PVALUE = 0.001998 18.4739% (92/498) of today’s cases have Gender = Male and Hom e Latitude > 40.5 10.2694% (202/1967) of baseline cases have Gender = Male and > 40.5 Home Latitude Rules 3-6 from 2001-10-28 to 2001-10-31 all have PVALUE = 0 an d involve rules with Hospital ID = Missing 7. 2001-11-01: SCORE = -7.78767e-21 PVALUE = 0 5.87084% (30/511) of today’s cases have Hospital ID = Missin g and Hemorrhagic Syndrome = True 0% (0/1827) of baseline cases have Hospital ID = Missing and H emorrhagic Syndrome = True Rules 8-14 from 2001-11-02 to 2001-11-08 all have PVALUE = 0 a nd have the rule Hospital ID = Missing Rules 15-37 from 2001-11-09 to 2001-12-02 all have PVALUE = 0 and have the rule Hospital ID = 14 Rules 38-59 from 2001-12-03 to 2001-12-24 all have PVALUE = 0 and have the rule Hospital ID = 50 1987

28 W ONG OORE , C OOPER AND W AGNER , M 60. 2001-12-25: SCORE = -2.99132e-09 PVALUE = 0 53.1835% (284/534) of today’s cases have Rash Syndrome = Fal se and Unmapped Syndrome = False 39.2165% (911/2323) of baseline cases have Rash Syndrome = F alse and Unmapped Syndrome = False Rules 61-63 from 2001-12-26 to 2001-12-30 all have PVALUE = 0 and have the rule Hospital ID = 50 64. 2001-12-31: SCORE = -7.30783e-07 PVALUE = 0 me = True and Unmapped Syndrome = False 52.071% (352/676) of today’s cases have Hemorrhagic Syndro 41.6113% (1064/2557) of baseline cases have Hemorrhagic Sy ndrome = True and Unmapped Syndrome = False involves hospital IDs From the output above, WSARE 2.0 produces a large number of rules that rules typically persist for 14 and 50 because those two hospitals start providing data in 2002. These of WSARE 2.0. We about a month, at which point the new hospitals begin to appear in the baseline ming online and a new speculate that the missing hospital IDs in rules 3-14 are due to hospital 14 co .0 are very different from hospital code not being available. The other rules produced by WSARE 2 those generated by WSARE 3.0. This difference is likely due to the fact that WSARE 3.0 considers ced by WSARE 2.0 the effects of the environmental attributes. The most interesting rules produ code in the are rules 2 and 64. Rule 2 highlights the fact that more male patients with a home zip northern half of the city appear in the EDs on 2001-05-01. Rule 64 indicate s that an increase in the number of hemorrhagic syndromes have occurred. Both of these rule s are unlikely to have been caused by environmental trends; they are simply anomalous patterns when c ompared against the baseline of WSARE 2.0. From our available resources, we are unable to d etermine if rules 2 and 64 are truly indicative of an outbreak. 4.3 Results from the Israel Center for Disease Control tively using an unusual out- The Israel Center for Disease Control evaluated WSARE 3.0 retrospec break of influenza type B that occurred in an elementary school in centra l Israel (Kaufman et al., 2004). WSARE 3.0 was applied to patient visits to community clinics between the date s of May 24, 2004 to June 11, 2004. The attributes in this data set include the visit date , area code, ICD-9 code, age category, and day of week. The day of week was used as th e only environmental at- tribute. WSARE 3.0 reported two rules with p-values at 0.002 and five other r ules with p-values below 0.0001. Two of the five anomalous patterns with p-values below 0.0001 corresponded to the influenza outbreak in the data. The rules that characterized the two anomalo us patterns consisted of the same three attributes of ICD-9 code, area code and age category, ind icating that an anomalous pattern was found involving children aged 6-14 having viral symptoms within a specific geographic area. WSARE 3.0 detected the outbreak on the second day from its onset. T he authors of (Kaufman et al., 2004) found the results from WSARE 3.0 promising and concluded tha t the algorithm was indeed able to detect an actual outbreak in syndromic surveillance data. 4.4 Summary of Results Overall, WSARE 2.0 and 3.0 have been demonstrated to be more effective tha n univariate methods at finding anomalous patterns in multivariate, categorical data. The advanta ge that the WSARE algorithms have over univariate methods is their ability to identify the combination of attributes that characterize the most anomalous groups in the data rather than relying o n a user to specify 1988

29 W HAT S S TRANGE A BOUT R ECENT E VENTS ’ further advantage beforehand which combination of characteristics to monitor. WSARE 3.0 has a in its ability to account for temporal trends when producing the baseline distrib ution while WSARE data for the baseline. 2.0 can be thrown off by these temporal trends when it uses raw historical We would like to emphasize the fact that WSARE 3.0 is not necessarily the best version of WSARE in all cases. WSARE 3.0 needs a large amount of data in order to lear n the structure and parameters of its Bayesian network reliably, particularly if there are many attr ibutes in the data. If WSARE 3.0 is intended to model long term trends such as seasonal fluctua tions, several years worth of historical data are needed. Large amounts of historical data are not available in many cases, such as when a syndromic surveillance system needs to be set up from sc ratch in a few months for n advantage a major event like the Olympic games. In these scenarios, WSARE 2.0 may have a learned Bayesian over WSARE 3.0. This disadvantage of WSARE 3.0 highlights the fact that the ach node. Future network only stores the posterior mean in the conditional probability tables of e arameters in the work on WSARE 3.0 will involve accounting for the variance of the network p llen (2000), van Allen et al. p-value calculation, perhaps using the approaches proposed by van A (2001), and Singh (2004). Moreover, WSARE 3.0 assumes that the environmental attributes are the only source of variation in the baseline distribution. If other hidden variables cause a significant amo unt of noise in the r approach might be baseline, then WSARE 3.0 will not be very effective. In this situation, a bette period. Finally, to use WSARE 2.0 with a baseline of raw historical data from a very recent time redictions we do not recommend using WSARE 2.5 because the algorithm is unable to make p torical data. The for days in which the combination of environmental attributes do not exist in his Bayesian network used by WSARE 3.0 is able to handle such situations and WSA RE 3.0 effectively supersedes WSARE 2.5. 5. Finding Anomalous Patterns in Real-Valued Data The WSARE algorithm can only be used on categorical data sets. If the data is entirely real-valued, the attributes can certainly be discretized in a pre-processing step before WSARE operates on the data. Discretization, however, treats all data points in the same discretization b in identically; the ortant, then a real- distances between data points in the same bin are lost. If these distances are imp Kulldorff, 1997) can be valued version of WSARE is needed. Fortunately, the spatial scan statistic ( considered as the real-valued analog of WSARE. The spatial scan statistic works on a geographic area A in which there is an underlying popu- n and within this population there is a count c of interest. The distribution of the counts c lation riable size and is assumed to follow either a Bernoulli model or a Poisson model. A window of va A . The crucial characteristic of this window is that shape then passes through the geographic area the union of the areas covered by the window is the entire area A . Existing spatial scan statistic applications typically use window shapes of circles (Kulldorff, 1999) altho ugh ellipses (Kulldorff et al., 2002) and rectangles (Neill and Moore, 2004) have also been us ed. In order to set up the scan statistic, we need to define p as the probability of being a “count” within the scanning window. Furthermore, let q be the probability of being a “count” outside of the scanning window. Under p = q the null hypothesis, p > q . The spatial scan statistic then while the alternative hypothesis is consists of the maximum likelihood ratio between L , the likelihood of the counts in the scanning W W , and L window area , the likelihood under the null hypothesis. Equation 1 illustrates the spatial 0 scan statistic in its general form, using the term W for the zone covered by a scanning window and 1989

30 W ONG OORE , C OOPER AND W AGNER , M for the entire collection of zones: W ) L ( W S = max (1) . W ε W W L 0 Since an analytical form for the distribution of the spatial scan statistic is not a vailable, a Monte ally 999 or 9999 Carlo simulation is needed to obtain the significance of the hypothesis test. Typic l complexity, the replications of the data set are used for the simulation. In terms of computationa bottleneck for the algorithm is the Monte Carlo simulation. The spatial scan statistic has been extended to three dimensions in the space- time scan statistic ce, the scanning window is (Kulldorff, 1999, 2001). Instead of using a circular window over spa er a time interval. now a cylinder, with its circular base for the spatial dimension and its height ov d time to find potential Cylinders of varying heights and base radii are moved through space an disease clusters. Naive implementations of the spatial scan statistic and the space-time scan statistic a re too computationally expensive for large data sets. Assuming that the circular win dows are centered 2 D x N grid and the dimensionality is D , the complexity is O ( RN is the number on an ) where R N of Monte Carlo simulations. Neill et al. (2005) have developed a fast spatia l scan using overlap-kd D in the best case. The algorithms discussed so ( NlogN ) O ) R trees that can reduce the complexity to ( lued. Efficiently finding far find abnormally high density regions in data sets that are entirely real-va anomalous patterns in a data set with a mixture of categorical and real-valued attributes remains an open problem. 6. Related Work The task of detecting anomalous events in data is most commonly associated with mon itoring sys- ity, fraud detection, tems. As a result, related work can be found in the domains of computer secur Topic Detection and Tracking (TDT) and fMRI analysis. In computer secu rity, anomaly detection s by distinguish- has been most prominent in intrusion detection systems, which identify intrusion ing between normal system behavior and behavior when security has bee n compromised (Lane and Brodley, 1999; Warrender et al., 1999; Eskin, 2000; Lee et al., 2000 ; Maxion and Tan, 2002; Kruegel and Vigna, 2003). In other related security work, Cabuk et al. (2004) describe methods to detect IP covert timing channels, which surreptitiously use the arrival pattern of packets to send informa- etween normal and tion. As in computer security, automated fraud detection systems differentiate b and Provost, 1997) and unusual activity on a variety of data such as cellular phone calls (Fawcett automobile insurance fraud (Phua et al., 2004). TDT is the task of identifyin g the earliest report of a previously unseen news story from a sequence of news stories. Cluste ring approaches are typically used in TDT (Yang et al., 1998; Zhang et al., 2005). Finally, anomalous ev ent detection has also been used in fMRI analysis to identify regions of increased brain activity c orresponding to given cognitive tasks (Neill et al., 2005). In general, WSARE can be applied to data from these different domains as long as the data and the anomalous events satisfy several criteria. WSARE is intended to ope rate on categorical, case-level records in which the presence of a record can be conside red an event. For instance, in ED data, an event is defined as the appearance of a person at the ED sin ce it provides a signal of the community health and we are interested in the characteristics of that person . Secondly, WSARE only finds differences between the recent data and the baseline data. If we consider the baseline 1990

31 W HAT S S TRANGE A BOUT R ECENT E VENTS ’ ome domains, such data to be a “class”, then WSARE looks for deviations from a single class. S as TDT, require comparisons between several classes. For instance, the current news story needs was discussed in Section to be compared against several categories of news stories. Thirdly, as 2.6, WSARE’s running time depends on the number of attributes and the number of values each attribute can take. If the number of attributes and the number of values for ea ch attribute are too high, WSARE may not finish in a reasonable amount of time. Some domains requir e the running time of the detection algorithm to be a few seconds or less in order for the entire detection system to other hand, for domains be effective. In these situations, using WSARE is not appropriate. On the s approximately a such as biosurveillance, the running time of WSARE is acceptable since it take record in the minute to a few minutes to complete on real ED data sets. Finally, WSARE treats each ighly indicative of, for instance, data independently of the other records. If a sequence of records is h a security breach in a network, WSARE will not be able to detect this pattern. Other related work can also be found in the area of stream mining. In stream mining, the focus is on the online processing of large amounts of data as it arrives. Many algo rithms have been developed evelop a novelty detection to detect anomalies in the current stream of data. Ma and Perkins (2003) d be characterized by an algorithm based on online support vector regression. Anomalies can also 2003) simultaneously monitors abnormal burst of data. The technique described by Zhu and Shasha ( windows of different sizes and reports those that have an abnormal ag gregation of data. A density estimation approach is used by Aggarwal (2003) to help visualize both spatia l and temporal trends in evolving data streams. Finally, Hulten et al. (2001) present an efficient algorithm for mining decision trees from continuously changing data streams. While this work is pr imarily concerned with maintaining an up-to-date concept, detecting concept drift is similar to detec ting changes in e amount of a data stream. WSARE 3.0 cannot be directly applied to stream mining because th sible in a stream mining historical data needed to create the baseline distribution is typically not acces context. However, WSARE 2.0 could possibly be modified for a stream mining a pplication. In the following paragraphs, we will briefly review methods that have been used for the de- tection of disease outbreaks. Readers interested in a detailed survey of b iosurveillance methods etection algorithms in can be found in (Wong, 2004) and (Moore et al., 2003). The majority of d ariate algorithms have biosurveillance operate on univariate time series data. Many of these univ been taken from the field of Statistical Quality Control and directly applied to bio surveillance. The three most common techniques from Statistical Quality Control include the Shewh art control chart (Montgomery, 2001), CUSUM (Page, 1954; Hutwagner et al., 2003), a nd EWMA (Roberts, 1959; Williamson and Hudson, 1999). Although these three algorithms are simple to impleme nt, they have difficulty dealing with temporal trends. Univariate algorithms based on r egression and time series models, on the other hand, are able to model explicitly the seasonal an d day of week effects in the data. The Serfling method (Serfling, 1963) uses sinusoidal compone nts in its regression equa- tion to model the seasonal fluctuations for influenza. A Poisson regressio n model that included a day of week term as a covariate was demonstrated to be a fairly capable dete ctor in (Buckeridge et al., 2005). As for time series models, the ARIMA and SARIMA models (Choi and Thacker, 1981; Watier et al., 1991; Reis and Mandl, 2003) are commonly used in biosu rveillance to deal with temporal trends. Recently, wavelets (Goldenberg et al., 2002; Zhang et al., 2003) have been used as a preprocessing step to handle temporal fluctuations including unusually low values due to holidays. The most common algorithm used in biosurveillance of spatial data is the Spatial S can Statistic (Kulldorff, 1997), which has already been discussed. The Spatial Sc an Statistic has been generalized to include a time dimension (Kulldorff, 2001) such that the algorithm searches for cylinders in 1991

32 W ONG OORE , C OOPER AND W AGNER , M can method using an spatio-temporal data. Recent work has improved the speed of the Spatial S overlap-kd tree structure (Neill and Moore, 2004; Neill et al., 2005). The algorithms mentioned thus far have only looked at either univariate or sp atial data. Only a few multivariate biosurveillance algorithms that consider spatial, temporal, de mographic, and symptomatic attributes for individual patient cases currently exist. BCD (Buc keridge et al., 2005) is a multivariate changepoint detection algorithm that monitors in a frequentist man ner whether a iod) appears to have a dis- Bayesian network learned from past data (during a “safe” training per tribution that differs from the distribution of more recent data. If so, then a n anomaly may have t al., 2004) is an algorithm occurred. The Bayesian Aerosol Release Detector (BARD) (Hogan e tmospheric dispersion of specifically designed to detect an outbreak of inhalational anthrax due to a logical data, and spa- anthrax spores. BARD combines information from ED visits, recent meteoro rmine if an anthrax tial and population information about the region being monitored in order to dete -based anomaly detection attack has occurred. Finally, PANDA (Cooper et al., 2004) is a population algorithm that uses a massive causal Bayesian network to model each indi vidual in the region under represent different surveillance. By modeling at the individual level, PANDA is able to coherently sumptions about a types of background knowledge in its model. For example, spatio-temporal as disease outbreak can be incorporated as prior knowledge. In addition, the characteristics of each individual, such as their age, gender, home zip, symptom information and ad mission date to the ED can be used to derive a posterior probability of an outbreak. There are two algorithms that are similar to the approach taken by WSARE. Con trast set mining (Bay and Pazzani, 1999) finds rules that distinguish between two or more g roups using a pruning away rules whose counts algorithm to reduce the exponential search space. This optimization prunes g to WSARE. are too small to yield a valid Chi Square test. Many of these rules are interestin Multiple hypothesis testing problems are addressed in contrast set mining thro ugh a Bonferroni correction. In health care, Brossette et al. use association rules for ho spital infection control and public health surveillance (Brossette et al., 1998). Their work is similar to WSA RE 2.0 (Wong et al., 2002), with the main difference being the additional steps of the rando mization test and FDR in WSARE. 7. Conclusions WSARE approaches the problem of early outbreak detection on multivariate surveillance data using two key components. The first component is association rule search, whic h is used to find anomalous patterns between a recent data set and a baseline data set. The contributio n of this rule search is best seen by considering the alternate approach of monitoring a univariate sign al. If an attribute or combination of attributes is known to be an effective signal for the presenc e of a certain disease, then a univariate detector or a suite of univariate detectors that monitors this signal will be an effective early warning detector for that specific disease. However, if such a signal is not known beforehand, then the association rule search will determine which attributes are of interest. We intend WSARE to be a general purpose safety net to be used in combination w ith a suite of specific disease detectors. Thus, the key to this safety net is to perform non-spe cific disease detection and notice any unexpected patterns. With this perspective in mind, the fundamental assumption to our association rule approach is that an outbreak in its early stages will manifest itself in categorical surveillan ce data as an anoma- lous cluster in attribute space. For instance, a localized gastrointestinal outb reak originating at a 1992

33 W HAT S S TRANGE A BOUT R ECENT E VENTS ’ cases involving peo- popular restaurant in zipcode X would likely cause an upswing in diarrhea ple with home zipcode X. These cases would appear as a cluster in the catego rical attributes of X and Symptom = Diarrhea . The rule search allows us to find the combina- Home Zi p Code = tion of attributes that characterize the set of cases from recent data that are most anomalous when troduces the problem of compared to the baseline data. The nature of the rule search, however, in scoring rule multiple hypothesis testing to the algorithm. Even with purely random data, the best te the statistical sig- may seem like a truly significant anomalous pattern. We are careful to evalua hypothesis is the nificance of the best scoring rule using a randomization test in which the null independence of date and case attributes. The second major component of WSARE is the use of a Bayesian network to mo del a base- line that changes due to temporal fluctuations such as seasonal trends an d weekend versus weekday onse attributes. Envi- effects. In WSARE 3.0, attributes are divided into environmental and resp h are responsible for the ronmental attributes, such as season and day of week, are attributes whic tes. When the Bayesian temporal trends while response attributes are the non-environmental attribu network structure is learned, the environmental attributes are not permitted to have parents because we are not interested in predicting their distributions. Instead, we want to de termine how the envi- ronmental attributes affect the distributions of the response attributes. WSA RE 3.0 operates on an assumption that the environmental attributes account for the majority of the var iation in the data. Under this assumption, the ratios compared in the rule search should remain re asonably stable over the current day historical time periods with similar environmental attribute values. As an example, if ibutes, then the fraction is a winter Friday and we use season and day of week as environmental attr of male senior citizens, for instance, showing up at an ED to the total number o f patients should remain roughly stable over all winter Fridays in the historical period over wh ich the Bayesian net- work is learned. Once the Bayesian network structure is learned, it repr esents the joint probability s to produce the distribution of the baseline. We can then condition on the environmental attribute baseline given the environment for the current day. Multivariate surveillance data with known outbreak periods is extremely diffic ult to obtain. As a s do not reflect result, we resorted to evaluating WSARE on simulated data. Although the simulator real life, detecting an outbreak in our simulated data sets is a challenging prob lem for any detection algorithm. We evaluated WSARE on the CityBN simulator, which was implemented to gen erate surveillance data which contained temporal fluctuations due to day of week e ffects and seasonal variations of background illnesses such as flu, food poisoning and aller gies. Despite the fact that the environmental attributes used by WSARE 3.0 did not account for all of th e variation in the data, d a very low WSARE 3.0 detected the anthrax outbreaks with nearly the optimal detection time an false positive rate. We show that WSARE 3.0 outperformed three common uni variate detection algorithms in terms of false positives per month and detection time. WSARE 3.0 also p roduced a better AMOC curve than WSARE 2.0 because the latter was thrown off by the te mporal trends in the data. Finally, the Bayesian network provided some smoothing to the baseline distribution which enhanced WSARE 3.0’s detection capability as compared to that of WSARE 2.5 . WSARE has been demonstrated to outperform traditional univariate methods on simulated data in terms of false positives per month and detection time. Its performance on rea l world data requires further evaluation. Currently, WSARE is part of the collection of biosurve illance algorithms in the RODS system (Real-time Outbreak Detection System, 2004). WSARE 2.0 was de ployed to monitor ED cases in western Pennsylvania and Utah. It was also used during the 2 002 Salt Lake City winter 1993

34 W ONG OORE , C OOPER AND W AGNER , M ealth data by several Olympics. WSARE 3.0 is currently being used as a tool for analysis of public h American state health departments and by the Israel Center for Disease Con trol. Acknowledgments the State of Pennsylvania, and This research was supported by DARPA grant F30602-01-2-0550, Rich Tsui, Bob Olszewski, by the National Science Foundation grant IIS-0325581. Many thanks to Jeremy Espino, Jeff Schneider and Robin Sabhnani for their helpful c omments and technical exper- tise. References 45 CFR Parts 160 through 164, April 2003. Available at http://www.hhs.gov/ocr/combinedregtext.pdf. ta streams. In Proceedings Charu C. Aggarwal. A framework for diagnosing changes in evolving da ata of the 2003 ACM SIGMOD International Conference on Management of D , pages 575–586, New York, NY, 2003. ACM Press. data: Mining contrast sets. Stephen D. Bay and Michael J. Pazzani. Detecting change in categorical Knowledge Discovery and Data Mining , pages 302–306, 1999. In Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate : a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B. , 57:289–300, 1995. Chris M. Bishop. Novelty detection and neural network validation. IEEE Proceedings - Vision, , 141(4):217–222, August 1994. Image and Signal Processing ` a. Pubblicazioni del R Carlo E. Bonferroni. Teoria statistica delle classi e calcolo delle probabilit , 8:3–62, 1936. Istituto Superiore di Scienze Economiche e Commerciali di Firenze Time series analysis: Forecasting and control . Holden- George E. P. Box and Gwilym M. Jenkins. Day, San Francisco, 1976. Stephen E. Brossette, Alan P. Sprague, J. Michael Hardin, Ken B. Waite s, Warren T. Jones, and Stephen A. Moser. Association rules and data mining in hospital infection con trol and public health surveillance. , 5:373–381, 1998. Journal of the American Medical Informatics Association David L. Buckeridge, Howard Burkom, Murray Campbell, William R. Hogan, and Andrew W. Moore. Algorithms for rapid outbreak detection: a research synthesis. Biomedical Informatics , 38(2):99–113, 2005. Serdar Cabuk, Carla E. Brodley, and Clay Shields. IP covert timing chan nels: design and detection. In Proceedings of the 11th ACM conference on Computer and Communication s Security , pages 178–187, New York, NY, 2004. ACM Press. 1994

35 W HAT S S TRANGE A BOUT R ECENT E VENTS ’ lity surveillance, 1962- Keewhan Choi and Stephen B. Thacker. An evaluation of influenza morta American Journal of 1979 I. time series forecasts of expected pneumonia and influenza deaths . , 113(3):215–226, 1981. Epidemiology en Wong, William R. Hogan, and Gregory F. Cooper, Denver H. Dash, John D. Levander, Weng-Ke Michael M. Wagner. Bayesian biosurveillance of disease outbreaks. I n Max Chickering and Proceedings of the Twentieth Conference on Uncertainty in Artificial Joseph Halpern, editors, , pages 94–103, Banff, Alberta, Canada, 2004. AUAI Press. Intelligence istributions. In Pro- Eleazar Eskin. Anomaly detection over noisy data using learned probability d CML-2000) , Palo Alto, ceedings of the 2000 International Conference on Machine Learning (I CA, July 2000. Tom Fawcett and Foster Provost. Adaptive fraud detection. , Data Mining and Knowledge Discovery 1(3):291–316, 1997. Tom Fawcett and Foster Provost. Activity monitoring: Noticing interesting cha nges in behavior. Proceedings on the Fifth ACM SIGKDD International Con- In Chaudhuri and Madigan, editors, ference on Knowledge Discovery and Data Mining , pages 53–62, San Diego, CA, 1999. URL citeseer.nj.nec.com/fawcett99activity.html . Anna Goldenberg, Galit Shmueli, and Rich Caruana. Using grocery sales data for the detection of bio-terrorist attacks. Submitted to Statistics in Medicine, 2003. Anna Goldenberg, Galit Shmueli, Richard A. Caruana, and Stephen E. Fie Early nberg. dication sales. statistical detection of anthrax outbreaks by tracking over-the-counter me , 99(8):5237–5240, April 2002. Proceedings of the National Academy of Sciences http://www.pnas.org/cgi/doi/10.1073/pnas.042117499. Permutation Tests - A Practical Guide to Resampling Methods for Testing Hypo theses Phillip Good. . Springer-Verlag, New York, 2nd edition, 2000. Greg Hamerly and Charles Elkan. Bayesian approaches to failure predic tion for disk drives. In Proceedings of the eighteenth international conference on machine learn ing , pages 202–209. Morgan Kaufmann, San Francisco, CA, 2001. William R. Hogan, Gregory F. Cooper, Garrick L. Wallstrom, and Michael M . Wagner. The Bayesian aerosol release detector. In Proceedings of the Third National Syndromic Surveillance Conference [CD-ROM] , Boston, MA, 2004. Fleetwood Multimedia, Inc. Geoff Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD International Conference on K nowledge Dis- covery and Data Mining , pages 97–106, San Francisco, CA, 2001. ACM Press. URL citeseer.ist.psu.edu/hulten01mining.html . Lori Hutwagner, William Thompson, G. Matthew Seeman, and Tracee Tread well. The bioterrorism preparedness and response early aberration reporting system (EAR S). Journal of Urban Health , 80(2):i89–i96, 2003. 1995

36 W ONG OORE , C OOPER AND W AGNER , M aronowitz, Rita Zalman Kaufman, Erica Cohen, Tamar Peled-Leviatan, Hanna Lavi, Gali Ah Dichtiar, Michal Bromberg, Ofra Havkin, Yael Shalev, Rachel Marom, V arda Shalev, Joshua early detection of bioter- Shemer, and Manfred S Green. Evaluation of syndromic surveillance for rorism using a localized, summer outbreak of Influenza B. In Proceedings of the Third National Syndromic Surveillance Conference [CD-ROM] , Boston, MA, 2004. Fleetwood Multimedia Inc. Poster Presentation. ed attacks. In Proceedings Christopher Kruegel and Giovanni Vigna. Anomaly detection of web-bas th of the ACM Conference on Computer and Communication Security (CCS ’03) , pages 251– 10 261, Washington, DC, October 2003. ACM Press. Martin Kulldorff. A spatial scan statistic. , 26 Communications in Statistics: Theory and Methods (6):1481–1496, 1997. Martin Kulldorff. Spatial scan statistics: models, calculations, and application s. In J. Glaz and Scan Statistics and Applications , pages 303–322. Birkhauser, Boston, N. Balakrishnan, editors, MA, 1999. illance using a scan statistic. Martin Kulldorff. Prospective time periodic geographical disease surve Journal of the Royal Statistical Society, Series A , 164:61–72, 2001. Martin Kulldorff, Lan Huang, and Linda Pickle. An elliptic spatial scan statistic and its application to breast cancer mortality data in the northeastern united states. In Proceedings of the National Syndromic Surveillance Conference , 2002. ata reduction for anomaly Terran Lane and Carla E. Brodley. Temporal sequence learning and d ACM Transactions on Information and System Security , 2:295–331, 1999. detection. Wenke Lee, Salvatore J. Stolfo, and Kui W. Mok. Adaptive intrusion dete ction: A data mining approach. , 14(6):533–567, 2000. URL Artificial Intelligence Review citeseer.ist.psu.edu/lee00adaptive.html . Junshui Ma and Simon Perkins. Online novelty detection on temporal sequen ces. In Proceedings on the ninth ACM SIGKDD international conference on knowledge discover y and data mining , pages 613–618, New York, NY, 2003. ACM Press. Oded Maron and Andrew W. Moore. The racing algorithm: Model selection for lazy learners. , 11(1-5):193–225, 1997. Artificial Intelligence Review Roy A. Maxion and Kymie M.C. Tan. Anomaly detection in embedded systems. IEEE Trans. Comput. , 51(2):108–120, 2002. ISSN 0018-9340. Christopher J. Miller, Christopher Genovese, Robert C. Nichol, Larry Wasserman, Andrew Con- nolly, Daniel Reichart, Andrew Hopkins, Jeff Schneider, and Andrew Moore. Controlling the false-discovery rate in astrophysical data analysis. The Astronomical Journal , 122:3492–3505, Dec 2001. Douglas C. Montgomery. Introduction to Statistical Quality Control . John Wiley and Sons, Inc., 4th edition, 2001. 1996

37 W HAT S S TRANGE A BOUT R ECENT E VENTS ’ cient machine learning with Andrew Moore and Mary Soon Lee. Cached sufficient statistics for effi , 8:67–91, March 1998. large datasets. Journal of Artificial Intelligence Research operator for accelerated Andrew Moore and Weng-Keen Wong. Optimal reinsertion: A new search nd N. Mishra, editors, and more accurate Bayesian network structure learning. In T. Fawcett a Proceedings of the 20th International Conference on Machine Learning , pages 552–559, Menlo Park, CA, August 2003. AAAI Press. ner. Andrew W. Moore, Gregory F. Cooper, Rich Tsui, and Michael M. Wag Sum- Technical report, Realtime O utbreak mary of biosurveillance-related technologies. Available at and Disease Surveillance Laboratory, University of Pittsburgh, 2003. http://www.autonlab.org/autonweb/showPaper.jsp?ID=moore-biosurv. l perspective. Journal of Farzad Mostashari and Jessica Hartman. Syndromic surveillance: a loca , 80(2):i1–i7, 2003. Urban Health ction of significant Daniel B. Neill and Andrew W. Moore. A fast multi-resolution method for dete ̈ spatial disease clusters. In Sebastian Thrun, Lawrence Saul, and Ber nhard Sch olkopf, editors, Advances in Neural Information Processing Systems 16 . MIT Press, Cambridge, MA, 2004. Daniel B. Neill, Andrew W. Moore, Francisco Pereira, and Tom Mitchell. D etecting significant ́ multidimensional spatial clusters. In Lawrence K. Saul, Yair Weiss, and L eon Bottou, editors, Advances in Neural Information Processing Systems 17 . MIT Press, Cambridge, MA, 2005. E. S. Page. Continuous inspection schemes. Biometrika , 41(1):100–115, 1954. ud detection: classi- Clifton Phua, Damminda Alahakoon, and Vincent Lee. Minority report in fra ACM SIGKDD Explorations Newsletter Special issue on learning from fication of skewed data. imbalanced datasets , 6(1):50–59, 2004. Real-time Outbreak Detection System, 2004. Online at http://www.health.pitt.edu/rods /default.htm. Ben Y. Reis and Kenneth D. Mandl. Time series modeling for syndromic surve illance. BMC Medical Informatics and Decision Making , 3(2), 2003. http://www.biomedcentral.com/1472-6947/3/2. S. W. Roberts. Control chart tests based on geometric moving averages. , 1:239–250, Technometrics 1959. Robert E. Serfling. Methods for current statistical analysis of excess pneumonia-influenza deaths. Public Health Reports , 78:494–506, 1963. Ajit P. Singh. What to do when you don’t have much data: Issues in small sa mple parameter learning in Bayesian Networks. Master’s thesis, Dept. of Computing Science, Univ ersity of Alberta, 2004. Daniel M. Sosin. Draft framework for evaluating syndromic surveillance systems. Journal of Urban Health , 80(2):i8–i13, 2003. Fu-Chiang Tsui, Michael M. Wagner, Virginia Dato, and Chung-Chou Ho Chang. Value of ICD-9- coded chief complaints for detection of epidemics. In S Bakken, editor, Journal of the American 1997

38 W ONG OORE , C OOPER AND W AGNER , M of the Annual Fall Sym- Medical Informatics Association, Supplement i ssue on the Proceedings posium of the American Medical Inf ormatics Association , pages 711–715. Hanley and Belfus, Inc, 2001. el selection and error Tim van Allen. Handling uncertainty when you’re handling uncertainty: Mod niversity of Alberta, bars for belief networks. Master’s thesis, Dept. of Computing Science, U 2000. Tim van Allen, Russell Greiner, and Peter Hooper. Bayesian error-ba rs for belief net inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial In telligence , Aug 2001. g intrusions using system Christina Warrender, Stephanie Forrest, and Barak Pearlmutter. Detectin Proceedings of the 1999 IEEE Symposium on Security and calls: Alternative data models. In , pages 133–145. IEEE Computer Society, 1999. Privacy struction of an alert Laurence Watier, Sylvia Richardson, and Bruno Hubert. A time series con Statistics in Medicine , 10:1493–1509, threshold with application to s. bovismorbificans in france. 1991. tecting aberrations G. David Williamson and Ginner Weatherby Hudson. A monitoring system for de in public health surveillance reports. , 18:3283–3298, 1999. Statistics in Medicine Weng-Keen Wong. Data mining for early disease outbreak detection . PhD thesis, School of Com- puter Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, 2 004. Weng-Keen Wong, Andrew Moore, Gregory Cooper, and Michael Wa gner. Bayesian network anomaly pattern detection for disease outbreaks. In Tom Fawcett and Nina Mishra, editors, Pro- ceedings of the Twentieth International Conference on Machine Learning , pages 808–815, Menlo Park, California, August 2003. AAAI Press. Weng-Keen Wong, Andrew W. Moore, Gregory Cooper, and Michael Wagner. Rule-based anomaly Proceedings of the 18th National Conference pattern detection for detecting disease outbreaks. In on Artificial Intelligence , pages 217–223. MIT Press, 2002. d on-line event detec- Yiming Yang, Tom Pierce, and Jiame Carbonell. A study on retrospective an Proceedings of the 21st Annual International ACM SIGIR Conference on Research and tion. In Development in Information Retrieval , pages 28–36, New York, NY, 1998. ACM Press. Jian Zhang, Zoubin Ghahramani, and Yiming Yang. A probabilistic model for o nline document ́ s, and L clustering with application to novelty detection. In Lawrence K. Saul, Yair Weis eon Bottou, editors, . MIT Press, Cambridge, Advances in Neural Information Processing Systems 17 MA, 2005. Jun Zhang, Fu-Chiang Tsui, Michael M. Wagner, and William R. Hogan. D etection of outbreaks from time series data using wavelet transform. In Proc AMIA Fall Symp , pages 748–752. Omni Press CS, 2003. Yunyue Zhu and Dennis Shasha. Efficient elastic burst detection in data s treams. In Proceedings of the ninth ACM SIGKDD international conference on knowledge discover y and data mining , pages 336–345, New York, NY, 2003. ACM Press. 1998