rfp0573 ribeiroA

Transcript

1 “Why Should I Trust You?” Explaining the Predictions of Any Classifier Marco Tulio Ribeiro Sameer Singh Carlos Guestrin University of Washington University of Washington University of Washington Seattle, WA 98105, USA Seattle, WA 98105, USA Seattle, WA 98105, USA [email protected] [email protected] [email protected] how much the human understands a model’s behaviour, as ABSTRACT opposed to seeing it as a black box. Despite widespread adoption, machine learning models re- Determining trust in individual predictions is an important main mostly black boxes. Understanding the reasons behind problem when the model is used for decision making. When trust , predictions is, however, quite important in assessing using machine learning for medical diagnosis [ 6 ] or terrorism which is fundamental if one plans to take action based on a detection, for example, predictions cannot be acted upon on prediction, or when choosing whether to deploy a new model. blind faith, as the consequences may be catastrophic. Such understanding also provides insights into the model, Apart from trusting individual predictions, there is also a which can be used to transform an untrustworthy model or need to evaluate the model as a whole before deploying it “in prediction into a trustworthy one. the wild”. To make this decision, users need to be confident In this work, we propose LIME, a novel explanation tech- that the model will perform well on real-world data, according classifier in an in- nique that explains the predictions of any to the metrics of interest. Currently, models are evaluated terpretable and faithful manner, by learning an interpretable using accuracy metrics on an available validation dataset. model locally around the prediction. We also propose a However, real-world data is often significantly different, and method to explain models by presenting representative indi- further, the evaluation metric may not be indicative of the vidual predictions and their explanations in a non-redundant product’s goal. Inspecting individual predictions and their way, framing the task as a submodular optimization prob- explanations is a worthwhile solution, in addition to such lem. We demonstrate the flexibility of these methods by metrics. In this case, it is important to aid users by suggesting explaining different models for text (e.g. random forests) which instances to inspect, especially for large datasets. and image classification (e.g. neural networks). We show the In this paper, we propose providing explanations for indi- utility of explanations via novel experiments, both simulated vidual predictions as a solution to the “trusting a prediction” and with human subjects, on various scenarios that require problem, and selecting multiple such predictions (and expla- trust: deciding if one should trust a prediction, choosing nations) as a solution to the “trusting the model” problem. between models, improving an untrustworthy classifier, and Our main contributions are summarized as follows. identifying why a classifier should not be trusted. LIME, an algorithm that can explain the predictions of any • classifier or regressor in a faithful way, by approximating 1. INTRODUCTION it locally with an interpretable model. Machine learning is at the core of many recent advances in science and technology. Unfortunately, the important role • SP-LIME, a method that selects a set of representative of humans is an oft-overlooked aspect in the field. Whether instances with explanations to address the “trusting the humans are directly using machine learning classifiers as tools, model” problem, via submodular optimization. or are deploying models within other products, a vital concern • Comprehensive evaluation with simulated and human sub- remains: if the users do not trust a model or a prediction, jects, where we measure the impact of explanations on they will not use it . It is important to differentiate between trust and associated tasks. In our experiments, non-experts two different (but related) definitions of trust: (1) trusting a using LIME are able to pick which classifier from a pair prediction , i.e. whether a user trusts an individual prediction generalizes better in the real world. Further, they are able trusting sufficiently to take some action based on it, and (2) to greatly improve an untrustworthy classifier trained on , i.e. whether the user trusts a model to behave in a model 20 newsgroups, by doing feature engineering using LIME. reasonable ways if deployed. Both are directly impacted by We also show how understanding the predictions of a neu- Permission to make digital or hard copies of all or part of this work for personal or ral network on images helps practitioners know when and classroom use is granted without fee provided that copies are not made or distributed why they should not trust a model. for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission 2. THE CASE FOR EXPLANATIONS and/or a fee. Request permissions from [email protected] By “explaining a prediction”, we mean presenting textual or KDD 2016 San Francisco, CA, USA visual artifacts that provide qualitative understanding of the c © 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. relationship between the instance’s components (e.g. words ISBN 978-1-4503-4232-2/16/08. . . $15.00 in text, patches in an image) and the model’s prediction. We DOI: http://dx.doi.org/10.1145/2939672.2939778

2 sneeze Flu Explainer sneeze (LIME) weight headache headache no fatigue no fatigue age Data and Prediction Human makes decision Explanation Model Figure 1: Explaining individual predictions. A model predicts that a patient has the flu, and LIME highlights the symptoms in the patient’s history that led to the prediction. Sneeze and headache are portrayed as contributing to the “flu” prediction, while “no fatigue” is evidence against it. With these, a doctor can make sneeze an informed decision about whether to trust the model’s prediction. headache argue that explaining predictions is an important aspect in active Explainer getting humans to trust and use machine learning effectively, (LIME) if the explanations are faithful and intelligible. The process of explaining individual predictions is illus- Human makes Explanation trated in Figure 1. It is clear that a doctor is much better decision positioned to make a decision with the help of a model if intelligible explanations are provided. In this case, an ex- planation is a small list of symptoms with relative weights – symptoms that either contribute to the prediction (in green) or are evidence against it (in red). Humans usually have prior knowledge about the application domain, which they can use to accept (trust) or reject a prediction if they understand the reasoning behind it. It has been observed, for example, that Figure 2: Explaining individual predictions of com- providing explanations can increase the acceptance of movie peting classifiers trying to determine if a document recommendations [12] and other automated systems [8]. is about “Christianity” or “Atheism”. The bar chart Every machine learning application also requires a certain represents the importance given to the most rele- measure of overall trust in the model. Development and vant words, also highlighted in the text. Color indi- evaluation of a classification model often consists of collect- cates which class the word contributes to (green for ing annotated data, of which a held-out subset is used for “Christianity”, magenta for “Atheism”). automated evaluation. Although this is a useful pipeline for many applications, evaluation on validation data may not 2, we show how individual prediction explanations can be correspond to performance “in the wild”, as practitioners used to select between models, in conjunction with accuracy. ], and 21 often overestimate the accuracy of their models [ In this case, the algorithm with higher accuracy on the thus trust cannot rely solely on it. Looking at examples validation set is actually much worse, a fact that is easy to see offers an alternative method to assess truth in the model, when explanations are provided (again, due to human prior especially if the examples are explained. We thus propose knowledge), but hard otherwise. Further, there is frequently explaining several representative individual predictions of a a mismatch between the metrics that we can compute and model as a way to provide a global understanding. optimize (e.g. accuracy) and the actual metrics of interest There are several ways a model or its evaluation can go such as user engagement and retention. While we may not wrong. Data leakage, for example, defined as the uninten- be able to measure such metrics, we have knowledge about tional leakage of signal into the training (and validation) data how certain model behaviors can influence them. Therefore, that would not appear when deployed [ 14 ], potentially in- a practitioner may wish to choose a less accurate model for creases accuracy. A challenging example cited by (author?) content recommendation that does not place high importance [14] is one where the patient ID was found to be heavily corre- in features related to “clickbait” articles (which may hurt lated with the target class in the training and validation data. user retention), even if exploiting such features increases This issue would be incredibly challenging to identify just by the accuracy of the model in cross validation. We note observing the predictions and the raw data, but much easier that explanations are particularly useful in these (and other) if explanations such as the one in Figure 1 are provided, as any model, so scenarios if a method can produce them for patient ID would be listed as an explanation for predictions. that a variety of models can be compared. Another particularly hard to detect problem is dataset shift Desired Characteristics for Explainers [ ], where training data is different than test data (we give 5 an example in the famous 20 newsgroups dataset later on). We now outline a number of desired characteristics from The insights given by explanations are particularly helpful in explanation methods. identifying what must be done to convert an untrustworthy An essential criterion for explanations is that they must model into a trustworthy one – for example, removing leaked , i.e., provide qualitative understanding be interpretable data or changing the training data to avoid dataset shift. between the input variables and the response. We note that Machine learning practitioners often have to select a model interpretability must take into account the user’s limitations. from a number of alternatives, requiring them to assess Thus, a linear model [ 24 ], a gradient vector [ 2 ] or an additive the relative trust between two or more models. In Figure model [ 6 ] may or may not be interpretable. For example, if

3 hundreds or thousands of features significantly contribute 3.2 Fidelity-Interpretability Trade-off to a prediction, it is not reasonable to expect any user to , G ∈ g Formally, we define an explanation as a model comprehend why the prediction was made, even if individual where interpretable models, such is a class of potentially G weights can be inspected. This requirement further implies as linear models, decision trees, or falling rule lists [ 27 ], i.e. a that explanations should be easy to understand, which is ∈ model g G can be readily presented to the user with visual ′ not necessarily true of the features used by the model, and d { or textual artifacts. The domain of g is } 0 , 1 acts g , i.e. thus the “input variables” in the explanations may need . As over absence/presence of the interpretable components to be different than the features. Finally, we note that the may be simple enough to be interpretable - not every g ∈ G notion of interpretability also depends on the target audience. complexity g ) be a measure of (as opposed to thus we let Ω( Machine learning practitioners may be able to interpret small ∈ g G interpretability . For example, for ) of the explanation Bayesian networks, but laymen may be more comfortable Ω( decision trees ) may be the depth of the tree, while for g with a small number of weighted features as an explanation. g ) may be the number of non-zero weights. linear models, Ω( Another essential criterion is local fidelity . Although it is d R → Let the model being explained be denoted . In f : R often impossible for an explanation to be completely faithful x ) is the probability (or a binary indicator) ( f classification, unless it is the complete description of the model itself, for 1 x that ( z ) as a π . We further use belongs to a certain class x locally an explanation to be meaningful it must at least be to x , so as to define proximity measure between an instance z , i.e. it must correspond to how the model behaves in faithful ( L . Finally, let x locality around ) be a measure of f,g,π x the vicinity of the instance being predicted. We note that how unfaithful g is in approximating f in the locality defined local fidelity does not imply global fidelity: features that and by π . In order to ensure both interpretability local x are globally important may not be important in the local L fidelity , we must minimize f,g,π ( ) be g ) while having Ω( x context, and vice versa. While global fidelity would imply low enough to be interpretable by humans. The explanation local fidelity, identifying globally faithful explanations that produced by LIME is obtained by the following: are interpretable remains a challenge for complex models. , While there are models that are inherently interpretable [ 6 g x ( f,g,π ξ ) (1) ) + Ω( ( L ) = argmin x g ∈ G , 17 26 , 27 ], an explainer should be able to explain model, any and thus be (i.e. treat the original model model-agnostic This formulation can be used with different explanation as a black box). Apart from the fact that many state-of- families G , fidelity functions L , and complexity measures Ω. the-art classifiers are not currently interpretable, this also Here we focus on sparse linear models as explanations, and provides flexibility to explain future classifiers. on performing the search using perturbations. In addition to explaining predictions, providing a global perspective is important to ascertain trust in the model. 3.3 Sampling for Local Exploration As mentioned before, accuracy may often not be a suitable L We want to minimize the locality-aware loss ( ) f,g,π x metric to evaluate the model, and thus we want to explain f , since we want the without making any assumptions about the model . Building upon the explanations for individual explainer to be model-agnostic . Thus, in order to learn predictions, we select a few explanations to present to the the local behavior of f as the interpretable inputs vary, we user, such that they are representative of the model. L ( f,g,π approximate ) by drawing samples, weighted by x ′ by drawing nonzero x . We sample instances around π x ′ 3. LOCAL INTERPRETABLE uniformly at random (where the number of x elements of such draws is also uniformly sampled). Given a perturbed MODEL-AGNOSTIC EXPLANATIONS ′ d ′ , 1 sample z ∈{ (which contains a fraction of the nonzero 0 } We now present Local Interpretable Model-agnostic Expla- ′ ), we recover the sample in the original repre- elements of x ). The overall goal of LIME is to identify an LIME nations ( d for label ), which is used as a z ( f and obtain R ∈ z sentation interpretable model over the interpretable representation of perturbed the explanation model. Given this dataset Z locally faithful that is to the classifier. to samples with the associated labels, we optimize Eq. (1) ). The primary intuition behind LIME ( ξ get an explanation x 3.1 Interpretable Data Representations is presented in Figure 3, where we sample instances both Before we present the explanation system, it is impor- in the vicinity of ) π (which have a high weight due to x x tant to distinguish between features and interpretable data ). Even though x (low weight from π and far away from x interpretable expla- representations. As mentioned before, the original model may be too complex to explain globally, nations need to use a representation that is understandable LIME presents an explanation that is locally faithful (linear to humans, regardless of the actual features used by the . It is worth π in this case), where the locality is captured by x model. For example, a possible interpretable representation noting that our method is fairly robust to sampling noise for text classification is a binary vector indicating the pres- in Eq. π since the samples are weighted by . We now (1) x ence or absence of a word, even though the classifier may present a concrete instance of this general framework. use more complex (and incomprehensible) features such as in- word embeddings. Likewise for image classification, an 3.4 Sparse Linear Explanations may be a binary vector indicating terpretable representation be the class of linear For the rest of this paper, we let G the “presence” or “absence” of a contiguous patch of similar ′ ′ z w z ( · . We use the locally weighted models, such that g ) = g pixels (a super-pixel), while the classifier may represent the L ) = , as defined in Eq. (2) , where we let π square loss as ( z x image as a tensor with three color channels per pixel. We 2 2 ( /σ exp ) x,z ) be an exponential kernel defined on some ( D − d be the original representation of an instance R ∈ x denote ′ ′ d 1 , ∈{ 0 being explained, and we use x 1 } to denote a binary For multiple classes, we explain each class separately, thus vector for its interpretable representation. f ( x ) is the prediction of the relevant class.

4 Algorithm 1 Sparse Linear Explanations using LIME f , Number of samples N Require: Classifier ′ x x Require: Instance , and its interpretable version π K , Length of explanation Require: Similarity kernel x Z ←{} 3 1 , 2 , i ,...,N } do for ∈{ ′ ′ sample around z x ( ← ) i ′ z Z ←Z∪〈 z ( ,f ) ,π 〉 ( z ) i x i i end for ′ K-Lasso( w ,K ) . with z ← ) as target as features, f ( z Z i return w Figure 3: Toy example to present intuition for LIME. The black-box model’s complex decision function f (unknown to LIME) is represented by the blue/pink background, which cannot be approximated well by , and present this information to the the explanation on Z a linear model. The bold red cross is the instance user. This estimate of faithfulness can also be used for being explained. LIME samples instances, gets pre- selecting an appropriate family of explanations from a set of , and weighs them by the proximity f dictions using multiple interpretable model classes, thus adapting to the to the instance being explained (represented here given dataset and the classifier. We leave such exploration by size). The dashed line is the learned explanation for future work, as linear explanations work quite well for that is locally (but not globally) faithful. multiple black-box models in our experiments. (e.g. cosine distance for text, 2 distance distance function L D . σ for images) with width ∑ ( ) 3.5 Example 1: Text classification with SVMs 2 ′ f,g,π ) = ( L π ) ) (2) ( z g ( z f ( z ) − x x In Figure 2 (right side), we explain the predictions of a ′ z,z ∈Z support vector machine with RBF kernel trained on uni- For text classification, we ensure that the explanation is grams to differentiate “Christianity” from “Atheism” (on a be interpretable representation interpretable by letting the subset of the 20 newsgroup dataset). Although this classifier a bag of words, and by setting a limit K on the number of achieves 94% held-out accuracy, and one would be tempted Ω( words, i.e. ]. Potentially, > K [ ‖ can be w ‖ K 1 ∞ ) = g g to trust it based on this, the explanation for an instance 0 adapted to be as big as the user can handle, or we could shows that predictions are made for quite arbitrary reasons K have different values of for different instances. In this (words “Posting”, “Host”, and “Re” have no connection to K paper we use a constant value for , leaving the exploration either Christianity or Atheism). The word “Posting” appears of different values to future work. We use the same Ω for in 22% of examples in the training set, 99% of them in the image classification, using “super-pixels” (computed using class “Atheism”. Even if headers are removed, proper names any standard algorithm) instead of words, such that the of prolific posters in the original newsgroups are selected by interpretable representation of an image is a binary vector the classifier, which would also not generalize. where 1 indicates the original super-pixel and 0 indicates a After getting such insights from explanations, it is clear Ω makes grayed out super-pixel. This particular choice of that this dataset has serious issues (which are not evident intractable, but we approximate it by directly solving Eq. (1) just by studying the raw data or predictions), and that this K features with Lasso (using the regularization first selecting classifier, or held-out evaluation, cannot be trusted. It is also ]) and then learning the weights via least squares (a 9 path [ clear what the problems are, and the steps that can be taken procedure we call K-LASSO in Algorithm 1). Since Algo- to fix these issues and train a more trustworthy classifier. rithm 1 produces an explanation for an individual prediction, its complexity does not depend on the size of the dataset, ( but instead on time to compute f x ) and on the number 3.6 Example 2: Deep networks for images of samples N . In practice, explaining random forests with trees using scikit-learn (http://scikit-learn.org) on a When using sparse linear explanations for image classifiers, 1000 laptop with = 5000 takes under 3 seconds without any N one may wish to just highlight the super-pixels with posi- optimizations such as using gpus or parallelization. Explain- tive weight towards a specific class, as they give intuition ] for image as to why the model would think that class may be present. 25 ing each prediction of the Inception network [ We explain the prediction of Google’s pre-trained Inception classification takes around 10 minutes. neural network [ 25 ] in this fashion on an arbitrary image G Any choice of interpretable representations and will (Figure 4a). Figures 4b, 4c, 4d show the superpixels expla- have some inherent drawbacks. First, while the underlying nations for the top 3 predicted classes (with the rest of the model can be treated as a black-box, certain interpretable . What the neural image grayed out), having set representations will not be powerful enough to explain certain = 10 K network picks up on for each of the classes is quite natural behaviors. For example, a model that predicts sepia-toned to humans - Figure 4b in particular provides insight as to cannot be explained by presence of absence images to be retro why acoustic guitar was predicted to be electric: due to the of super pixels. Second, our choice of G (sparse linear models) means that if the underlying model is highly non-linear even fretboard. This kind of explanation enhances trust in the in the locality of the prediction, there may not be a faithful classifier (even if the top predicted class is wrong), as it shows explanation. However, we can estimate the faithfulness of that it is not acting in an unreasonable manner.

5 (a) Original Image (d) Explaining Acoustic guitar (c) Explaining Electric guitar (b) Explaining Labrador Figure 4: Explaining an image classification prediction made by Google’s Inception neural network. The top 21 . 24 ) and “Labrador” ( p = 0 . = 0 ) ), “Acoustic guitar” ( 32 3 classes predicted are “Electric Guitar” ( p = 0 p . f5 f1 f2 f3 f4 4. SUBMODULAR PICK FOR EXPLAINING MODELS Although an explanation of a single prediction provides some understanding into the reliability of the classifier to the user, it is not sufficient to evaluate and assess trust in the model as a whole. We propose to give a global understanding of the model by explaining a set of individual instances. This approach is still model agnostic, and is complementary to computing summary statistics such as held-out accuracy. . Rows represent in- W Figure 5: Toy example Even though explanations of multiple instances can be Covered Features stances (documents) and columns represent features insightful, these instances need to be selected judiciously, (words). Feature f2 (dotted blue) has the highest im- since users may not have the time to examine a large number portance. Rows 2 and 5 (in red) would be selected of explanations. We represent the time/patience that humans by the pick procedure, covering all but feature f1. B have by a budget that denotes the number of explanations they are willing to look at in order to understand a model. , we define the Given a set of instances as the X pick step Algorithm 2 Submodular pick (SP) algorithm task of selecting instances for the user to inspect. B Require: Instances B , Budget X The pick step is not dependent on the existence of explana- for all ∈ X do x i ′ ] 1 tions - one of the main purpose of tools like Modeltracker [ Using Algorithm 1 ) . ( x ← ,x explain W i i i and others [ 11 ] is to assist users in selecting instances them- end for ′ selves, and examining the raw data and predictions. However, 1 ∈{ j for } ...d do √ ∑ n since looking at raw data is not enough to understand predic- |W . | ← Compute feature importances I ij j =1 i tions and get insights, the pick step should take into account end for the explanations that accompany each prediction. Moreover, ←{} V this method should pick a diverse, representative set of expla- while | V | < B do . Greedy optimization of Eq (4) nations to show the user – i.e. non-redundant explanations i ← ∪{ V V ( c argmax ∪ V ) ,I W , } i that represent how the model behaves globally. end while ), X ( | X | = n Given the explanations for a set of instances return V ′ n × d explanation matrix W that represents we construct an the local importance of the interpretable components for each instance. When using linear models as explanations, such as color histograms or other features of super-pixels; we g ξ for an instance x ( and explanation x ), we set W = = ij i i i leave further exploration of these ideas for future work. | j in W w , we . Further, for each component (column) | g ij While we want to pick instances that cover the important let I denote the global importance of that component in j components, the set of explanations must not be redundant I the explanation space. Intuitively, we want such that in the components they show the users, i.e. avoid selecting features that explain many different instances have higher instances with similar explanations. In Figure 5, after the , W importance scores. In Figure 5, we show a toy example ′ second row is picked, the third row adds no value, as the is binary (for simplicity). The d with = 5 , where W n = user has already seen features f2 and f3 - while the last row I should score feature f2 higher than importance function exposes the user to completely new features. Selecting the feature f1, i.e. , since feature f2 is used to explain I > I 2 1 second and last row results in the coverage of almost all the more instances. Concretely for the text applications, we set √ ∑ n features. We formalize this non-redundant coverage intuition W I I = . For images, must measure something ij j =1 i in Eq. (3) , where we define coverage as the set function c that is comparable across the super-pixels in different images, that, given W and I , computes the total importance of the features that appear in at least one instance in a set V .

6 97.0 100 100 92.1 ′ d ∑ 78.9 72.8 75 75 c ( V, 1 ,I ) = W (3) I j 0] ∃ i ∈ V : W [ > ij 64.3 j =1 50 50 37.0 The pick problem, defined in Eq. (4) , consists of finding the Recall (%) Recall (%) 25 25 20.6 17.4 set V, B that achieves highest coverage. | |≤ V 0 0 LIME greedy parzen random greedy LIME parzen random ) = argmax ( V, ( c ,I W ,I ) (4) Pick W |≤ V | V, B (a) Sparse LR (b) Decision Tree (4) The problem in Eq. is maximizing a weighted coverage Figure 6: Recall on truly important features for two c function, and is NP-hard [ V, W ,I ) 10 ]. Let c ( V ∪{ i } , W ,I ) − ( interpretable classifiers on the books dataset. to a set i be the marginal coverage gain of adding an instance . Due to submodularity, a greedy algorithm that iteratively V 97.8 100 100 90.2 adds the instance with the highest marginal coverage gain to 80.6 75 75 the solution offers a constant-factor approximation guarantee 63.4 60.8 − /e ]. We outline this approximation 15 to the optimum [ 1 of 1 47.6 50 50 in Algorithm 2, and call it submodular pick . Recall (%) Recall (%) 25 25 19.2 17.4 0 0 5. SIMULATED USER EXPERIMENTS greedy greedy random random parzen parzen LIME LIME In this section, we present simulated user experiments to (a) Sparse LR (b) Decision Tree evaluate the utility of explanations in trust-related tasks. In Figure 7: Recall on truly important features for two particular, we address the following questions: (1) Are the interpretable classifiers on the DVDs dataset. explanations faithful to the model, (2) Can the explanations aid users in ascertaining trust in predictions, and (3) Are and decision trees). In particular, we train both classifiers the explanations useful for evaluating the model as a whole. such that the maximum number of features they use for any Code and data for replicating our experiments are available gold 10, and thus we know the instance is set of features at https://github.com/marcotcr/lime-experiments. that the are considered important by these models. For each prediction on the test set, we generate explanations and 5.1 Experiment Setup compute the fraction of these features that are recovered gold books and We use two sentiment analysis datasets ( , DVDs by the explanations. We report this recall averaged over all 2000 instances each) where the task is to classify prod- the test instances in Figures 6 and 7. We observe that ]. We train decision 4 uct reviews as positive or negative [ the greedy approach is comparable to parzen on logistic ), LR ), logistic regression with L2 regularization ( DT trees ( regression, but is substantially worse on decision trees since NN ), and support vector machines with nearest neighbors ( changing a single feature at a time often does not have an SVM RBF kernel ( ), all using bag of words as features. We effect on the prediction. The overall recall by parzen is low, also include random forests (with 1000 trees) trained with likely due to the difficulty in approximating the original high- 19 ), a model that is RF the average word2vec embedding [ ] ( dimensional classifier. LIME consistently provides 90% > impossible to interpret without a technique like LIME. We recall for both classifiers on both datasets, demonstrating use the implementations and default parameters of scikit- that LIME explanations are faithful to the models. learn, unless noted otherwise. We divide each dataset into train (1600 instances) and test (400 instances). 5.3 Should I trust this prediction? To explain individual predictions, we compare our pro- In order to simulate trust in individual predictions, we first LIME posed approach ( 2 ], a method that parzen ), with [ randomly select 25% of the features to be “untrustworthy”, approximates the black box classifier globally with Parzen and assume that the users can identify and would not want windows, and explains individual predictions by taking the to trust these features (such as the headers in 20 newsgroups, gradient of the prediction probability function. For parzen, “trustworthiness” oracle leaked data, etc). We thus develop features with the highest absolute gradients we take the K by labeling test set predictions from a black box classifier as as explanations. We set the hyper-parameters for parzen and “untrustworthy” if the prediction changes when untrustworthy LIME using cross validation, and set N = 15 , 000. We also features are removed from the instance, and “trustworthy” greedy (author?) compare against a procedure (similar to otherwise. In order to simulate users, we assume that users ) in which we greedily remove features that contribute [18] deem predictions untrustworthy from LIME and parzen ex- the most to the predicted class until the prediction changes planations if the prediction from the linear approximation random features), and a K (or we reach the maximum of changes when all untrustworthy features that appear in the features as an explanation. K procedure that randomly picks explanations are removed (the simulated human “discounts” K to 10 for our experiments. We set the effect of untrustworthy features). For greedy and random, For experiments where the pick procedure applies, we either the prediction is mistrusted if any untrustworthy features RP do random selection (random pick, ) or the procedure are present in the explanation, since these methods do not 4 (submodular pick, § described in ). We refer to pick- SP provide a notion of the contribution of each feature to the explainer combinations by adding RP or SP as a prefix. prediction. Thus for each test set prediction, we can evaluate 5.2 Are explanations faithful to the model? whether the simulated user trusts it using each explanation method, and compare it to the trustworthiness oracle. We measure faithfulness of explanations on classifiers that are by themselves interpretable (sparse logistic regression Using this setup, we report the F1 on the trustworthy

7 Table 1: Average F1 of trustworthiness for different fewer untrustworthy predictions, and compare this choice to explainers on a collection of classifiers and datasets. the classifier with higher held-out test set accuracy. We present the accuracy of picking the correct classifier Books DVDs as runs, in Figure 8. We omit 800 varies, averaged over B SP-parzen and RP-parzen from the figure since they did not NN RF SVM LR RF SVM LR NN produce useful explanations, performing only slightly better 14.2 14.3 14.5 14.4 14.6 14.8 14.7 14.7 Random than random. LIME is consistently better than greedy, irre- Parzen 87.0 81.7 94.2 87.3 84.0 87.6 94.3 92.3 spective of the pick method. Further, combining submodular 52.4 58.1 46.6 55.1 53.7 47.4 45.0 53.3 Greedy pick with LIME outperforms all other methods, in particular LIME 96.6 94.5 96.2 96.7 96.6 91.8 96.1 95.6 it is much better than RP-LIME when only a few examples are shown to the users. These results demonstrate that the trust assessments provided by SP-selected LIME explana- 85 85 tions are good indicators of generalization, which we validate with human experiments in the next section. 65 65 6. EVALUATION WITH HUMAN SUBJECTS SP-LIME SP-LIME In this section, we recreate three scenarios in machine RP-LIME RP-LIME SP-greedy SP-greedy % correct choice % correct choice learning that require trust and understanding of predictions 45 RP-greedy RP-greedy and models. In particular, we evaluate LIME and SP-LIME 45 0 0 10 20 30 30 10 20 in the following settings: (1) Can users choose which of two # of instances seen by the user # of instances seen by the user 6.2), (2) based on the explana- § classifiers generalizes better ( (b) DVDs dataset (a) Books dataset tions, can users perform feature engineering to improve the model ( § 6.3), and (3) are users able to identify and describe Figure 8: Choosing between two classifiers, as the classifier irregularities by looking at explanations ( § 6.4). number of instances shown to a simulated user is varied. Averages and standard errors from 800 runs. 6.1 Experiment setup predictions for each explanation method, averaged over 100 For experiments in 6.2 and § 6.3, we use the “Christianity” § runs, in Table 1. The results indicate that LIME dominates and “Atheism” documents from the 20 newsgroups dataset . = 0 p others (all results are significant at 01) on both datasets, mentioned beforehand. This dataset is problematic since it and for all of the black box models. The other methods either contains features that do not generalize (e.g. very informative achieve a lower recall (i.e. they mistrust predictions more header information and author names), and thus validation than they should) or lower precision (i.e. they trust too many accuracy considerably overestimates real-world performance. predictions), while LIME maintains both high precision and In order to estimate the real world performance, we create high recall. Even though we artificially select which features a new for evaluation. We download Atheism religion dataset are untrustworthy, these results indicate that LIME is helpful and Christianity websites from the DMOZ directory and in assessing trust in individual predictions. webpages in each class. 819 human curated lists, yielding High accuracy on this dataset by a classifier trained on 20 5.4 Can I trust this model? newsgroups indicates that the classifier is generalizing using In the final simulated user experiment, we evaluate whether semantic content, instead of placing importance on the data the explanations can be used for model selection, simulating specific issues outlined above. Unless noted otherwise, we the case where a human has to decide between two competing use SVM with RBF kernel, trained on the 20 newsgroups models with similar accuracy on validation data. For this data with hyper-parameters tuned via the cross-validation. purpose, we add 10 artificially “noisy” features. Specifically, on training and validation sets ( / split of the original 20 80 6.2 Can users select the best classifier? training data), each artificial feature appears in 10% of the In this section, we want to evaluate whether explanations examples in one class, and 20% of the other, while on the can help users decide which classifier generalizes better, i.e., test instances, each artificial feature appears in 10% of the which classifier would the user deploy “in the wild”. Specif- examples in each class. This recreates the situation where the ically, users have to decide between two classifiers: SVM models use not only features that are informative in the real trained on the original 20 newsgroups dataset, and a version world, but also ones that introduce spurious correlations. We of the same classifier trained on a “cleaned” dataset where create pairs of competing classifiers by repeatedly training many of the features that do not generalize have been man- pairs of random forests with 30 trees until their validation ually removed. The original classifier achieves an accuracy . of each other, but their test accuracy 1% accuracy is within 0 . score of , while the “cleaned” religion dataset on the 3% 57 differs by at least 5% . Thus, it is not possible to identify the classifier achieves a score of . In contrast, the test accu- 69 . 0% classifier (the one with higher test accuracy) from the better racy on the original 20 newsgroups split is . , and 6% 0% . 94 88 accuracy on the validation data. respectively – suggesting that the worse classifier would be The goal of this experiment is to evaluate whether a user selected if accuracy alone is used as a measure of trust. can identify the better classifier based on the explanations of We recruit human subjects on Amazon Mechanical Turk – instances from the validation set. The simulated human B by no means machine learning experts, but instead people B marks the set of artificial features that appear in the with basic knowledge about religion. We measure their explanations as untrustworthy, following which we evaluate ability to choose the better algorithm by seeing side-by- how many total predictions in the validation set should be side explanations with the associated raw data (as shown trusted (as in the previous section, treating only marked in Figure 2). We restrict both the number of words in each features as untrustworthy). Then, we select the classifier with explanation ( K ) and the number of documents that each

8 100 K words in each explanation (an B = 10 instances with = 10 Random Pick (RP) 89.0 interface similar to Figure 2, but with a single algorithm). Submodular Pick (RP) As a reminder, the users here are not experts in machine 80.0 80 75.0 learning and are unfamiliar with feature engineering, thus 68.0 are only identifying words based on their semantic content. dataset religion Further, users do not have any access to the 60 – they do not even know of its existence. We start the experi- % correct choice ment with 10 subjects. After they mark words for deletion, 40 different classifiers, one for each subject (with the 10 we train LIME greedy corresponding words removed). The explanations for each classifier are then presented to a set of 5 users in a new round Figure 9: Average accuracy of human subject (with new classifiers. We do a 50 of interaction, which results in standard errors) in choosing between two classifiers. final round, after which we have 250 classifiers, each with a path of interaction tracing back to the first 10 subjects. 0.8 The explanations and instances shown to each user are SP-LIME SP-LIME . We show the average RP-LIME or produced by RP-LIME accuracy on the religion dataset at each interaction round No cleaning 0.7 subjects for the paths originating from each of the original 10 (shaded lines), and the average across all paths (solid lines) in Figure 10. It is clear from the figure that the crowd 0.6 workers are able to improve the model by removing features they deem unimportant for the task. Further, SP-LIME Real world accuracy outperforms RP-LIME , indicating selection of the instances 0.5 to show the users is crucial for efficient feature engineering. 1 2 3 0 3 Each subject took an average of . 6 minutes per round Rounds of interaction of cleaning, resulting in just under 11 minutes to produce Figure 10: Feature engineering experiment. Each a classifier that generalizes much better to real world data. shaded line represents the average accuracy of sub- Each path had on average 200 , SP words removed with jects in a path starting from one of the initial 10 sub- and 157 with RP , indicating that incorporating coverage of jects. Each solid line represents the average across important features is useful for feature engineering. Further, all paths per round of interaction. out of an average of 200 174 were words selected with SP, the selected by at least half of the users, while 68 by all ) to B person inspects ( 6. The position of each algorithm users. Along with the fact that the variance in the accuracy and the order of the instances seen are randomized between decreases across rounds, this high agreement demonstrates subjects. After examining the explanations, users are asked models. This correct that the users are converging to similar to select which algorithm will perform best in the real world. evaluation is an example of how explanations make it easy The explanations are produced by either greedy (chosen to improve an untrustworthy classifier – in this case easy as a baseline due to its performance in the simulated user enough that machine learning knowledge is not required. experiment) or LIME, and the instances are selected either by random (RP) or submodular pick (SP). We modify the 6.4 Do explanations lead to insights? greedy step in Algorithm 2 slightly so it alternates between Often artifacts of data collection can induce undesirable explanations of the two classifiers. For each setting, we repeat correlations that the classifiers pick up during training. These the experiment with 100 users. issues can be very difficult to identify just by looking at The results are presented in Figure 9. Note that all of the raw data and predictions. In an effort to reproduce the methods are good at identifying the better classifier, such a setting, we take the task of distinguishing between demonstrating that the explanations are useful in determining photos of Wolves and Eskimo Dogs (huskies). We train a which classifier to trust, while using test set accuracy would images, 20 logistic regression classifier on a training set of result in the selection of the wrong classifier. Further, we see hand selected such that all pictures of wolves had snow in that the submodular pick (SP) greatly improves the user’s the background, while pictures of huskies did not. As the ability to select the best classifier when compared to random features for the images, we use the first max-pooling layer pick (RP), with LIME outperforming greedy in both cases. 25 ]. On of Google’s pre-trained Inception neural network [ a collection of additional images, the classifier predicts 60 6.3 Can non-experts improve a classifier? “Wolf” if there is snow (or light background at the bottom), If one notes that a classifier is untrustworthy, a common and “Husky” otherwise, regardless of animal color, position, task in machine learning is feature engineering, i.e. modifying pose, etc. We trained this bad classifier intentionally, to the set of features and retraining in order to improve gener- evaluate whether subjects are able to detect it. alization. Explanations can aid in this process by presenting The experiment proceeds as follows: we first present a the important features, particularly for removing features test predictions (without explanations), balanced set of 10 that the users feel do not generalize. where one wolf is not in a snowy background (and thus the We use the 20 newsgroups data here as well, and ask Ama- prediction is “Husky”) and one husky is (and is thus predicted zon Mechanical Turk users to identify which words from the as “Wolf”). We show the “Husky” mistake in Figure 11a. The explanations should be removed from subsequent training, for other 8 examples are classified correctly. We then ask the § the worse classifier from the previous section ( 6.2). In each subject three questions: (1) Do they trust this algorithm round, the subject marks words for deletion after observing

9 learning, specifically for vision tasks [ , 29 ]. Letting users 3 know when the systems are likely to fail can lead to an increase in trust, by avoiding “silly mistakes” [ 8 ]. These solutions either require additional annotations and feature engineering that is specific to vision tasks or do not provide insight into why a decision should not be trusted. Further- more, they assume that the current evaluation metrics are reliable, which may not be the case if problems such as data leakage are present. Other recent work [ 11 ] focuses on ex- posing users to different kinds of mistakes (our pick step). Interestingly, the subjects in their study did not notice the (a) Husky classified as wolf (b) Explanation serious problems in the 20 newsgroups data even after look- ing at many mistakes, suggesting that examining raw data Figure 11: Raw data and explanation of a bad [11] (author?) is not sufficient. Note that are not alone in model’s prediction in the “Husky vs Wolf ” task. this regard, many researchers in the field have unwittingly published classifiers that would not generalize for this task. Before After Using LIME, we show that even non-experts are able to Trusted the bad model 10 out of 27 3 out of 27 identify these irregularities when explanations are present. 25 out of 27 12 out of 27 Snow as a potential feature Further, LIME can complement these existing systems, and allow users to assess trust even when a prediction seems Table 2: “Husky vs Wolf ” experiment results. “correct” but is made for the wrong reasons. Recognizing the utility of explanations in assessing trust, many have proposed using interpretable models [ 27 ], espe- to work well in the real world, (2) why, and (3) how do , ]. While such models 26 , 17 6 cially for the medical domain [ they think the algorithm is able to distinguish between these may be appropriate for some domains, they may not apply photos of wolves and huskies. After getting these responses, ] equally well to others (e.g. a supersparse linear model [ 26 we show the same images with the associated explanations, − 10 features is unsuitable for text applications). In- with 5 such as in Figure 11b, and ask the same questions. terpretability, in these cases, comes at the cost of flexibility, Since this task requires some familiarity with the notion of 16 ] is a full accuracy, or efficiency. For text, EluciDebug [ spurious correlations and generalization, the set of subjects human-in-the-loop system that shares many of our goals for this experiment were graduate students who have taken at (interpretability, faithfulness, etc). However, they focus on least one graduate machine learning course. After gathering an already interpretable model (Naive Bayes). In computer 3 independent evaluators read their the responses, we had vision, systems that rely on object detection to produce reasoning and determine if each subject mentioned snow, ] are able to pro- 28 13 candidate alignments [ ] or attention [ background, or equivalent as a feature the model may be duce explanations for their predictions. These are, however, using. We pick the majority to decide whether the subject constrained to specific neural network architectures or inca- was correct about the insight, and report these numbers pable of detecting “non object” parts of the images. Here we before and after showing the explanations in Table 2. focus on general, model-agnostic explanations that can be Before observing the explanations, more than a third applied to any classifier or regressor that is appropriate for trusted the classifier, and a little less than half mentioned the domain - even ones that are yet to be proposed. the snow pattern as something the neural network was using A common approach to model-agnostic explanation is learn- – although all speculated on other patterns. After examining ing a potentially interpretable model on the predictions of the explanations, however, almost all of the subjects identi- the original model [ 22 , 7 , 2 ]. Having the explanation be a fied the correct insight, with much more certainty that it was gradient vector [ ] captures a similar locality intuition to 2 a determining factor. Further, the trust in the classifier also that of LIME. However, interpreting the coefficients on the dropped substantially. Although our sample size is small, gradient is difficult, particularly for confident predictions this experiment demonstrates the utility of explaining indi- (where gradient is near zero). Further, these explanations ap- vidual predictions for getting insights into classifiers knowing proximate the original model globally , thus maintaining local when not to trust them and why. fidelity becomes a significant challenge, as our experiments demonstrate. In contrast, LIME solves the much more feasi- 7. RELATED WORK ble task of finding a model that approximates the original model locally . The idea of perturbing inputs for explanations The problems with relying on validation set accuracy as 24 has been explored before [ ], where the authors focus on the primary measure of trust have been well studied. Practi- contribution learning a specific model, as opposed to our tioners consistently overestimate their model’s accuracy [ ], 21 general framework. None of these approaches explicitly take 14 propagate feedback loops [ 23 ], or fail to notice data leaks [ ]. cognitive limitations into account, and thus may produce In order to address these issues, researchers have proposed non-interpretable explanations, such as a gradients or linear ], which help users tools like Gestalt [ 20 ] and Modeltracker [ 1 models with thousands of non-zero weights. The problem navigate individual instances. These tools are complemen- becomes worse if the original features are nonsensical to tary to LIME in terms of explaining models, since they do humans (e.g. word embeddings). In contrast, LIME incor- not address the problem of explaining individual predictions. porates interpretability both in the optimization and in our Further, our submodular pick procedure can be incorporated , such that domain and notion of interpretable representation in such tools to aid users in navigating larger datasets. task specific interpretability criteria can be accommodated. Some recent work aims to anticipate failures in machine

10 8. CONCLUSION AND FUTURE WORK [8] M. T. Dzindolet, S. A. Peterson, R. A. Pomranky, L. G. Pierce, and H. P. Beck. The role of trust in automation In this paper, we argued that trust is crucial for effective reliance. Int. J. Hum.-Comput. Stud. , 58(6), 2003. human interaction with machine learning systems, and that [9] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least explaining individual predictions is important in assessing angle regression. , 32:407–499, 2004. Annals of Statistics trust. We proposed LIME, a modular and extensible ap- U. Feige. A threshold of ln n for approximating set cover. [10] J. ACM , 45(4), July 1998. proach to faithfully explain the predictions of any model in [11] A. Groce, T. Kulesza, C. Zhang, S. Shamasunder, an interpretable manner. We also introduced SP-LIME, a M. Burnett, W.-K. Wong, S. Stumpf, S. Das, A. Shinsel, method to select representative and non-redundant predic- F. Bice, and K. McIntosh. You are the only possible oracle: tions, providing a global view of the model to users. Our Effective test selection for end users of interactive machine experiments demonstrated that explanations are useful for a , 40(3), 2014. IEEE Trans. Softw. Eng. learning systems. variety of models in trust-related tasks in the text and image [12] J. L. Herlocker, J. A. Konstan, and J. Riedl. Explaining collaborative filtering recommendations. In Conference on domains, with both expert and non-expert users: deciding Computer Supported Cooperative Work (CSCW) , 2000. between models, assessing trust, improving untrustworthy [13] A. Karpathy and F. Li. Deep visual-semantic alignments for models, and getting insights into predictions. Computer Vision and generating image descriptions. In There are a number of avenues of future work that we , 2015. Pattern Recognition (CVPR) would like to explore. Although we describe only sparse [14] S. Kaufman, S. Rosset, and C. Perlich. Leakage in data linear models as explanations, our framework supports the mining: Formulation, detection, and avoidance. In Knowledge Discovery and Data Mining (KDD) , 2011. exploration of a variety of explanation families, such as de- [15] A. Krause and D. Golovin. Submodular function cision trees; it would be interesting to see a comparative Tractability: Practical Approaches to Hard maximization. In study on these with real users. One issue that we do not . Cambridge University Press, February 2014. Problems mention in this work was how to perform the pick step for [16] T. Kulesza, M. Burnett, W.-K. Wong, and S. Stumpf. images, and we would like to address this limitation in the Principles of explanatory debugging to personalize future. The domain and model agnosticism enables us to interactive machine learning. In Intelligent User Interfaces (IUI) , 2015. explore a variety of applications, and we would like to inves- [17] B. Letham, C. Rudin, T. H. McCormick, and D. Madigan. tigate potential uses in speech, video, and medical domains, Interpretable classifiers using rules and bayesian analysis: as well as recommendation systems. Finally, we would like Building a better stroke prediction model. Annals of Applied to explore theoretical properties (such as the appropriate Statistics , 2015. number of samples) and computational optimizations (such [18] D. Martens and F. Provost. Explaining data-driven as using parallelization and GPU processing), in order to MIS Q. , 38(1), 2014. document classifications. [19] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and provide the accurate, real-time explanations that are critical J. Dean. Distributed representations of words and phrases for any human-in-the-loop machine learning system. Neural Information and their compositionality. In Processing Systems (NIPS) . 2013. Acknowledgements [20] K. Patel, N. Bancroft, S. M. Drucker, J. Fogarty, A. J. Ko, and J. Landay. Gestalt: Integrated support for We would like to thank Scott Lundberg, Tianqi Chen, and User implementation and analysis in machine learning. In Tyler Johnson for helpful discussions and feedback. This Interface Software and Technology (UIST) , 2010. work was supported in part by ONR awards #W911NF-13- [21] K. Patel, J. Fogarty, J. A. Landay, and B. Harrison. 1-0246 and #N00014-13-1-0023, and in part by TerraSwarm, Investigating statistical machine learning as a tool for one of six centers of STARnet, a Semiconductor Research software development. In Human Factors in Computing Corporation program sponsored by MARCO and DARPA. , 2008. Systems (CHI) I. Sanchez, T. Rocktaschel, S. Riedel, and S. Singh. Towards [22] extracting faithful and descriptive representations of latent 9. REFERENCES AAAI Spring Syposium on Knowledge variable models. In Representation and Reasoning (KRR): Integrating Symbolic [1] S. Amershi, M. Chickering, S. M. Drucker, B. Lee, and Neural Approaches , 2015. P. Simard, and J. Suh. Modeltracker: Redesigning [23] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, performance analysis tools for machine learning. In Human D. Ebner, V. Chaudhary, M. Young, and J.-F. Crespo. , 2015. Factors in Computing Systems (CHI) Hidden technical debt in machine learning systems. In [2] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, . 2015. Neural Information Processing Systems (NIPS) K. Hansen, and K.-R. M uller. How to explain individual ̈ [24] E. Strumbelj and I. Kononenko. An efficient explanation of Journal of Machine Learning classification decisions. individual classifications using game theory. Journal of Research , 11, 2010. , 11, 2010. Machine Learning Research [3] A. Bansal, A. Farhadi, and D. Parikh. Towards transparent [25] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, systems: Semantic characterization of failure modes. In D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. , 2014. European Conference on Computer Vision (ECCV) Computer Vision and Going deeper with convolutions. In [4] J. Blitzer, M. Dredze, and F. Pereira. Biographies, , 2015. Pattern Recognition (CVPR) bollywood, boom-boxes and blenders: Domain adaptation [26] B. Ustun and C. Rudin. Supersparse linear integer models Association for for sentiment classification. In , for optimized medical scoring systems. Machine Learning , 2007. Computational Linguistics (ACL) 2015. [5] J. Q. Candela, M. Sugiyama, A. Schwaighofer, and N. D. Artificial [27] F. Wang and C. Rudin. Falling rule lists. In . MIT, 2009. Lawrence. Dataset Shift in Machine Learning , 2015. Intelligence and Statistics (AISTATS) [6] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and [28] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, N. Elhadad. Intelligible models for healthcare: Predicting R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend pneumonia risk and hospital 30-day readmission. In and tell: Neural image caption generation with visual , 2015. Knowledge Discovery and Data Mining (KDD) attention. In International Conference on Machine Learning M. W. Craven and J. W. Shavlik. Extracting tree-structured [7] (ICML) , 2015. Neural information representations of trained networks. [29] P. Zhang, J. Wang, A. Farhadi, M. Hebert, and D. Parikh. processing systems (NIPS) , pages 24–30, 1996. Predicting failures of vision systems. In Computer Vision and Pattern Recognition (CVPR) , 2014.

Related documents