45189

Transcript

1 Smart Reply: Automated Response Suggestion for Email F F F Sujith Ravi Karol Kurach Anjuli Kannan F Andrew Tomkins Balint Miklos Tobias Kaufmann Greg Corrado László Lukács Marina Ganea Vivek Ramavajjala Peter Young Google {anjuli, kkurach, sravi, snufkin}@google.com An initial study covering several million email-reply pairs ABSTRACT showed that ∼ 25% of replies have 20 tokens or less. Thus In this paper we propose and investigate a novel end-to-end we raised the following question: can we assist users with method for automatically generating short email responses, composing these short messages? More specifically, would called Smart Reply. It generates semantically diverse sug- it be possible to suggest brief responses when appropriate, gestions that can be used as complete email responses with just one tap away? just one tap on mobile. The system is currently used in In- and is responsible for assisting with 10% of box by Gmail all mobile responses. It is designed to work at very high throughput and process hundreds of millions of messages daily. The system exploits state-of-the-art, large-scale deep learning. We describe the architecture of the system as well as the challenges that we faced while building it, like response di- versity and scalability. We also introduce a new method for semantic clustering of user-generated content that requires only a modest amount of explicitly labeled data. Keywords Email; LSTM; Deep Learning; Clustering; Semantics 1. INTRODUCTION Email is one of the most popular modes of communica- tion on the Web. Despite the recent increase in usage of social networks, email continues to be the primary medium for billions of users across the world to connect and share information [2]. With the rapid increase in email overload, it has become increasingly challenging for users to process and respond to incoming messages. It can be especially time- consuming to type email replies on a mobile device. F Equal contribution. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Figure 1: Example Smart Reply suggestions. KDD ’16, August 13–17, 2016, San Francisco, CA, USA c © 2016 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-4232-2/16/08.. http://dx.doi.org/XXXX.XXXX DOI:

2 At the core of our system, an Response selection: 1. LSTM neural network processes an incoming message, then uses it to predict the most likely responses. LSTM scalabil- computation can be expensive, so we improve by finding only the approximate best responses. We ity explain the model in detail in Section 3. Preprocess 2. Response set generation: To deliver high response Email quality , we only select responses from response space which is generated offline using a semi-supervised graph learning approach. This is discussed in Section 4. Diversity: 3. After finding a set of most likely responses from the LSTM, we would like to choose a small set to No Smart Trigger . We utility show to the user that maximize the total no Reply response? found that enforcing diverse semantic intents is critical to making the suggestions useful. Our method for this is described further in Section 5. yes 4. A feedforward neural network de- Triggering model: cides whether or not to suggest responses. This fur- Response Permitted by not showing suggestions when utility ther improves selection responses they are unlikely to be used. We break this out into (LSTM) and clusters a separate component so that we have the option to use a computationally cheaper architecture than what is used for the scoring model; this keeps the system Smart scalable . This model is described in Section 6. Diversity Reply selection Suggested The combination of these components is a novel end-to- end method for generating short, complete responses to emails, going beyond previous works. For response selection it ex- Figure 2: Life of a message. The figure presents the ploits state-of-the-art deep learning models trained on bil- overview of inference. lions of words, and for response set generation it introduces a new semi-supervised method for semantic understanding To address this problem, we leverage the sequence-to- of user-generated content. sequence learning framework [23], which uses long short- Moreover, since it directly addresses all four challenges term memory networks (LSTMs) [10] to predict sequences I nbox. mentioned above, it has been successfully deployed in of text. Consistent with the approach of the Neural Conver- Currently, the Smart Reply system is responsible for assist- sequence is an incoming mes- sation Model [24], our input nbox on mobile. I ing with 10% of email replies for distribution is over the space of possible sage and our output Next, we discuss the related work in Section 2, followed by replies. Training this framework on a large corpus of con- a description of our core system components in Sections 3, 4, versation data produces a fully generative model that can 5, and 6. We close by showing modeling results in Section 7 produce a response to any sequence of input text. As [24] and our conclusions in Section 8. demonstrated on a tech support chat corpus, this distribu- tion can be used to decode coherent, plausible responses. 2. RELATED WORK However, in order to deploy such a model into a product As we consider related work, we note that building an used globally by millions, we faced several challenges not automated system to suggest email responses is not a task considered in previous work: for which there is existing literature or benchmarks, nor is Response quality • How to ensure that the individual this a standard machine learning problem to which existing high quality in language always response options are algorithms can be readily applied. However, there is work and content. related to two of our core components which we will review here: predicting responses and identifying a target response Utility How to select multiple options to show a user • space. so as to maximize the likelihood that one is chosen. • Scalability How to efficiently process millions of mes- Predicting full responses. Much work exists on analyz- sages per day while remaining within the latency re- ing natural language dialogues in public domains such as quirements of an email delivery system. Twitter, but it has largely focused on social media tasks like predicting whether or not a response is made [3], predicting • Privacy How to develop this system without ever in- next word only [14], or curating threads [4]. specting the data except aggregate statistics. Full response prediction was initially attempted in [16], To tackle these challenges, we propose Smart Reply (Fig- which approached the problem from the perspective of ma- ure 1), a novel method and system for automated email chine translation: given a Twitter post, ”translate” it into response suggestion. Smart Reply consists of the following a response using phrase-based statistical machine transla- components, which are also shown in Figure 2: tion (SMT). Our approach is similar, but rather than using

3 SMT we use the neural network machine translation model 3.1 LSTM model proposed in [23], called ”sequence-to-sequence learning”. Since we are scoring one sequence of tokens r , conditional Sequence-to-sequence learning, which makes use of long on another sequence of tokens o , this problem is a natural short-term memory networks (LSTMs) [10] to predict se- fit for sequence-to-sequence learning [23]. The model itself quences of text, was originally applied to Machine Trans- is an LSTM. The input is the tokens of the original message lation but has since seen success in other domains such as { o ,...,o } , and the output is the conditional probability n 1 image captioning [25] and speech recognition [6]. distribution of the sequence of response tokens given the Other recent works have also applied recurrent neural net- input: ,...,o ) o | P ,...,r ( r works (RNNs) or LSTMs to full response prediction [21], 1 n 1 m [20], [19], [24]. In [21] the authors rely on having an SMT As in [23], this distribution can be factorized as: system to generate n-best lists, while [19] and [24], like this work, develop fully generative models. Our approach is most m ∏ similar to the Neural Conversation Model [24], which uses r ( ,...,r | o P ) = ,...,o | r ( P ) ,...,r ,r ,...,o o n 1 m 1 1 n 1 i 1 i − sequence-to-sequence learning to model tech support chats =1 i and movie subtitles. The primary difference of our work is that it was deployed First, the sequence of original message tokens, including a in a production setting, which raised the challenges of re- o , are read in, such that the special end-of-message token n sponse quality, utility, scalability, and privacy. These chal- LSTM’s hidden state encodes a vector representation of the lenges were not considered in any of these related works and whole message. Then, given this hidden state, a softmax led to our novel solutions explained in the rest of this paper. o r output is computed and interpreted as ( | ), or P ,...,o n 1 1 Furthermore, in this work we approach a different domain the probability distribution for the first response token. As than [21], [20], [19], and [24], which primarily focus on social response tokens are fed in, the softmax at each timestep t media and movie dialogues. In both of those domains it can ,r ,...,o o | r ( is interpreted as P ). Given the fac- ,...,r 1 − t 1 n 1 t be acceptable to provide a response that is merely related torization above, these softmaxes can be used to compute or on-topic. Email, on the other hand, frequently expresses ). ,...,o o | ,...,r r ( P 1 m n 1 a request or intent which must be addressed in the response. Given a large corpus of messages, the training Training Our approach here Identifying a target response space. objective is to maximize the log probability of observed re- builds on the Expander graph learning approach [15], since sponses, given their respective originals: it scales well to both large data (vast amounts of email) and large output sizes (many different underlying seman- ∑ log ,...,r ,...,o ) r o ( | P m n 1 1 tic intents). While Expander was originally proposed for ) , o ( r knowledge expansion and classification tasks [26], our work is the first to use it to discover semantic intent clusters from We train against this objective using stochastic gradient user-generated content. descent with AdaGrad [8]. Ten epochs are run over a mes- Other graph-based semi-supervised learning techniques have sage corpus which will be described in Section 7.1. Due to been explored in the past for more traditional classification the size of the corpus, training is run in a distributed fashion problems [27, 5]. Other related works have explored tasks in- using the TensorFlow library [1]. volving semantic classification [12] or identifying word-level Both our input and output vocabularies consist of the intents [17] targeted towards Web search queries and other most frequent English words in our training data after pre- forums [7]. However, the problem settings and tasks them- processing (steps described in Section 7.1). In addition to selves are significantly different from what is addressed in the standard LSTM formulation, we found that the addition our work. of a recurrent projection layer [18] substantially improved both the quality of the converged model and the time to Finally, we note that Smart Reply is the first work to converge. We also found that gradient clipping (with the address these tasks together and solve them in a single end- value of 1) was essential to stable training. to-end, deployable system. Inference At inference time we can feed in an original mes- sage and then use the output of the softmaxes to get a prob- 3. SELECTING RESPONSES ability distribution over the vocabulary at each timestep. The fundamental task of the Smart Reply system is to find These distributions can be used in a variety of ways: the most likely response given an original message. In other and the set of all possible words, given original message o 1. To draw a random sample from the response distribu- , we would like to find: responses R ,...,r P r o tion ,...,o | ). This can be done by sam- ( n 1 m 1 pling one token at each timestep and feeding it back ∗ ) = argmax ( r | P r o into the model. r R ∈ To find this response, we will construct a model that can 2. To approximate the most likely response, given the score responses and then find the highest scoring response. original message. This can be done greedily by taking We will next describe how the model is formulated, trained, the most likely token at each time step and feeding it and used for inference. Then we will discuss the core chal- back in. A less greedy strategy is to use a beam search, lenges of bringing this model to produce high quality sug- i.e., take the top b tokens and feed them in, then retain gestions on a large scale. the b best response prefixes and repeat.

4 Query Top generated responses First, to improve specificity of responses, we apply some Hi, I thought it would be I can do Tuesday. light normalization that penalizes responses which are ap- great for us to sit down I can do Wednesday. plicable to a broad range of incoming messages. The results How about Tuesday? and chat. I am free of this normalization can be seen in column 2 of Table 2. I can do Tuesday! Tuesday and Wenesday. For example, the very generic ”Yes!” has fallen out of the Can you do either of I can do Tuesday. What top ten. Second, to increase the breadth of options shown time works for you? those days? to the user, we enforce diversity by exploiting the semantic I can do Wednesday! R structure of , as we will discuss in Section 5. The results Thanks! I can do Tuesday or of this are also shown at the bottom of Table 2. Wednesday. We further improve the utility of suggestions by first pass- How about Wednesday? –Alice ing each message through a triggering model (described in I can do Wednesday. What Section 6) that determines whether suggestions should be time works for you? generated at all. This reduces the likelihood that we show I can do either. suggestions when they would not be used anyway. Table 1: Generated response examples. Our model needs to be deployed in a production Scalability setting and cannot introduce latency to the process of email delivery, so scalability is critical. 3. To determine the likelihood of a specific response can- , ∈ r Exhaustively scoring every response candidate R didate. This can be done by feeding in each token of R ( O would require | is the length of l ) LSTM steps where l | the candidate and using the softmax output to get the the longest response. In previous work [23], this could be likelihood of the next candidate token. afforded because computations were performed in a batch process offline. However, in the context of an email deliv- Table 1 shows some example of generating the approxi- ery pipeline, time is a much more precious resource. Fur- mate most likely responses using a beam search. thermore, given the tremendous diversity with which people communicate and the large number of email scenarios we R is very large and would like to cover, we can assume that 3.2 Challenges only expected to grow over time. For example, in a uniform As described thus far, the model can generate coherent sample of 10 million short responses (say, responses with at and plausible responses given an incoming email message. most 10 tokens), more than 40% occur only once. There- However, several key challenges arise when bringing this fore, rather than performing an exhaustive scoring of every model into production. R r candidate , we would like to efficiently search for the ∈ a function of not best responses such that complexity is | R . | In order to surface responses to users, Response quality Our search is conducted as follows. First, the elements of we need to ensure that they are always high quality in style, are organized into a trie. Then, we conduct a left-to-right R tone, diction, and content. beam search, but only retain hypotheses that appear in the Given that the model is trained on a corpus of real mes- bl ) for beam size ( O trie. This search process has complexity sages, we have to account for the possibility that the most l are typically b and maximum response length . Both b and l probable response is not necessarily a high quality response. in the range of 10-30, so this method dramatically reduces Even a response that occurs frequently in our corpus may the time to find the top responses and is a critical element not be appropriate to surface back to users. For example, of making this system deployable. In terms of quality, we your it could contain poor grammar, spelling, or mechanics ( find that, although this search only approximates the best the best! ); it could convey a familiarity that is likely to be , its results are very similar to what we would R responses in thanks hon! ); it could jarring or offensive in many situations ( get by scoring and ranking all r ∈ R , even for small b . At be too informal to be consistent with other Inbox intelligence = 128, for example, the top scoring response found by this b ); it could convey a sentiment that yup, got it thx features ( process matches the true top scoring response 99% of the is politically incorrect, offensive, or otherwise inappropriate time. Results for various beam sizes are shown in Figure 3. Leave me alone ( ). Additionally, requiring that each message first pass through While restricting the model vocabulary might address sim- a triggering model, as mentioned above, has the additional ple cases such as profanity and spelling errors, it would not benefit of reducing the total amount of LSTM computation. be sufficient to capture the wide variability with which, for example, politically incorrect statements can be made. In- Privacy Note that all email data (raw data, preprocessed stead, we use semi-supervised learning (described in detail data and training data) was encrypted. Engineers could only in Section 4) to construct a target response space R com- inspect aggregated statistics on anonymized sentences that prising only high quality responses. Then we use the model occurred across many users and did not identify any user. R described here to choose the best response in , rather than Also, only frequent words are retained. As a result, verifying the best response from any sequence of words in its vocab- model’s quality and debugging is more complex. ulary. Our solutions for the first three challenges are described Our user studies showed that suggestions are most Utility further in Sections 4, 5, and 6. useful when they are highly specific to the original message and express diverse intents. However, as column 1 in Table 2 shows, the raw output of the model tends to (1) favor com- mon but unspecific responses and (2) have little diversity.

5 4.2 Semantic intent clustering 1 In the next step, we want to partition all response mes- sages into “semantic” clusters where a cluster represents a meaningful response intent (for example, “ thank you ” type 0 . 8 sorry ” versus “ of response versus “ cannot make it ”). All mes- sages within a cluster share the same semantic meaning but lol may appear very different. For example, “ Ha ha ”, “ ” and 0 . 6 cluster. funny ” are associated with the Oh that’s funny! “ This step helps to automatically digest the entire infor- mation present in frequent responses into a coherent set of 4 . 0 semantic clusters. If we were to build a semantic intent pre- diction model for this purpose, we would need access to a large corpus of sentences annotated with their corresponding . 2 0 semantic intents. However, this is neither readily available for our task nor at this scale. Moreover, unlike typical ma- Frequency of matching exhaustive search chine learning classification tasks, the semantic intent space 20 0 40 60 80 100 120 140 cannot be fully defined a priori. So instead, we model the task as a semi-supervised machine learning problem and use Beam size scalable graph algorithms [15] to automatically learn this information from data and a few human-provided examples. Figure 3: Effectiveness of searching the response space R . For a sample of messages we compute the fre- 4.3 Graph construction quency with which the best candidate found by a beam search over R matches the best candidate found by exhaus- We start with a few manually defined clusters sampled . We compare various beam tively scoring all members of R , thanks i love you from the top frequent messages (e.g., , sizes. At a beam size of 16, these two methods find the same ). A small number of example responses are sounds good best response 93% of the time. thanks → added as “seeds” for each cluster (for example, 1 Thanks! ”, “ Thank you. ”). “ We then construct a base graph with frequent response V messages as nodes ( ). For each response message, we fur- R 4. RESPONSE SET GENERATION ther extract a set of lexical features (ngrams and skip-grams Two of the core challenges we face when building the end ) to V of length up to 3) and add these as “feature” nodes ( F to end automated response system are response quality and the same graph. Edges are created between a pair of nodes utility . Response quality comes from suggesting “high qual- ∈ if V u belongs to the feature ( u,v ) where v ∈ V v and R F ity” responses that deliver a positive user experience. Util- . We follow the same process and create u set for response ity comes from ensuring that we don’t suggest multiple re- V nodes for the manually labeled examples . We make an L sponses that capture the same intent (for example, minor observation that in some cases an incoming original mes- lexical variations such as “ ” and “ I will be Yes, I’ll be there. sage could potentially be treated as a response to another ”). We can consider these two challenges jointly. there. email depending on the context. For example, consider the We first need to define a target response space that com- following (original, response) message pairs: prises high quality messages which can be surfaced as sug- gestions. The goal here is to generate a structured response When should we meet? → Let us get together soon. set that effectively captures various intents conveyed by peo- How about Friday? → When should we meet? ple in natural language conversations. The target response space should capture both variability in language and in- Inter-message relations as shown in the above example can tents. The result is used in two ways downstream—(a) de- be modeled within the same framework by adding extra fine a response space for scoring and selecting suggestions edges between the corresponding message nodes in the graph. using the model described in Section 3, and (b) promote di- versity among chosen suggestions as discussed in Section 5. We construct a response set using only the most frequent 4.4 Semi-supervised learning anonymized sentences aggregated from the preprocessed data The constructed graph captures relationships between sim- (described in Section 7.1). This process yields a few million ilar canonicalized responses via the feature nodes. Next, we unique sentences. learn a semantic labeling for all response nodes by propa- gating semantic intent information from the manually la- 4.1 Canonicalizing email responses beled examples through the graph. We treat this as a semi- supervised learning problem and use the distributed EX- The first step is to automatically generate a set of canon- PANDER [15] framework for optimization. The learning ical responses messages that capture the variability in lan- framework is scalable and naturally suited for semi-supervised guage. For example, responses such as “ Thanks for your kind graph propagation tasks such as the semantic clustering ”, “ Thanks for the status update. ”, “ Thank you for updating! problem described here. We minimize the following objec- ” may appear slightly different on the surface but in update. tive function for response nodes in the graph: fact convey the same information. We parse each sentence using a dependency parser and use its syntactic structure to 1 generate a canonicalized representation. Words (or phrases) In practice, we pick 100 clusters and on average 3–5 labeled seed examples per cluster. that are modifiers or unattached to head words are ignored.

6 Figure 4: Semantic clustering of response messages. lows us to both expand cluster membership as well as dis- cover (up to 5X) new clusters, where each cluster has an 2 2 ˆ ˆ || || − + μ C || C C s − U || i i i i pp interpretable semantic interpretation. ( ) ∑ ∑ 2 2 ˆ ˆ ˆ ˆ + μ − || C w C || + || w C || − C np i j ij i ik k 4.5 Cluster Validation ) ∈N ( j i i j ∈N ( ) F R Finally, we extract the top k members for each semantic (1) cluster, sorted by their label scores. The set of (response, cluster label) pairs are then validated by human raters. The i is s is an indicator function equal to 1 if the node where i ˆ , a corresponding clus- R raters are provided with a response is the learned semantic cluster a seed and 0 otherwise, C i i (e.g., ) as well as few example responses ter label C thanks distribution for response node , C i is the true label distri- i belonging to the cluster (e.g., “ Thanks! ”, “ Thank you. ”) and i ) and bution (i.e., for manually provided examples), N ( F R . C belongs to asked whether ( i N ) represent the feature and message neighborhood of i R The result is an automatically generated and validated i , node μ is the predefined penalty for neighboring nodes np ˆ set of high quality response messages labeled with semantic with divergent label distributions, C is the learned label j intent. This is subsequently used by the response scoring distribution for feature neighbor j , w is the weight of fea- ij model to search for approximate best responses to an in- μ is the penalty for label distribution ture j in response i , pp coming email (described earlier in Section 3) and further U deviating from the prior, a uniform distribution . to enforce diversity among the top responses chosen (Sec- The objective function for a feature node is alike, except tion 5). that there is no first term, as there are no seed labels for feature nodes: ∑ 5. SUGGESTION DIVERSITY 2 2 ˆ ˆ ˆ μ || U C w || || μ + − || C C (2) − j i pp np ij j As discussed in Section 3, the LSTM first processes an j ) i ∈N ( incoming message and then selects the (approximate) best The objective function is jointly optimized for all nodes responses from the target response set created using the in the graph. method described in Section 4. Recall that we follow this The output from EXPANDER is a learned distribution of by some light normalization to penalize responses that may semantic labels for every node in the graph. We assign the be too general to be valuable to the user. The effect of this top scoring output label as the semantic intent for the node, normalization can be seen by comparing columns 1 and 2 of labels with low scores are filtered out. Figure 4 illustrates Table 2. For example, the very generic ”Yes!” falls out of the this process. top ten responses. To discover new clusters which are not covered by the Next, we need to choose a small number of options to labeled examples, we run the semi-supervised learning algo- show the user. A straight-forward approach would be to just rithm in repeated phases. In the first phase, we run the label N choose the top responses and present them to the user. propagation algorithm for 5 iterations. We then fix the clus- However, as column 2 of Table 2 shows, these responses tend ter assignment, randomly sample 100 new responses from to be very similar. the remaining unlabeled nodes in the graph. The sampled The likelihood of at least one response being useful is nodes are treated as potential new clusters and labeled with greatest when the response options are not redundant, so it their canonicalized representation. We rerun label propa- would be wasteful to present the user with three variations gation with the new labeled set of clusters and repeat this of, say, I’ll be there. The job of the diversity component is procedure until convergence (i.e., until no new clusters are to select a more varied set of suggestions using two strate- discovered and members of a cluster do not change between gies: omitting redundant responses and enforcing negative iterations). The iterative propagation method thereby al- or positive responses.

7 Unnormalized Responses Normalized Responses 5.1 Omitting Redundant Responses Sure, I’ll be there. Yes, I’ll be there. This strategy assumes that the user should never see two Yes, I will be there. Yes, I can. . An intent can be thought of as responses of the same intent Yes, I can be there. I’ll be there. a cluster of responses that have a common communication Yes, I can. Yes, I’ll be there. purpose, e.g. confirming, asking for time or rejecting partic- Sure, I can be there. What time? ipation. In Smart Reply, every target response is associated Yeah, I can. I’ll be there! with exactly one intent. Intents are defined based on auto- Yeah, I’ll be there. I will be there. matic clustering followed by human validation as discussed Sure, I can. Sure, I’ll be there. in Section 4. Yes, I can be there. Yes. I can. The actual diversity strategy is simple: the top responses Yes! Yes, I will be there. are iterated over in the order of decreasing score. Each re- Normalized Negative Responses sponse is added to the list of suggestions, unless its intent Sorry, I won’t be able to make it tomorrow. is already covered by a response on the suggestion list. The Unfortunately I can’t. resulting list contains only the highest-scored representative Sorry, I won’t be able to join you. of each intent, and these representatives are ordered by de- Sorry, I can’t make it tomorrow. creasing score. No, I can’t. Sorry, I won’t be able to make it today. 5.2 Enforcing Negatives and Positives Sorry, I can’t. I will not be available tomorrow. We have observed that the LSTM has a strong tendency I won’t be available tomorrow. towards producing positive responses, whereas negative re- Unfortunately, I can’t. I don’t think so or I can’t make it typically sponses such as receive low scores. We believe that this tendency reflects Final Suggestions the style of email conversations: positive replies may be Sure, I’ll be there. more common, and where negative responses are appropri- Yes, I can. ate, users may prefer a less direct wording. Sorry, I won’t be able to make it tomorrow. Nevertheless, we think that it is important to offer nega- tive suggestions in order to give the user a real choice. This Table 2: Different response rankings for the message policy is implemented through the following strategy: “Can you join tomorrow’s meeting?” If the top two responses (after omitting redun- dant responses) contain at least one positive re- Reply for roughly 11% of messages, so this process vastly sponse and none of the top three responses are reduces the number of useless sugestions seen by the users. negative, the third response is replaced with a An additional benefit is to decrease the number of calls to negative one. the more expensive LSTM inference, which translates into A positive response is one which is clearly affirmative, e.g. smaller infrastructure cost. Sure one that starts with Yes , . In order to find or Of course There are two main requirements for the design of the the negative response to be included as the third sugges- triggering component. First, it has to be good enough to tion, a second LSTM pass is performed. In this second pass, figure out cases where the response is not expected. Note the search is restricted to only the negative responses in the that this is a very different goal than just scoring a set of re- target set (refer Table 2 for scored negative response exam- sponses. For instance, we could propose several valid replies ples). This is necessary since the top responses produced in to a newsletter containing a sentence “ Where do you want the first pass may not contain any negatives. ”, but most likely all of the responses would be to go today? Even though missing negatives are more common, there useless for our users. Second, it has to be fast: it processes are also cases in which an incoming message triggers exclu- hundreds of millions of messages daily, so we aim to process sively negative responses. In this situation, we employ an each message within milliseconds. analogous strategy for enforcing a positive response. The main part of the triggering component is a feedfor- ward neural network which produces a probability score for The final set of top scoring responses (bottom row in Ta- every incoming message. If the score is above some thresh- ble 2) are then presented to the user as suggestions. old, we trigger and run the LSTM scoring. We have adopted this approach because feedforward networks have repeatedly 6. TRIGGERING been shown to outperform linear models such as SVM or lin- ear regression on various NLP tasks (see for example [9]). The module is the entry point of the Smart triggering Reply system. It is responsible for filtering messages that 6.1 Data and Features are bad candidates for suggesting responses. This includes In order to label our training corpus of emails, we use as emails for which short replies are not appropriate (e.g., con- positive examples those emails that have been responded to. taining open-ended questions or sensitive topics), as well as More precisely, out of the data set described in Section 7.1, emails for which no reply is necessary at all (e.g., promo- we create a training set that consists of pairs ( y o , ), where tional emails and auto-generated updates). is a boolean o is an incoming message and y ∈{ true,false } The module is applied to every incoming email just after false label, which is true if the message had a response and the preprocessing step. If the decision is negative, we finish otherwise. For the positive class, we consider only messages the execution and do not show any suggestions (see Fig- that were replied to from a mobile device, while for negative ure 2). Currently, the system decides to produce a Smart

8 we use a subset of all messages. We downsample the negative 7.2 Results y = class to balance the training set. Our goal is to model P ( The most important end-to-end metric for our system is | ), the probability that message o will have a response true o the fraction of messages for which it was used. This is cur- on mobile. rently 10% of all mobile replies. Below we describe in more After preprocessing (described in Section 7.1), we extract detail evaluation stats for different components of the sys- content features (e.g. unigrams, bigrams) from the mes- tem. We evaluate all parts in isolation using both offline sage body, subject and headers. We also use various so- analysis as well as online experiments run on a subset of cial signals like whether the sender is in recipient’s address accounts. book, whether the sender is in recipient’s social network and whether the recipient responded in the past to this sender. 7.2.1 Triggering results In order to evaluate the triggering model, we split the 6.2 Network Architecture and Training data set described in Section 6.1 into train (80%) and test We use a feedforward multilayer perceptron with an em- (20%) such that all test messages are delivered after train bedding layer (for a vocabulary of roughly one million words) messages. This is to ensure that the test conditions are and three fully connected hidden layers. We use feature similar to the final scenario. We use a set of standard binary hashing to bucket rare words that are not present in the classifier metrics: precision, recall and the area under the vocabulary. The embeddings are separate for each sparse 854. . ROC curve. The AUC of the triggering model is 0 feature type (eg. unigram, bigram) and within one feature We also compute the fraction of triggered messages in the type, we aggregate embeddings by summing them up. Then, deployed system, which is 11%. We observed that it may be all sparse feature embeddings are concatenated with each beneficial to slightly over-trigger, since the cost of presenting other and with the vector of dense features (those are real a suggestion, even if it is not used, is quite low. numbers and boolean values mentioned in Section 6.1). We use the ReLu [13] activation function for non-linearity 7.2.2 Response selection results between layers. The dropout [22] layer is applied after each We evaluate the LSTM scoring model on three standard hidden layer. We train the model using AdaGrad [8] opti- metrics: Perplexity, Mean Reciprocal Rank and [email protected] mization algorithm with logistic loss cost function. Similarly to the LSTM, the training is run in a distributed fashion us- Perplexity. ing the TensorFlow library [1]. Perplexity is a measure of how well the model has fit the data: a model with lower perplexity assigns higher likelihood 7. EVALUATION AND RESULTS to the test responses, so we expect it to be better at pre- means k dicting responses. Intuitively, a perplexity equal to In this section, we describe the training and test data, as that when the model predicts the next word, there are on well as preprocessing steps used for all messages. Then, we k average likely candidates. In particular, for the ideal sce- evaluate different components of the Smart Reply system nario of perplexity equal to 1, we always know exactly what and present overall usage statistics. test should be the next word. The perplexity on a set of N 7.1 Data samples is computed using the following formula: To generate the training data for all Smart Reply models from sampled accounts, we extracted all pairs of an incom- N ∑ 1 i i i i ˆ ing message and the user’s response to that message. For r | P ( ln( ,...,r o ,...,o ))) P − = exp( r 1 n 1 m W training the triggering model (see Section 6), we addition- =1 i ally sampled a number of incoming personal messages which ˆ where W is the total number of words in all N samples, P the user didn’t reply to. At the beginning of Smart Reply i i -th response is the learned distribution and are the r , o i pipeline (Figure 2), data is preprocessed in the following and original message. Note that in the equation above only way: . The perplexity of the response terms are factored into P r . Smart Reply LSTM is 17 0. By comparison, an n-grams Language detection The language of the message is iden- language model with Katz backoff [11] and a maximum or- tified and non-English messages are discarded. 4 on the same data (again, . der of 5 has a perplexity of 31 Tokenization Subject and message body are broken into computed only from response terms). words and punctuation marks. Sentences boundaries are identi- Sentence segmentation Response ranking. fied in the message body. While perplexity is a quality indicator, it does not actually Infrequent words and entities like personal Normalization measure performance at the scoring task we are ultimately names, URLs, email addresses, phone numbers etc. are interested in. In particular, it does not take into account replaced by special tokens. the constraint of choosing a response in R . Therefore we also evaluate the model on a response ranking task: for each Quotation removal Quoted original messages and forwarded of N test message pairs ( o,r ) for which r ∈ R , we compute messages are removed. ), where o | w ( -th element i is the s = P w r | o ) and ∀ ( x P = i i i i Salutation/close removal Hi John Salutations like and } in descending of ,...,x . Then we sort the set R = { s,x R 1 N closes such as are removed. Best regards, Mary order. Finally, we define rank s = argmin = ( R R | ). Put j j i j simply, we are finding the rank of the actual response with After the preprocessing steps, the size of the training set . R respect to all elements in is 238 million messages, which include 153 million messages that have no response.

9 0.1 Model [email protected] [email protected] MRR 58 5 4 1 . 12 e − 3 3 . 64 e − 4 Random e . − 0 368 0 . 155 321 Frequency 0 . . 0 . 425 0 . 197 Multiclass-BOW 0 . 345 0.01 . 579 0 . 267 Smart Reply 0 0 . 483 Table 3: Response ranking Relative usage 0.001 Using this value, we can compute the Mean Reciprocal Rank: N ∑ (% of total usage count for all suggested responses) 1 1 MRR = N rank i 0.0001 i =1 250 400 0 50 100 150 200 450 300 500 350 Rank of used responses Additionally we can compute [email protected] For a given it is computed as the number of cases for which K value of Figure 5: Usage distribution for top suggested re- K responses that were target response r was within the top sponses. ranked by the model. We compare the Smart Reply response selection model Daily Count Used Seen Random to three baselines on the same ranking task. The randomly. The baseline ranks Frequency baseline ranks R 97 376 Unique Clusters . 83 1% 2% . them in order of their frequency in the training corpus. This 12 9% . 31 78% 9k Unique Suggestions . baseline captures the extent to which we can simply suggest highly frequent responses without regard for the contents of Table 4: Unique cluster/suggestions usage per day baseline ranks Multiclass-BOW the original message. The R using a feedforward neural network whose input is the or just not scroll down to the bottom of the message. Also, original message, represented with bag of words features, only one of the three displayed suggestions will be selected R and whose output is a distribution over the elements of by the user. These statistics demonstrate the need to go (a softmax). well beyond a simple system with 5 or 10 canned responses. As shown in Table 3, the Smart Reply LSTM significantly Figure 5 and Figure 6 present, respectively, the distribu- improves on the Frequency baseline, demonstrating that con- tion of the rank for suggested responses and the distribution ditioning on the original message is effective; the model suc- of suggested clusters. The tail of the cluster distribution cessfully extracts information from the original message and Frequency is long, which explains the poor performance of uses it to rank responses more accurately. baseline described in Section 7.2.2. Multiclass-BOW It also significantly outperforms the base- We also measured how Smart Reply suggestions are used line. There are a few possible explanations for this. First, based on their location on a screen. Recall that Smart Reply the recurrent architecture allows the model to learn more always presents 3 suggestions, where the first suggestion is sophisticated language understanding than bag of words fea- the top one. We observed that, out of all used suggestions, tures. Second, when we pose this as a mulitclass prediction st position, 35% from the 2nd position 45% were from the 1 problem, we can only train on messages whose response is and 20% from the 3rd position. Since usually the third po- , a small fraction of our data. On the other hand, the in R sition is used for diverse responses, we conclude that the sequence-to-sequence framework allows us to take advantage diversity component is crucial for the system quality. of all data in our corpus: the model can learn a lot about Finally, we measured the impact of enforcing a diverse set original-response relationships even when the response does of responses (e.g., by not showing two responses from the exactly. not appear in R same semantic cluster) on user engagement: when we com- Note that an added disadvantage of the multiclass formu- pletely disabled the diversity component and simply sug- lation is that it tightly couples the training of the model gested the three suggestions with the highest scores, the . We expect to the construction of R R to grow over time, 5% relative. . click-through rate decreased by roughly 7 given the incredible diversity with which people communi- cate. While a simpler application such as chat might only need a small number of possible responses, we find that for 8. CONCLUSIONS email we will need a tremendous number of possible sugges- We presented Smart Reply, a novel end-to-end system for tions to really address users’ needs. automatically generating short, complete email responses. The core of the system is a state-of-the-art deep LSTM 7.2.3 Diversity results model that can predict full responses, given an incoming We justify the need for both the diversity component and Inbox email message. To successfully deploy this system in by reporting statistics around a sizable response space R , we addressed several challenges: by Gmail unique suggestions and clusters in Table 4. The Smart Re- . ply system generates daily 12 9k unique suggestions that We ensure that individual response options deliver qual- • belong to 376 unique semantic clusters. Out of those, peo- ity by selecting them from a carefully constructed re- , ple decide to use 4 115, or 31 . 9% of, unique suggestions and sponse space. The responses are identified by a novel . 313, or 83 2% of, unique clusters. Note, however, that many method for semantic clustering. suggestions are never seen, as column 2 shows: the user may of our chosen combina- • We increase the total utility not open an email, use the web interface instead of mobile

10 0.1 [8] J. Duchi, E. Hazad, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR , 12, 2011. [9] Y. Goldberg. A primer on neural network models for natural , abs/1510.00726, 2015. CoRR language processing. [10] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation , 9(8):1735–1780, 1997. Relative usage [11] S. M. Katz. Estimation of probabilities from sparse data for the IEEE language model component of a speech recogniser. Transactions on Acoustics, Speech, and Signal Processing , 35:400–401, 1987. (% of total usage count for semantic clusters) 0.01 [12] X. Li. Understanding the semantic structure of noun phrase Proceedings of the 48th Annual Meeting of the queries. In great noted love it i got it will do thanks see you yes i am Association for Computational Linguistics (ACL) , pages ok thanks that s fine no i did nt looks good i ll be there no problem here you go great thanks sounds good 1337–1345, Uppsala, Sweden, July 2010. Association for thanks for email Computational Linguistics. thanks for the help thanks for the update [13] V. Nair and G. E. Hinton. Rectified linear units improve Proceedings of the 27th restricted boltzmann machines. In , International Conference on Machine Learning (ICML-10) pages 807–814, 2010. Figure 6: Usage distribution for semantic clusters [14] B. Pang and S. Ravi. Revisiting the predictability of language: corresponding to top suggested responses. Response completion in social media. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning , pages 1489–1499, Jeju Island, Korea, July 2012. tion of suggestions by enforcing diversity among them, Association for Computational Linguistics. and filtering traffic for which suggestions would not be [15] S. Ravi and Q. Diao. Large scale distributed semi-supervised Proceedings of the learning using streaming approximation. In useful. 19th International Conference on Artificial Intelligence and , 2016. Statistics (AISTATS) • We build a scalable system by efficiently searching the [16] A. Ritter, C. Cherry, and W. B. Dolan. Data-driven response target response space. Proceedings of the 2011 generation in social media. In Conference on Empirical Methods in Natural Language Our clearest metric of success is the fact that 10% of mo- Processing , Edinburgh, UK, July 2011. Association for Computational Linguistics. are now composed with assistance from bile replies in Inbox [17] R. Saha Roy, R. Katare, N. Ganguly, S. Laxman, and the Smart Reply system. Furthermore, we have designed the M. Choudhury. Discovering and understanding word level user system in such a way that it is easily extendable to address intent in web search queries. Web Semant. , 30(C):22–38, Jan. additional user needs; for instance, the architecture of our 2015. [18] H. Sak, A. Senior, and F. Beaufays. Long short-term memory core response scoring model is language agnostic, therefore recurrent neural network architectures for large scale acoustic accommodates extension to other languages in addition to Proceedings of the Annual Conference of modeling. In English. International Speech Communication Association , 2014. (INTERSPEECH) [19] I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and 9. ACKNOWLEDGMENTS J. Pineau. Hierarchical neural network generative models for movie dialogues. In arXiv preprint arXiv:1507.04808 , 2015. The authors would like to thank Oriol Vinyals and Ilya [20] L. Shang, Z. Lu, and H. Li. Neural responding machine for Sutskever for many helpful discussions and insights, as well In Proceedings of ACL-IJCNLP , short-text conversation. In as Prabhakar Raghavan for his guidance and support. 2015. [21] A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J.-Y. Nie, J. Gao, and B. Dolan. A neural network approach to 10. REFERENCES context-sensitive generation of conversation responses. In In [1] M. Abadi, A. Agarwal, P. Barham, and et al. Tensorflow: Proceedings of NAACL-HLT , 2015. Large-scale machine learning on heterogeneous systems. 2015. [22] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and [2] I. G. P. Affairs. Interconnected world: Communication & social R. Salakhutdinov. Dropout: A simple way to prevent neural networking. Press Release, March 2012. http: networks from overfitting. Journal of Machine Learning //www.ipsos-na.com/news-polls/pressrelease.aspx?id=5564. Research , 15:1929–1958, 2014. [3] Y. Artzi, P. Pantel, and M. Gamon. Predicting responses to [23] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence Proceedings of the 2012 Conference of the microblog posts. In Advances in Neural learning with neural networks. North American Chapter of the Association for , 2014. Information Processing Systems (NIPS) , Computational Linguistics: Human Language Technologies [24] O. Vinyals and Q. V. Le. A neural conversation model. In pages 602–606, Montr ́eal, Canada, June 2012. Association for ICML Deep Learning Workshop , 2015. Computational Linguistics. [25] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: [4] L. Backstrom, J. Kleinberg, L. Lee, and Proceedings of IEEE A neural image caption generator. In C. Danescu-Niculescu-Mizil. Characterizing and curating Conference on Computer Vision and Pattern Recognition conversation threads: Expansion, focus, volume, re-entry. In , 2015. (CVPR) Proceedings of the Sixth ACM International Conference on [26] J. B. Wendt, M. Bendersky, L. Garcia-Pueyo, V. Josifovski, , WSDM ’13, pages 13–22, 2013. Web Search and Data Mining B. Miklos, I. Krka, A. Saikia, J. Yang, M.-A. Cartright, and [5] Y. Bengio, O. Delalleau, and N. Le Roux. Label propagation S. Ravi. Hierarchical label propagation and discovery for olkopf, and and quadratic criterion. In O. Chapelle, B. Sch ̈ Proceedings of the International machine generated email. In A. Zien, editors, Semi-Supervised Learning , pages 193–216. Conference on Web Search and Data Mining (WSDM) MIT Press, 2006. (2016) , 2016. [6] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals. Listen, attend, [27] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised and spell. arXiv:1508.01211 , abs/1508.01211, 2015. learning using gaussian fields and harmonic functions. In [7] Z. Chen, B. Liu, M. Hsu, M. Castellanos, and R. Ghosh. Proceedings of the International Conference on Machine Proceedings Identifying intention posts in discussion forums. In , pages 912–919, 2003. Learning (ICML) of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language , pages 1041–1050, Atlanta, Georgia, June 2013. Technologies Association for Computational Linguistics.

Related documents