Finding Deceptive Opinion Spam by Any Stretch of the Imagination

Transcript

1 Finding Deceptive Opinion Spam by Any Stretch of the Imagination Jeffrey T. Hancock Myle Ott Claire Cardie Yejin Choi Department of Communication Department of Computer Science Cornell University Cornell University Ithaca, NY 14853 Ithaca, NY 14853 [email protected] {myleott,ychoi,cardie}@cs.cornell.edu Abstract hired people to write positive reviews for an other- Consumers increasingly rate, review and re- 4 wise poorly reviewed product. search products online (Jansen, 2010; Litvin While other kinds of spam have received consid- et al., 2008). Consequently, websites con- erable computational attention, regrettably there has taining consumer reviews are becoming tar- been little work to date (see Section 2) on opinion opinion spam While recent work . gets of has focused primarily on manually identifi- spam detection. Furthermore, most previous work in able instances of opinion spam, in this work the area has focused on the detection of DISRUPTIVE deceptive opinion spam —fictitious we study —uncontroversial instances of spam OPINION SPAM opinions that have been deliberately written to that are easily identified by a human reader, e.g., ad- sound authentic. Integrating work from psy- vertisements, questions, and other irrelevant or non- chology and computational linguistics, we de- opinion text (Jindal and Liu, 2008). And while the velop and compare three approaches to detect- presence of disruptive opinion spam is certainly a ing deceptive opinion spam, and ultimately nuisance, the risk it poses to the user is minimal, develop a classifier that is nearly 90% accurate gold-standard opinion spam dataset. on our since the user can always choose to ignore it. Based on feature analysis of our learned mod- We focus here on a potentially more insidi- els, we additionally make several theoretical ous type of opinion spam: DECEPTIVE OPINION contributions, including revealing a relation- SPAM —fictitious opinions that have been deliber- ship between deceptive opinions and imagina- ately written to sound authentic, in order to deceive tive writing. the reader. For example, one of the following two hotel reviews is truthful and the other is deceptive : opinion spam 1 Introduction 1. I have stayed at many hotels traveling for both business With the ever-increasing popularity of review web- and pleasure and I can honestly stay that The James is tops. The service at the hotel is first class. The rooms sites that feature user-generated opinions (e.g., are modern and very comfortable. The location is per- 1 2 TripAdvisor and Yelp ), there comes an increasing fect within walking distance to all of the great sights and opinion spam — potential for monetary gain through restaurants. Highly recommend to both business trav- inappropriate or fraudulent reviews. Opinion spam ellers and couples. can range from annoying self-promotion of an un- 2. My husband and I stayed at the James Chicago Hotel for our anniversary. This place is fantastic! We knew related website or blog to deliberate review fraud, as soon as we arrived we made the right choice! The 3 of a Belkin employee who as in the recent case rooms are BEAUTIFUL and the staff very attentive and wonderful!! The area of the hotel is great, since I love 1 http://tripadvisor.com to shop I couldn’t ask for more!! We will definatly be 2 http://yelp.com 3 4 http://news.cnet.com/8301-1001_ It is also possible for opinion spam to be negative, poten- tially in order to sully the reputation of a competitor. 3-10145399-92.html 309 , pages 309–319, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics c 2011 Association for Computational Linguistics © Portland, Oregon, June 19-24, 2011.

2 back to Chicago and we will for sure be back to the James Additionally, we make several theoretical con- Chicago. tributions based on an examination of the feature weights learned by our machine learning classifiers. Typically, these deceptive opinions are neither Specifically, we shed light on an ongoing debate in easily ignored nor even identifiable by a human the deception literature regarding the importance of 5 consequently, there are few good sources reader; considering the context and motivation of a decep- of labeled data for this research. Indeed, in the ab- tion, rather than simply identifying a universal set sence of gold-standard data, related studies (see Sec- of deception cues. We also present findings that are tion 2) have been forced to utilize ad hoc procedures consistent with recent work highlighting the difficul- for evaluation. In contrast, one contribution of the ties that liars have encoding spatial information (Vrij work presented here is the creation of the first large- et al., 2009). Lastly, our study of deceptive opinion 6 scale, publicly available dataset for deceptive opin- spam detection as a genre identification problem re- ion spam research, containing 400 truthful and 400 veals relationships between deceptive opinions and gold-standard deceptive reviews. imaginative writing, and between truthful opinions To obtain a deeper understanding of the nature of and informative writing. deceptive opinion spam, we explore the relative util- The rest of this paper is organized as follows: in ity of three potentially complementary framings of Section 2, we summarize related work; in Section 3, our problem. Specifically, we view the task as: (a) we explain our methodology for gathering data and task, in which we use text categorization a standard evaluate human performance; in Section 4, we de- -gram–based classifiers to label opinions as either n scribe the features and classifiers employed by our deceptive or truthful (Joachims, 1998; Sebastiani, three automated detection approaches; in Section 5, 2002); (b) an instance of psycholinguistic decep- we present and discuss experimental results; finally, , in which we expect deceptive state- tion detection conclusions and directions for future work are given ments to exemplify the psychological effects of ly- in Section 6. ing, such as increased negative emotion and psycho- logical distancing (Hancock et al., 2008; Newman et 2 Related Work , genre identification al., 2003); and, (c) a problem of Spam has historically been studied in the contexts of in which we view deceptive and truthful writing as ̈ e-mail (Drucker et al., 2002), and the Web (Gy ongyi sub-genres of imaginative and informative writing, et al., 2004; Ntoulas et al., 2006). Recently, re- respectively (Biber et al., 1999; Rayson et al., 2001). as opinion spam searchers have began to look at We compare the performance of each approach well (Jindal and Liu, 2008; Wu et al., 2010; Yoo on our novel dataset. Particularly, we find that ma- and Gretzel, 2009). chine learning classifiers trained on features tradi- Jindal and Liu (2008) find that opinion spam is tionally employed in (a) psychological studies of both widespread and different in nature from either deception and (b) genre identification are both out- e-mail or Web spam. Using product review data, - n performed at statistically significant levels by and in the absence of gold-standard deceptive opin- gram–based text categorization techniques. Notably, ions, they train models using features based on the a combined classifier with both n -gram and psy- review text, reviewer, and product, to distinguish chological deception features achieves nearly 90% 7 between opinions duplicate (considered deceptive cross-validated accuracy on this task. In contrast, non-duplicate opinions (considered truth- spam) and we find deceptive opinion spam detection to be well ful). Wu et al. (2010) propose an alternative strategy beyond the capabilities of most human judges, who for detecting deceptive opinion spam in the absence perform roughly at-chance—a finding that is consis- 7 tent with decades of traditional deception detection Duplicate (or near-duplicate) opinions are opinions that ap- pear more than once in the corpus with the same (or similar) research (Bond and DePaulo, 2006). text. While these opinions are likely to be deceptive, they are 5 The second example review is deceptive opinion spam. unlikely to be representative of deceptive opinion spam in gen- 6 eral. Moreover, they are potentially detectable via off-the-shelf Available by request at: http://www.cs.cornell. myleott/op_spam edu/ plagiarism detection software. ̃ 310

3 Following the work of Yoo and Gretzel (2009), we of gold-standard data, based on the distortion of pop- reviews for positive compare truthful and deceptive ularity rankings. Both of these heuristic evaluation hotels found on TripAdvisor. Specifically, we mine approaches are unnecessary in our work, since we all 5-star truthful reviews from the 20 most popular deceptive and truthful opin- gold-standard compare 8 9 in the Chicago area. hotels on TripAdvisor De- ions. ceptive opinions are gathered for those same 20 ho- Yoo and Gretzel (2009) gather 40 truthful and 42 10 (AMT). Be- tels using Amazon Mechanical Turk deceptive hotel reviews and, using a standard statis- low, we provide details of the collection methodolo- tical test, manually compare the psychologically rel- gies for deceptive (Section 3.1) and truthful opinions evant linguistic differences between them. In con- (Section 3.2). Ultimately, we collect 20 truthful and trast, we create a much larger dataset of 800 opin- 20 deceptive opinions for each of the 20 chosen ho- ions that we use to develop and evaluate automated tels (800 opinions total). deception classifiers. Research has also been conducted on the re- 3.1 Deceptive opinions via Mechanical Turk . psycholinguistic deception detection lated task of Crowdsourcing services such as AMT have made Newman et al. (2003), and later Mihalcea and large-scale data annotation and collection efforts fi- Strapparava (2009), ask participants to give both nancially affordable by granting anyone with ba- their true and untrue views on personal issues sic programming skills access to a marketplace of (e.g., their stance on the death penalty). Zhou et Turkers anonymous online workers (known as ) will- al. (2004; 2008) consider computer-mediated decep- ing to complete small tasks. tion in role-playing games designed to be played opinion spam deceptive To solicit gold-standard over instant messaging and e-mail. However, while Human- using AMT, we create a pool of 400 -gram–based deception clas- these studies compare n (HITs) and allocate them evenly Intelligence Tasks sifiers to a random guess baseline of 50%, we addi- across our 20 chosen hotels. To ensure that opin- tionally evaluate and compare two other computa- ions are written by unique authors, we allow only a tional approaches (described in Section 4), as well single submission per Turker. We also restrict our as the performance of human judges (described in task to Turkers who are located in the United States, Section 3.3). and who maintain an approval rating of at least 90%. re- Lastly, automatic approaches to determining Turkers are allowed a maximum of 30 minutes to have been studied—directly (Weimer view quality work on the HIT, and are paid one US dollar for an et al., 2007), and in the contexts of helpful- accepted submission. ness (Danescu-Niculescu-Mizil et al., 2009; Kim et Each HIT presents the Turker with the name and al., 2006; O’Mahony and Smyth, 2009) and credibil- website of a hotel. The HIT instructions ask the ity (Weerkamp and De Rijke, 2008). Unfortunately, Turker to assume that they work for the hotel’s mar- most measures of quality employed in those works keting department, and to pretend that their boss are based exclusively on human judgments, which wants them to write a fake review (as if they were we find in Section 3 to be poorly calibrated to de- a customer) to be posted on a travel review website; tecting deceptive opinion spam. additionally, the review needs to sound realistic and portray the hotel in a positive light. A disclaimer 3 Dataset Construction and Human Performance 8 TripAdvisor utilizes a proprietary ranking system to assess hotel popularity. We chose the 20 hotels with the greatest num- While truthful opinions are ubiquitous online, de- ber of reviews, irrespective of the TripAdvisor ranking. 9 ceptive opinions are difficult to obtain without re- It has been hypothesized that popular offerings are less likely to become targets of deceptive opinion spam, since the sorting to heuristic methods (Jindal and Liu, 2008; relative impact of the spam in such cases is small (Jindal and Wu et al., 2010). In this section, we report our ef- Liu, 2008; Lim et al., 2010). By considering only the most forts to gather (and validate with human judgments) popular hotels, we hope to minimize the risk of mining opinion the first publicly available opinion spam dataset with spam and labeling it as truthful. 10 gold-standard deceptive opinions. http://mturk.com 311

4 Time spent t (minutes) at least 150 characters long (see footnote 11 in : 400 count Section 3.1); t All submissions : 29.78 t : 0.08, max min ̄ • 1,607 reviews written by first-time authors — : 8.06, s t : 6.32 new users who have not previously posted an ` (words) Length : 425 ` ` : 25, min max opinion on TripAdvisor—since these opinions All submissions ̄ : 61.30 ` : 115.75, s are more likely to contain opinion spam, which : 47 count would reduce the integrity of our truthful re- : 39, 1 ` : 407 Time spent ` t < min max ̄ view data (Wu et al., 2010). ` : 113.94, s : 66.24 count : 353 Finally, we balance the number of truthful and ≥ Time spent 1 t : 425 ` ` : 25, min max ̄ : 115.99, s : 60.71 ` deceptive opinions by selecting 400 of the remain- ing 2,124 truthful reviews, such that the document Table 1: Descriptive statistics for 400 deceptive opinion lengths of the selected truthful reviews are similarly corresponds to s spam submissions gathered using AMT. distributed to those of the deceptive reviews. Work the sample standard deviation. by Serrano et al. (2009) suggests that a log-normal distribution is appropriate for modeling document indicates that any submission found to be of insuffi- lengths. Thus, for each of the 20 chosen hotels, we cient quality (e.g., written for the wrong hotel, unin- select 20 truthful reviews from a log-normal (left- 12 11 plagiarized, telligible, unreasonably short, etc.) truncated at 150 characters) distribution fit to the 14 will be rejected. Combined with lengths of the deceptive reviews. It took approximately 14 days to collect 400 sat- the 400 deceptive reviews gathered in Section 3.1 isfactory deceptive opinions. Descriptive statistics this yields our final dataset of 800 reviews. appear in Table 1. Submissions vary quite dramati- 3.3 Human performance cally both in length, and time spent on the task. Par- Assessing human deception detection performance ticularly, nearly 12% of the submissions were com- is important for several reasons. First, there are few . Surprisingly, an inde- under one minute pleted in other baselines for our classification task; indeed, re- pendent two-tailed t-test between the mean length of ̄ lated studies (Jindal and Liu, 2008; Mihalcea and ` ) and the other submissions these submissions ( t< 1 ̄ Strapparava, 2009) have only considered a random 83 ). . ( ` = 0 ) reveals no significant difference ( p 1 t ≥ guess baseline. Second, assessing human perfor- We suspect that these “ quick ” users may have started mance is necessary to validate the deceptive opin- working prior to having formally accepted the HIT, ions gathered in Section 3.1. If human performance presumably to circumvent the imposed time limit. is low, then our deceptive opinions are convincing, Indeed, the quickest submission took just 5 seconds and therefore, deserving of further attention. and contained 114 words. Our initial approach to assessing human perfor- 3.2 Truthful opinions from TripAdvisor mance on this task was with Mechanical Turk. Un- For truthful opinions, we mine all 6,977 reviews fortunately, we found that some Turkers selected from the 20 most popular Chicago hotels on among the choices seemingly at random, presum- TripAdvisor. From these we eliminate: ably to maximize their hourly earnings by obviating the need to read the review. While a similar effect 3,130 non-5-star reviews; • has been observed previously (Akkaya et al., 2010), 13 • 41 non-English reviews; there remains no universal solution. • 75 reviews with fewer than 150 characters Instead, we solicit the help of three volunteer un- since, by construction, deceptive opinions are dergraduate university students to make judgments 11 A submission is considered unreasonably short if it con- on a subset of our data. This balanced subset, cor- tains fewer than 150 characters. responding to the first fold of our cross-validation 12 Submissions are individually checked for plagiarism at 14 http://plagiarisma.net . We use the R package GAMLSS (Rigby and Stasinopoulos, 13 . Language is determined using http://tagthe.net 2005) to fit the left-truncated log-normal distribution. 312

5 TRUTHFUL DECEPTIVE R Accuracy R F F P P 61.9% 48.7 87.5 69.7 74.4 36.3 1 JUDGE 57.9 2 53.9 95.0 68.8 78.9 18.8 30.3 JUDGE HUMAN 56.9% 53.1% 52.3 JUDGE 59.9 54.7 36.3 43.6 3 70.0 MAJORITY 54.8 92.5 68.8 76.0 23.8 36.2 58.1% META 60.9 SKEPTIC 60.4 60.5 61.3 60.8 60.6% 60.0 Table 2: Performance of three human judges and two meta-judges on a subset of 160 opinions, corresponding to the first fold of our cross-validation experiments in Section 5. Boldface indicates the largest value for each column. that scores in the range (0.00, 0.20] correspond experiments described in Section 5, contains all 40 to “ ” between annotators. The slight agreement reviews from each of four randomly chosen hotels. largest pairwise Cohen’s kappa is 0.12, between Unlike the Turkers, our student volunteers are not JUDGE 3—a value far below generally JUDGE 2 and offered a monetary reward. Consequently, we con- accepted pairwise agreement levels. We suspect sider their judgements to be more honest than those that agreement among our human judges is so obtained via AMT. low precisely because humans are poor judges of Additionally, to test the extent to which the in- deception (Vrij, 2008), and therefore they perform dividual human judges are biased, we evaluate the nearly at-chance respective to one another. performance of two virtual meta-judges. Specifi- cally, the MAJORITY meta-judge predicts “ decep- 4 Automated Approaches to Deceptive ” when at least two out of three human judges tive Opinion Spam Detection believe the review to be deceptive, and the SKEP - ” when any TIC meta-judge predicts “ deceptive hu- We consider three automated approaches to detect- man judge believes the review to be deceptive. ing deceptive opinion spam, each of which utilizes Human and meta-judge performance is given in classifiers (described in Section 4.4) trained on the Table 2. It is clear from the results that human dataset of Section 3. The features employed by each judges are not particularly effective at this task. In- strategy are outlined here. deed, a two-tailed binomial test fails to reject the 4.1 Genre identification null hypothesis that 2 and JUDGE JUDGE 3 per- 48 for the three form at-chance ( p = 0 . 003 , 0 . 10 , 0 . Work in computational linguistics has shown that judges, respectively). Furthermore, all three judges (POS) part-of-speech the frequency distribution of truth-bias suffer from (Vrij, 2008), a common find- tags in a text is often dependent on the genre of the ing in deception detection research in which hu- text (Biber et al., 1999; Rayson et al., 2001). In our man judges are more likely to classify an opinion genre identification approach to deceptive opinion as truthful than deceptive. In fact, 2 clas- JUDGE spam detection, we test if such a relationship exists sified fewer than 12% of the opinions as decep- for truthful and deceptive reviews by constructing, tive! Interestingly, this bias is effectively smoothed for each review, features based on the frequencies of 15 SKEPTIC meta-judge, which produces nearly by the These features are also intended to each POS tag. perfectly class-balanced predictions. A subsequent provide a good baseline with which to compare our reevaluation of human performance on this task sug- other automated approaches. gests that the truth-bias can be reduced if judges 4.2 Psycholinguistic deception detection are given the class-proportions in advance, although The (LIWC) Linguistic Inquiry and Word Count such prior knowledge is unrealistic; and ultimately, software (Pennebaker et al., 2007) is a popular au- performance remains similar to that of Table 2. tomated text analysis tool used widely in the so- Inter-annotator agreement among the three cial sciences. It has been used to detect personality judges, computed using Fleiss’ kappa, is 0.11. 15 While there is no precise rule for interpreting We use the Stanford Parser (Klein and Manning, 2003) to kappa scores, Landis and Koch (1977) suggest obtain the relative POS frequencies. 313

6 Machine classifiers, both of which have performed traits (Mairesse et al., 2007), to study tutoring dy- well in related work (Jindal and Liu, 2008; Mihalcea namics (Cade et al., 2010), and, most relevantly, to and Strapparava, 2009; Zhou et al., 2008). analyze deception (Hancock et al., 2008; Mihalcea and Strapparava, 2009; Vrij et al., 2007). ıve Bayes For a document ~x , with label Na y , the ̈ While LIWC does not include a text classifier, we (NB) classifier gives us the following decision rule: can create one with features derived from the LIWC In particular, LIWC counts and groups output. (1) ) c = y | ˆ Pr( · ) c = y Pr( y = arg max ~x c the number of instances of nearly 4,500 keywords into 80 psychologically meaningful dimensions. We uniform When the class prior is , for example construct one feature for each of the 80 LIWC di- when the classes are balanced (as in our case), (1) mensions, which can be summarized broadly under can be simplified to the maximum likelihood classi- the following four categories: fier (Peng and Schuurmans, 2003): 1. Linguistic processes: Functional aspects of text = y | ~x Pr( = arg max y ˆ (2) ) c (e.g., the average number of words per sen- c tence, the rate of misspelling, swearing, etc.) Under (2), both the NB classifier used by Mihal- 2. Psychological processes: Includes all social, cea and Strapparava (2009) and the language model emotional, cognitive, perceptual and biological classifier used by Zhou et al. (2008) are equivalent. processes, as well as anything related to time or Thus, following Zhou et al. (2008), we use the SRI space. Language Modeling Toolkit (Stolcke, 2002) to esti- 3. Personal concerns: Any references to work, | ~x Pr( mate individual language models, , ) c = y leisure, money, religion, etc. for truthful and deceptive opinions. We consider 4. Spoken categories: Primarily filler and agree- UNIGRAMS all three n -gram feature sets, namely , ment words. + + BIGRAMS , with corresponding , and TRIGRAMS language models smoothed using the interpolated While other features have been considered in past Kneser-Ney method (Chen and Goodman, 1996). deception detection work, notably those of Zhou et We also train Support Vector Machine (SVM) al. (2004), early experiments found LIWC features classifiers, which find a high-dimensional separating to perform best. Indeed, the LIWC2007 software hyperplane between two groups of data. To simplify used in our experiments subsumes most of the fea- feature analysis in Section 5, we restrict our evalu- tures introduced in other work. Thus, we focus our linear SVMs, which learn a weight vector ation to psycholinguistic approach to deception detection on ~w can be ~x , such that a document b and bias term LIWC-based features. classified by: 4.3 Text categorization ˆ + ~x · ~w ( sign = y ) (3) b In contrast to the other strategies just discussed, our text categorization approach to deception de- light (Joachims, 1999) to train our We use SVM tection allows us to model both content and con- linear SVM models on all three approaches and text with n -gram features. Specifically, we consider POS feature sets described above, namely , , LIWC -gram feature sets, with the n the following three + + UNIGRAMS BIGRAMS , . We also , and TRIGRAMS corresponding features lowercased and unstemmed: + + evaluate every combination of these features, but TRIGRAMS , , where the BIGRAMS , UNIGRAMS + + for brevity include only BIGRAMS , which + LIWC superscript indicates that the feature set subsumes performs best. Following standard practice, doc- the preceding feature set. ument vectors are normalized to unit-length. For 4.4 Classifiers + , we unit-length normalize LIWC + BIGRAMS LIWC + features individually before com- and BIGRAMS Features from the three approaches just introduced ̈ bining them. are used to train Na ıve Bayes and Support Vector 314

7 TRUTHFUL DECEPTIVE Features Accuracy R F P R F Approach P 77.5 73.0% 68.5 71.7 71.1 75.3 74.2 POS GENRE IDENTIFICATION SVM PSYCHOLINGUISTIC 76.8% 77.2 76.0 76.6 76.4 LIWC 76.9 77.5 SVM DECEPTION DETECTION UNIGRAMS 88.4% 89.9 86.5 88.2 87.0 90.3 88.6 SVM + 89.1 90.1 89.0 89.6 89.6% 90.3 89.7 BIGRAMS SVM + + BIGRAMS 89.8 89.8 89.8 % LIWC 89.8 89.8 89.8 89.8 SVM + TEXT CATEGORIZATION 89.0% 89.0 89.0 89.0 89.0 89.0 89.0 TRIGRAMS SVM 85.0 92.5 83.5 87.8 88.4% 93.3 88.9 UNIGRAMS NB + 89.0 88.9% 89.8 BIGRAMS 87.8 88.0 90.0 88.7 NB + 87.6% 87.7 87.5 87.6 87.5 87.8 87.6 TRIGRAMS NB JUDGE 61.9 % 57.9 87.5 69.7 74.4 36.3 48.7 1 META 18.8 HUMAN JUDGE 2 56.9% 53.9 95.0 68.8 78.9 / 30.3 SKEPTIC 60.0 61.3 60.9 60.4 60.8 60.6% 60.5 Table 3: Automated classifier performance for three approaches based on nested 5-fold cross-validation experiments. true positive, false Reported precision, recall and F-score are computed using a micro-average, i.e., from the aggregate positive and false negative rates, as suggested by Forman and Scholz (2009). Human performance is repeated here for SKEPTIC meta-judge, although they cannot be directly compared since the 160-opinion JUDGE 1, JUDGE 2 and the subset on which they are assessed only corresponds to the first cross-validation fold. 5 Results and Discussion mated classifier outperforms most human judges 06 , (one-tailed sign test 0 . 01 , p 0 . 001 = 0 for the . The deception detection strategies described in Sec- three judges, respectively, on the first fold). This cross- tion 4 are evaluated using a 5-fold nested result is best explained by theories of reality mon- validation (CV) procedure (Quadrianto et al., 2009), itoring (Johnson and Raye, 1981), which suggest where model parameters are selected for each test that truthful and deceptive opinions might be clas- CV experiments on the train- standard fold based on sified into informative and imaginative genres, re- all ing folds. Folds are selected so that each contains spectively. Work by Rayson et al. (2001) has found reviews from four hotels; thus, learned models are strong distributional differences between informa- always evaluated on reviews from unseen hotels. tive and imaginative writing, namely that the former Results appear in Table 3. We observe that auto- typically consists of more nouns, adjectives, prepo- mated classifiers outperform human judges for every sitions, determiners, and coordinating conjunctions, metric, except truthful recall where 2 per- JUDGE 18 17 while the latter consists of more verbs, adverbs, 16 However, this is expected given that forms best. pronouns, and pre-determiners. Indeed, we find that untrained humans often focus on unreliable cues to (found in Table 4) the weights learned by POS SVM deception (Vrij, 2008). For example, one study ex- are largely in agreement with these findings, no- amining deception in online dating found that hu- tably except for adjective and adverb superlatives , mans perform at-chance detecting deceptive pro- the latter of which was found to be an exception by files because they rely on text-based cues that are Rayson et al. (2001). However, that deceptive opin- unrelated to deception, such as second-person pro- ions contain more superlatives is not unexpected, nouns (Toma and Hancock, In Press). since deceptive writing (but not necessarily imagi- Among the automated classifiers, baseline per- native writing in general) often contains exaggerated formance is given by the simple genre identifica- language (Buller and Burgoon, 1996; Hancock et al., tion approach ( POS ) proposed in Section 4.1. SVM 2008). Surprisingly, we find that even this simple auto- Both remaining automated approaches to detect- 16 ing deceptive opinion spam outperform the simple 2 classified fewer than As mentioned in Section 3.3, JUDGE 12% of opinions as deceptive. While achieving 95% truthful re- 17 verbs were an exception. Past participle call, this judge’s corresponding precision was not significantly 18 ). 4 . = 0 p better than chance (two-tailed binomial Superlative adverbs were an exception. 315

8 TRUTHFUL INFORMATIVE IMAGINATIVE / DECEPTIVE / Category Variant Variant Category Weight Weight Base 0.008 Singular -0.057 0.041 Plural Past tense 0.002 NOUNS Proper, singular Present participle -0.089 -0.041 -0.031 Singular, present Proper, plural 0.091 VERBS General 0.002 Third person 0.026 Comparative singular, present 0.058 ADJECTIVES Superlative -0.164 -0.063 Modal PREPOSITIONS General General 0.064 0.001 ADVERBS Comparative -0.035 General DETERMINERS 0.009 . . COORD Personal -0.098 General CONJ 0.094 PRONOUNS Possessive -0.303 Past participle VERBS 0.053 PRE ADVERBS DETERMINERS General 0.017 Superlative -0.094 - . Based on work by Rayson et al. (2001), we expect weights on POS Table 4: Average feature weights learned by SVM the left to be positive (predictive of truthful opinions), and weights on the right to be negative (predictive of deceptive opinions). Boldface entries are at odds with these expectations. We report average feature weights of unit-normalized weights vectors, to account for potential differences in magnitude between the folds. weight vectors, rather than raw + + LIWC LIWC BIGRAMS SVM SVM genre identification baseline just discussed. Specifi- DECEPTIVE TRUTHFUL TRUTHFUL DECEPTIVE ) pro- LIWC cally, the psycholinguistic approach ( SVM i - hear chicago posed in Section 4.2 performs 3.8% more accurately ... my number family . (one-tailed sign test ), and the standard text p = 0 02 hotel perspron allpunct on see negemo and , location categorization approach proposed in Section 4.3 per- pronoun dash luxury ) forms between 14.6% and 16.6% more accurately. leisure allpunct exclusive experience LIWC However, best performance overall is achieved by we exclampunct hilton floor combining features from these two approaches. Par- sexual sixletters business ( + vacation hotel the posemo period BIGRAMS + LIWC ticularly, the combined model SVM bathroom i otherpunct comma is 89.8% accurate at detecting deceptive opinion small cause space spa 19 spam. helpful human auxverb looking models trained only on Surprisingly, future past $ while . inhibition perceptual husband hotel -gram feature set— UNIGRAMS —the simplest n feel assent husband other my outperform all non–text-categorization approaches, + even perform and models trained on BIGRAMS Table 5: Top 15 highest weighted truthful and deceptive (one-tailed sign test better ). This suggests 07 . = 0 p + . LIWC and features learned by LIWC BIGRAMS + SVM SVM that a universal set of keyword-based deception Ambiguous features are subscripted to indicate the source LIWC ) is not the best approach to de- cues (e.g., of the feature. LIWC features correspond to groups tecting deception, and a context-sensitive approach of keywords as explained in Section 4.2; more details + BIGRAMS (e.g., ) might be necessary to achieve about LIWC and the LIWC categories are available at . http://liwc.net state-of-the-art deception detection performance. To better understand the models learned by these automated approaches, we report in Table 5 the top particular, truthful opinions are more specific about 15 highest weighted features for each class ( truthful spatial configurations (e.g., small, bathroom, on, lo- + BIGRAMS deceptive ) as learned by + and LIWC SVM cation). This finding is also supported by recent and LIWC . In agreement with theories of reality SVM work by Vrij et al. (2009) suggesting that liars have monitoring (Johnson and Raye, 1981), we observe considerable difficultly encoding spatial information that truthful opinions tend to include more sensorial into their lies. Accordingly, we observe an increased and concrete language than deceptive opinions; in focus in deceptive opinions on aspects external to + 19 The result is not significantly better than BIGRAMS . the hotel being reviewed (e.g., husband, business, SVM 316

9 by our classifiers that illustrate the difficulties faced vacation). by liars in encoding spatial information. Lastly, we We also acknowledge several findings that, on the have discovered a plausible relationship between de- surface, are in contrast to previous psycholinguistic ceptive opinion spam and imaginative writing, based studies of deception (Hancock et al., 2008; Newman on POS distributional similarities. et al., 2003). For instance, while deception is often Possible directions for future work include an ex- associated with negative emotion terms, our decep- tended evaluation of the methods proposed in this tive reviews have more positive and fewer negative work to both negative opinions, as well as opinions emotion terms. This pattern makes sense when one coming from other domains. Many additional ap- considers the goal of our deceivers, namely to create proaches to detecting deceptive opinion spam are a positive review (Buller and Burgoon, 1996). also possible, and a focus on approaches with high Deception has also previously been associated deceptive precision might be useful for production with decreased usage of first person singular, an ef- environments. fect attributed to psychological distancing (Newman et al., 2003). In contrast, we find increased first Acknowledgments person singular to be among the largest indicators of deception, which we speculate is due to our de- This work was supported in part by National ceivers attempting to enhance the credibility of their Science Foundation Grants BCS-0624277, BCS- reviews by emphasizing their own presence in the 0904822, HSD-0624267, IIS-0968450, and NSCC- review. Additional work is required, but these find- 0904822, as well as a gift from Google, and the ings further suggest the importance of moving be- Jack Kent Cooke Foundation. We also thank, al- yond a universal set of deceptive language features phabetically, Rachel Boochever, Cristian Danescu- (e.g., ) by considering both the contextual (e.g., LIWC Niculescu-Mizil, Alicia Granstein, Ulrike Gretzel, + BIGRAMS ) and motivational parameters underly- Danielle Kirshenblat, Lillian Lee, Bin Lu, Jack ing a deception as well. Newton, Melissa Sackler, Mark Thomas, and Angie Yoo, as well as members of the Cornell NLP sem- 6 Conclusion and Future Work inar group and the ACL reviewers for their insight- ful comments, suggestions and advice on various as- In this work we have developed the first large-scale pects of this work. deceptive opinion gold-standard dataset containing spam. With it, we have shown that the detection of deceptive opinion spam is well beyond the ca- References pabilities of human judges, most of whom perform C. Akkaya, A. Conrad, J. Wiebe, and R. Mihalcea. 2010. roughly at-chance. Accordingly, we have introduced Amazon mechanical turk for subjectivity word sense approaches to deceptive opinion automated three Proceedings of the NAACL HLT disambiguation. In spam detection, based on insights coming from re- 2010 Workshop on Creating Speech and Language search in computational linguistics and psychology. , Data with Amazons Mechanical Turk, Los Angeles n We find that while standard -gram–based text cate- pages 195–203. gorization is the best individual detection approach, D. Biber, S. Johansson, G. Leech, S. Conrad, E. Finegan, Longman grammar of spoken and and R. Quirk. 1999. approach using psycholinguistically- combination a written English , volume 2. MIT Press. n motivated features and -gram features can perform C.F. Bond and B.M. DePaulo. 2006. Accuracy of de- slightly better. Personality and Social Psychology ception judgments. Finally, we have made several theoretical con- Review , 10(3):214. Specifically, our findings suggest the tributions. Interpersonal 1996. D.B. Buller and J.K. Burgoon. importance of considering both the context (e.g., deception theory. Communication Theory , 6(3):203– + ) and motivations underlying a decep- BIGRAMS 242. tion, rather than strictly adhering to a universal set W.L. Cade, B.A. Lehman, and A. Olney. 2010. An ex- of deception cues (e.g., LIWC ). We have also pre- ploration of off topic conversation. In Human Lan- guage Technologies: The 2010 Annual Conference of sented results based on the feature weights learned 317

10 the North American Chapter of the Association for J.R. Landis and G.G. Koch. 1977. The measurement of , pages 669–672. Associa- Computational Linguistics , Biometrics observer agreement for categorical data. tion for Computational Linguistics. 33(1):159. S.F. Chen and J. Goodman. 1996. An empirical study of E.P. Lim, V.A. Nguyen, N. Jindal, B. Liu, and H.W. smoothing techniques for language modeling. In Pro- Lauw. 2010. Detecting product review spammers us- ceedings of the 34th annual meeting on Association ing rating behaviors. In Proceedings of the 19th ACM for Computational Linguistics , pages 310–318. Asso- international conference on Information and knowl- ciation for Computational Linguistics. edge management , pages 939–948. ACM. C. Danescu-Niculescu-Mizil, G. Kossinets, J. Kleinberg, S.W. Litvin, R.E. Goldsmith, and B. Pan. 2008. Elec- and L. Lee. 2009. How opinions are received by on- tronic word-of-mouth in hospitality and tourism man- line communities: a case study on amazon.com help- Tourism management , 29(3):458–468. agement. Proceedings of the 18th international fulness votes. In F. Mairesse, M.A. Walker, M.R. Mehl, and R.K. Moore. , pages 141–150. ACM. conference on World wide web 2007. Using linguistic cues for the automatic recogni- H. Drucker, D. Wu, and V.N. Vapnik. 2002. Support tion of personality in conversation and text. Journal of vector machines for spam categorization. Neural Net- Artificial Intelligence Research , 30(1):457–500. , 10(5):1048–1054. works, IEEE Transactions on R. Mihalcea and C. Strapparava. 2009. The lie detector: G. Forman and M. Scholz. 2009. Apples-to-Apples in Explorations in the automatic recognition of deceptive Cross-Validation Studies: Pitfalls in Classifier Perfor- language. In Proceedings of the ACL-IJCNLP 2009 ACM SIGKDD Explorations mance Measurement. , Conference Short Papers , pages 309–312. Association 12(1):49–57. for Computational Linguistics. ̈ Z. Gy ongyi, H. Garcia-Molina, and J. Pedersen. 2004. M.L. Newman, J.W. Pennebaker, D.S. Berry, and J.M. Proceedings Combating web spam with trustrank. In Richards. 2003. Lying words: Predicting deception of the Thirtieth international conference on Very large from linguistic styles. Personality and Social Psychol- data bases-Volume 30 , pages 576–587. VLDB Endow- ogy Bulletin , 29(5):665. ment. A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. J.T. Hancock, L.E. Curry, S. Goorha, and M. Woodworth. 2006. Detecting spam web pages through content 2008. On lying and being lied to: A linguistic anal- analysis. In Proceedings of the 15th international con- ysis of deception in computer-mediated communica- ference on World Wide Web , pages 83–92. ACM. tion. Discourse Processes , 45(1):1–23. M.P. O’Mahony and B. Smyth. 2009. Learning to rec- J. Jansen. 2010. Online product research. Pew Internet ommend helpful hotel reviews. Proceedings of In . & American Life Project Report , the third ACM conference on Recommender systems pages 305–308. ACM. N. Jindal and B. Liu. 2008. Opinion spam and analysis. In Proceedings of the international conference on Web F. Peng and D. Schuurmans. 2003. Combining naive search and web data mining , pages 219–230. ACM. Bayes and n-gram language models for text classifica- tion. Advances in Information Retrieval , pages 547– T. Joachims. 1998. Text categorization with support vec- 547. tor machines: Learning with many relevant features. Machine Learning: ECML-98 , pages 137–142. J.W. Pennebaker, C.K. Chung, M. Ireland, A. Gonzales, and R.J. Booth. 2007. The development and psycho- T. Joachims. 1999. Making large-scale support vec- metric properties of LIWC2007. Austin, TX, LIWC. tor machine learning practical. In Advances in kernel . Net , page 184. MIT Press. methods N. Quadrianto, A.J. Smola, T.S. Caetano, and Q.V. M.K. Johnson and C.L. Raye. 1981. Reality monitoring. Le. 2009. Estimating labels from label proportions. Psychological Review , 88(1):67–85. The Journal of Machine Learning Research , 10:2349– S.M. Kim, P. Pantel, T. Chklovski, and M. Pennacchiotti. 2374. 2006. Automatically assessing review helpfulness. P. Rayson, A. Wilson, and G. Leech. 2001. Grammatical In Proceedings of the 2006 Conference on Empirical word class variation within the British National Cor- , pages 423– Methods in Natural Language Processing pus sampler. Language and Computers , 36(1):295– 430. Association for Computational Linguistics. 306. D. Klein and C.D. Manning. 2003. Accurate unlexical- ized parsing. In Proceedings of the 41st Annual Meet- R.A. Rigby and D.M. Stasinopoulos. 2005. Generalized ing on Association for Computational Linguistics- additive models for location, scale and shape. Jour- , pages 423–430. Association for Computa- Volume 1 nal of the Royal Statistical Society: Series C (Applied tional Linguistics. Statistics) , 54(3):507–554. 318

11 F. Sebastiani. 2002. Machine learning in automated text categorization. , ACM computing surveys (CSUR) 34(1):1–47. ́ 2009. M. A. Serrano, A. Flammini, and F. Menczer. PloS Modeling statistical properties of written text. one , 4(4):5372. A. Stolcke. 2002. SRILM-an extensible language mod- eling toolkit. In Seventh International Conference on Spoken Language Processing , volume 3, pages 901– 904. Citeseer. C. Toma and J.T. Hancock. In Press. What Lies Beneath: The Linguistic Traces of Deception in Online Dating Journal of Communication Profiles. . A. Vrij, S. Mann, S. Kristen, and R.P. Fisher. 2007. Cues to deception and ability to detect lies as a function of police interview styles. , Law and human behavior 31(5):499–518. A. Vrij, S. Leal, P.A. Granhag, S. Mann, R.P. Fisher, J. Hillman, and K. Sperry. 2009. Outsmarting the liars: The benefit of asking unanticipated questions. , 33(2):159–166. Law and human behavior A. Vrij. 2008. Detecting lies and deceit: Pitfalls and opportunities . Wiley-Interscience. W. Weerkamp and M. De Rijke. 2008. Credibility im- proves topical blog post retrieval. ACL-08: HLT , pages 923–931. ̈ ̈ M. Weimer, I. Gurevych, and M. M auser. 2007. Au- uhlh tomatically assessing the post quality in online discus- sions on software. In Proceedings of the 45th An- nual Meeting of the ACL on Interactive Poster and , pages 125–128. Association Demonstration Sessions for Computational Linguistics. G. Wu, D. Greene, B. Smyth, and P. Cunningham. 2010. Distortion as a validation criterion in the identification of suspicious reviews. Technical report, UCD-CSI- 2010-04, University College Dublin. K.H. Yoo and U. Gretzel. 2009. Comparison of De- ceptive and Truthful Travel Reviews. Information and Communication Technologies in Tourism 2009 , pages 37–47. L. Zhou, J.K. Burgoon, D.P. Twitchell, T. Qin, and J.F. Nunamaker Jr. 2004. A comparison of classifica- tion methods for predicting deception in computer- mediated communication. Journal of Management In- formation Systems , 20(4):139–166. L. Zhou, Y. Shi, and D. Zhang. 2008. A Statistical Lan- guage Modeling Approach to Online Deception De- tection. IEEE Transactions on Knowledge and Data Engineering , 20(8):1077–1081. 319

Related documents