1 Session: Similarity Search WWW 2007 / Track: Data Mining Scaling Up All Pairs Similarity Search * Yiming Ma Roberto J. Bayardo Ramakrishnan Srikant U. California, Irvine Google, Inc. Google, Inc. [email protected] [email protected] [email protected] One solution often applied to deal with data of this scale is to ABSTRACT apply approximation techniques. While in theory many a high dimensional vector data in Given a large collection of sparse making the probability of error approximation techniques allow for of finding all pairs of vectors space, we investigate the problem rs, in practice these negligibly small by t uning appropriate paramete whose similarity score (as determin ed by a function such as cosine algorithms are applied in a manner that results in a non-trivial distance) is above a given threshol d. We propose a simple algorithm amount of error (e.g. see [ 4 , 8 ].) Recent work from the database based on novel indexing and optimization strategies that solves this 1 21 ] on finding all similar pa irs has focused instead , community [ problem without relying on appr oximation methods or extensive within the context of a database on solving the problem exactly, and parameter tuning. We show the approach efficiently handles a opose an exact solution to the management system. We also pr variety of datasets across a wide setting of similarity thresholds, problem, though we ignore issues of DBMS integration and focus with large speedups over previous rt approaches. state-of-the-a solely on performance issues. We show that a parsimonious several other subtle yet simple indexing approach combined with H.3.3 [ ]: Search Information Search and Retrieval Categories: rformance improvements. We optimizations yield dramatic pe Process, Clustering comprised of publicly available validate our algorithms on a dataset Algorithms, Performance General terms: and on two real-world web data from the DBLP server, similarity search, similarity join, data mining Keywords: applications: ge for the Orkut social nerating recommendations 1. INTRODUCTION network, and computing pairs of similar queries among the 5 million most frequently issued Google queries represented as ions require solving a similarity search Many real-world applicat r search result snippets. vectors generated from thei in all pairs of objects whose problem where one is interested We give a formal problem Paper outline: statement and define threshold. Consider for example: similarity is above a specified our terminology in we describe related Section 3 . In Section 2 • Search engines often suggest Query refinement for web search: work. We present our algorithm in Section 4 , where we limit ations. One approach to generating such alternate query formul attention to the cosine similarity metric. In some applications, the suggestions is to find all pairs of similar queries based on the vector elements are binary; we therefore specialize the algorithms similarity of the search results for those queries . Since the [19] of this section, we extend the for this case. In the final part goal is to offer only high quality suggestions, we only need to dent data. We present the results of our algorithms to disk resi whose similarity score is above a threshold. find pairs of queries empirical evaluation on large-scal e memory and disk resident • Collaborative filtering: Collaborative filtering algorithms make datasets in Section 5 , and generalize our resu lts to other similarity ng which users have similar recommendations by determini Section 7 . In Section 6 metrics in we conclude with a summary of tastes. Thus the algorithms need to compute pairs of similar users our contributions. whose similarity is above some threshold. 2. PROBLEM STATEMENT • Especially Near duplicate document detection & elimination: it is important to detect and in the area of document indexing, ... of fixed = {} v ,,, Given a set of real-valued vectors v Vv n 2 1 nt. In many cases, the presence purge documents that are equivale () xy sim , a similarity function m dimensionality , , and a similarity of trivial modifications make su ch detection difficult, since a t , we wish to compute the set of all pairs threshold and their xy , () simple equality test no longer su icate detection ffices. Near dupl ∈ V , xy such that () , xy sim similarity values ≥ t () , xy sim and . is made possible through similarity search with a very high simi - We assume the similarity function is commutative. Thus, if the pair larity threshold. , yx meets the threshold, so does xy () , , and we need only () Recent work has applied algorithms for • Coalition detection: assume vector values are non- include one in the result. We also finding all similar pairs within an application for identifying coa - negative. . [13] litions of click fraudsters focus on the case of unit-length For concreteness, we initially normalized vector input and cosine similarity. Cosine similarity is While these applications are not a ll new, the scale of the problem widely used, and produces high quality results across several has increased dramatically due to the web. The number of distinct domains [ 19 x t-length input vectors ]. Restricted to uni 20 8 , 9 , and , search queries issued over a single week to any large search engine is in the tens of millions. Similarly, if one wishes to perform , cosine similarity is simply the vector dot product: y collaborative filtering on data from sites such as Amazon or ⋅ xi [] yi [] dot = , () xy NetFlix, the algorithms need to scale to tens of millions of users. ∑ Depending on the appl ication, thes e domains could involve i dimensionality equal to if not larger than the number of input For many problem domains, especi ally those i nvolving textual vectors. in that a vast ma jority of vector sparse data, input vectors are weights are 0. A sparse vector representation for a vector x is the * , [] 0 > over all i 1 [] m = . We () ... such that set of all pairs xi ixi Work done while at Google. size of the vector. The features sometimes refer to such pairs as the Copyright is held by the Intern ational World Wide Web Conference of a vector x , which we denote as x , is the number of such pairs. Committee (IW3C2). Distribution of th ese papers is limited to classroom Vector size should no t be confused with vector length, or use, and personal use by others. magnitude , which is typically denoted by x . WWW 2007 , May 8-12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 131

2 Session: Similarity Search WWW 2007 / Track: Data Mining Given a set of sparse vectors V , an inverted list representation 4. ALGORITHMS lists (one for each I I ,,, m of the set consists of ... I 2 1 m that can be used to exactly We now describe our algorithm consists of all pairs such that I () dimension), where list xw , i all-pairs similarity search (without approximation) solve the vector is a list V , xi [] w = , and w is non-zero. Each x I is in i as a series of refinements to problem. We present the algorithm and not a set since we may impose requirements on the order in various optimiz simplify understanding of the ations and which its elements appear. engineering choices we made. We initially assume memory over xi we denote the maximum value x For a given vector [] ndex structures, which allows each residence of the vectors and/or i all as x . For a given dimension maxweight i , we denote the i () algorithm to solve the all-pair s problem with only a single pass as maximum value xi [] over all vectors x in the dataset V over the data. Extensions for datasets that are larger than available . maxweight V () i memory appear in . Section 4.6 3. RELATED WORK 4.1 A Basic Inverted Index-Based Approach The all-pairs similarity search pr oblem is a generalization of the One can naively solve the all-pairs similarity search problem by well-known nearest neighbor problem in which the goal is to find inverted list index of the input employing an IR system to build an the nearest neighbors of a given point query. There is a wide body vectors, and issuing each input vect or as if it were a query to find of work on this problem, with many recent works considering the set of matching documents. Ma tching documents could then be various approximation techniques [ 10 , 11 , 12 ]. Approximation 6 , filtered according to the similarity threshold, or the IR system itself methods aim to reduce th e dimensionality and/or size of the input could be modified to only retu rn documents that meet the vectors. The all-pairs similarity search problem has been directly threshold. It is easy to see that such an algorithm produces correct addressed by Broder et al. in the context of identifying near- results for any similarity function that must compare only those duplicate web pages [4] . In this work, the problem is solved nputs have non-zero weights. dimensions for which both i etching function based on min-wise approximately by applying a sk Unfortunately, there are several inef ficiencies to this approach: (1) der to compress document vectors independent permutations in or For each similar pair xy , it wastes computation effort also () n-grams (called whose dimensions co ). shingles rrespond to distinct . (2) It builds a full inverted list index yx , generating the pair () To further improve performance, and to avoid issues of false prior to generating any output, an d (3) it requires both the index resemblance, the implementation removes the most frequent and the vectors themselves rema in memory resident for high shingles, though the actual effect of this heuristic is not quantified. performance. Our methods do not resort to approximation or discarding of d build the index dynamically as A better approach is to instea contribution is frequent features, and thus our to scale exact suggested in , and to store the vect or weights within the [21] datasets. Additionally , the techniques we methods to larger inverted index itself. Score accumulation [14] can then be applied propose are orthogonal to some approximation approaches that tion using the inverted index to compute the similarity func reduce vector size such as min hashing, and can be combined with il is exactly how the index is structure alone. An important deta them for even greater improvements in scalability. accessed in order to accumulate the similarity score. Previous Our work is also related to work in information retrieval (IR) works in IR [23] and similarity join processing [21] use a heap data optimization [ 5 , 14 , 15 16 , 17 , 22 , 23 ]. In IR, each document can , ts corresponding to each feature structure to merge the inverted lis or of weighted terms, indexed be represented by a sparse vect of a given vector x . Such an approach is appealing in IR when , itself a list of through inverted lists. The user query terms, can be answers may be required, since only the initial n only the first those terms with certain (usually represented by a sparse vector of portion of many lists may need to be accessed. However, in the ery then amounts to finding all, equal) weights. Answering the qu case where all answers (meeting a particular score threshold) are th non-zero similarity to the or the top k, document vectors wi list participating in the merge may often be required, each inverted query vector, ranked in order of their similarity. The above-cited scanned in its entirety. Our approach at exploiting the inverted lists works propose various optimizations for this problem, which dually and accumulate scores in a is to instead scan each one indivi Turtle and Flood classify into two categories: term-at-a-time [23] d locality and hash-based map. This approach offers improve . Optimization strategies for and document-at-a-time approaches avoids the logarithmic overh ead of the heap structure. t the fact that the essing typically exploi document-at-a-time proc score accumulation approach, Pseudo-code for our map-based e do not directly documents, and henc r user is looking for the top Figure 1 . The top-level irs-0, appears in which we call All-Pa apply to our problem domain. The term-at-a-time strategies that crementally builds the inverted function scans the dataset and in term-at-a-time Turtle and Flood’s are “safe” (complete), such as e scans the inverted lists to lists. The Find-Matches-0 subroutin optimization, are more appl icable. Our work develops max_score y that is added to We call any vector perform score accumulation. more powerful optimizations th at exploit the particular the weight accumulation map a candidate vector for x , and the similarity search problem. requirements of the all-pairs vector pair ecks the score . The algorithm ch candidate pair xy () , a The all-pairs similarity search pr oblem has also been addressed of each candidate pair, and adds the pair in the result set if it meets similarity join in the database community, where it is known as the the threshold. , 7 , 21 ]. The techniques proposed in this work fall into 1 problem [ solutions convert an imprecise two categories. Signature based 4.2 Exploiting the Threshold During Indexing matching problem to a precise matching problem followed by a e threshold only e similarity scor Up to now we have consulted th filtering phase to eliminate fals e positives. Inverted list based to determine which candidate pairs make it into the result set. ch as those discussed above. We solutions exploit IR techniques su Previous work has described how to exploit a similarity threshold leverage and extend upon some of the ideas proposed in this work, more aggressively in order to limit the set of candidate pairs that and use them as comparison points in our experiments. are considered [ ], but these approaches 21 still involve building 23 , Other related work includes cluste ring of web data [2, 8, 19, 20]. e vector input. Our approach is the complete inverted index over th ng typically employ relatively These applications of clusteri different than previous work in th at rather than simply exploiting straightforward exact algorithm s or approximation methods for the threshold to reduce candidate pa irs, we exploit the threshold to could be leveraged by these computing similarity. Our work ndexed in the first place. This reduce the amount of information i applications for improved perform ance or higher accuracy through the number of candidate pairs approach dramatically reduces less reliance on approximation. 132

3 WWW 2007 / Track: Data Mining Session: Similarity Search , , 0( , ) A AIRS - t ,, I ... LL ) -P 0( - ATCHES -M IND F xI Vt 1 m A empty map from vector id to weight ← ← ∅ O ∅ ← M I ,,, ← ∅ I ... I m 1 2 do for each do s.t. for each ixi [] 0 > xV ∈ do for each F I ... ,, ∈ t IND -M ATCHES -0( , , ) () yyi [] , I ∪ ← xI OO 1 i m do ixi [] xi [] [] Ay 0 ← [] Ay > + for each s.t. ⋅ [] yi () xxi [] , I {} ∪ ← for each do with non-zero weight in yA I i i O then if t ≥ Ay return [] ,, () {} ∪ ← MM xyAy [] return M Figure 1. A basic inverted index based approach. ) F IND A LL -P AIRS - 1( , ,, -M t I ... ATCHES 1( ) , , - xI Vt 1 m Reorder the dimensions such that dimensions with A ← empty map from vector id to weight m ... 1 the most non-zero entries in appear first. ← ∅ M V . Denote the max. of over all as V () for each s.t. do > [] ixi 0 [] xV ∈ maxweight xi i ∈ O ∅ ← for each do yyi [] , () I i ,,, I ... I ← ∅ ⋅ Ay [] Ay ← + [] [] yi [] xi I m 2 1 xV ∈ yA for each do for each do with non-zero weight in t ... I -1( , , ) IND -M ATCHES ,, OO ∪ ← xI y y ' F e unindexed portion of ;; Recall that is th m 1 ← 0 b , ← xy dot + () [] sAy ' for each do in increasing order of s.t. [] i > ixi 0 ≥ st if then + ← ⋅ [] xi () V bb maxweight {} MM xys ← () ,, ∪ i bt ≥ if then M return , ← xxi [] I () {} ∪ I i i xi [] 0 ← x ' ;; create O return Figure 2. An algorithm that expl oits the threshol d during indexing. such as index considered and reduces overhead x . Before we begin, recall that we denote the construction and as a candidate of inverted list scanning during score accumulation. unindexed features from vector y as y ' . Let us further denote the as The threshold-based refinement indexed features of vector y to our algorithm appears in y '' . Now, to see why (1) holds, note that after accumulating scor es over the indexed portion of y . irs function now iterates over Figure 2 The main loop of the All-Pa , we have accumulated in Ay ds indexing any features from most to least given . Since () ' frequent, and avoi ' xy dot the value [] x , dot , () dot xy ' , () dot xy '' , xy + = , the final sum in Find-Matches- () condition is met. The result is to vector features until a particular features to ensure that any index just enough of the least frequent 1 (indicated by the arrow) computes the actual cosine similarity vector score. y that has the potential of meeting the similarity threshold given during matching. x must be identified as a candidate of x To see why (2) holds, note that when indexing a vector, All- Pairs-1 maintains a trivial upperbound on the score attainable by b ordering is not required for The frequency-based feature correctness; its effect is to heuristically minimize the length of the matching the first features of the current vector against any other , it t inverted lists. soon as this upperbound exceeds vector in the dataset. As ining features. It thus begins indexing the rema follows that for any By indexing only a subset of vect or features, All-Pairs-1 reduces indexed vector the overall number of candidate pa irs that are considered by Find- ' y < t () , . We xy dot , we have that x and vector , ' , () dot xy '' , () + = , hence for any xy dot () again note that dot xy Matches. Also note that vector features that are indexed are removed from the vector itself in order to preserve memory and x cos for which y and indexed vector , , we have xy () t ≥ vector speed up the computation that re mains after accumulating weights 0 , () '' xy cos > . It follows that at least one indexed feature of y that from the partial index. Ov erall, each vector feature is stored only is produced as a is in common with some feature of x , so y once, either within the partial index or within the vector itself. x . candidate for tion of each vector is indexed, Because of the fact that only a por 4.3 Exploiting a Specific Sort Order after processing the inverted li sts within Find-Matches, only a Recall that the correctness of All-Pairs-1 relies on the fact that it portion of the similarity score will be accumulated. is produced y y to guarantee that indexes enough of each vector must therefore compute cosine similarity over the Find-Matches-1 as a candidate when matche d against any other vector x in the ctor in order to complete the unindexed portion of any candidate ve x satisfy the similarity threshold. Note y and dataset such that Even though explicit similarity e computation. similarity scor however that correctness is maintained even under a more nput vectors is required by this computation over a pair of i restrictive condition: a vector must be produced as a candidate y ially faster than All-Pairs-0 always substant algorithm, it is almost only when matched against those vectors that follow it in the order because of the vast re duction in the maximum size of its inverted . This suggests that by exploiting in which the dataset is processed lists. We further optimize away this overhead in subsequent may be able to further reduce the a particular dataset sort order, we refinements. ly further improve amount of features i ndexed, and consequent approach, we must establish To demonstrate correctness of this has noted that [21] While previous work algorithm performance. two facts: (1) the algorithm correctly computes the similarity score sorting the dataset according to particular criteria such as vector between any candidate pair x , and (2) for the current vector and y size can improve algorithm perfor mance due to the nature of the , xy dot such that y , any indexed vector ≥ x must be produced t () 133

4 Session: Similarity Search WWW 2007 / Track: Data Mining LL -P AIRS -2( , ) A Vt such that dimensions with Reorder the dimensions 1 ... m the most non-zero entries in appear first. V () V as . Denote the max. of over all xi [] xV ∈ maxweight i () [] x =maxweight m xi ... i 1 Denote the max. of for as . V maxweight x () Sort in decreasing order of . O ∅ ← I ,,, ← ∅ I ... I 2 m 1 xV ∈ do for each I ,, t -2( , , ) ATCHES -M IND ... ← xI ∪ OO F m 1 b 0 ← i > 0 [] ixi for each do in increasing order of s.t. V () maxweight ⋅ + x () , () xi [] ← min maxweight bb i bt ≥ if then , I ← ∪ {} () [] xxi I i i ← 0 [] xi O return input vectors to further reduce the that exploits a sort order of the Figure 3a. A version of All-Pairs amount of indexed data, and hence candidate pairs. xI IND -M ATCHES -2( , , ) t ,, I F ... m 1 A ← empty map from vector id to weight 1 M ∅ ← 2 () V ⋅ maxweight [] xi 3 ← remscore ∑ i i maxweight x () ⁄ ← minsize t 4 0 [] ixi > for each s.t. do 5 I yw , () 6 Iteratively remove from the front of while < minsize y . i 7 [] do I () , yyi ∈ for each i 8 if or then [] Ay remscore t ≠ ≥ 0 9 [] + ← ⋅ Ay [] Ay [] xi [] yi ⋅ ← () V maxweight – [] remscore x i remscore 10 i do with non-zero weight in for each yA 11 then 12 if maxweight t ≥ Ay min y ' x , () [] x () maxweight y ⋅ ' () ⋅ + ' 13 [] dot xy sAy , () + ← 14 if then ≥ st 15 MM xys () {} ∪ ← ,, return M 16 ts the similarity threshold. -Matches that directly exploi Figure 3b. A version of Find rk did not otherwise exploit the data structures employed, this wo exploited more directly in orde r to further improve performance. sort order. Our final refinement of Find-Matches ( Figure 3 b) exploits the The first two ways we describe do threshold directly in three ways. Figure 3a provides our refinement of the All-Pairs procedure order, while the third does. not rely on the dataset sort to further minimize the ar dataset sort order that exploits a particul e ordering over the vectors number of indexed features. Th xploiting the threshold relies on the fact The first method of e x that is produced after a vector y has a x , we eventually get to the guarantees any vector that as we iterate over the features of point where if a vector has not already been identified as a lower maximum weight. Significantly fewer features of each eeing candidacy of a vector can now be indexed while still guarant candidate of x , then there is no way it can meet the score vector when matched against those that follow it in the iteration threshold. Find-Matches-2 therefore keeps track of such a point by order. To show why, we apply an argument analogous to the maintaining an upperbound on the score attainable remscore correctness claim we made for All-Pairs-1. Note that All-Pairs-2 x that shares none of the “already y and any vector between now maintains an upperbound 3 & 10). Once the upperbound drops processed” features (lines b on the score attainable by beneath the score threshold, the algorithm switches to a phase matching the first features of a ve ctor against any other vector that new candidates into th where it avoids putting e map, and instead follows it in the dataset. As before, as soon as this upperbound those vectors already in the only updates accumulated scores of exceeds t it begins indexing the remaining features. Thus, for any map (line 8). that follows it in the dataset x and vector y indexed vector , from which correctness ordering, we have that ' , () t < xy The second method of exploiting th dot e threshold takes place in the follows. that Find-Matches-2 does not candidate iteration loop. Note now and each partial x unconditionally compute the dot product over 4.4 Exploiting the Threshold During Matching . Instead, it computes a cheap upperbound on candidate vector y ' The modifications we made to Find-Matches in Find-Matches-1 dot xy , () , and only explicitly computes the dot product should the indirectly exploit the similarity threshold to reduce candidate pairs 12). With this optimization, upperbound meet the threshold (line by virtue of the partially indexed vectors, but the threshold can be 134

5 Session: Similarity Search WWW 2007 / Track: Data Mining I -B ) , ( INARY -B AIRS -P LL A t IND F INARY ( , , ) ,, ATCHES -M ... Vt xI 1 m A empty map from vector id to int ← Reorder the dimensions ch that dimensions with su m ... 1 ∅ x ← ← M remscore , the most non-zero entries in appear first. V 2 ← ⋅ x t minsize . Sort in increasing order of Vx ← O 1 [] ixi = ∅ for each do s.t. ,,, I ← ∅ y minsize < ... I yI I Remove all from s.t. . m i 1 2 ∈ yI xV ∈ for each do do for each i ≥ remscore minsize ≠ [] 0 Ay IND ATCHES ... INARY t ,, -M -B I (, , ) OO ∪ ← xI if or then F m 1 Ay [] Ay 1 + ← [] b 0 ← ← – 1 remscore remscore i s.t. for each do in increasing order of ixi [] = 1 yA 1 + ← bb do with non-zero count in for each if then bx ⁄ t ≥ Ay + [] y ' ------------------------ ≥ t if then x ← I ∪ I {} i i ⋅ xy xi [] 0 ← () Ay [] dot xy ' + , O return -------------------------------------- - d ← xy ⋅ dt ≥ then if ,, ← ∪ {} () MM xyd M return Figure 4. The All-Pairs algorithm specialized for binary vector input. vectors are binary, there is no need to store vector weights within are quickly excluded without iterating over most candidates of x n get at the vector sizes in order the inverted index as long as we ca their (potentially numerous) unindexed features. rity function. In the case of to appropriately compute the simila , we x The third optimization exploits the fact that for a vector ty between two non-normalized cosine similarity, the similari y can show that any vector that satisfies the similarity score binary valued vectors is as follows: threshold must meet the follo wing minimum size (“minsize”) constraint: x y ⋅ i i ∑ maxweight yt () ≥ ⁄ x () , dot xy i ----------------------- - ----------------------- - cos () xy , == To see why, by definition, if two vectors x and y meet the ⋅ xy ⋅ xy xy dot similarity score threshold then . Now, note that ≥ t () , Because binary input vectors are no longer of unit length, , < y () x maxweight () xy dot ⋅ . Thus, we have that instead of sorting the input and computing bounds based on the ≥ t ⋅ y () x maxweight , from which the claim follows. vector property, the property of interest becomes the maxweight sed by All-Pairs-2 in decreasing Now, because vectors are proces size of the input vectors. Recall th at a vector’s size is simply its x maxweight () , the minimum allowable matching vector order of number of features, and hence af ter unit-length normalization of e algorithm progresses. We can size increases monotonically as th weight of a vector is inversely binary valued vectors, the maximum tly remove from the index any therefore safely and permanen ecialization of proportional to its size. The binary -valued input sp vector encountered that fails to meet the size constraint at any All-Pairs algorithm appears in Figure 4 . given point. We remove such vect ors lazily during Find-Matches at Correctness of the reduced indexi ng condition used in All-Pairs- the time a particular inverted list is required for score accumulation Binary follows from the fact that ndexes all but a the algorithm i (line 6). Note that for efficiency we only remove such vectors from b y of a given vector < t ⁄ by . such that number of features the beginning of each inverted list, which implies we do not Recall that we denote the uninde y and xed and indexed features of necessarily remove all vectors that fail meet the size constraint. x satisfy y and y respectively. We show that if vectors '' y and ' But note that the inverted lists ar e themselves grown in decreasing the similarity threshold, they share at least one indexed term. Note order of maxweight , and that is highly inversely maxweight first that since : xy ≥ ginning of each list will thus correlated with vector size. The be typically contain most of the vectors eligible for removal. ' , dot xy () b b b ----------------------- - - ----------------------- - ----------------------- ----- ≤≤ < t = In implementations (such as ours ) that use arrays for inverted y ⋅ xy ⋅ xy ⋅ yy ginning of a list lists, removing elements from the be can be costly due to the shifting of the remaining elements. In such a case, one xy ' , () dot xy () , '' dot ----------------------- - ----------------------- - , ≥ t and cos Next, since () xy , () xy , cos = + can instead maintain a start offset into each inverted list array that xy ⋅ xy ⋅ can be incrementally advanced over vectors that no longer meet we must have that: the size constraint. If desired, inverted lists could still be , '' xy dot () periodically filtered in order to reclaim memory occupied by the ----------------------- - 0 > vectors too small to meet the similarity threshold. xy ⋅ Hence one indexed term. share at least y and x 4.5 Specializations for Binary Vector Data We have highlighted the mins ize index pruni ng optimization is most naturally represented Many applications have data that used by Find-Matches-Binary, sinc e it has been strengthened for d directly apply using binary valued sparse vector s. While one coul the case of binary data. First, note that since inverted lists are by simply normalizing the inputs to All-Pairs to binary vector data grown in increasing order of vector size, all vectors that fail to resulting weight ed vectors, unit length and operating on the meet the minimum size constraint a ppear at the front of the list. All specialization avoids the need to perform any such conversion, and ficiently removed or advanced such vectors can therefore be ef also allows for other optimizati ons. For example, when the input over, instead of just most of them as was the case with weighted 135

6 Session: Similarity Search WWW 2007 / Track: Data Mining 1e+007 1e+006 DBLP DBLP QSC QSC 1e+006 100000 100000 10000 10000 1000 1000 Freqency Freqency 100 100 10 10 1 1 100000 0 100 100 1 100 1000 10000 10 1e+0 06 10 1 In Degree Out Degree d query snippet containment data. ee distributions of the DBLP an Figure 5. Outdegree and Indegr x and vector data. Now, for any pair of vectors y that meet a 5. EXPERIMENTS , we exploit the following minsize t cosine similarity threshold In this section we compare All-Pairs to previous inverted list constraint, which as before in creases monotonically as the ]. We run these algorithms and signature based methods [ 1 , 11 [21] algorithm progresses: on three real world datasets which we describe in Section 5.1. Two 2 of these datasets come from web applications, while the third is a yxt ⋅ ≥ much smaller but publicly availa ble dataset comp iled from DBLP lds, we have by definition: To see why this condition ho data. All our implemen tations are in C++, and use the standard () , xy dot r inverted lists and most other template library vector class fo - ----------------------- ≥ t e dense_hash_map class for structures. We used the Googl ⋅ xy performing score accumulatio n in Find-Matches and the and hence, dense_hash_set class for signature matching (both freely available experiments in this subsection from http://code.google.com/). All 2 , () xy dot 2 ntium-4 class machine with 3 were performed on a 3.4 GHz Pe ---------------------- ≥ ⋅ xt y Gbytes of RAM and a 7200 RPM SATA-IDE hard drive. ≤ xy Since dot , we must also have that , () y 5.1 Datasets 2 The first dataset we used was motivated by applications QSC: y 2 -------- ⋅ ≥ = , and the condition is established. yxt ine semantic similarity between of query snippet data to determ y queries [19] . We selected roughly [9] and for taxonomy generation 4.6 Extensions for Disk Resident Data the 5 million most frequent queries that were submitted to the Google search engine from the Un ited States during one week in Pairs as storing each feature in RAM at We have described All- January 2006. Queries in the li st were then normalized (e.g. most once during its execution. For datasets whose features exceed excessive whitesp ace stripped). We then used the Google SOAP available memory, virtual memory thrashing would make such an m/apis/) to generate the top-20 search API (http://www.google.co ituations, we have developed an approach impractical. For such s search results and their snippet data for each query. We used the set irs algorithm. For starters, All- out-of-core variant of the All-Pa terms in the vector for each query, i.e., if of queries as candidate Pairs must use an out-of-core sorting algorithm to impose the query (and also met the appeared in the snippets for query x x ining problem is to prevent the desired vector ordering. The rema j i x j = 1 . We only included term criteria describe d next), then [] x ctures used for vector matching various memory resident data stru j i x ’s snippets, in the vector for x x if appeared at least twice in from exceeding available memory. Our solution is to have the i j i x x ’s snippets was si gnificantly higher and the frequency of in algorithm index newly encountered vectors only up to the point i j across all queries. than the background frequency of x where further indexing would excee d available memory. At this j The indegree (feature frequency ) and outdegree (features per point, the algorithm switches to a “matching only” phase in which query snippet containment vector) distributions for this particular memory only for the duration of each remaining vector is kept in (QSC) dataset appear in . Figure 5 While our significance the Find-Matches call. To ensure completeness, once the end of the y restrained outdegree with an thresholding resulted in a relativel the algorithm dataset is reached during the ma tching only phase, s a power law distribution. This average of 36, the indegree follow clears out the index and other memo ry resident vector features, and te process space, so the runtimes dataset fits entirely in a 2 gigaby to resume indexing anew from the performs a seek into the dataset we report for it are almost en tirely dominated by CPU overhead. last point it previously sto pped indexing. This strategy is Orkut (http://orkut.com/) The second dataset is the Orkut: -known block nested loops strategy conceptually similar to the well user is represented by a binary social network, in which each employed by database system s for out-of-core table joins [18] . Due A user’s vector has a 1 in any vector over the space of all users. to the nature in which vectors are removed from the index during user has listed as a dimension that represents himself or anyone the matching, it is possible for the partial index to become completely hip is treated cted since friends “friend.” The Orkut graph is undire implementation immediately empty. Should this happen, our as a symmetric relationship. It c onsists of almost 20 million nodes an and proceeds to indexing and terminates the current dataset sc (vectors) and 2 billion links (non- zero weights), yielding roughly matching against the next dataset block. 136

7 Session: Similarity Search WWW 2007 / Track: Data Mining 1e+007 1e+008 QSC Orkut Orkut DBLP 1e+006 1e+007 100000 1e+006 10000 Freqency Result pairs 100000 1000 10000 100 100 1000 1 0. 0.5 0.6 0.7 0.8 9 10 Degree Cosine Similarity social network. n of the Orkut Figure 6. Degree distributio size for a given threshold. Figure 7. Algorithm output by having ProbeOpt-sort apply implementation for cosine distance e. Similar pairs from this data 100 features per vector on averag the same cosine distance upp erbounding technique used by All- reflect users who have a signifi cant overlap in friends. This te evaluation loop. We did not Pairs-Binary within its candida a variety of ways. For example, similarity data could be used in compare against a variant of ProbeOpt-sort called ProbeOpt- users with high similarity scores who are not already friends may cluster since ProbeOpt-cluster was found to be only slightly faster. be good targets for introductions. To shed more insight on the impact of the optimizations within The degree distribution for the Or kut graph appear in Figure 6. ion of the algorithm which, like All-Pairs, we implemented a vers As can be clearly seen from this figure, Orkut limits the number of ly on the dataset sort order. The ProbeOpt-sort, did not explicitly re dataset is too large to fit into friends of each user to 1000. This algorithm is called All-Pairs-Unsorted within the graphs. The so all experimental results on main memory on our test machine, difference between All-Pairs-Unsorted and All-Pairs is that All- ime of the out-of-core algorithm this dataset report the wall-clock t explicitly expl oit the sort order of the Pairs-Unsorted does not igured to halt indexing and enter variants. These variants were conf on the sorted datasets.) So All- records (even though we still run it loading 250 million non-zero the matching only phase after Pairs-Unsorted uses only () maxweight V () and not maxweight x weights. Each algorithm thus requ ired 8 passes over the dataset to i ctor needs to be indexed. For to determine how much of the ve process it in its entirety. binary datasets, the reduced i ndexing condition of All-Pairs- DBLP: The third dataset is compiled from a snapshot of the 2 Unsorted is t < . All-Pairs-Unsorted ⁄ bv ⁄ t < bv instead of DBLP data as described in [1] . Our snapshot consisted of almost pruning method that iteratively also cannot exploit the minsize r vector on average. The total 800,000 vectors and 14 features pe removes small records from the front of the inverted lists. number of dimensi ons was roughly 530,000. Results for all three datasets appear in Figure 8. Even though it All datasets are sorted in in vector size. creasing order of endent optimizations, All-Pairs- lacks any explicit sort-order-dep Runtimes are for the sorted dataset even if the algorithm does not Unsorted always outperforms ProbeOp t-sort, typically by at least a explicitly exploit the sort orde r. We confirmed the finding from factor of 2. Much of the speedu p is due to indexing only part of previous work [21] that performance was always better on the ed score accumulation in place each input vector, though hash-bas sorted dataset. We did not include the time required to perform the ctor. Note that for all datasets, of queue-based merging is also a fa dataset sort, which for QSC and DBLP was negligible. Because the sort-dependent optimizations lead to an even larger Orkut required an out-of-core sort (the linux “sort” command was performance gain. On the QSC dataset, All-Pa irs outperforms used with a 2 gigabyte memory buffer), sorting required 21 ProbeOpt-sort from a factor of 22x at .9 threshold to a factor of 8x minutes. at .7. ProbeOpt-sort failed to complete at lower threshold settings We use similarity thresholds between .5 and .9 in our within the 1000 minute imposed. time limit we experiments. We chose .5 as th e lowest setting based on previous On the Orkut dataset, All-Pairs is much faster (2-3x) than All- that found semantically similar work on semantic query similarity Pairs-Unsorted, but the performance difference is less than query pairs to typically have scores above .5 [19] . The number of datasets for the following reasons: (1) The witnessed on the other Figure 7 result pairs for each threshold and each dataset appears in . artificial 1000 friend limit prevents highly frequent features -- a 5.2 Comparison of Inverted List-Based Approaches key source of inefficiency, and (2) the fact that the data was disk- based approach, We compared All-Pairs to anothe r inverted-list resident led to some IO overhead, which was similar across all [21] . ProbeOpt-sort is an algorithm that ProbeOpt-sort riant modified to handle out-of- algorithms. Our ProbeOpt-sort va dynamically builds an inverted i ndex and consults the index to the approach described core datasets based on was not Section 4.6 All-Pairs. Unlike All-Pairs, it identify candidate vectors much like es even with a .9 similarity able to complete within 1000 minut s use the similarity threshold to indexes every feature, but it doe threshold. We suspect ProbeOpt-sort performed poorly on this that must be scanned in full in order to reduce the inverted lists dataset due to the longer average vector size. generate candidates. For each candidate identified by the inverted Performance characteristics on the DBLP data was similar to lists are probed using a list scans, any remain ing relevant inverted QSC, with All-Pairs providing a speedup of approximately 6x over doubling binary search. Before each binary search, a score ProbeOpt-Sort across all similarity thresholds. upperbound is computed, and if the score upperbound falls beneath 5.3 Comparison with Signature-Based Methods the threshold, the algorithm mo ves on to the next candidate. two signature-based methods Here we compare All-Pairs with ProbeOpt-sort was presented in for the Overlap similarity [21] for finding all similar pairs: LSH [11] and PartEnum [1] . LSH es comparison, we modified our metric, so for an apples-to-appl 137

8 WWW 2007 / Track: Data Mining Session: Similarity Search re based algorithm that like PE (PartEnum) is another signatu QSC All-Pairs is guaranteed to genera te all pairs that satisfy the similarity threshold. The intuition is that a similarity threshold t 1000 on the Hamming distance k can be converted to a threshold e generated in such that any two between two vectors. Signatures ar stance is less than vectors whose Hamming di k will have at least one matching signature. ard similarity to The conversion of Jacc 100 : [1] Hamming distance is covered in 1 t – () --------------- k ⋅ xy + () = + () 1 t 10 Time (minutes) To use PartEnum with the cosine similarity metric, we derived the following conversion from co sine similarity to Hamming ProbeCount-sort 1 distance: All-Pairs-Unsorted All-Pairs = ⋅⋅ ⋅ txy + – kxy 2 1 0. 9 0.8 0.7 0.6 0.5 Given a vector l x on the size of any vector , an upper bound u Cosine Similarity that matches x with cosine similarity at least t is as follows: 2 Orkut l xt ⁄ = u 1000 We use this bound to partition the vectors into groups by size, All-Pairs-Unsorted such that only the vectors in each group or in adjacent groups need All-Pairs e same approach as in Figure 6 to be compared. Since we follow th of cept to point out that the , we omit further details, ex [1] 100 ) and (c) in Figure 6 for cosine equations corresponding to lines (b 2 = t = ⁄ ⋅⋅ r l r () – t 21 k in line (b), and similarity are: i i i i in line (c). s for PartEnum, LSH and All- Figure 9 shows the running time 10 Pairs on the DBLP data using both cosine similarity (left graph) Time (minutes) and Jaccard similarity (right gra ph). With cosine similarity, all but the highest 2 threshold PartEnum fails to complete for settings within the 200 minute time limit we imposed; at these settings, All-Pairs is 16 to 700 times faster. The speedup given 1 0.6 0.5 9 0.8 0.7 0. Jaccard similarity is less dramatic, but still at least an order of magnitude. The reason for the wi der performance discrepancy Cosine Similarity when using cosine distance is that the bound on Hamming distance DBLP is a factor of + t 1 higher. Without tight bounds on Hamming 100 candidates to distance, PartEnum needs to generate many ProbeCount-sort though LSH only provides an guarantee completeness. Even All-Pairs-Unsorted approximate solution to the probl em, All-Pairs outperforms it by 2 All-Pairs to 15 times for cosine similarity , and 1.3 to 6 times for Jaccard 10 similarity. This discrepancy in spee dups is due to the fact that at the same threshold, LSH needs to use more signatures for the cosine similarity than Jaccard similarity to ensure a 5% or lower 1 false negative rate. To understand the performance advantage of All-Pairs over the Time (minutes) signature based schemes, note that signature based algorithms fully 0.1 scan each vector in a candidate pair to compute their similarity score. So for example if a vector appears in 10 candidate pairs, x x will be scanned in its entirety 10 times. In contrast, All-Pairs 0.01 once for all vectors shorter than y . For vectors only scans x x 0.5 0.7 0. 0.8 0.6 9 ) that x , only the portions of x longer than x (the indexed part of ' Cosine Similarity intersect y are scanned. The unindexed portion of x is rarely scanned due to the effectivenes Figure 8. Inverted List Algorithm Performance s of the score bound condition. All- Pairs also exhibits better cache locality, since it sequentially scans inverted lists, rather than jumping between candidate pairs. Finally, is a well-known approximate (Locality Sensitive Hashing) gnature based schemes are less and perhaps most importantly, si d random projections m. It first generates algorithm for the proble effective in reducing the numbe r of candidate pairs at lower for each input vector and each of the set of dimensions. Next, random projection, the algorit hm generates a signature by combining the vector fe atures belonging to th at projection. Thus d 1 We give the derivation for completene ss. Hamming distance is defined as signatures are generated for each vector. If two vectors have a that satisfy a cosine sim y and x . Given vectors () - xy 2 – +dot xy , ⋅ the same projection, a candidate pair is matching signature along t ilarity threshold : generated. The false nega tive rate is controlled using the number of = , and thus: cos xy , () dot xy , () xy ⋅ () ⁄ t ≥ projections to allow for at most 5% . In our experiment, we set d d () txy , ⋅ ≥ . Finally, we get: xy dot ⋅ false negatives measured as actual error. txy +dot , () ⋅ xy 2 – + xy ⋅ ⋅⋅ ≤ – 2 xy 138

9 Session: Similarity Search WWW 2007 / Track: Data Mining DBLP DBLP 1000 1000 100 100 10 10 1 1 Time (minutes) Time (minutes) 0.1 0.1 PartEnum PartEnum LSH LSH All-Pairs All-Pairs 0.01 0.01 0.75 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.7 0.5 0.65 0.9 0.85 0.8 0.6 0.55 Jaccard Similarity Cosine Similarity Figure 9. Signature-Ba sed Algorithm Performance Jaccard coefficient: worth noting that both PartEnum similarity thresholds. It is also and LSH performance are highly dependent on their parameter ' , dot () xy b ------------------------------------------- ------------------------------------------- ≤ settings. We spent considerable time tuning parameters to optimize – + xy xy dot xy dot xy , () – + , () their runtime. x dot xy , () Note that ≤ , hence: 6. OTHER SIMILARITY MEASURES b b b -------------------------------------------- - ---------------------------- ----- ≤ < t = icitly exploits the cosine Although our main algorithm expl + xy ' , – xyx dot xy + – () y similarity metric, it can be genera lized to several other similarity Given that the existing reduced indexing condition is sufficient, bility with the measures. We demonstrate this ge neralization capa the only modifications required are of Find-Matches-Binary where following common similarity measures: Jaccard coefficient, dice upperbound the similarity score, we compute the similarity score, coefficient, and overlap coefficient. These similarity measures are and compute a monotone minimum size constraint on candidate defined over binary vect or inputs as follows: vectors. We forego providing detail s to these variations due to space constraints, though they can be derived by appropriately adapting the techniques used to handle cosine distance. Similar xy dot , () - ------------------- () , xy sim = Cosine similarity functions over e also possible for other generalizations ar xy ⋅ weighted vector data. Performance of the All-Pairs algorithm when using these dot xy () , Figure alternate similarity metrics appears in the left-hand graph of --------------------------- xy , () sim = Overlap , xy min () . Note that there is a modest increase in runtime for some of 10 these alternate metrics. The explanation for this runtime increase can be seen from the accompanying graph which reports the xy 2 dot , () ⋅ --------------------------- sim = () , xy Dice output. For the same threshold number of similar pairs in the + xy setting, the overlap measure allows many more pairs, which results ector intersections that must be in an increase in the number of v , xy dot () ------------------------------------------- () , xy sim = Jaccard slower than cosine distance performed by Find-Matches. Dice is xy xy – + , dot () despite producing similar size output because the minsize of as effectively. matching vectors cannot be bounded Table 1: Similarity Measures 7. CONCLUSIONS AND FUTURE WORK ng condition used in All-Pairs- Conveniently, the reduced indexi of similar vectors arises in The problem of finding all pairs Binary already preserves comp leteness given any of these finement for search engines and many applications such as query re similarity measures. Recall that the algorithm indexes all but a collaborative filtering. We presente d two key insights in this paper number of features t ⁄ by such that y of a given vector b . We < not identified in previous work: therefore need only show that the unindexed portion of the vector • An inverted list based approach to the problem need not build a ), which consists of the last b (denoted y ' y unit valued weights in complete inverted index over the vector input. , contributes an amount to the sim ilarity score that alone is not y • Appropriately ordering the vectors in addition to the dimensions and any y enough to meet the threshold. For an indexed vector can vastly reduce the search space. ), the following xy following it in the ordering ( x vector ≥ these insights to produce an We aggressively exploited inequalities establish these results. to implement and does not require algorithm, All-Pairs, that is easy Overlap coefficient: any parameter tuning. The simplicity of the approach is combined xy ' , () b dot b We found All-Pairs to be 5 to 22 with dramatic performance gains: --------------------------- --------------------------- ----- ≤ = t < min , xy () min () , xy y between 10 to over 700 times faster than ProbeCount-sort, and times faster than PartEnum. All-Pairs is even 1.3 to 15 times faster Dice coefficient: LSH tuned to have a 5% false than an approximate algorithm, ⋅ () 2dot xy ' , b b 2 ---------------------------- ----- ----------------- t = ≤ < negative rate. xy + yy + y 139

10 Session: Similarity Search WWW 2007 / Track: Data Mining DBLP DBLP 1e+08 100 Overlap Overlap Dice Dice Cosine Cosine Jaccard Jaccard 1e+07 10 1e+06 Result pairs 1 Time (minutes) 100000 10000 0.1 0.65 0.7 0.75 0.8 0.85 0.9 0.5 0.55 0.6 0.7 0.5 0.85 0.8 0.75 0.9 0.55 0.65 0.6 th alternate similarity functions. Figure 10. All-Pairs performance wi While we have focused primarily on scalability with respect to [10] R. Fagin, R. Kumar, & D. Sivakumar (2003). Efficient cation via Rank Aggregation. In Similarity Search and Classifi the number of records in sparse da ta, we believe scalability of all- Proc. of the 2003 ACM-SIGMOD Int’l Conf. on Management pairs similarity search algorithms with respect to feature density , 301-312. of Data merits further study. Finally, in addition to developing further [11] A. Gionis, P. Indyk, & R. Mo twani (1999). Similarity Search performance improvements for th e all-pairs similarity search in High Dimensions via Hashing. In Proc. of the 25th Int’l problem, we believe future work might address how algorithms for Conf. on Very Large Data Bases , 518-529. imitives for other mining tasks. the problem can serve as useful pr [12] P. Indyk, & R. Motwani (1998). Approximate Nearest Interesting possibilities include exploiting all similar pairs for Neighbors: Towards Removing th e Curse of Dimensionality. istic clustering approaches, improving the quality of heur , Proc. of the 30th Symposium on the Theory of Computing In performing deeper social network analysis, or in improving 604-613. performance of related problems [3] . l, & A. El Abbadi (2007). [13] A. Metwally, D. Agrawa DETECTIVES: DETEcting Coalition hiT Inflation attacks in 8. ACKNOWLEDGEMENTS adVertising nEtworks Streams. In Proc. of the 16th Int’l Conf. We thank Ellen Spertus for her assistance with the Orkut data, on the World Wide Web , to appear. and Tim Heilman for his assistance with generating datasets for Wilkinson, & J. Zobel (1994). [14] A. Moffat, R. Sacks-Davis, R. semantic query similarity. Retrieval of partial documents. In The Second Text REtrieval 181-190. Conference, 9. REFERENCES . Self-indexing inverted files for [15] A. Moffat & J. Zobel (1996) ACM Transactions on Information Systems fast text retrieval. , [1] A. Arasu, V. Ganti, & R. Ka ushik (2006). Efficient Exact Set- 14(4):349-379. Similarity Joins. In Proc. of the 32nd Int’l Conf. on Very Large Data Bases st ranking. In [16] M. Persin (1994). Document filtering for fa , 918-929. Proc. of the 17th Annual Int’l Conf. on Research and [2] D. Beeferman & A. Berger (2 000). Agglomerative Clustering 339-348. Development in Information Retrieval, of a Search Engine Query Log. In Proc. of the 6th ACM- Sacks-Davis (1994). Fast document [17] M. Persin, J. Zobel, & R. SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining ranking for large scale info rmation retrieval. In Proc. of the , 407-416. First Int’l Conf. on Applications of Databases, Lecture Notes [3] C. Böhm, B. Braunmüller, M. Breunig, & H.-P. Kriegel , 253-266. in Computer Science v819 ring Based on the Similarity (2000). High Performance Cluste [18] R. Ramakrishnan & Join. In Database Management Proc. of the 2000 ACM CIKM International J. Gehrke (2002). Systems. Conference on Information and Knowledge Manageme nt, McGraw-Hill; 3rd edition. 298-305. [19] M. Sahami & T. Heilma n (2006). A Web-based Kernel Function for Measuring the Simila [4] A. Z. Broder, S. C. Glassman, M. S. Manasse, & G. Zweig rity of Short Text Snippets. Proc. of the 15th Int’l Conf. on the World Wide Web , 377- Proc. of the 6th (1997). Syntactic clustering of the Web. In In 386. Int’l World Wide Web Conference , 391-303. [20] E. Spertus, M. Sahami, & O. Buyukkokten (2005). Evaluating [5] C. Buckley & A. F. Lewit (1985). Optimization of Inverted Proc. of the Eight Annual Int’l Conf. on Similarity Measures: A Large Sc ale Study in the Orkut Social Vector Searches. In Network. In Research and Development in Information Retrieval , 97-110. Proc. of the 11th ACM-SIGKDD Int’l Conf. on Knowledge Discovery in Data Mining , 678-684. [6] M. S. Charikar (2002). Similarity Estimation Techniques from (2004). Efficient Set Joins on [21] S. Sarawagi & A. Kirpal Rounding Algorithms. In Proc. of the 34th Annual Symposium Proc. of the ACM SIGMOD Int’l Similarity Predicates. In on Theory of Computing , 380-388. Conf. on Management of Data , 743-754. R. Kaushik (2006). A Primitive [7] S. Chaudhuri, V. Ganti, & Proc. of the Operator for Similarity Joins in Data Cleaning. In B. Croft (2005). Optimization [22] T. Strohman, H. Turtle, & W. 22nd Int’l Conf on Data Engineering, Strategies for Complex Queries. In Proc. of the 28th Annual 5. Int’l ACM-SIGIR Conf. on Research and Development in [8] S. Chien & N. Immorlica (2005). Semantic Similarity Information Retrieval, 219-225. Between Search Engine Querie s Using Temporal Correlation. , 2-11. Proc. of the 14th Int’l World Wide Web Conference ery Evaluation: Strategies and In [23] H. Turtle & J. Flood (1995). Qu Optimizations. In Information Processing & Management , [9] S.-L. Chuang & L.-F. Chien (2005). Taxonomy Generation for 31(6), 831-850. ACM Text Segments: A Practical Web-Based Approach. In Transactions on Information Systems , 23(4), 363-396. 140

LANDS AVAILABLE FOR TAXES Please Allow up to 30 days for a response to your request for an updated minimum bid. Information must be obtained from the Tax Collector’s Office to determine the current bi...

More info »