The Consistency of Common Neighbors for Link Prediction in Stochastic Blockmodels

Transcript

1 The Consistency of Common Neighbors for Link Prediction in Stochastic Blockmodels Deepayan Chakrabarti Purnamrita Sarkar IROM, McCombs School of Business Department of Statistics University of Texas at Austin University of Texas at Austin [email protected] [email protected] Peter Bickel Department of Statistics University of California, Berkeley [email protected] Abstract Link prediction and clustering are key problems for network-structured data. While spectral clustering has strong theoretical guarantees under the popular stochastic blockmodel formulation of networks, it can be ex- pensive for large graphs. On the other hand, the heuristic of predicting links to nodes that share the most common neighbors with the query node is much fast, and works very well in practice. We show theoretically that the common neighbors heuristic can extract clusters with high probabil- ity when the graph is dense enough, and can do so even in sparser graphs with the addition of a “cleaning” step. Empirical results on simulated and real-world data support our conclusions. 1 Introduction Networks are the simplest representation of relationships between entities, and as such have attracted significant attention recently. Their applicability ranges from social networks such as Facebook, to collaboration networks of researchers, citation networks of papers, trust networks such as Epinions, and so on. Common applications on such data include ranking, recommendation, and user segmentation, which have seen wide use in industry. Most of link prediction these applications can be framed in terms of two problems: (a) , where the goal is to find a few similar nodes to a given query node, and (b) clustering , where we want to find groups of similar individuals, either around a given seed node or a full partitioning of all nodes in the network. An appealing model of networks is the stochastic blockmodel, which posits the existence of a latent cluster for each node, with link probabilities between nodes being simply functions of their clusters. Inference of the latent clusters allows one to solve both the link prediction problem and the clustering problem (predict all nodes in the query node’s cluster). Strong theoretical and empirical results have been achieved by , which uses the spectral clustering singular value decomposition of the network followed by a clustering step on the eigenvectors to determine the latent clusters. However, singular value decomposition can be expensive, particularly for (a) large graphs, when (b) many eigenvectors are desired. Unfortunately, both of these are common require- ments. Instead, many fast heuristic methods are often used, and are empirically observed to yield good results [8]. One particularly common and effective method is to predict links to nodes that share many “common neighbors” with the query node q , i.e., rank nodes by | CN ( q,i ) | , where CN ( q,i ) = { u | q ∼ u ∼ i } ( i ∼ j represents an edge between i 1

2 and j q probably has many links with others in its cluster, and ). The intuition is that hence probably also shares many common friends with others in its cluster. Counting com- mon neighbors is particularly fast (it is a Join operation supported by all databases and Map-Reduce systems). In this paper, we study the theoretical properties of the common neighbors heuristic. Our contributions are the following: (a) We present, to our knowledge the first, theoretical analysis of the common neighbors for the stochastic blockmodel. (b) We demarcate two regimes, which we call semi-dense and semi-sparse, under which common neighbors can be successfully used for both link prediction and clustering. (c) In particular, in the semi-dense regime, the number of common neighbors between the q and another node within its cluster is significantly higher than that with a query node node outside its cluster. Hence, a simple threshold on the number of common neighbors suffices for both link prediction and clustering. (d) However, in the semi-sparse regime, there are too few common neighbors with any node, and hence the heuristic does not work. However, we show that with a simple additional “cleaning” step, we regain the theoretical properties shown for the semi-dense case. (e) We empirically demonstrate the effectiveness of counting common neighbors followed by the “cleaning” post-process on a variety of simulated and real-world datasets. 2 Related Work Link prediction has recently attracted a lot of attention, because of its relevance to important practical problems like recommendation systems, predicting future connections in friendship networks, better understanding of evolution of complex networks, study of missing or partial information in networks, etc [9, 8]. Algorithms for link prediction fall into two main groups: similarity-based, and model-based. These methods use similarity measures based on network Similarity-based methods: topology for link prediction. Some methods look at nodes two hops away from the query node: counting common neighbors, the Jaccard index, the Adamic-Adar score [1] etc. More complex methods include nodes farther away, such as the Katz score [7], and methods based on random walks [16, 2]. These are often intuitive, easily implemented, and fast, but they typically lack theoretical guarantees. Model-based methods: The second approach estimates parametric models for predict- ing links. Many popular network models fall in the latent variable model category [12, 3]. These models assign n latent random variables Z := ( Z nodes in a net- ,Z n ,...,Z ) to n 2 1 . The probability of linkage between work. These variables take values in a general space Z : Z ×Z → [0 two nodes is specified via a symmetric map 1]. These Z h ’s can be i.i.d , i d − dimensional latent space [12]. In [5] a mixture of Uniform(0,1) [3], or positions in some multivariate Gaussian distributions is used, each for a separate cluster. A Stochastic Block- model [6] is a special class of these models, where Z is a binary length k vector encoding i membership of a node in a cluster. In a well known special case (the planted partition α , whereas model), all nodes in the same cluster connect to each other with probability all pairs in different clusters connect with probability γ . In fact, under broad parameter regimes, the blockmodel approximation of networks has recently been shown to be analogous to the use of histograms as non-parametric summaries of an unknown probability distribu- tion [11]. Varying the number of bins or the bandwidth corresponds to varying the number or size of communities. Thus blockmodels can be used to approximate more complex models (under broad smoothness conditions) if the number of blocks are allowed to increase with n . Empirical results: As the models become more complex, they also become computation- ally demanding. It has been commonly observed that simple and easily computable measures like common neighbors often have competitive performance with more complex methods. 2

3 This behavior has been empirically established across a variety of networks, starting from co-authorship networks [8] to router level internet connections, protein protein interaction networks and electrical power grid network [9]. Theoretical results: Spectral clustering has been shown to asymptotically recover cluster memberships for variations of Stochastic Blockmodels [10, 4, 13]. However, apart from [15], there is little understanding of why simple methods such as common neighbors perform so well empirically. Given their empirical success and computational tractability, the common neighbors heuris- tic is widely applied for large networks. Understanding the reasons for the accuracy of common neighbors under the popular stochastic blockmodel setting is the goal of our work. 3 Proposed Work Many link prediction methods ultimately make two assumptions: (a) each node belongs to a latent “cluster”, where nodes in the same cluster have similar behavior; and (b) each node is very likely to connect to others in its cluster, so link prediction is equivalent to finding other nodes in the cluster. These assumptions can be relaxed: instead of belonging to the same cluster, nodes could have “topic distributions”, with links being more likely between pairs of nodes with similar topical interests. However, we will focus on the assumptions stated above, since they are clean and the relaxations appear to be fundamentally similar. Model. Specifically, consider a stochastic blockmodel where each node i belongs to an c is fixed as ∈ { C unknown cluster ,...,C K } . We assume that the number of clusters K 1 i n π = n/K members, the number of nodes increases. We also assume that each cluster has P though this can be relaxed easily. The probability i ∼ j ) of a link between nodes i and j ( ( i 6 = j ) depends only on the clusters of i and j : P ( i ∼ j ) = B } c = 6 , α { c c = c { } + γ j i i ,c c j i j α > γ > 0; in other words, the probability of a link is α between nodes in the same for some γ otherwise. By definition, P ( i ∼ i ) = 0. If the nodes were arranged so that all cluster, and nodes in a cluster are contiguous, then the corresponding matrix, when plotted, attains a block-like structure, with the diagonal blocks (corresponding to links within a cluster) being α > γ denser than off-diagonal blocks (since ). Under these assumptions, we ask the following two questions: (Link Prediction and Recommendation) . Given node Problem 1 , how can we identify at i least a constant number of nodes from c ? i Problem 2 (Local Cluster Detection) . Given node i , how can we identify all nodes in c ? i Problem 1 can be considered as the problem of finding good recommendations for a given i i could connect to (e.g., recommending node . Here, the goal is to find a few good nodes that a few possible friends on Facebook, or a few movies to watch next on Netflix). Since within- cluster links have higher probability than across-cluster links ( α > γ ), predicting nodes from c gives the optimal answer. Crucially, it is unnecessary to find all good nodes. As against i that, Problem 2 requires us to find everyone in the given node’s cluster. This is the problem of detecting the entire cluster corresponding to a given node. Note that Problem 2 is clearly harder than Problem 1. We next present a summary of our results and the underlying intuition before delving into the details. 3.1 Intuition and Result Summary Standard approaches to inference for the stochastic blockmodel Current approaches. attempt to solve an even harder problem: (Full Cluster Detection) . ? c Problem 3 for all i How can we identify the latent clusters i A popular solution is via spectral clustering , involving two steps: (a) computing the top- K eigenvectors of the graph Laplacian, and (b) clustering the projections of each node on the 3

4 corresponding eigenspace via an algorithm like k-means [13]. A slight variation of this has √ √ ) / α been shown to work as long as ( − γ n/ α n = Ω(log ) and the average degree grows [10]. faster than poly-logarithmic powers of n However, (a) spectral clustering solves a harder problem than Problems 1 and 2, and (b) eigen-decompositions can be expensive, particularly for very large graphs. Our claim is that a simpler operation — counting common neighbors between nodes — can yield results that are almost as good in a broad parameter regime. , link prediction via common neighbors follows a Common neighbors. Given a node i i and j have the maximum number such that simple prescription: predict a link to node j ( i,j ) | of shared friends CN ( i,j | { u | i ∼ u ∼ j } . The usefulness of common CN ) = neighbors have been observed in practice [8] and justified theoretically for the latent distance model [15]. However, its properties under the stochastic blockmodel remained unknown. and from the same cluster to have many Intuitively, we would expect a pair of nodes j i from the same cluster, since both the links i ∼ common neighbors and u ∼ j occur with u u α , whereas for c 6 probability = c must have the , at least one of the edges i ∼ u and u ∼ j j i γ . lower probability 2 2 ∈ CN ( i,j ) | c = = c ) ) = α P P ( c c = c = | c c ( c | ) + γ u P ( c c 6 = j u i i i u j j i i 2 2 + (1 − π ) γ πα = 2 ( u ∈ CN ( i,j ) | c ) 6 = c c ) = αγP ( c = = c 6 or c c = c | | c c 6 = c = ) + γ P P ( c 6 6 = c ,c i u j i j j u i i u j j i u 2 2 παγ 2 π ) γ π = P ( u ∈ CN ( i,j ) | c + (1 = c − ) − = 2 ( α − γ ) j i P ( u ∈ CN ( i,j ) | c ≤ = c ) j i E [ | CN ( i,j ) Thus the expected number of common neighbors ] is higher when c . If = c | j i we can show that the random variable CN ( i,j ) concentrates around its expectation, node pairs with the most common neighbors would belong to the same cluster. Thus, common neighbors would offer a good solution to Problem 1. We show conditions under which this is indeed the case. There are three key points regarding our method: (a) handling dependencies between common neighbor counts, (b) defining the graph density regime under which common neighbors is consistent, and (c) proposing a variant of common neighbors which significantly broadens this region of consistency. ′ ( ) and CN ( i,j i,j ) are dependent; hence, distinguishing between Dependence. CN CN ( i,j within-group and outside-group nodes can be complicated even if each ) concen- trates around its expectation. We handle this via a careful conditioning step. In general, the parameters and γ can be functions of Dense versus sparse graphs. α , and we can try to characterize parameter settings when common neighbors consistently n returns nodes from the same cluster as the input node. We show that when the graph is √ sufficiently “dense” (average degree is growing faster than n log n ), common neighbors is powerful enough to answer Problem 2. Also, ( α − γ ) /α can go to zero at a suitable rate. On the other hand, the expected number of common neighbors between nodes tends to zero for sparser graphs, irrespective of whether the nodes are in the same cluster or not. Further, the standard deviation is of a higher order than the expectation, so there is no concentration. In this case, counting common neighbors fails, even for Problem 1. A variant with better consistency properties. However, we show that the addition of an extra post-processing step (henceforth, the “cleaning” step) still enables common neighbors to identify nodes from its own cluster, while reducing the number of off-cluster n →∞ . This requires a stronger separation nodes to zero with probability tending to one as α and γ . However, such “strong consistency” is only possible when condition between 1 / 3 n ) the average degree grows faster than ( n log . Thus, the cleaning step extends the √ n ) range. consistency of common neighbors beyond the O (1 / 4

5 4 Main Results We first split the edge set of the complete graph on nodes into two sets: K n and its 1 complement (independent of the given graph G ). We compute common neighbors on K 2 G K and perform a “cleaning” process on G = ∩ ∩ K . The adjacency matrices = G G 1 2 1 2 A and of are denoted by A G and G , which belongs to . We will fix a reference node q 2 1 2 1 , each of size without loss of generality (recall that there are K clusters C class ...C C K 1 1 ). nπ Let ( i 6 = q ) denote the number of common neighbors between q and i . Algorithm 1 X i S i : X computes the set ≥ { } of nodes who have at least t common neighbors with = t n i n S A , whereas Algorithm 2 does a further degree thresholding on A on to refine . into S q 1 1 2 Common neighbors screening algorithm Algorithm 1 1: procedure ( A Scan ,q,t ) n 1 2 i ≤ n , X 2: ← A ≤ For 1 ( q,i ) i 1 3: ← 0 X q S i : X 4: ≥ t ←{ } i n return S 5: Algorithm 2 Post Selection Cleaning algorithm procedure Clean ( S,A ) 1: ,q,s n 2 ∑ S : 2: i ←{ ≥ A } ( i,j ) s 1 n 2 S ∈ j return 3: S 1 α To analyze the algorithms, we must specify conditions on graph densities. Recall that γ represent within-cluster and across-cluster link probabilities. We assume that α/γ and α → is constant while ,γ → 0; equivalently, assume that both α and γ are both some 0 constant times ρ , where ρ → 0. The analysis of graphs has typically been divided into two regimes. The dense regime →∞ nρ is a fraction of n as n grows. , where the expected degree consists of graphs with nρ nρ = O (1), so degree is roughly constant. Our work explores a finer In the sparse regime, semi-sparse , defined next. semi-dense gradation, which we call and . Definition 4.1 (Semi-dense graph) A sequence of graphs is called semi-dense if 2 log →∞ as n →∞ . / n nρ 2 Definition 4.2 nρ (Semi-sparse graph) → 0 . A sequence of graphs is called semi-sparse if / 3 2 n ρ/ log n →∞ as n →∞ . but Our first result is that common neighbors is enough to solve not only the link-prediction problem (Problem 1) but also the local clustering problem (Problem 2) in the semi-dense case. This is because even though both nodes within and outside the query node’s cluster q have a growing number of common neighbors with , there is a clear distinction in the expected number of common neighbors between the two classes. Also, since the standard deviation is of a smaller order than the expectation, the random variables concentrate. ( such that SCAN Thus, we can pick a threshold t ) yields just the nodes in the ,q,t A n 1 n same cluster as q with high probability. Note that the cleaning step (Algorithm 2) is not necessary in this case. (Algorithm 1 solves Problem 2 in semi-dense graphs) Let t = . Theorem 4.1 n ) ( 2 2 . Let S be the set of nodes returned by ) ( A . Let ,q,t SCAN π ) γ π ( α n γ ) + / 2 + (1 − 2 n 1 and respectively. If the graph is denote the number of nodes in S ∩ C n S \ C n and w o 1 1 ( ) 4 / 1 log n − γ α 2 √ = 0) n P → 1 ≥ , then . ( n ( = nπ ) → 1 and P semi-dense, and if 2 w o nα α π Proof Sketch. We only sketch the proof here, deferring details to the supplementary mate- ∑ d = rial. Let A to nodes in q ) be the number of links from the query node q,i ( qa 1 C ∈ i a 5

6 ∑ cluster . Let = { d d . We first show that ,...q d } and d = C 1 a q qK q qa a ) ( K nπα ∈ ψ d ) (1 ± q n 1 , (1) P ( d ≥ ∈ Good − 1 ) , P q 2 (1 ) ψ ± 6 nπγ ∀ = 1 ∈ d a n qa n √ √ √ √ 2 , ψ ) = (6 log n log n/n · ) / log n/ ( nρ ( )) → 0 . (2) nπγ Θ( n Conditioned on , X d is the sum of K Binomial( d ) independent random variables ,B 1 a i q qa via nodes in each of the representing the number of common neighbors between i K and q | d ,i ∈ C clusters: E d α + ( d − d ) γ. We have: [ X ] = q qa a qa i ( ) 2 2 η | d ) ∈ Good ,i ∈ C ψ ] ≥ n , πα E + (1 − π ) γ [ X (1 − ψ − ) , ` (1 n n 1 n q i 1 ( ) 2 [ X ) | d ψ ∈ η ,i ∈ C (1 + ,a 6 = 1] ≤ n Good 2 παγ + (1 − 2 π ) γ , E (1 + ψ u ) , n n q n i a a √ 2 2 ( + u → ) / 2, u n ≤ t log ≤ ` / , and ` u − t nα = nπ = ( α − γ ) ` ≥ 4 log n Note that n n n n n n n n , where we applied condition on ( − γ ) /α noted in the theorem statement. We show: ∞ α (1) o 3+ / 4 − | d ∈ Good ,i ∈ C P ) ≤ n ( t ≤ X i 1 q n (1) o − 4 / 3+ d ( ∈ Good ,i X C ∈ ,a 6 = 1) ≤ n ≥ t P | a q i n d n , both Conditioned on are sums of conditionally independent and identically and n o q w distributed Bernoullis. ( ) K / 3 4 − ( d d ∈ Good ) P ( n = nπ | P ∈ Good ) ≥ ( 1 − n = nπ ) ≥ P − nπ · n (1 · → 1 ) q q w w 2 n / 3 1 − → P d P ∈ Good ) · P ( n n = 0 | d 1 ∈ Good ) ≥ 1 − Θ( n ≥ = 0) ( ( ) o q o q There are two major differences between the semi-sparse and semi-dense cases. First, in the 2 and η semi-sparse case, both expectations are of the order O ( nρ η ) which tends to zero. 1 a Second, standard deviations on the number of common neighbors are of a larger order than expectations. Together, this means that the number of common neighbors to within-cluster and outside-cluster nodes can no longer be separated; hence, Algorithm 1 by itself cannot cleaning, the entire cluster of the query node q can still be recovered. work. However, after (Algorithm 1 followed by Algorithm 2 solves Problem 2 in semi-sparse Theorem 4.2 2 2 t and = 1 graphs) s ) = n . ( πα + (1 − π ) γ ) Let ( α + γ ) / 2 . Let S = Scan ( A ,q,t n 1 n n ( ) c ) ( ( c ) ) and n S S = Clean ( n S,A C ,q,s ∩ denote the number of nodes in . Let o w 1 n 2 1 1 ) ( ) c ( = 1 → and nπ − π γ , then P 3(1 n ≥ πα ). If the graph is semi-sparse, and ) ( C \ S w 1 1 ) ( c ) ( → = 0 . 1 P n o Proof Sketch. We only sketch the proof here, with details being deferred to the supplemen- E [ X ] hold | d tary material. The degree bounds of Eq. 1 and the equations for ∈ Good q i X (which are sums of even in the semi-sparse case. We can also bound the variances of i conditionally independent Bernoullis): X η | d ] = ∈ Good ,i ∈ C C ] ≤ var[ [ X ∈ | d ,i ∈ Good E q 1 1 q i i 1 Since the expected number of common neighbors vanishes and the standard deviation is an order larger than the expectation, there is no hope for concentration; however, there are slight differences in the probability of having at least one common neighbor. First, by an application of the Paley-Zygmund inequality, we find: p ) , P ( X C ≥ 1 | d ∈ ∈ Good ,i q i 1 1 2 Good X E | d ] ∈ [ ,i ∈ C 1 i q ≥ 2 X d | d ] ∈ Good ,i ∈ C C ) + E [ X ∈ | var( ,i ∈ Good q 1 1 q i i 2 η 1 ` 0 → η since ) η ≥ ≥ − (1 − ψ )(1 n 1 n 1 2 η + η 1 1 6

7 For a > 1, Markov’s inequality gives: E ,a X ≥ 1 | d ∈ Good ,i ∈ C ,a 6 = 1) ≤ ( [ X | d ∈ Good ,i ∈ C P 6 = 1] = η , p a q a i i a a q 2 2 ρ Even though = Θ( n p 0, nπp ) → ∞ , so we can use concentration inequalities like → a a the Chernoff bound again to bound and n n . o w √ 3 / 4 − P (1 − ( 6 log n/nπp n )) ≥ 1 ≥ n nπp − 1 w 1 √ 3 / 4 − ( π ) p P (1 + n ≤ n (1 − − π p 6 log )) ≥ 1 ) n n/n (1 − o a a n can be of the same order here. Hence, the candidate and n Unlike the denser regime, w o S returned by thresholding the common neighbors has a non-vanishing fraction of nodes set from outside q ’s community. However, this fraction is relatively small, which is what we would exploit in the cleaning step. θ . The separation and θ from a node to denote the expected number of edges in A Let S 2 w o √ θ − θ 4 condition in the theorem statement gives ≥ θ log n. Setting the degree threshold o w w s 2, we bound the probability of mistakes in the cleaning step: = ( θ / + θ ) o w n ∑ o (1) − 1 / 3+ C A ( i,j ) ≤ ∈ | d i ∈ Good ) ≤ n s.t. ( ∃ P s q 1 n 2 S ∈ j ∑ (1) o 3+ / 1 − ( ∃ i 6∈ C s.t. P A i,j s ) | d ≥ ∈ Good ) ≤ n ( 1 2 n q ∈ j S d Good ∈ Removing the conditioning on (as in Theorem 4.1) yields the desired result. q 5 Experiments We present our experimental results in two parts. First, we use simulations to support our theoretical claims. Next we present link prediction accuracies on real world collaborative networks to show that common neighbors indeed perform close to gold standard algorithms like spectral clustering and the Katz score. Implementation details: Recall that our algorithms are based on thresholding. When there is a large gap between common neighbors between node q and nodes in its cluster (e.g., in the semi-dense regime), this is equivalent to using the k-means algorithm with k = 2 to S in Algorithm 1. The same holds for finding find in algorithm 2. When the number S 1 S of nodes with more than two common neighbors is less than ten, we define the set by finding all neighbors with at least one common neighbor (as in the semi-sparse regime). On the other hand, since the cleaning step works only when S is sufficiently large (so that | S degrees concentrate), we do not perform any cleaning when < 30. While we used the | split sample graph A in the cleaning step for ease of analysis, we did the cleaning using 2 the same network in the experiments. Experimental setup for simulations: We use a stochastic blockmodel of 2000 nodes split into 4 equal-sized clusters. For each value of ( α,γ ) we pick 50 query nodes at random, and calculate the precision and recall of the result against nodes from the query node’s cluster (for any subset S and true cluster C , precision = | S ∩ C | / | S | and recall = | S ∩ C | / | C | ). We report mean precision and recall over 50 random generated graph instances. Figure 1 shows the precision and recall as degree grows, Accuracy on simulated data: with the parameters ( α,γ ) satisfying the condition πα ≥ 3(1 − π ) γ of Thm. 4.2. We see that cleaning helps both precision and recall, particularly in the medium-degree range (the semi-sparse regime). As a reference, we also plot the precision of spectral clustering, when it was given the correct number of clusters ( K = 4). Above average degree of 10, spectral clustering gives perfect precision, whereas common neighbors can identify a large fraction of the true cluster once average degree is above 25. On the other hand, for average degree less than seven, spectral clustering performs poorly, whereas the precision of common neighbors is remarkably higher. Precision is relatively higher than recall for a broad degree regime, and this explains why common neighbors are a popular choice for link prediction. On a side 7

8 (A) (B) Recall and precision versus average degree: When degree is very small, none of the Figure 1: methods work well. In the medium-degree range (semi-sparse regime), we see that common neighbors gets increasingly better precision and recall, and cleaning helps. With high enough degrees (semi-dense regime), just common neighbors is sufficient and gets excellent accuracy. Table 1: AUC scores for co-authorship networks Mean degree Dataset AUC n Time-steps SPEC Katz Random CN CN-clean 5969 4 HepTH .70 .74 .82 .82 .49 6 Citeseer 4520 5 11 .88 .89 .89 .95 .52 NIPS 1222 3.95 9 .63 .69 .68 .78 .47 note, it is not surprising that in a very sparse graph common neighbors cannot identify the whole cluster, since not everyone can be reached in two hops. We used publicly available co-authorship datasets over Accuracy on real-world data: time where nodes represent authors and an edge represents a collaboration between two authors. In particular, we used subgraphs of the High Energy Physics (HepTH) co- authorship dataset (6 timesteps), the NIPS dataset (9 timesteps) and the Citeseer dataset (11 timesteps). We obtain the training graph by merging the first T-2 networks, use the th step for cross-validation and use the last timestep as the test graph. The number of T-1 nodes and average degrees are reported in Table 1. We merged 1-2 years of papers to create one timestep (so that the median degree of the test graph is at least 1). We compare our algorithm (CN and CN-clean) with the Katz score which is used widely in link prediction [8] and spectral clustering of the network. Spectral clustering is carried out on the giant component of the network. Furthermore, we cross-validate the number of clusters using the held out graph. Our setup is very similar to link prediction experiments in related literature [14]. Since these datasets are unlabeled, we cannot calculate precision or recall as before. Instead for any score or affinity measure, we propose to perform link prediction experiments as follows. For a randomly picked node we calculate the score from the node to everyone else. We compute the AUC score of this vector against the edges in the test graph. We report the average AUC for 100 randomly picked nodes. Table 1 shows that even in sparse regimes common neighbors performs similar to benchmark algorithms. 6 Conclusions Counting common neighbors is a particularly useful heuristic: it is fast and also works well empirically. We prove the effectiveness of common neighbors for link prediction as well as local clustering around a query node, under the stochastic blockmodel setting. In particular, we show the existence of a semi-dense regime where common neighbors yields the right cluster w.h.p, and a semi-sparse regime where an additional “cleaning” step is required. Experiments with simulated as well as real-world datasets shows the efficacy of our approach, including the importance of the cleaning step. 8

9 References [1] L. Adamic and E. Adar. Friends and neighbors on the web. , 25:211–230, Social Networks 2003. [2] L. Backstrom and J. Leskovec. Supervised random walks: Predicting and recommend- ing links in social networks. In Proceedings of the Fourth ACM International Conference , pages 635–644, New York, NY, USA, 2011. ACM. on Web Search and Data Mining [3] P. J. Bickel and A. Chen. A nonparametric view of network models and newman girvan and other modularities. Proceedings of the National Academy of Sciences of the Unites , 106(50):21068Ű21073, 2009. States of America [4] K. Chaudhuri, F. C. Graham, and A. Tsiatas. Spectral clustering of graphs with general degrees in the extended planted partition model. Journal of Machine Learning Research , 23:35.1–35.23, 2012. - Proceedings Track [5] M. S. Handcock, A. E. Raftery, and J. M. Tantrum. Model-based clustering for social networks. , Journal of the Royal Statistical Society: Series A (Statistics in Society) 170(2):301–354, 2007. [6] P. W. Holland, K. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social Networks , 5(2):109–137, 1983. [7] L. Katz. A new status index derived from sociometric analysis. In Psychometrika , volume 18, pages 39–43, 1953. [8] D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In Conference on Information and Knowledge Management . ACM, 2003. [9] L. Lü and T. Zhou. Link prediction in complex networks: A survey. Physica A , 390(6):11501170, 2011. [10] F. McSherry. Spectral partitioning of random graphs. In , pages 529–537, 2001. FOCS [11] S. C. Olhede and P. J. Wolfe. Network histograms and universality of blockmodel Proceedings of the National Academy of Sciences of the Unites States approximation. , 111(41):14722–14727, 2014. of America [12] A. E. Raftery, M. S. Handcock, and P. D. Hoff. Latent space approaches to social network analysis. Journal of the American Statistical Association , 15:460, 2002. [13] K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the high-dimensional stochastic blockmodel. , 39:1878–1915, 2011. Annals of Statistics [14] P. Sarkar and P. J. Bickel. Role of normalization in spectral clustering for stochastic blockmodels. To appear in the Annals of Statistics. , 2014. [15] P. Sarkar, D. Chakrabarti, and A. Moore. Theoretical justification of popular link prediction heuristics. In Conference on Learning Theory . ACM, 2010. [16] P. Sarkar and A. Moore. A tractable approach to finding closest truncated-commute- time neighbors in large graphs. In Proc. UAI , 2007. 9

Related documents

NCDOT Current STIP

NCDOT Current STIP

May 2019 NCDOT Current STIP

More info »
JudDir

JudDir

Directory 2019 CONNECTICUT JUDICIAL BRANCH www.jud.ct.gov JDP-ES-190

More info »
Operator Directory Listing

Operator Directory Listing

OPERATOR'S DIRECTORY SORTED BY OPERATOR NAME DATA SUPPLIED BY: FORM 1006 B CURRENT AS OF: Tuesday, April 16, 2019 Please notify Surety Department at ( 405 ) 521-2273 of any corrections or omissions th...

More info »
58367.1 EFL Delta Ray One RTF   BNF Manual MULTI.indb

58367.1 EFL Delta Ray One RTF BNF Manual MULTI.indb

® SAFE Technology, Optional Flight Envelope Protection Instruction Manual Bedienungsanleitung Manuel d’utilisation Manuale di Istruzioni

More info »
435 441 458 467r e

435 441 458 467r e

WT/DS435/R, WT/DS441/R WT/DS458/R, WT/DS467/R 28 June 2018 Page: (18 - 1/884 4061 ) Original: English AUSTRALIA CERTAIN MEASURES CON CERNING TRADEMARKS, – PACKAGING IONS AND OTHER PLAIN GEOGRAPHICAL I...

More info »
dac2011

dac2011

1 Direct Air Capture of CO with Chemicals June 1, 2011 2 Direct Air Capture of CO with Chemicals 2 A Technology Assessment for the APS Panel on Public Affairs June 1, 2011

More info »
StateoftheClimate2017 lowres

StateoftheClimate2017 lowres

STATE OF THE CLIMATE I N 2017 Special Supplement to the Bulletin of the American Meteorological Society Vol. 99, No. 8, August 2018

More info »
WG1AR5 Chapter13 FINAL

WG1AR5 Chapter13 FINAL

Sea Level Change 13 Coordinating Lead Authors: John A. Church (Australia), Peter U. Clark (USA) Lead Authors: Anny Cazenave (France), Jonathan M. Gregory (UK), Svetlana Jevrejeva (UK), Anders Leverman...

More info »
551694

551694

Chapter Overview Notification and Orders : Chapter 1 – Discusses the preparation and types of orders used to mobilize/employ/deploy military and civilian personnel (includes installation/unit requirem...

More info »
Approved Air Traffic Collegiate Training Initiative (AT CTI) Schools

Approved Air Traffic Collegiate Training Initiative (AT CTI) Schools

Approved Air Traffic Collegiate Training Initiative -CTI) Schools (AT Alphabetical Listing Institution Point of Contact Aims Community College Patti Ph illips Greeley, Colorado [email protected]

More info »
Layout 1

Layout 1

Understanding European versus U.S. temperature code ratings for solenoid- operated valves a b y M o n n y A r c e A White Paper From ASCO

More info »
Microsoft Word   DIRECTORY2 2.docx

Microsoft Word DIRECTORY2 2.docx

DIRECTORY Union City Mayor and Commissioners City of Union City 3715 Palisade Avenue Union City, NJ 07087 Union City is part of the 9 - 1 - 1 emergency system. Please call 9 - 1 - 1 in the event of a ...

More info »
sbl flyer

sbl flyer

Washington State Small Business Liaison Team Making Washington State the best place to do business. Services Agency Contact Nancy Rocha Aguilar Commission of Hispanic Issues concerning the rights & ne...

More info »
Assessing the Relationship between Student Involvement and Academic Performance in Higher Education

Assessing the Relationship between Student Involvement and Academic Performance in Higher Education

W estern Ke ky U niv ersit y ntuc HOL AR® TopSC Graduate School Masters The ses & Specialist Projects 8-2010 g the R elationshi p b Asses wee n S tude nt sin et ve me nt a nd Ac ade mic P erfor m ance...

More info »
TPM Main Part 3 Commands v1.2 rev116 01032011

TPM Main Part 3 Commands v1.2 rev116 01032011

1 2 1 2 3 4 5 6 7 8 TPM Main Part 3 Commands 9 10 Specification Version 1.2 11 Level 2 Revision 116 12 1 March 2011 13 TCG Published 14 15 16 17 Contact: [email protected] 18 19 20 21 22...

More info »
Online Charter Study Final

Online Charter Study Final

Online Charter School Study 2015

More info »
2017 2018 Bluebook Complete

2017 2018 Bluebook Complete

IDAHO BLUE BOOK 2017-2018 Published by SECRETARY OF STATE LAWERENCE DENNEY for the STATE OF IDAHO

More info »
Directory of Frequently Purchased Commodities and Services by New York State Agencies

Directory of Frequently Purchased Commodities and Services by New York State Agencies

Directory of Frequently Purchased Commodities and Services by New York State Agencies th 27 Edition OFFICE OF THE NEW YORK STATE COMPTROLLER Thomas P. DiNapoli, State Comptroller February 2017

More info »
2018 0026 China FRN 7 10 2018 0

2018 0026 China FRN 7 10 2018 0

[Billing Code 3290 - ] F 8 OFFICE OF THE UNITED STATES TRADE REPRESENTATIVE Docket Number USTR - 2018 - 0026 Concerning Proposed Request for Comment Action Pursuant to Section s Modification of : Chin...

More info »
Protocol for CopyControl™ Fosmid Library/CopyControl™ HTP Fosmid Library Production Kit with pCC1FOS™ Vector

Protocol for CopyControl™ Fosmid Library/CopyControl™ HTP Fosmid Library Production Kit with pCC1FOS™ Vector

CopyControl™ Fosmid Library Production Kit with pCC1FOS™ Vector with pCC1FOS™ Vector and R E. coli Phage T-1 Resistant EPI300™-T1 Plating Strain Cat. No. CCFOS110 CopyControl™ HTP Fosmid Library Produ...

More info »