# The Consistency of Common Neighbors for Link Prediction in Stochastic Blockmodels

## Transcript

4 corresponding eigenspace via an algorithm like k-means . A slight variation of this has √ √ ) / α been shown to work as long as ( − γ n/ α n = Ω(log ) and the average degree grows . faster than poly-logarithmic powers of n However, (a) spectral clustering solves a harder problem than Problems 1 and 2, and (b) eigen-decompositions can be expensive, particularly for very large graphs. Our claim is that a simpler operation — counting common neighbors between nodes — can yield results that are almost as good in a broad parameter regime. , link prediction via common neighbors follows a Common neighbors. Given a node i i and j have the maximum number such that simple prescription: predict a link to node j ( i,j ) | of shared friends CN ( i,j | { u | i ∼ u ∼ j } . The usefulness of common CN ) = neighbors have been observed in practice  and justified theoretically for the latent distance model . However, its properties under the stochastic blockmodel remained unknown. and from the same cluster to have many Intuitively, we would expect a pair of nodes j i from the same cluster, since both the links i ∼ common neighbors and u ∼ j occur with u u α , whereas for c 6 probability = c must have the , at least one of the edges i ∼ u and u ∼ j j i γ . lower probability 2 2 ∈ CN ( i,j ) | c = = c ) ) = α P P ( c c = c = | c c ( c | ) + γ u P ( c c 6 = j u i i i u j j i i 2 2 + (1 − π ) γ πα = 2 ( u ∈ CN ( i,j ) | c ) 6 = c c ) = αγP ( c = = c 6 or c c = c | | c c 6 = c = ) + γ P P ( c 6 6 = c ,c i u j i j j u i i u j j i u 2 2 παγ 2 π ) γ π = P ( u ∈ CN ( i,j ) | c + (1 = c − ) − = 2 ( α − γ ) j i P ( u ∈ CN ( i,j ) | c ≤ = c ) j i E [ | CN ( i,j ) Thus the expected number of common neighbors ] is higher when c . If = c | j i we can show that the random variable CN ( i,j ) concentrates around its expectation, node pairs with the most common neighbors would belong to the same cluster. Thus, common neighbors would offer a good solution to Problem 1. We show conditions under which this is indeed the case. There are three key points regarding our method: (a) handling dependencies between common neighbor counts, (b) defining the graph density regime under which common neighbors is consistent, and (c) proposing a variant of common neighbors which significantly broadens this region of consistency. ′ ( ) and CN ( i,j i,j ) are dependent; hence, distinguishing between Dependence. CN CN ( i,j within-group and outside-group nodes can be complicated even if each ) concen- trates around its expectation. We handle this via a careful conditioning step. In general, the parameters and γ can be functions of Dense versus sparse graphs. α , and we can try to characterize parameter settings when common neighbors consistently n returns nodes from the same cluster as the input node. We show that when the graph is √ sufficiently “dense” (average degree is growing faster than n log n ), common neighbors is powerful enough to answer Problem 2. Also, ( α − γ ) /α can go to zero at a suitable rate. On the other hand, the expected number of common neighbors between nodes tends to zero for sparser graphs, irrespective of whether the nodes are in the same cluster or not. Further, the standard deviation is of a higher order than the expectation, so there is no concentration. In this case, counting common neighbors fails, even for Problem 1. A variant with better consistency properties. However, we show that the addition of an extra post-processing step (henceforth, the “cleaning” step) still enables common neighbors to identify nodes from its own cluster, while reducing the number of off-cluster n →∞ . This requires a stronger separation nodes to zero with probability tending to one as α and γ . However, such “strong consistency” is only possible when condition between 1 / 3 n ) the average degree grows faster than ( n log . Thus, the cleaning step extends the √ n ) range. consistency of common neighbors beyond the O (1 / 4

5 4 Main Results We first split the edge set of the complete graph on nodes into two sets: K n and its 1 complement (independent of the given graph G ). We compute common neighbors on K 2 G K and perform a “cleaning” process on G = ∩ ∩ K . The adjacency matrices = G G 1 2 1 2 A and of are denoted by A G and G , which belongs to . We will fix a reference node q 2 1 2 1 , each of size without loss of generality (recall that there are K clusters C class ...C C K 1 1 ). nπ Let ( i 6 = q ) denote the number of common neighbors between q and i . Algorithm 1 X i S i : X computes the set ≥ { } of nodes who have at least t common neighbors with = t n i n S A , whereas Algorithm 2 does a further degree thresholding on A on to refine . into S q 1 1 2 Common neighbors screening algorithm Algorithm 1 1: procedure ( A Scan ,q,t ) n 1 2 i ≤ n , X 2: ← A ≤ For 1 ( q,i ) i 1 3: ← 0 X q S i : X 4: ≥ t ←{ } i n return S 5: Algorithm 2 Post Selection Cleaning algorithm procedure Clean ( S,A ) 1: ,q,s n 2 ∑ S : 2: i ←{ ≥ A } ( i,j ) s 1 n 2 S ∈ j return 3: S 1 α To analyze the algorithms, we must specify conditions on graph densities. Recall that γ represent within-cluster and across-cluster link probabilities. We assume that α/γ and α → is constant while ,γ → 0; equivalently, assume that both α and γ are both some 0 constant times ρ , where ρ → 0. The analysis of graphs has typically been divided into two regimes. The dense regime →∞ nρ is a fraction of n as n grows. , where the expected degree consists of graphs with nρ nρ = O (1), so degree is roughly constant. Our work explores a finer In the sparse regime, semi-sparse , defined next. semi-dense gradation, which we call and . Definition 4.1 (Semi-dense graph) A sequence of graphs is called semi-dense if 2 log →∞ as n →∞ . / n nρ 2 Definition 4.2 nρ (Semi-sparse graph) → 0 . A sequence of graphs is called semi-sparse if / 3 2 n ρ/ log n →∞ as n →∞ . but Our first result is that common neighbors is enough to solve not only the link-prediction problem (Problem 1) but also the local clustering problem (Problem 2) in the semi-dense case. This is because even though both nodes within and outside the query node’s cluster q have a growing number of common neighbors with , there is a clear distinction in the expected number of common neighbors between the two classes. Also, since the standard deviation is of a smaller order than the expectation, the random variables concentrate. ( such that SCAN Thus, we can pick a threshold t ) yields just the nodes in the ,q,t A n 1 n same cluster as q with high probability. Note that the cleaning step (Algorithm 2) is not necessary in this case. (Algorithm 1 solves Problem 2 in semi-dense graphs) Let t = . Theorem 4.1 n ) ( 2 2 . Let S be the set of nodes returned by ) ( A . Let ,q,t SCAN π ) γ π ( α n γ ) + / 2 + (1 − 2 n 1 and respectively. If the graph is denote the number of nodes in S ∩ C n S \ C n and w o 1 1 ( ) 4 / 1 log n − γ α 2 √ = 0) n P → 1 ≥ , then . ( n ( = nπ ) → 1 and P semi-dense, and if 2 w o nα α π Proof Sketch. We only sketch the proof here, deferring details to the supplementary mate- ∑ d = rial. Let A to nodes in q ) be the number of links from the query node q,i ( qa 1 C ∈ i a 5

6 ∑ cluster . Let = { d d . We first show that ,...q d } and d = C 1 a q qK q qa a ) ( K nπα ∈ ψ d ) (1 ± q n 1 , (1) P ( d ≥ ∈ Good − 1 ) , P q 2 (1 ) ψ ± 6 nπγ ∀ = 1 ∈ d a n qa n √ √ √ √ 2 , ψ ) = (6 log n log n/n · ) / log n/ ( nρ ( )) → 0 . (2) nπγ Θ( n Conditioned on , X d is the sum of K Binomial( d ) independent random variables ,B 1 a i q qa via nodes in each of the representing the number of common neighbors between i K and q | d ,i ∈ C clusters: E d α + ( d − d ) γ. We have: [ X ] = q qa a qa i ( ) 2 2 η | d ) ∈ Good ,i ∈ C ψ ] ≥ n , πα E + (1 − π ) γ [ X (1 − ψ − ) , ` (1 n n 1 n q i 1 ( ) 2 [ X ) | d ψ ∈ η ,i ∈ C (1 + ,a 6 = 1] ≤ n Good 2 παγ + (1 − 2 π ) γ , E (1 + ψ u ) , n n q n i a a √ 2 2 ( + u → ) / 2, u n ≤ t log ≤ ` / , and ` u − t nα = nπ = ( α − γ ) ` ≥ 4 log n Note that n n n n n n n n , where we applied condition on ( − γ ) /α noted in the theorem statement. We show: ∞ α (1) o 3+ / 4 − | d ∈ Good ,i ∈ C P ) ≤ n ( t ≤ X i 1 q n (1) o − 4 / 3+ d ( ∈ Good ,i X C ∈ ,a 6 = 1) ≤ n ≥ t P | a q i n d n , both Conditioned on are sums of conditionally independent and identically and n o q w distributed Bernoullis. ( ) K / 3 4 − ( d d ∈ Good ) P ( n = nπ | P ∈ Good ) ≥ ( 1 − n = nπ ) ≥ P − nπ · n (1 · → 1 ) q q w w 2 n / 3 1 − → P d P ∈ Good ) · P ( n n = 0 | d 1 ∈ Good ) ≥ 1 − Θ( n ≥ = 0) ( ( ) o q o q There are two major differences between the semi-sparse and semi-dense cases. First, in the 2 and η semi-sparse case, both expectations are of the order O ( nρ η ) which tends to zero. 1 a Second, standard deviations on the number of common neighbors are of a larger order than expectations. Together, this means that the number of common neighbors to within-cluster and outside-cluster nodes can no longer be separated; hence, Algorithm 1 by itself cannot cleaning, the entire cluster of the query node q can still be recovered. work. However, after (Algorithm 1 followed by Algorithm 2 solves Problem 2 in semi-sparse Theorem 4.2 2 2 t and = 1 graphs) s ) = n . ( πα + (1 − π ) γ ) Let ( α + γ ) / 2 . Let S = Scan ( A ,q,t n 1 n n ( ) c ) ( ( c ) ) and n S S = Clean ( n S,A C ,q,s ∩ denote the number of nodes in . Let o w 1 n 2 1 1 ) ( ) c ( = 1 → and nπ − π γ , then P 3(1 n ≥ πα ). If the graph is semi-sparse, and ) ( C \ S w 1 1 ) ( c ) ( → = 0 . 1 P n o Proof Sketch. We only sketch the proof here, with details being deferred to the supplemen- E [ X ] hold | d tary material. The degree bounds of Eq. 1 and the equations for ∈ Good q i X (which are sums of even in the semi-sparse case. We can also bound the variances of i conditionally independent Bernoullis): X η | d ] = ∈ Good ,i ∈ C C ] ≤ var[ [ X ∈ | d ,i ∈ Good E q 1 1 q i i 1 Since the expected number of common neighbors vanishes and the standard deviation is an order larger than the expectation, there is no hope for concentration; however, there are slight differences in the probability of having at least one common neighbor. First, by an application of the Paley-Zygmund inequality, we find: p ) , P ( X C ≥ 1 | d ∈ ∈ Good ,i q i 1 1 2 Good X E | d ] ∈ [ ,i ∈ C 1 i q ≥ 2 X d | d ] ∈ Good ,i ∈ C C ) + E [ X ∈ | var( ,i ∈ Good q 1 1 q i i 2 η 1 ` 0 → η since ) η ≥ ≥ − (1 − ψ )(1 n 1 n 1 2 η + η 1 1 6

7 For a > 1, Markov’s inequality gives: E ,a X ≥ 1 | d ∈ Good ,i ∈ C ,a 6 = 1) ≤ ( [ X | d ∈ Good ,i ∈ C P 6 = 1] = η , p a q a i i a a q 2 2 ρ Even though = Θ( n p 0, nπp ) → ∞ , so we can use concentration inequalities like → a a the Chernoff bound again to bound and n n . o w √ 3 / 4 − P (1 − ( 6 log n/nπp n )) ≥ 1 ≥ n nπp − 1 w 1 √ 3 / 4 − ( π ) p P (1 + n ≤ n (1 − − π p 6 log )) ≥ 1 ) n n/n (1 − o a a n can be of the same order here. Hence, the candidate and n Unlike the denser regime, w o S returned by thresholding the common neighbors has a non-vanishing fraction of nodes set from outside q ’s community. However, this fraction is relatively small, which is what we would exploit in the cleaning step. θ . The separation and θ from a node to denote the expected number of edges in A Let S 2 w o √ θ − θ 4 condition in the theorem statement gives ≥ θ log n. Setting the degree threshold o w w s 2, we bound the probability of mistakes in the cleaning step: = ( θ / + θ ) o w n ∑ o (1) − 1 / 3+ C A ( i,j ) ≤ ∈ | d i ∈ Good ) ≤ n s.t. ( ∃ P s q 1 n 2 S ∈ j ∑ (1) o 3+ / 1 − ( ∃ i 6∈ C s.t. P A i,j s ) | d ≥ ∈ Good ) ≤ n ( 1 2 n q ∈ j S d Good ∈ Removing the conditioning on (as in Theorem 4.1) yields the desired result. q 5 Experiments We present our experimental results in two parts. First, we use simulations to support our theoretical claims. Next we present link prediction accuracies on real world collaborative networks to show that common neighbors indeed perform close to gold standard algorithms like spectral clustering and the Katz score. Implementation details: Recall that our algorithms are based on thresholding. When there is a large gap between common neighbors between node q and nodes in its cluster (e.g., in the semi-dense regime), this is equivalent to using the k-means algorithm with k = 2 to S in Algorithm 1. The same holds for finding find in algorithm 2. When the number S 1 S of nodes with more than two common neighbors is less than ten, we define the set by finding all neighbors with at least one common neighbor (as in the semi-sparse regime). On the other hand, since the cleaning step works only when S is sufficiently large (so that | S degrees concentrate), we do not perform any cleaning when < 30. While we used the | split sample graph A in the cleaning step for ease of analysis, we did the cleaning using 2 the same network in the experiments. Experimental setup for simulations: We use a stochastic blockmodel of 2000 nodes split into 4 equal-sized clusters. For each value of ( α,γ ) we pick 50 query nodes at random, and calculate the precision and recall of the result against nodes from the query node’s cluster (for any subset S and true cluster C , precision = | S ∩ C | / | S | and recall = | S ∩ C | / | C | ). We report mean precision and recall over 50 random generated graph instances. Figure 1 shows the precision and recall as degree grows, Accuracy on simulated data: with the parameters ( α,γ ) satisfying the condition πα ≥ 3(1 − π ) γ of Thm. 4.2. We see that cleaning helps both precision and recall, particularly in the medium-degree range (the semi-sparse regime). As a reference, we also plot the precision of spectral clustering, when it was given the correct number of clusters ( K = 4). Above average degree of 10, spectral clustering gives perfect precision, whereas common neighbors can identify a large fraction of the true cluster once average degree is above 25. On the other hand, for average degree less than seven, spectral clustering performs poorly, whereas the precision of common neighbors is remarkably higher. Precision is relatively higher than recall for a broad degree regime, and this explains why common neighbors are a popular choice for link prediction. On a side 7

8 (A) (B) Recall and precision versus average degree: When degree is very small, none of the Figure 1: methods work well. In the medium-degree range (semi-sparse regime), we see that common neighbors gets increasingly better precision and recall, and cleaning helps. With high enough degrees (semi-dense regime), just common neighbors is sufficient and gets excellent accuracy. Table 1: AUC scores for co-authorship networks Mean degree Dataset AUC n Time-steps SPEC Katz Random CN CN-clean 5969 4 HepTH .70 .74 .82 .82 .49 6 Citeseer 4520 5 11 .88 .89 .89 .95 .52 NIPS 1222 3.95 9 .63 .69 .68 .78 .47 note, it is not surprising that in a very sparse graph common neighbors cannot identify the whole cluster, since not everyone can be reached in two hops. We used publicly available co-authorship datasets over Accuracy on real-world data: time where nodes represent authors and an edge represents a collaboration between two authors. In particular, we used subgraphs of the High Energy Physics (HepTH) co- authorship dataset (6 timesteps), the NIPS dataset (9 timesteps) and the Citeseer dataset (11 timesteps). We obtain the training graph by merging the first T-2 networks, use the th step for cross-validation and use the last timestep as the test graph. The number of T-1 nodes and average degrees are reported in Table 1. We merged 1-2 years of papers to create one timestep (so that the median degree of the test graph is at least 1). We compare our algorithm (CN and CN-clean) with the Katz score which is used widely in link prediction  and spectral clustering of the network. Spectral clustering is carried out on the giant component of the network. Furthermore, we cross-validate the number of clusters using the held out graph. Our setup is very similar to link prediction experiments in related literature . Since these datasets are unlabeled, we cannot calculate precision or recall as before. Instead for any score or affinity measure, we propose to perform link prediction experiments as follows. For a randomly picked node we calculate the score from the node to everyone else. We compute the AUC score of this vector against the edges in the test graph. We report the average AUC for 100 randomly picked nodes. Table 1 shows that even in sparse regimes common neighbors performs similar to benchmark algorithms. 6 Conclusions Counting common neighbors is a particularly useful heuristic: it is fast and also works well empirically. We prove the effectiveness of common neighbors for link prediction as well as local clustering around a query node, under the stochastic blockmodel setting. In particular, we show the existence of a semi-dense regime where common neighbors yields the right cluster w.h.p, and a semi-sparse regime where an additional “cleaning” step is required. Experiments with simulated as well as real-world datasets shows the efficacy of our approach, including the importance of the cleaning step. 8

9 References  L. Adamic and E. Adar. Friends and neighbors on the web. , 25:211–230, Social Networks 2003.  L. Backstrom and J. Leskovec. Supervised random walks: Predicting and recommend- ing links in social networks. In Proceedings of the Fourth ACM International Conference , pages 635–644, New York, NY, USA, 2011. ACM. on Web Search and Data Mining  P. J. Bickel and A. Chen. A nonparametric view of network models and newman girvan and other modularities. Proceedings of the National Academy of Sciences of the Unites , 106(50):21068Ű21073, 2009. States of America  K. Chaudhuri, F. C. Graham, and A. Tsiatas. Spectral clustering of graphs with general degrees in the extended planted partition model. Journal of Machine Learning Research , 23:35.1–35.23, 2012. - Proceedings Track  M. S. Handcock, A. E. Raftery, and J. M. Tantrum. Model-based clustering for social networks. , Journal of the Royal Statistical Society: Series A (Statistics in Society) 170(2):301–354, 2007.  P. W. Holland, K. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social Networks , 5(2):109–137, 1983.  L. Katz. A new status index derived from sociometric analysis. In Psychometrika , volume 18, pages 39–43, 1953.  D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In Conference on Information and Knowledge Management . ACM, 2003.  L. Lü and T. Zhou. Link prediction in complex networks: A survey. Physica A , 390(6):11501170, 2011.  F. McSherry. Spectral partitioning of random graphs. In , pages 529–537, 2001. FOCS  S. C. Olhede and P. J. Wolfe. Network histograms and universality of blockmodel Proceedings of the National Academy of Sciences of the Unites States approximation. , 111(41):14722–14727, 2014. of America  A. E. Raftery, M. S. Handcock, and P. D. Hoff. Latent space approaches to social network analysis. Journal of the American Statistical Association , 15:460, 2002.  K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the high-dimensional stochastic blockmodel. , 39:1878–1915, 2011. Annals of Statistics  P. Sarkar and P. J. Bickel. Role of normalization in spectral clustering for stochastic blockmodels. To appear in the Annals of Statistics. , 2014.  P. Sarkar, D. Chakrabarti, and A. Moore. Theoretical justification of popular link prediction heuristics. In Conference on Learning Theory . ACM, 2010.  P. Sarkar and A. Moore. A tractable approach to finding closest truncated-commute- time neighbors in large graphs. In Proc. UAI , 2007. 9

### Related documents ### NCDOT Current STIP

May 2019 NCDOT Current STIP ### JudDir

Directory 2019 CONNECTICUT JUDICIAL BRANCH www.jud.ct.gov JDP-ES-190 ### Operator Directory Listing

OPERATOR'S DIRECTORY SORTED BY OPERATOR NAME DATA SUPPLIED BY: FORM 1006 B CURRENT AS OF: Tuesday, April 16, 2019 Please notify Surety Department at ( 405 ) 521-2273 of any corrections or omissions th... ### 58367.1 EFL Delta Ray One RTF BNF Manual MULTI.indb

® SAFE Technology, Optional Flight Envelope Protection Instruction Manual Bedienungsanleitung Manuel d’utilisation Manuale di Istruzioni ### 435 441 458 467r e

WT/DS435/R, WT/DS441/R WT/DS458/R, WT/DS467/R 28 June 2018 Page: (18 - 1/884 4061 ) Original: English AUSTRALIA CERTAIN MEASURES CON CERNING TRADEMARKS, – PACKAGING IONS AND OTHER PLAIN GEOGRAPHICAL I... ### dac2011

1 Direct Air Capture of CO with Chemicals June 1, 2011 2 Direct Air Capture of CO with Chemicals 2 A Technology Assessment for the APS Panel on Public Affairs June 1, 2011 ### StateoftheClimate2017 lowres

STATE OF THE CLIMATE I N 2017 Special Supplement to the Bulletin of the American Meteorological Society Vol. 99, No. 8, August 2018 ### WG1AR5 Chapter13 FINAL

Sea Level Change 13 Coordinating Lead Authors: John A. Church (Australia), Peter U. Clark (USA) Lead Authors: Anny Cazenave (France), Jonathan M. Gregory (UK), Svetlana Jevrejeva (UK), Anders Leverman... ### 551694

Chapter Overview Notification and Orders : Chapter 1 – Discusses the preparation and types of orders used to mobilize/employ/deploy military and civilian personnel (includes installation/unit requirem... ### Approved Air Traffic Collegiate Training Initiative (AT CTI) Schools

Approved Air Traffic Collegiate Training Initiative -CTI) Schools (AT Alphabetical Listing Institution Point of Contact Aims Community College Patti Ph illips Greeley, Colorado [email protected] ### Layout 1

Understanding European versus U.S. temperature code ratings for solenoid- operated valves a b y M o n n y A r c e A White Paper From ASCO ### Microsoft Word DIRECTORY2 2.docx

DIRECTORY Union City Mayor and Commissioners City of Union City 3715 Palisade Avenue Union City, NJ 07087 Union City is part of the 9 - 1 - 1 emergency system. Please call 9 - 1 - 1 in the event of a ... ### sbl flyer

Washington State Small Business Liaison Team Making Washington State the best place to do business. Services Agency Contact Nancy Rocha Aguilar Commission of Hispanic Issues concerning the rights & ne... ### Assessing the Relationship between Student Involvement and Academic Performance in Higher Education

W estern Ke ky U niv ersit y ntuc HOL AR® TopSC Graduate School Masters The ses & Specialist Projects 8-2010 g the R elationshi p b Asses wee n S tude nt sin et ve me nt a nd Ac ade mic P erfor m ance... ### TPM Main Part 3 Commands v1.2 rev116 01032011

1 2 1 2 3 4 5 6 7 8 TPM Main Part 3 Commands 9 10 Specification Version 1.2 11 Level 2 Revision 116 12 1 March 2011 13 TCG Published 14 15 16 17 Contact: [email protected] 18 19 20 21 22... ### Online Charter Study Final

Online Charter School Study 2015 ### 2017 2018 Bluebook Complete

IDAHO BLUE BOOK 2017-2018 Published by SECRETARY OF STATE LAWERENCE DENNEY for the STATE OF IDAHO ### Directory of Frequently Purchased Commodities and Services by New York State Agencies

Directory of Frequently Purchased Commodities and Services by New York State Agencies th 27 Edition OFFICE OF THE NEW YORK STATE COMPTROLLER Thomas P. DiNapoli, State Comptroller February 2017 ### 2018 0026 China FRN 7 10 2018 0

[Billing Code 3290 - ] F 8 OFFICE OF THE UNITED STATES TRADE REPRESENTATIVE Docket Number USTR - 2018 - 0026 Concerning Proposed Request for Comment Action Pursuant to Section s Modification of : Chin... ### Protocol for CopyControl™ Fosmid Library/CopyControl™ HTP Fosmid Library Production Kit with pCC1FOS™ Vector

CopyControl™ Fosmid Library Production Kit with pCC1FOS™ Vector with pCC1FOS™ Vector and R E. coli Phage T-1 Resistant EPI300™-T1 Plating Strain Cat. No. CCFOS110 CopyControl™ HTP Fosmid Library Produ...