ijcai01 linkanalysis

Transcript

1 Link and Stability vectors Eigen Analysis, I. Jordan Michael Alice Zheng X. Andr ew Y. Ng Div. & of CS Statistics Dept. Science Division Computer Computer Science Division y ele Berk U.C. y U.C. Berk ele ele U.C. y Berk Berk y, CA ele 94720 94720 ele Berk y, CA 94720 y, CA ele Berk Abstract authoritati ve or influential, then surely the ad- article is truly should not dition of a few links e us mak or a few citations The algorithms PageRank the and HITS eigen- are or sites these about minds our change been having articles identifying vector methods or ve” “authoritati for the in ver, even Moreo influential. very link ed a fix of xt conte in- citation or hyperlink given articles, “influential” unreliable a dynamic, structure, infrastructure such as the web give con- formation. That such algorithms should of ws vie ferent dif give us occa- may dif on structure the ferent this in and a desideratum, is surely sistent answers should insensiti ve Ideally sions. be algorithm analysis , a link address be the when paper , we of the question y can perturbations. to such - per small under rankings give stable to expected matrix techniques use , we paper this In perturba- from patterns. Using tools turbations to the hyperlink ov chain Mark coupled and theory tion characterize to theory from matrix perturbation theory and Mark ov chain the of stability the HITS PageRank. and by assigned ranks these which under conditions vide pro , we theory HITS ving impro of ways Some briefly the stability of also are stable, of examples give specific and methods are are changes algorithmic these metioned; studied in more de- these conditions are violated. when We instability ] [ Ng in tail . et al. , 2001 briefly describe that HITS to a modification also stability ves impro . its Example An 2 begin us Let with a Cor The example. empirical an oduction Intr 1 ] [ containing al. , 2000 database is a collection McCallum et have seen years Recent for algorithms in interest wing gro AI. the citation information from several thousand papers in web- identifying “authoritati ve” or “influential” articles from HITS algorithms PageRank on the We ran subset and the page - par In data. citation other structures from or hyperlink database Learning all of consisting Machine a Cor the of its g [1998] and Google’ s algorithm HITS , the ticular Kleinber of papers. two algorithms, the of To evaluate the stability we [ ] Brin Page, algorithm PageRank 1998 have attracted the and five perturbed of a set constructed which also in databases [ ] 1996 attention y researchers for (see also of Osareh, man were from 30% of the papers the base set randomly deleted. literature). of Both earlier developments in the bibliometrics a Cor wl, what if, cra a web via database its obtained (“Since “au- assign to calculations vector eigen use algorithms these mishap, only instead of it had ved or 70% chance by retrie in thority” weights to articles, and while originally designed ve, If papers?”) these might we authoritati truly is a paper on xt conte of link analysis the the web, both algorithms can identify to possible be it would that hope it as such with only to be readily applied citation patterns in academic papers and a subset the set. base of other graphs. citation follo the in wn sho are HITS from table. wing results The a link evaluation the to aspects several analy- are There of In column from rank the reports first the table, this HITS relates algorithm such as HITS or PageRank. One aspect sis of five papers, Learning Machine whereas set full the on the the to “authoritati notion specific veness” embodied of by an rightmost report the ranks in runs on the perturbed columns of algorithm. Thus specific users may have an understanding databases. dif the across variation substantial We see ferent a what constitutes an authoritati ve web page or document in runs: HITS or PageRank can be domain, given the output of and ”, Goldber 1 1 3 1 g 1 optimization... search, in algorithms 1 “Genetic natural 3 3 5 2 2 Holland systems”, artificial and in “Adaptation 2 evaluated by such users. While useful, analyses often such programming: “Genetic 3 3 6 6 12 3 ”, Koza of... programming the On subjecti ve flavor. A more objecti ve criterion— have a rather 23 20 52 4 Jong ”, De genetic... of a class of vior beha the “Analysis 4 of 4 99 5 “Uniform crosso ver in genetic algorithms”, Syswerda 5 171 5 119 the of a link of stability focus —concerns paper current the the 56 135 6 ”, Fogel simulated... through intelligence “Artificial 6 8 40 Does results similar return algorithm an algorithm. analysis evolution ey of 100 159 179 10 Back+al “A surv gies”, strate 7 7 the a small perturbation of upon link structure or docu- the 8 “Optimization of control parameters for genetic... ”, Grefenstette 8 316 141 170 6 pressure”, 9 GENIT OR algorithm and selection “The Whitle y 9 257 107 72 9 a desirable a of as w stability We vie collection? ment feature wicz 10 algorithms + Data Structures = ...”, Michale “Genetic 13 170 80 69 18 analysis link particular the beyond ve and abo algorithm, no- vey...”, Koza 11 programming II: Automatic disco “Genetic 7 - - - 10 algorithm the that veness authoritati of If tion an embodies. ...”, Rumelhart+al 2060 internal representations by error “Learning - 1 2 2 -

2 2061 to by the method of temporal... ”, Sutton - 9 4 5 - “Learning ” means page links to page ) to obtain the predict (where “ )(*   ,+ + - - studies 2063 in machine learning 10 using check ers”, 10 Samuel - “Some 21 1 ! ! .-&/0 .-&/&0 ed-points fix (with and   2065 - - 8 - - ”, Barto+Sutton ficult... e dif solv can that e elements “Neuronlik   renormalized to unit length). The abo the ve equations vectors Tesauro 2066 issues in TD learning”, “Practical - - 9 9 - 4 2071 “Pattern classification and - 7 7 scene - Duda+Hart analysis”, can also be written: 5 Breiman+al - 2 2075 4 - “Classification and regression trees”, 543 '4 2117 7 - databases”, 8 - Murphy+Aha learning of repository - “UCI machine !3 ! !    features selection... 2174 subset the - - vant - “Irrele 8 - ”, John+al and   ' 4 algorithm”, Clark+Niblett - 6 - - - 2184 “The CN2 induction !3 !$ !6    2222 “Probabilistic systems”, in intelligent Pearl - 10 - - - reasoning   When ones iterations are initialized with the vector of the is intrinsic Although it might be thought that this variability 78 8<; 4 696:6 , this the obtaining of method prin- power is the   sho is not the case, as to problem, this wn by the results the [ ] vector of a matrix cipal Golub , eigen 1996 Loan, Van and the stable: more much were which algorithm, from PageRank + =+ so conditions) mild (under and principal are and the Optimization Search, 1 1 1 g ”, Goldber and... 1 1 in Algorithms “Genetic 1 '   4 4  2 2 2 2 2 “Learning internal representations by ...”, Rumelhart+al 2 error respecti and vely . The “authori- of vectors eigen “Adaptation 3 Natural 5 4 6 5 3 Holland Systems”, Artificial and in ,+  is then ewise lik , and for be to tati veness” of page en tak  gression 3 4 Breiman+al Trees”, 4 “Classification Re and 5 5 4 + Pearl 6 5 Systems”, 5 “Probabilistic Reasoning in Intelligent 3 6 3 and hubs . Programming On of Programming: “Genetic 6 6 ...”, Koza 4 4 3 6 the  Temporal 7 7 7 ...”, Sutton 7 “Learning 7 7 of Methods the by Predict to 3.2 algorithm PageRank analysis”, Duda+Hart 8 8 8 8 9 8 classification “Pattern and scene  8 10 ”, Dempster+al via... data incomplete 11 9 9 9 “Maximum lik elihood from and y matrix adjacenc the (de- pages web a set of Given 10 9 9 databases”, learning machine of repository 11 10 10 “UCI Murphy+Aha [ ] and Brin pre PageRank viously), con- first fined Page, 1998 - - - - Rumelhart+McClelland 10 uted Distrib “Parallel 11 Processing”, “Introduction to Neural the Theory of 12 Computation”, Hertz+al - 10 - - - structs a probability transition matrix by renormalizing > 8  in are 6. It detail more in discussed results These Section web a random imagines then . One to sum to each row of be should outset, the howe ver, that our conclusion at stated surfer who time step is at some web page, and decides at each that HITS is unstable while PageRank is not. The is- is not probabil- page which to visit on the next step as follo ws: with [email protected]?BA considerations sue is more subtle than that, involving as such the hyperlinks - the , she randomly on picks cur one of ity A the between multiple eigen vectors and invariant relationships jumps to the and page, rent to; with probability it links , page subspaces. We do wish to suggest, howe ver, that stability is she pick by jumping to a web page ed uniformly and “resets” A 3 We now issue a brief to turn attention. an needs that indeed , typi- is a parameter collection. random from at the Here, description PageRank, and follo wed by HITS our analysis. of 0.1-0.2. on ov chain cally set to defines This process a Mark 8F?GA A  , where the web pages, with transition matrix C transition matrix of uniform transition probabilities is the Ov erview of HITS and 3 PageRank 8 I C H  is all ( ). The vector of PageRank scores for  J of web pages or academic papers linking Given a collection be Mark ution distrib stationary the of to defined then ov this and PageRank algorithms each to/citing each other , the HITS J tran- the of vector eigen principal is the , valently Equi chain. 8?EA AK (see, Van and Golub e.g. matrix sition determines and authorities by computing the principal eigen- Loan, ution distrib stationary the definition by since 1996), 1 vector of the matrix. satisfies 8F?LA AK  J J  ] [ that HITS algorithm g, Kleinber 1998 The posits an article  by y pages man to ed if it is link weight “authority” high has page , is then chance of visiting The is, , that asymptotic  J and with high “hub” weight, that a page has high hub weight .  the be to en “quality” authoritati veness of page or tak man if it links to y authoritati ve pages. More precisely , given , retrie web a search pages (say to response in ved a set of Analysis of Algorithms 4 y -by- adjacenc forms query), the HITS algorithm first the  a simple with addi- how a small wing We begin sho example is 1 if page page to links , -element , whose matrix    2 can pages web of a collection change to tion ge a lar in result equations: wing follo the iterates It then and 0 otherwise. vectors eigen the have a collec- we Suppose to returned.   !$ ! #" %     tion to linking pages web contains web of that pages 100 &'      another , and pages web 103 http://www.algore.com 1 3 is typically described HITS as running on It is worth that noting of various to treat the case ways pages with no out- There are to a query), a small collection of articles (say retrie ved in response simple links (leaf nodes). In this paper we utilize a particularly entire web the of terms in is described PageRank while algo- . Either web the a page, next the picks surfer approach—upon reaching such distinction this be plays run in can either setting, howe rithm ver, and has all zero uniformly at random. page means that if a row of This N no our analysis. in role O of then the corresponding row entries, is constructed to have all 2 [ Page in described PRQ:S . The PageRank algorithm to equal entries regarding Kleinber several other heuristics g [1998] is- discusses ] a dif utilizes a at ving arri upon ution distrib reset et al. , 1998 ferent intra-domain ignored which sec- references, are as this such sues in of leaf node. It is possible to sho w, howe ver, that every instantiation tion experiments). See also used (but are simplicity in our for Bharat to the of valent is equi algorithm instantiation the of an our variant to vements other for It should HITS. [1998] Henzinger and impro the of value original algorithm on the same graph with a dif ferent none of noted that these fundamentally change the be of the spirit . probability reset underlying eigen vector calculations HITS.

3 Bush and Gore Bush Bush 1 1 Principal Principal Eigenvector Eigenvector 0 0 Gore Gore (a) (b) 1 0 1 0 (b) (a) dif with two matrices of Contours 2: Figure eigengaps. ferent graph. hyperlink of scatterplot Jittered 1: Figure Gaussians miliar with of think also plots variate multi of can perturbations these as the contours of a Gaussian with small ad- The linking to http://www.georgewbush.com . matrix.) (in imposed on the verse) covariance  y matrix jacenc two columns the for except zeros all has In quantities. perturbed denote to a tilde use we sequel, the these princi- the corresponding to therefore two web pages, instance, (For Y .) g Y denotes a perturbed version of We now ,+ for values pal eigen vector al- will have non-zero only [ positi first, give our so ve result, long as that eigengap the is presents 1(a) . Figure gore.com and georgewbush.com 6 ve to lar perturbations. small is insensiti HITS then ge, a jittered of links to along these two web pages, scatterplot   4 ,+  be given. Let be the princi- Let 1. Theor em Y with por non-zero the the first two eigen vectors. (Only - eig the max- the . Assume of engap eig envector and pal Y [ eigen the of tions No wn.) sho are vectors w, suppose five web gree out-de of every imum page is bounded by h . For web trickle into our collection, to happen new pages which i_j by perturb , suppose the web/citation graph we any . Fig- georgewbush.com link to both algore.com and ^ deleting page, wher adding one from links e most at or k ure vec- that see and plot, new eigen the the ws we sho 1(b) D p DqpG? I Z a  n n T kGlmon h h ir[ ir the e , wher . Then eigen tors vector dramatically , with the principal have changed ,+ TVUXW Thus, to line. perturbation small vely a relati the now near Y g of envector eig the perturbed matrix perturbed principal g 4 collection caused vectors. eigen to the change ge a has our lar satisfies: s&s s&s phenomenon to be is perv If this it needs ve, then asi addressed ? + + i (2) Z]t by uses eigen vectors to determine author - any algorithm that g the next two sections, we give characterizations of ity. In is big, eigengap if the So, HITS will be insensiti ve to small from whether when algorithms can be expected to suf fer and ved ec- This perturbations. result is pro dir by sho wing i) the problems. these tion of the principal eigen vector does not change too much, eigen gnitudes ii) and of the rele vant the values do not ma HITS 4.1 Analysis of change the second eigen vector does not “over- too so much,   4 w s&svus&s  Y to - deter HITS uses the principal eigen vector of the eigen tak new principal e” the first and become vector . this section, we sho w authorities. mine In that the stability 7 Frobenius The- denote We apply the norm. . Let Pr oof small under vector eigen perturbations this is determined of [ matrix V.2.8 and from wart orem perturbation theory Sun, Ste Y dif the be to defined is , which ference by eig engap of the ] with matrix is a symmetric : Suppose prin- 1990 YyxLz|{~}={ the values. the and gest between eigen lar gest lar second + ,+ eigengap , and  vector eigen and . [€j value eigen cipal ^ on the importance light shed may that Here example is an  Y to a symmetric perturbation wing be . Then the follo Let eigengap. the of with associated contours the 2 plots Figure + + eigenpair hold inequalities for and principal old the  ‚ before and two matrices after and lines) solid (with Y [email protected]  g new eigenpair ( some . ƒ w s&s ss g (with additi dashed have been lines) the ve perturbation same ss s&s 5 s&s ss w of the matrices are indicated them. made to The eigen values ?  T + (3) Z t ? the The directions ellipses. of the the principal of by axes ma- g w s s ss s&s a n  [ in sho wn Figure 2a has eigengap , and a small trix [ Y  ]\_^ ? + a vec- in results ellipse) the hence perturbation eigen Y (and to n (4)  g   t  Y T U`W tors away from the eigen vectors; the matrix original Z positi the (assuming denominator Let ve). that is (3) in ba perturbed , and 2b Figure in wn sho the eigengap has [:Z + +   by represented to eigenspace complementary the be same eigen original vectors the as eigen vectors. are nearly the … M… „ all columns , i.e. its and is orthonormal, contain Z Z Z of the eigengap example, how, in see we So, the this size ,+ and contains except the ; is diagonal of vectors eigen „†Z Y fa- (Readers vectors. eigen the of stability the affects directly which the values, all at least are corresponding of eigen less [ 4 about the number 5 here; There is nothing a smaller special 6 to directly apply also analyses Our hub-weight calculations, lar ge swings number of the relati results eigen vectors. in also vely N N link directions and interchanging reversing simply by and . d and to vector eigen principal the 4 causes 3, 2, 1, 5 with Replacing 7 by Frobenius The norm is defined degrees, 63, lie at 73, 58 . vely respecti and 55  ”v•”M–‚—$• ŠŒ‹Ž ‡H‡ ˆ‰‡H‡ ‘’“ˆ 5 f [email protected],fc . . More , these are contours precisely form the quadratic of f

4 +  ; and than (4) . A bound similar to Equation …€Z:„†Z Y˜…BZ  a h : for holds ™ „šZ w w s&s s&s s&s s&s ? a n g (5) „šZ  „šZ t . . . . . . g „ g  the . Using lar gest eigen value Let of be Corollary Z Z 20 Nodes 20 Nodes Ste Equa- and one wart w that Sun from IV.3.6 sho can [1990], Figure a web of Picture community . 3: implies (5) tion w s&s ss the connected formally (more the with component component D 9 a n g (6) ~Z  Z  t lar eigen value). gest ss w s&s wn in Figure web/citation-graph which the Consider sho 3, I a a n  l›[ turn If to- in and (4) Equations , then (6) Solid a small to be imagine we subset of a much lar ger graph. g g g ensure will principal the , i.e. is gether that    Z œj hyperlinks; of set ar- dashed the arro ws denote the original g The row represents the link principal original we add. will eigenpair . of Y g components two connected the of each for wn eigen sho value or adding we Since one only are page, from links deleting ss s&s w  ža is addition of link, it is easy to the ; with a single   ^ one row of Y , so that g let  denote the perturbation to D D   4 4 w w s&s s&s s&s ss °a g U  . Suppose commu- that verify that this jumps to the       w ard to sho . It is straightforw t   4 4 is part wn of a lar ger web/citation graph with multiple sho nity  n thus . We can bound the and h k  k  t subcommunities, biggest subcommu- and that originally the norm of : the perturbation to Y w w s&s ss s&s ss a a eigen value nity had l± adding one l link, U . By the ^  D ?  a Figure in wn sho graph , subcommunity biggest the 3 becomes n  k Y Y g (7) h k t the and principal eigen vector now has ve values only for positi Using Equations (3) and (7) to determine when we may guar - else figure, and this zeros in wn sho nodes where. kžl bound (2) to hold, Equation arri ve at the antee we D I p DŸp ? w ss s&s Z  a PageRank 4.2 of Analysis n n i` ir[ can easily , where T h . One h n I a a s authority PageRank’ of sensiti the analyze now We vity n ensures that the same bound on verify also l¡[ k  the of web/citation-graph. perturbations to J scores denominator also posi- is (3) in the guarantees that (which ,+  stated. as viously pre tive), hence g g given, principal the Let 3. em Theor be be let and > J AK if the that result, this of verse con give the xt we Ne eigengap 696:6 then perturbations. ve to sensiti be can vectors eigen is small, e- g be > the  MoZ` corr $³ , and way any in ed chang be  w) PageRank new the Then matrix. transition (ne sponding matrix is a symmetric engap eig with em 2. Suppose Theor Y 8 J satisfies: es scor g to ther a ¢B[r . Then [ e exists Y that causes a perturbation 8 ss s&s ³ £] e in eig the principal ) chang envector . ge ( lar ¶ a† ́ Mμ J ?  4 A J J (8) t   Y Y . oof Pr be Since diagonalized: , it can g A  is not that so ws Thus, assuming sho 0, too close to this ^  ^ C¥¤ C 4   Z Y not pages did as have high long the perturbed/modified web ^ ^ to with measured (as scores PageRank overall the un- respect ^ ^ ¦†§ C J PageRank perturbed the ), then perturbed scores PageRank columns are ’s eigen- the whose and is orthogonal, where Y C not original. be will from the far scores J ss s&s g  the of We pick denote column -th . vectors. Let  ̈ We construct . oof Pr ov chain Mark a coupled º¹r»º¼ ·V…  ̧ D 8 w ss s&s 4    a  [ ̈ g Y Y ̈ ̈ the - . Since pertur norm , the of Z Z Z  Z ̧~¾ pages/documents web of pairs over follo ws. …B¾ as ^=½ 4 a #a bation is only . Moreo ver, [ ̈©ZR ̈ [ Z J the probability vector wn is, from , that according to is dra PageRank the of distrib the ution surfer” “random stationary  ^  ^ D C Cb¤ 4 follo as work ws: transitions state The model. On step , we » A a   [ Z g Y ^ ^ with probability decide “reset” both chains, to in which ^ ¦†§ ^ … ̧ same chosen page and we uniformly to the set case   D s&s s&s the and at if If occurs, collection. random “reset” from no  a g g As is the , eigen- new principal  ~Z ~Z  ZX$ ̈©Z9 [ªj_  ? 8  is one then pages, unperturbed the … of ̧ and … o¿@ o¿À o¿À  , so ̈ to is orthogonal ̈©Z . But pair . ̈ Z ̈©Z £]    … ̧ to the is chosen page be by a random ed link page to   why 1 re- Theorem results these illustrate To ground and … … is chosen to be all page other cases, a random . In o¿@  give another we grees, of example quires a bound h on out-de link independently , and is chosen it, page by to ed of … ̧  o¿@ where a single link—can have a perturbation—adding a small page by . ̧ to ed link page a random be to o¿À this lar ge effect. In example we use has if a graph the fact that 9 value multiple connected components, then the principal eigen a graph of component connected A [1994]. Chung e.g. See, is only gest” entries in nodes have non-zero from the “lar will ÁqP whose elements are connected via length a subset paths to each graph. the value eigen The a connected other , but not to the rest of of 8 e « e of , , there formally , denoted More exists a perturbed version à Nºd~N  Nºd Ã@N value gest eigen (cf. of by is the lar used component ” ‡H‡ ‡H‡ ŠŒ‹®­]“ ̄ à « eª¬ e that so HITS), where N . , a submatrix of N , is the adjacenc y matrix of  .

5 Second Eigenvector Principal Eigenvector Thus, we now have two “coupled” and ov chains Mark … 1 1  AK 2g >K tran- latter their , and that so , but 0.6 0.6 are “correlated. ” For instance, the “resets” to both sitions is each since But lock-step. chain in occur always chains 0.4 0.4 own state transition distrib ution, the follo wing its asymptotic 0.2 0.2 distrib utions of w, be and and must respecti vely . No J J ̧ …   g  ±Å   ̧ h … … h ̧ , since . Note always. let ¾ ¾ ¾ ^ Æ   0 0 Italian French English French English Italian Letting have: denote the set of Ç we pages, perturbed (a) (b) ” Ë]“ˆ ‹ ‹LÏ ÈrÉ“Ê É“Ê ÉÌÊ – – –ÎÍ Randomized HITS Thrid Eigenvector 0.04 1 ” ” ‹LÏ Ë]ÌÐÑÒvÑÓ ‹ ‡ ÐÑÒvÑӃÔ<Ó©Õ~Ö Ë]“ˆ É“Ê ÉÌÊ –ÎÍ – P ” ” 0.8 ‡ ØVÙÚÐÑÒvÑÓÀÔ<Ó©Õ~Ö Ö×Ë]“ˆ ‹®Ï Ë]“ØV٘ÐÑÒvÑMÓ É“Ê ÉÌÊ – –FÍ P 0.03 ” ” ‹®Ï ‡ Ý ØXٚÐÑÒvÑӃÔ:Ó~Õ~Ö Ë]“ˆ Û×ÜRÝ@ÖL ‹ ÉÌÊ É“Ê 0.6 – – Í PÚ¬ P ”Þ ”  ˆ ‹LÏ ‹ Ý ‹®Ï ‡ Ë]“ˆ ØVÙÚÐÑÒvÑÓÀÔ<Ó©Õ~Ö É É“Ê É ÉÌÊ 0.4 –ß Í –FÍ P PÚ¬ 0.02 ”‚à ØVÙÚÐÑÒvÑӃÔ<Ó©Õ Ö ˆ ‡ ‹®Ï Ö×Ë]“ˆ ‹®Ï É É É“Ê ÉÌÊ – – ß Í 0.2 P ” ”Þ Ý Ë]“ˆ ØVÙÚÐÑÒvÑÓÀÔ<Ó©Õ~Ö ‡  ‹®Ï É É á Í 0.01 0 PÚ¬ P English French Italian French Italian English ”‚à ˆ ˆ ØVÙÚÐÑÒvÑÓ@Ô:Ó©Õ~Ö ‡ ։Ë]“ˆ ‹LÏ ‹LÏ ÉâBã É É“Ê É“Ê É ß –ÎÍ –ß (d) (c) P ”v” ” ” ØXٚÐÑÒvÑMÓÀÔ:Ó©Õ,Ö Ö Ë]“ˆ ‡ Ý  Ë]“ˆ ‹LÏ É É É âBã á on random Results corpora. Figure 4: Í P PÚ¬ ” ”   Ý  Ö  ÈrÉ á y matrix the be ê adjacenc in. it appears nodes document Let f PÚ¬ fä9åFæ of find we graph, this only HITS apply If we graph. this to the where to deri by that fact used , we inequality first ve the the of none (since weights hub have non-zero word-nodes   … construction, event “ ” is possible the M… ̧ ̧    3 Æ the document-nodes link to anything) and only the document- … pages. Using the fact that is one of the perturbed only if  nodes have non-zero ver, the vector weights. Moreo authority  of and in terms by iterating this , bound on h h hX¾   ^ é , the first weights of the word-nodes is exactly of HITS hub I’A  1 ́ çrè we . an obtain asymptotic -bound: upper  J h t found by LSI. vector singular left 1 1 stationary the from wn dra is Thus, if u- distrib …  ̧ ws transfer to us - allo connection This from insight exper correlated chains—so the mar distrib utions ginal tion of the HITS. iments our understanding of to In this vein, on LSI 1 1 J … ̧ J respecti given —then are by and and vely of g conducted an we in which random experiment corpora were IrA  1 1 1 ́ Å   çrè if two random . But  J ̧ h … t Æ sampling from a set of English, French, and generated by 1 10 taking h val- chance of dif ferent a small variables have only Italian that these are random corpora documents. Given must their . More similar be precisely utions ues, then , distrib the solution to combinations In- three distinct languages, of s s the Coupling Lemma (e.g., see Aldous, 1983) the varia- by ym- synon or clustering as such val problems Retrie formation 8’I ?   a ́  tional utions between the distrib distance  J J g we simple. exceedingly are identification are that The issue s&s s&s 1 sho . This h ws the by bounded be also must quantity same , we howe in, ver, is stability interested . To study gen- stability ? 1 a J h J the concludes , which proof. t  g such the erated 15 collections and examined the direction of found by HITS. principal eigen vectors 5 LSI and HITS The high the dimensional in lies vector eigen principal ocab To display languages. three the of space ulary our joint-v this In section we present an interesting connection between “di- we results, therefore defined English, French, and Italian ] [ , 1990 et al. Deerwester Inde xing Semantic Latent and HITS eigen ” and rections, the vector degree to which the measured pro (LSI) that results stability our into insight additional vides 11 directions. repetitions independent these Fifteen in each lies (see and Cohn also Chang, 2000). In LSI a collection of doc-   H the were results of this process in Fig- and plotted out, carried is 1 if docu- , where is represented uments a matrix as the ure As in the see, despite 4a. presence of clear clusters we 0 other , and ulary vocab the of word -th  the contains  ment -  highly eigen the corpora, variable. Moreo ver, this are vectors of wise. vectors singular LSI and left the computes right  '  4 4 in persists (Fig- vectors eigen third variability and second the vectors exam- (equi and of ). For eigen , the valently 4b,c). ures , has left principal singular vector , which we denote ple, the é  ulary measures vocab the equal the é and size, dimension to 10 The corpora were novels by taking paragraphs from generated  é of word “strength” -dimension. ’s membership along the words, 25–150 had “documents” Typical languages. three the in yms the informal hope is that synon The will be grouped into ulary the per words 1500 common most the of consisted vocab and singular same vectors, so that when a document (represented  language. was equally to “balanced” manually also collection The of a column by subspace the onto ) is projected spanned by represent language. each be vectors, “expanded” to the singular it will automatically 11 c=ë of unit-norm and whose was This done by picking a vector ì ì include im- synon to leading in yms of words document, the -th word y of in the English frequenc to the element is proportional pro information retrie ved val. c=ë of English “canonical” the thought be should as collection—thus, No consider constructing the graph follo wing citation w English direc- the in lies that amount the direction—and taking í,î a set from each a node be there Let for documents. of doc- c ë of be the absolute magnitude to the dot-product between tion and a word links to the ument and for each word. The node of í , and similarly for French and Italian. î

6 inherent Note is not an variability feature of the that the of the HITS authority weights revie ws Closer examination problem. algo- a dif of a run display we 4d, Figure In ferent are jumps lar to due indeed ge rankings in changes its that of algorithm the HITS rithm that we briefly describe (a variant whereas the PageRank scores tended to in authority weights, 12 [ ] 7, and in Section is studied detail in more in remain fairly stable. et al. Ng , 2001 ). Here variable. less significantly are results the web Given a on pages. experiments out carried We also for obtaining a , Kleinber g [1998] describes query a method on which to run HITS. We use ex- collection of web pages 6 Experiments Further method there, described and the actly perturbed it in a natural - we section this In exper perturbation of results further report 13 , we give the results of two way. For the sak e of bre vity only We also experiment Cor a database. the describe iments an on here. “mp3 query the On HITS’ results experiments players”, web pages. using were are truncated): (long ws follo as URLs methodology the Recall in experiments with the Cor a our .freecode.com/ 1 82 1 1 1 82 http://www a subset database the from papers of database: We choose http://www .html works.com/ 85 2 2 2 83 2 randomly generate a set and of perturbations to this subset by .internettraf 85 3 3 4 3 86 http://www ficreport.com/ 4 86 5 5 4 88 g/ http://slashdot.or Our deleting of the papers. 30% used first of all experiment http://windo 5 ws.da 84 4 3 5 87 vecentral.com/ the gely AI papers in Cora as the base set. Our repli- results lar orks.com/ .gifw http://www 87 6 6 6 6 84 sev- those of Cohn and Chang [2000]—HITS returned cated 7 http://www .thinkgeek.com/ 91 7 7 7 88 http://www 8 89 8 8 9 89 .com/ actory .animf as Genetics ed top-rank the ones. papers (GA) Algorithms eral http://freshmeat.net/ 90 9 9 8 90 9 re- ver, these With the database perturbed as described, howe ver.net/membership.htm 10 10 10 http://subscribe.ando 92 91 10 1385 x.htm 1 - - - 1 http://ourstory .about.com/inde sults were variable, very and HITS often returned seminal x.htm 1386 http://home.about.com/inde 2 - - - 2 ed from papers broader AI areas as its top-rank documents. x.htm - http://home.about.com/musicperform/inde - 3 - 1387 3 Repeating excluding all the GA papers, HITS experiment the http://home.about.com/teens/inde x.htm 4 - - 1388 - 4 1389 5 - - - 5 x.htm http://home.about.com/sports/inde on did five independent slightly better; the results trials are 6 1390 x.htm - - - http://home.about.com/autos/inde 6 wn belo sho w: - 7 x.htm http://home.about.com/style/inde 1391 7 - - 1 Brieman+al 1 1 1 1 1 “Classification and Re gression Trees”, 8 - - 8 1392 - x.htm http://home.about.com/careers/inde classification 2 Duda+Hart analysis”, “Pattern 2 scene and 2 2 3 2 9 - - - 9 x.htm wns/inde http://home.about.com/citiesto 1393 “UCI 3 3 3 7 3 4 Murphy+Aha databases”, learning machine of repository 1394 10 - - - 10 x.htm vel/inde http://home.about.com/tra representations internal 13 28 20 2 3 ...”, Rumelhart+al error by 4 “Learning Selection and Subset vant “Irrele 5 Features the 4 4 12 4 7 John+al Problem”, well perform rules classification simple ery “V 6 5 5 15 5 8 ”, Holte on... contrast, returned: PageRank In 14 10 11 Quinlan Learning”, Machine for Programs “C4.5: 7 6 10 1 1 http://www .team-mp3.com/ * 1 1 1 “Probabilistic 8 461 462 4 459 6 Pearl Systems”, Intelligent in Reasoning 2 http://click.linksyner gy.com/fs-bin/click 1 3 2 4 9 105 CN2 9 induction algorithm”, Clark+Niblett 9 54 11 78 “The 3 http://www .elizandra.com/ 2 2 2 3 2 the 11 9 34 14 ...”, Almuallim+Dietterich 13 in Concepts Boolean “Learning 10 4 4 14 http://stores.yahoo.com/help.html 5 10 11 comparison... ”, Thrun - 9 - 6 7 11 “The MONK’ s problems: A performance 5 3 10 4 12 13 http://shopping.yahoo.com/ 8 7 - the - Quinlan Principle”, MDL using trees decision “Inferring 12 8 wcase/phdss/ .netins.net/sho http://www 6 3 3 6 8 * ”, Fayyad+Irani “Multi-interv - - - - continuous... of al discretization 10 13 http://www 7 .thecounter .com/ 13 6 9 8 7 Relations “Learning 14 Richards+Moon - - - 6 - Pathfinding”, by 4 5 7 4 5 x.htm .about.com/inde http://ourstory 8 - fer - Schaf performance”, generalization law for 8 - 7 ation “A conserv 15 6 x.htm http://a-zlist.about.com/inde 6 6 10 9 5 ” Kira+Randall 20 Traditional... Problem: Selection Feature “The - - - - 9 5 7 8 9 * wcase/phdss/getm .netins.net/sho http://www 10 via... from ” Dempster+al 10 - 5 - lik “Maximum 21 - incomplete elihood data 8 - - 7 7 are/ are.mp3.com/softw http://softw 11 23 to - - 6 - 5 ”, Sutton Temporal... of Method the by Predict “Learning .winamp.com/ 8 - - - - http://www 12 “Introduction 8 - - Hertz+al Computation”, Neural - 36 - to the Theory of - - - 10 .nullsoft.com/ http://www 13 - w”, vie - - 10 - - Mitchell 49 “Explanation-based generalization: a unifying 9 - - 9 .consumerspot.com/redirect/1cac http://www 14 10 - 282“ A rob ust layered control system for a mobile robot”, Brooks - - 9 - changes, small go under s rankings PageRank’ While HITS’ that, We see 2-3 apart from the top rank ed papers, the re- pertur - vior beha “flipping” a mass display rankings . Similar results maining are still rather unstable. For example, Pearl’ s patterns bation for PageRank w) belo example the (and this to on book rank ed 8th; originally the second trial, it dropped was nineteen of out fourteen in ed observ are HITS queries. and rank 282, paper , Brooks’ Similarly 459. rank to and was “flips” mass such displayed results HITS’ Furthermore, in we ver, this variability jumped rank up 9 on trial 3. Ho to is roughly of the trials, which 20% in accordance with the is the problem, as sho wn by not PageRank re- intrinsic our to 20% val rate. remo results PageRank (all with generated were in this section sults A 6  a ): result, this time is another the query Here typical web on ^ “*” that ” Note recipes. “italian means re- was page the that 1 1 gression Trees”, Breiman+al 1 Re 1 1 2 and “Classification “Probabilistic Reasoning in Intelligent Systems”, Pearl 3 2 2 2 1 2 and ved that trial’ s perturbation, by mo has no rank. therefore 3 3 3 3 “Learning internal representations by error ...”, Rumelhart+al 2 3 were: HITS’ results 4 “Pattern classification and 4 analysis”, Duda+Hart 4 4 4 4 scene layered ust robot”, Brooks 5 6 7 5 5 a mobile for system control 5 “A rob 12 HITS eigen higher and vectors second the of Examination in via... data incomplete from elihood lik “Maximum 6 6 6 7 6 ’ Dempster+al 6 by “Learning 7 Predict 7 5 5 7 7 ”, Sutton Method of Temporal... the to vary that to trial from substantially trial. can y, too, the ws sho “UCI repository of machine learning databases”, Murphy+Aha 8 9 8 9 9 11 13 [1998] search Kleinber g engine first uses a web 8 Press+al C”, in Recipes “Numerical 9 8 12 10 11 vista.com (www documents 200 ve retrie to case) our in to .alta 9 Rumelhart+al 14 13 Processing”, uted Distrib “Parallel 10 10 9 of a theory 12 “An implementation of acti vity”, Agre+Chapmanre - 8 10 8 - a “root (and expanded is then ” which set, form further processed) to Neural Theory “Introduction 13 - the of to Computation”, Hertz+al - 10 - - Our operates. HITS which on perturbations the define web-graph and “A Representation 22 - - ”, Valente+al - 10 - in... ves Objecti for Library at randomly set were arri ved (i.e. by deleting 20% of the root from was a document’ in a drop change gest lar The s rank search the imagining that the web of engine had only returned 80% and did), it actually pages g’s procedure. Kleinber wing follo then are for to 12—these results 10 much more stable than HITS.

7 1 .about.com/inde * 1 1 1 1 http://ourstory x.htm the rel- be to appears algorithm PageRank The fact that 17 * http://home.about.com/culture/inde 2 x.htm 2 2 2 immune concerns is a matter of consid- stability atively to 3 http://home.about.com/inde 3 x.htm * 3 3 25 is It interest. erable “reset-to-uniform- the that belief our 4 2 4 4 * x.htm http://home.about.com/food/inde 4 5 5 5 x.htm http://home.about.com/science/inde 5 3 * aspect of PageRank is a critical feature in this ution” distrib 6 http://home.about.com/shopping/inde x.htm * 6 6 6 4 Indeed, HITS al- one can explore a variation of the regard. 7 5 7 7 * x usiness/inde http://home.about.com/smallb 7 incorporates such a feature. Suppose that we gorithm which 8 http://home.about.com/sports/inde x.htm * 8 8 8 6 9 9 http://home.about.com/arts/inde x.htm 7 * 9 9 construct proba- with which, in a Mark the on ov chain web 8Ú?‰A http://home.about.com/style/inde 10 10 10 8 * x.htm 10 randomly , we follo w a hyperlink from the current bility - - - 9 http://home.about.com/autos/inde - 11 x.htm 12 - x.htm http://home.about.com/teens/inde 10 - - - steps), odd we time page in the (on forw ard and direction 479 http://bestbrandrecipe.com/def 1 - - - - ault.asp backw ards direction (on randomly follo w a hyperlink in the A - - - - 480 http://myrecipe.com/help/shopping.asp 2 time even probability With steps). to , we reset a uniformly - - - 3 - ault.asp 481 http://v egetarianrecipe.com/def 482 - - - - 5 ault.asp http://holidayrecipe.com/def web-page u- distrib chosen page. The asymptotic visitation 4 - - - ault.asp http://beefrecipe.com/def 483 - odd the and tion on weights, be steps is defined authority to http://be veragerecipe.com/def 7 - - ault.asp - 484 - the even on can hub steps As in Theorem 3, we weights. ault.asp - - - - 485 6 http://appetizerrecipe.com/def 486 http://pierecipe.com/def ault.asp 8 - - - - w this small sho (but perturbations ve to is insensiti algorithm - - - - 9 ault.asp http://seafoodrecipe.com/def 487 authority scores). e PageRank, we obtain hub as well as unlik http://barbequerecipe.com/def 488 10 - - - ault.asp - languages” the on algorithm this running of results The “three other hand, returned: the PageRank, on problem in Figure 4d, where we see are sho wn it is in- that 1 1 1 * x.htm .about.com/inde http://ourstory 1 1 than stable more significantly deed HITS basic algorithm. the 2 2 http://a-zlist.about.com/inde x.htm * 2 2 2 [ This al. et Ng more in detail , in explored is also algorithm 3 1 .apple.com/ 3 http://www 3 3 3 g/ 13 4 http://www .tznet.com/isenber 9 2 4 4 ] 2001 . http://frontier .userland.com/ 3 4 7 5 5 5 .mikrostore.com/ http://www 6 5 6 * 4 6 6 * 7 http://www .amazinggiftsonline.com/ 5 7 7 wledgments Ackno .peck.it/peckshop/home.asp?pro 8 8 * v http://www 8 7 4 citation viding w McCallum Andre We thank for the Cora pro 9 http://geocities.yahoo.com/addons/interac 6 9 9 8 29 5 10 * 10 7 x.html vita.com/inde http://dvs.dolce 10 experiments. our in used thank data Zimdars We also Andy 6 9 10 - - .net/ .dossier http://www 11 by ONR supported was work This comments. helpful for 12 - - - 8 vita.com/ .dolce http://www 8 http://www .q-d.com/ 9 - - - 10 14 MURI NSF grant IIS-9988642. N00014-00-1-063 7 and - - - - http://www .silesk y.com/ 10 15 ences Refer ] [ 1983 Random Da vid Aldous. walks on finite Aldous, 7 Discussion ov chains. Dold A. In mark mixing rapidly and groups the in wn kno It is well algebra numerical linear community XVII es Probabilit e de eminair S editors, Eckmann, B. and ́ ́ k vec- ) eigen spanned first the that a subspace (e.g. several by Vol. pages 1981/1982 , Lecture Notes in Mathematics, 986, vidual indi perturbation, while eigen- tors may be stable under 1983. 243–297. Springer -Verlag, [ ] may not Ste wart and Sun, 1990 vectors . Our results—both [ ] fact. general this empirical—reflect and theoretical Bharat and , 1998 Henzinger K. Bharat and M. R. Hen- If the stability the then is a subspace, algorithm an of output zinger ved algorithms for topic distillation in a hy- . Impro considerations of a matter be not may have discussed we that Annual ACM perlink ed environment. In Proc. 21st Intl. for primary concern. Such is the case, for example, the LSI Confer ence , pages 104–111. ACM, 1998. SIGIR where a data set algorithm, the goal is generally to project [ ] The anatomy of and L. Brin and Page, 1998 Page. S. Brin onto -dimensional subspace. a lower The In engine. a lar ge-scale hyperte search eb) (W xtual eigen specific interpret vectors, howe ver, then to wish If we Wide World International Seventh , 1998. ence Confer Web the issue becomes a matter of more serious concern. stability [ ] Chung, 1994 Gr al Spectr . K. R. Fan aph Theory Chung. basic HITS This where pri- is the for situation the algorithm, , 1994. Society American Mathematical terms of in interpreted have been vectors eigen mary a set of ] [ have seen, are there “hubs” we ” As theoreti- and “authorities. Chang. Probabilis- 2000 Cohn Cohn and Chang, D. and H. caution cal and empirical reasons for exercising considerable Proc. tically identifying authoritati ve documents. In 17th interpretations. such making in , 2000. Learning hine Mac on ence Confer International Given that principal the have a reliable not eigen vector may [ ] Fur Dumais, Deerwester Deerwester G. S. , S. - , 1990 et al. ap- HITS the of variations consider can one interpretation, by latent , and T. Landauer nas, xing Inde R. Harshman. g Kleinber Indeed, vectors. eigen multiple utilize that proach the American Society for In- semantic analysis. Journal of a way as [1998] suggested examining multiple eigen vectors 1990. formation Science , 41(6):391–407, within of communities. Again, authorities obtaining multiple ] [ Golub C. and Golub H. G. F. Van 1996 Loan, Van and vidual indi interpret howe ver, it may be problematic to eigen- Computations Matrix Loan. v. Press, Uni Hopkins . Johns in significant and in fact vectors, our experiments we found 1996. alternati An vectors. eigen third and second in variability ve [ ] approach be to automatically combine multiple eigen- may Kleinber in a J. g. ve sources g, 1998 Kleinber Authoritati subspaces a way vectors in that explicitly within identifies 9th hyperlink ed environment. Proc. Sympo- ACM-SIAM ] [ Ng , 2001 HITS frame work. This is explored in the et al. sium on Discr ete Algorithms , 1998.

8 [ ] , 2000 McCallum Andre w McCallum, Kamal Nigam, et al. Jason Rennie, Kristie Seymore. Automating the con- and ï Internet learning. with machine of Infor - truction portals val Journal , 3:127–163, 2000. mation Retrie [ ] , 2001 Ng et al. w Y. Ng, Alice X. Zheng, and Andre Michael Jordan. Stable algorithms for link analysis. I. Confer Proc. In Intl. ACM SIGIR 24th ence . ACM, Annual 2001. [ ] 1996 Farideh Osareh, Osareh. Bibliometrics, citation anal- ysis and analysis: A revie w of literature I. co-citation , 46:149–158, Libri 1996. [ ] , 1998 Page La et al. Page, Ser gey Brin, Rajee v wrence Motw ani, and Terry Winograd. The PageRank cita- Unpublished tion order to the web. Bringing ranking: 1998. Manuscript, [ ] wart and Sun, 1990 Ste G. W. Ste wart and Ji-Guang Sun. Matrix Perturbation Theory . Academic Press, 1990.

Related documents

Corporate Surveillance in Everyday Life

Corporate Surveillance in Everyday Life

Wolfie Christl CORPORATE SURVEILLANCE IN EVERYDAY LIFE How Companies Collect, Combine, Analyze, Trade, and Use Personal Data on Billions EPORT BY CRACKED LABS A R Vienna, June 2017 Author: Wolfie Chri...

More info »
SEIRiP

SEIRiP

W. Bruce Croft Donald Metzler Trevor Strohman Search Engines Information Retrieval in Practice ©W.B. Croft, D. Metzler, T. Strohman, 2015 This book was previously published by: Pearson Education, Inc.

More info »