auth.dvi

Transcript

1  Authoritative Sources in a Hyperlinked Environment y Jon M. Kleinberg Abstract The network structure of a hyperlinked environment can be a rich source of in- formation about the content of the environment, provided we have e ective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their e ectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of \authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of \hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.  Preliminary versions of this paper appear in the Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 1998, and as IBM Research Report RJ 10076, May 1997. y Dept. of Computer Science, Cornell University, Ithaca NY 14853. Email: [email protected] . This work was performed in large part while on leave at the IBM Almaden Research Center, San Jose CA 95120. The author is currently supported by an Alfred P. Sloan Research Fellowship and by NSF Faculty Early Career Development Award CCR-9701399.

2 1 Introduction The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have e ective means for understanding it. In this work, we develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their ) [4]. In particular, we e ectiveness in a variety of contexts on the World Wide Web ( www focus on the use of links for analyzing the collection of pages relevant to a broad search topic, and for discovering the most \authoritative" pages on such topics. While our techniques are not speci c to the www , we nd the problems of search and is a structural analysis particularly compelling in the context of this domain. The www hypertext corpus of enormous complexity, and it continues to expand at a phenomenal rate. Moreover, it can be viewed as an intricate form of populist hypermedia, in which millions of on-line participants, with diverse and often conflicting goals, are continuously creating hyperlinked content. Thus, while individuals can impose order at an extremely local level, its global organization is utterly unplanned | high-level structure can emerge only through analysis. a posteriori searching www , which we could de ne Our work originates in the problem of on the quality of roughly as the process of discovering pages that are relevant to a given query. The a search method necessarily requires human evaluation, due to the subjectivity inherent in notions such as relevance . We begin from the observation that improving the quality of search methods on the www is, at the present time, a rich and interesting problem that is in many ways orthogonal to concerns of algorithmic eciency and storage. In particular, consider that current search engines typically index a sizable portion of the www and respond on the order of seconds. Although there would be considerable utility in a search tool with a longer response time, provided that the results were of signi cantly greater value to a user, what such a search tool should be computing with this it has typically been very hard to say extra time. Clearly we are lacking objective functions that are both concretely de ned and correspond to human notions of quality. We view searching as beginning from a user- Queries and Authoritative Sources. supplied query . It seems best not to take too uni ed a view of the notion of a query; there is more than one type of query, and the handling of each may require di erent techniques. Consider, for example, the following types of queries.  Speci c queries. E.g., \Does Netscape support the JDK 1.1 code-signing API?"  E.g., \Find information about the Java programming language." Broad-topic queries.  Similar-page queries. E.g., \Find pages `similar' to java.sun.com ." Concentrating on just the rst two types of queries for now, we see that they present very di erent sorts of obstacles. The diculty in handling speci c queries is centered, roughly, around what could be called the Scarcity Problem : there are very few pages that contain the 1

3 required information, and it is often dicult to determine the identity of these pages. For , on the other hand, one expects to nd many thousand relevant broad-topic queries www ; such a set of pages might be generated by variants of term-matching pages on the (e.g. one enters a string such as \Gates," \search engines," or \censorship" into a search engine such as AltaVista [17]), or by more sophisticated means. Thus, there is not an issue of scarcity here. Instead, the fundamental diculty lies in what could be called the Abundance Problem : The number of pages that could reasonably be returned as relevant is far too large for a human user to digest. To provide e ective search methods under these conditions, one needs a way to lter, from among a huge collection of relevant pages, a small set of the most \authoritative" or \de nitive" ones. authority , relative to a broad-topic query, serves as a central focus in our This notion of work. One of the fundamental obstacles we face in addressing this issue is that of accurately modeling authority in the context of a particular query topic. Given a particular page, how do we tell whether it is authoritative? It is useful to discuss some of the complications that arise here. First, consider the natural www.harvard.edu , the home page of Harvard University, as one of the most goal of reporting "Harvard" . Unfortunately, there are over a million pages authoritative pages for the query www that use the term \Harvard," and www.harvard.edu is not the one that uses on the the term most often, or most prominently, or in any other way that would favor it under a text-based ranking function. Indeed, one suspects that there is no purely endogenous measure of the page that would allow one to properly assess its authority. Second, consider the problem of nding the home pages of the main search engines. One could begin www "search engines" from the query , but there is an immediate diculty in the fact that many of the natural authorities ( Yahoo! , Excite, AltaVista) do not use the term on their pages. This is a fundamental and recurring phenomenon | as another example, there is no reason to expect the home pages of Honda or Toyota to contain the term \automobile manufacturers." Analysis of the Link Structure. Analyzing the hyperlink structure among www pages gives us a way to address many of the diculties discussed above. Hyperlinks encode a considerable amount of latent human judgment, and we claim that this type of judgment is precisely what is needed to formulate a notion of authority. Speci cally, the creation of www represents a concrete indication of the following type of judgment: the a link on the creator of page p , by including a link to page q , has in some measure conferred authority on q . Moreover, links a ord us the opportunity to nd potential authorities purely through the pages that point to them; this o ers a way to circumvent the problem, discussed above, that many prominent pages are not suciently self-descriptive. Of course, there are a number of potential pitfalls in the application of links for such a purpose. First of all, links are created for a wide variety of reasons, many of which 2

4 have nothing to do with the conferral of authority. For example, a large number of links are created primarily for navigational purposes (\Click here to return to the main menu"); others represent paid advertisements. Another issue is the diculty in nding an appropriate balance between the criteria of popularity , each of which contributes to our intuitive notion of authority. It relevance and is instructive to consider the serious problems inherent in the following simple heuristic for locating authoritative pages: Of all pages containing the query string, return those with the greatest number of in-links. We have already argued that for a great many queries ( "search engines" , "automobile manufacturers" , . . . ), a number of the most authoritative pages do not contain the associated query string. Conversely, this heuristic would consider a univer- www.yahoo.com sally popular page such as www.netscape.com to be highly authoritative or with respect to any query string that it contained. In this work, we propose a link-based model for the conferral of authority, and show how it leads to a method that consistently identi es relevant, authoritative pages for broad www search topics. Our model is based on the relationship that exists between the authorities for a topic and those pages that link to many related authorities | we refer to pages of this hubs . We observe that a certain natural type of equilibrium exists between latter type as hubs and authorities in the graph de ned by the link structure, and we exploit this to develop an algorithm that identi es both types of pages simultaneously. The algorithm operates on focused subgraphs of the www that we construct from the output of a text-based www search engine; our technique for constructing such subgraphs is designed to produce small collections of pages likely to contain the most authoritative pages for a given topic. Our approach to discovering authoritative sources is meant to have a Overview. www global nature: We wish to identify the most central pages for broad search topics in the context of the as a whole. Global approaches involve basic problems of representing www and ltering large volumes of information, since the entire set of pages relevant to a broad- topic query can have a size in the millions. This is in contrast to local approaches that seek to understand the interconnections among the set of www pages belonging to a single logical site or intranet; in such cases the amount of data is much smaller, and often a di erent set of considerations dominates. It is also important to note the sense in which our main concerns are fundamentally di erent from problems of clustering. Clustering addresses the issue of dissecting a hetero- geneous population into sub-populations that are in some way more cohesive; in the context of the www , this may involve distinguishing pages related to di erent meanings or senses of a query term. Thus, clustering is intrinsically di erent from the issue of distilling broad topics via the discovery of authorities, although a subsequent section will indicate some con- nections. For even if we were able perfectly to dissect the multiple senses of an ambiguous query term (e.g. \Windows" or \Gates"), we would still be left with the same underlying 3

5 problem of representing and ltering the vast number of pages that are relevant to each of the main senses of the query term. The paper is organized as follows. Section 2 discusses the method by which we construct with respect to a broad search topic, producing a set of a focused subgraph of the www relevant pages rich in candidate authorities. Sections 3 and 4 discuss our main algorithm for identifying hubs and authorities in such a subgraph, and some of the applications of this algorithm. Section 5 discusses the connections with related work in the areas of search, www bibliometrics, and the study of social networks. Section 6 describes how an extension of our basic algorithm produces multiple collections of hubs and authorities within a common link structure. Finally, Section 7 investigates the question of how \broad" a topic must be in order for our techniques to be e ective, and Section 8 surveys some work that has been done on the evaluation of the method presented here. 2 Constructing a Focused Subgraph of the WWW V of hyperlinked pages as a directed graph G =( V; E ): the We can view any collection p; q ) 2 nodes correspond to the pages, and a directed edge ( indicates the presence of a E link from to q . We say that the out-degree of a node p is the number of nodes it has links p in-degree p to, and the is is the number of nodes that have links to it. From a graph G , of we can isolate small regions, or , in the following way. If W  V is a subset of the subgraphs pages, we use G [ W ] to denote the graph induced on W : its nodes are the pages in W ,and its edges correspond to all the links between pages in W .  .Wewishtode- Suppose we are given a broad-topic query, speci ed by a query string termine authoritative pages by an analysis of the link structure; but rst we must determine the subgraph of the www on which our algorithm will operate. Our goal here is to focus the computational e ort on relevant pages. Thus, for example, we could restrict the analysis to the set Q of all pages containing the query string; but this has two signi cant drawbacks.  First, this set may contain well over a million pages, and hence entail a considerable compu- tational cost; and second, we have already noted that some or most of the best authorities may not belong to this set. Ideally, we would like to focus on a collection S of pages with the following properties.  (i) S is relatively small.  (ii) S is rich in relevant pages.  S (iii) contains most (or many) of the strongest authorities.  small, we are able to a ord the computational cost of applying non-trivial S By keeping  algorithms; by ensuring it is rich in relevant pages we make it easier to nd good authorities, as these are likely to be heavily referenced within S .  How can we nd such a collection of pages? For a parameter t (typically set to about 200), we rst collect the t highest-ranked pages for the query  from a text-based search 4

6 base root Figure 1: Expanding the root set into a base set. engine such as AltaVista [17] or Hotbot [57]. We will refer to these root t pagesasthe R set . This root set satis es (i) and (ii) of the desiderata listed above, but it generally is  far from satisfying (iii). To see this, note that the top t pages returned by the text-based search engines we use will all contain the query string R  , and hence is clearly a subset  of the collection Q of all will often  . Above we argued that even Q pages containing   not satisfy condition (iii). It is also interesting to observe that there are often extremely few links between pages in R , rendering it essentially \structureless". For example, in our  "java" contained 15 links between pages in di erent experiments, the root set for the query domains; the root set for the query contained 28 links between pages in "censorship" di erent domains. These numbers are typical for a variety of the queries tried; they should be compared with the 200  199 = 39800 potential links that could exist between pages in the root set. We can use the root set R , however, to produce a set of pages that will satisfy the S   conditions we're seeking. Consider a strong authority for the query topic | although it may well not be in the set R , it is quite likely to be pointed to by at least one page in R . Hence,   we can increase the number of strong authorities in our subgraph by expanding R along  the links that enter and leave it. In concrete terms, we de ne the following procedure. Subgraph(  , E , t , d ) :aquerystring.  : a text-based search engine. E , d : natural numbers. t  denote the top t results of E on . R Let  5

7 Set R S :=   2 R p For each page  + ( p p points to. Let Ξ“ ) denote the set of all pages βˆ’ p ) denote the set of all pages pointing to p . ( Let Ξ“ + p )to S . ( Add all pages in Ξ“  βˆ’ ) ( p j d then j If Ξ“ βˆ’ ( S p . )to Add all pages in Ξ“  Else βˆ’ p ( )to S . d pages from Ξ“ Add an arbitrary set of  End Return S  S Thus, we obtain by growing R and any to include any page pointed to by a page in R    | with the restriction that we allow a single page in R page that points to a page in R   d pages pointing to it into S to bring at most . This latter point is crucial since a number  of www pages are pointed to by several hundred thousand pages, and we can't include all S of them in if we wish to keep it reasonably small.  S We refer to as the for  ; in our experiments we construct it by invoking base set  Subgraph procedure with the search engine AltaVista, t = 200, and d = 50. We nd the S that typically satis es points (i), (ii), and (iii) above | its size is generally in the range  1000-5000; and, as we discussed above, a strong authority need only be referenced by any one of the 200 pages in the root set R in order to be added to S .   In the next section, we describe our algorithm to compute hubs and authorities in the base set S . Before turning to this, we discuss a heuristic that is very useful for o setting the  e ect of links that serve purely a navigational function. First, let G [ S ] denote, as above, the  S subgraph induced on the pages in . We distinguish between two types of links in G [ S ].   We say that a link is if it is between pages with di erent domain names, and transverse intrinsic if it is between pages with the same domain name. By \domain name" here, we mean here the rst level in the url string associated with a page. Since intrinsic links very often exist purely to allow for navigation of the infrastructure of a site, they convey much less information than transverse links about the authority of the pages they point to. Thus, we delete all intrinsic links from the graph G [ S ], keeping only the edges corresponding to  transverse links; this results in a graph G .  This is a very simple heuristic, but we nd it e ective for avoiding many of the pathologies caused by treating navigational links in the same way as other links. There are other simple heuristics that can be valuable for eliminating links that do not seem intuitively to confer authority. One that is worth mentioning is based on the following observation. Suppose a large number of pages from a single domain all point to a single page p . Quite often this corresponds to a mass endorsement, advertisement, or some other type of \collusion" among the referring pages | e.g. the phrase \This site designed by ..." and a corresponding link at the bottom of each page in a given domain. To eliminate this phenomenon, we can x 6

8 a parameter m m  4-8) and only allow up to m pages from a single domain to (typically p point to any given page . Again, this can be an e ective heuristic in some cases, although we did not employ it when running the experiments that follow. 3 Computing Hubs and Authorities The method of the previous section provides a small subgraph G that is relatively focused  on the query topic | it has many relevant pages, and strong authorities. We now turn to the problem of extracting these authorities from the overall collection of pages, purely through an analysis of the link structure of G .  The simplest approach, arguably, would be to order pages by their |the in-degree G number of links that point to them | in . We rejected this idea earlier, when it was  all pages containing the query term  ; but now we have explicitly applied to the collection of constructed a small collection of relevant pages containing most of the authorities we want to G nd. Thus, these authorities both belong to and are heavily referenced by pages within  G .  Indeed, the approach of ranking purely by in-degree does typically work much better in G the context of than in the earlier settings we considered; in some cases, it can produce  uniformly high-quality results. However, the approach still retains some signi cant prob- lems. For example, on the query "java" , the pages with the largest in-degree consisted www.gamelan.com of java.sun.com , together with pages advertising for Caribbean va- and cations, and the home page of Amazon Books. This mixture is representative of the type of problem that arises with this simple ranking scheme: While the rst two of these pages should certainly be viewed as \good" answers, the others are not relevant to the query topic; they have large in-degree but lack any thematic unity. The basic diculty this exposes is the inherent tension that exists within the subgraph G between strong authorities and  pages that are simply \universally popular"; we expect the latter type of pages to have large in-degree regardless of the underlying query topic. One could wonder whether circumventing these problems requires making further use of the textual content of pages in the base set, rather than just the link structure of G .Wenow  show that this is not the case | it is in fact possible to extract information more e ectively from the links | and we begin from the following observation. Authoritative pages relevant to the initial query should not only have large in-degree; since they are all authorities on a common topic, there should also be considerable overlap in the sets of pages that point to them. Thus, in addition to highly authoritative pages, we expect to nd what could be called hub pages : these are pages that have links to multiple relevant authoritative pages. It is these hub pages that \pull together" authorities on a common topic, and allow us to throw out unrelated pages of large in-degree. (A skeletal example is depicted in Figure 2; in reality, of course, the picture is not nearly this clean.) 7

9 unrelated page of large in-degree hubs authorities Figure 2: A densely linked set of hubs and authorities. Hubs and authorities exhibit what could be called a mutually reinforcing relationship :a is a page that points to many good authorities; a good good authority hub is a page that is pointed to by many good hubs. Clearly, if we wish to identify hubs and authorities within the subgraph G , we need a method for breaking this circularity.  An Iterative Algorithm. We make use of the relationship between hubs and authorities via an iterative algorithm that maintains and updates numerical weights for each page. Thus, h p i p , we associate a non-negative authority weight with each page x hub and a non-negative h p i weight y . We maintain the invariant that the weights of each type are normalized so their P P p h 2 h 2 i p i x ) =1,and - x We view the pages with larger : =1 ( y ) ( squares sum to 1: S 2 p S 2 p   y -values as being \better" authorities and hubs respectively. and Numerically, it is natural to express the mutually reinforcing relationship between hubs p points to many pages with large x and authorities as follows: If -values, then it should receive a large y -value; and if p is pointed to by many pages with large y -values, then it should receive a large x -value. This motivates the de nition of two operations on the h p i i p h weights, which we denote by O . Given weights and f x I y , g ,the I operation updates g f x -weights as follows. the X h i i q p h x : y E 2 q;p :( q ) The O operation updates the y -weights as follows. X h i i q h p y x : p;q 2 E :( q ) Thus I and O are the basic means by which hubs and authorities reinforce one another. (See Figure 3.) I and Now, to nd the desired \equilibrium" values for the weights, one can apply the O operations in an alternating fashion, and see whether a xed point is reached. Indeed, we h p i cannowstateaversionofourbasicalgorithm. Werepresentthesetofweights f x g as a 8

10 q1 q2 page p x[p] := sum of y[q], for all q pointing to p q3 q1 q2 page p y[p] := sum of x[q], for all q pointed q3 to by p Figure 3: The basic operations. vector x with a coordinate for each page in G ; analogously, we represent the set of weights  h p i f y g as a vector y . G , k ) Iterate( G : a collection of linked pages n : a natural number k n . denote the vector (1 1 ; 1 ;:::; 1) z R Let 2 ; := z: x Set 0 := z: Set y 0 For i =1 ; 2 ;:::;k 0 . x ;y -weights x ), obtaining new Apply the x operation to ( I βˆ’ i 1 βˆ’ i 1 i 0 0 y ;y . y ), obtaining new -weights operation to ( x O Apply the 1 βˆ’ i i i 0 x . , obtaining x Normalize i i 0 , obtaining y . y Normalize i i End ;y ). Return ( x k k This procedure can be applied to lter out the top c authorities and top c hubs in the 9

11 following simple way. Filter( , k , c ) G : a collection of n G linked pages c : natural numbers k , ). ;y Iterate ( G; k ):= ( x k k as authorities. Report the pages with the c largest coordinates in x k as hubs. largest coordinates in y c Report the pages with the k procedure with G set equal to G We will apply the Filter c  5-10. , and typically with  , the number of iterations, we rst show that To address the issue of how best to choose k with arbitrarily large values of f x k , the sequences of vectors Iterate as one applies g and k   converge to xed points x . and y g y f k We require the following notions from linear algebra, and refer the reader to a text M be a symmetric n  such as [30] for more comprehensive background. Let matrix. n An of M is a number  with the property that, for some vector ! ,wehave eigenvalue n = ! is a subspace of R ! M! . The set of all such eigenspace , which we refer to as the associated with multiplicity of ; the dimension of this space will be referred to as the  M . It is a standard fact that n distinct eigenvalues, each of them a real  has at most number, and the sum of their multiplicities is exactly n . We will denote these eigenvalues  by ( ; ) M ), indexed in order of decreasing absolute value, and with each ( M ) ;:::; M ( n 1 2 eigenvalue listed a number of times equal to its multiplicity. For each distinct eigenvalue, we choose an orthonormal basis of its eigenspace; considering the vectors in all these bases, we ! obtain a set of eigenvectors ) ( ) ;! ) that we can index in such a way that ( M M ;:::;! M ( n 1 2 ! ). ( M ) belongs to the eigenspace of  M ( i i For the sake of simplicity, we will make the following technical assumption about all the matrices we deal with: ) j  ( y . M j > j  ) ( ( ) j M 2 1 When this assumption holds, we refer to ! ( M principal eigenvector ,andallother )asthe 1 ! ( M )as . When the assumption does not hold, the analysis non-principal eigenvectors i becomes less clean, but it is not a ected in any substantial way. We now prove that the k increases arbitrarily. Iterate procedure converges as   x Theorem 3.1 The sequences ;x and ;::: and ;x y ;y x ;y converge (to limits ;::: y 3 2 1 3 2 1 respectively). Proof. Let G =( V; E ), with V = f p denote the ;p of g ,andlet A ;:::;p adjacency matrix n 2 1 th ;the( i; j ) the graph G A p ;p is equal to 1 if ( )isanedgeof G , and is equal to entry of j i T I and O operations can be written x A 0 otherwise. One easily veri es that the y and T 1 βˆ’ k T ,and is y A ) is the unit vector in the direction of ( z A A y Ax respectively. Thus x k k T k the unit vector in the direction of ( AA ) z . 10

12 Now, a standard result of linear algebra (e.g. [30]) states that if is a symmetric n M n  ! matrix, and v is a vector not orthogonal to the principal eigenvector ( M ), then the unit 1 k vector in the direction of M )as ( M v k increases without bound. Also (as a converges to ! 1 has only non-negative entries, then the principal eigenvector of M has only corollary), if M non-negative entries. T z ! is not orthogonal to Consequently, f ), and hence the sequence ( y AA g converges k 1  T y to a limit y (  A ) 6 = 0 (as dictated by Assumption ( A )), . Similarly, one can show that if 1 T T A then ! A ( A z is not orthogonal to ). It follows that the sequence f x converges to a g k 1  x limit . The proof of Theorem 3.1 yields the following additional result (in the above notation).  T  x ).) y Theorem 3.2 (Subject to Assumption ( ,and A is the principal eigenvector of y A T is the principal eigenvector of . AA Iterate In our experiments, we nd that the convergence of is quite rapid; one essentially = 20 is sucient for the c always nds that k largest coordinates in each vector to become stable, for values of c in the range that we use. Of course, Theorem 3.2 shows that one can   x use any eigenvector algorithm to compute the xed point and ; we have stuck to the y Iterate above exposition in terms of the procedure for two reasons. First, it emphasizes I and O the underlying motivation for our approach in terms of the reinforcing operations. Second, one does not have to run the above process of iterated I = O operations to conver- i p h h p i x f gence; one can compute weights g y and g by starting from any initial vectors x f and 0 y , and performing a xed bounded number of I and O operations. 0 We now give some sample results obtained via the algorithm, using some Basic Results. of the queries discussed in the introduction. (java) Authorities .328 http://www.gamelan.com/ Gamelan .251 http://java.sun.com/ JavaSoft Home Page .190 http://www.digitalfocus.com/digitalfocus/faq/howdoi.html The Java Developer: How Do I...  srp/java/javabooks.html The Java Book Pages .190 http://lightyear.ncsa.uiuc.edu/ comp.lang.java FAQ .183 http://sunsite.unc.edu/javafaq/javafaq.html (censorship) Authorities .378 http://www.e .org/ EFFweb - The Electronic Frontier Foundation .344 http://www.e .org/blueribbon.html The Blue Ribbon Campaign for Online Free Speech .238 http://www.cdt.org/ The Center for Democracy and Technology .235 http://www.vtw.org/ Voters Telecommunications Watch .218 http://www.aclu.org/ ACLU: American Civil Liberties Union (\search engines") Authorities .346 http://www.yahoo.com/ Yahoo! 11

13 .291 http://www.excite.com/ Excite Welcome to Magellan! .239 http://www.mckinley.com/ Lycos Home Page .231 http://www.lycos.com/ .231 http://www.altavista.digital.com/ AltaVista: Main Page (Gates) Authorities .643 http://www.roadahead.com/ Bill Gates: The Road Ahead .458 http://www.microsoft.com/ Welcome to Microsoft .440 http://www.microsoft.com/corpinfo/bill-g.htm R was Among all these pages, the only one which occurred in the corresponding root set  rd "Gates" ; it was ranked 123 www.roadahead.com/ , under the query by AltaVista. This is natural in view of the fact that many of these pages do not contain any occurrences of the initial query string. It is worth reflecting on two additional points here. First, our only use of the textual content of pages was in the initial \black-box" call to a text-based search engine, which produced the root set R . Following this, the analysis ignored the textual content of pages.  The point we wish to make here is not that text is best ignored in searching for authoritative pages; there is clearly much that can be accomplished through the integration of textual and link-based analysis, and we will be commenting on this in a subsequent section. However, the results above show that a considerable amount can be accomplished through essentially a \pure" analysis of link structure. Second, for many broad search topics, our algorithm produces pages that can legitimately be considered authoritative with respect to the as a whole, despite the fact that it www www . Rather, its only \global" operates without direct access to large-scale index of the www is through a text-based search engine such as AltaVista, from which access to the it is very dicult to directly obtain reasonable candidates for authoritative pages on most queries. What the results imply is that it is possible to reliably estimate certain types of global information about the www using only a standard search engine interface; a global analysis of the full link structure can be replaced by a much more local method of www analysis on a small focused subgraph. 4 Similar-Page Queries The algorithm developed in the preceding section can be applied to another type of problem | that of using link structure to infer a notion of \similarity" among pages. Suppose we have found a page p that is of interest | perhaps it is an authoritative page on a topic of interest | and we wish to ask the following type of question: What do users of the www consider to be related to p , when they create pages and hyperlinks? 12

14 If p is highly referenced page, we have a version of the Abundance Problem: the sur- rounding link structure will implicitly represent an enormous number of independent opinions about the relation of p to other pages. Using our notion of hubs and authorities, we can provide an approach to the issue of page similarity, asking: In the local region of the link p structure near , what are the strongest authorities? Such authorities can potentially serve as a broad-topic summary of the pages related to p . In fact, the method of Sections 2 and 3 can be adapted to this situation with essentially no modi cation. Previously, we initiated our search with a query string  ; our request from t pages containing the string  ." We now begin with the underlying search engine was \Find p and pose the following request to the search engine: \Find t pages pointing to p ." a page root set Thus, we assemble a R consisting of pages that point to p ;wegrowthisintoa t p base set S as before; and the result is a subgraph G in which we can search for hubs and p p authorities. Super cially, the set of issues in working with a subgraph G are somewhat di erent from p those involved in working with a subgraph de ned by a query string; however, we nd that most of the basic conclusions we drew in the previous two sections continue to apply. First, we observe that ranking pages of G by their in-degrees is still not satisfactory; consider for p example the results of this heuristic when the initial page was www.honda.com , the home p page of Honda Motor Company. http://www.honda.com Honda http://www.ford.com/ Ford Motor Company http://www.e .org/blueribbon.html The Blue Ribbon Campaign for Online Free Speech http://www.mckinley.com/ Welcome to Magellan! Welcome to Netscape http://www.netscape.com http://www.linkexchange.com/ LinkExchange | Welcome Welcome to @Toyota http://www.toyota.com/ http://www.pointcom.com/ PointCom Welcome to Netscape http://home.netscape.com/ http://www.yahoo.com Yahoo! In many cases, the top hubs and authorities computed by our algorithm on a graph of the form G can be quite compelling. We show the top authorities obtained when the initial page p was , the home page of the New York Stock Exchange. and www.nyse.com p www.honda.com (www.honda.com) Authorities .202 http://www.toyota.com/ Welcome to @Toyota .199 http://www.honda.com/ Honda .192 http://www.ford.com/ Ford Motor Company .173 http://www.bmwusa.com/ BMW of North America, Inc. .162 http://www.volvocars.com/ VOLVO .158 http://www.saturncars.com/ Welcome to the Saturn Web Site .155 http://www.nissanmotors.com/ NISSAN - ENJOY THE RIDE .145 http://www.audi.com/ Audi Homepage 13

15 .139 http://www.4adodge.com/ 1997 Dodge Site Welcome to Chrysler .136 http://www.chryslercars.com/ (www.nyse.com) Authorities .208 http://www.amex.com/ The American Stock Exchange - The Smarter Place to Be .146 http://www.nyse.com/ New York Stock Exchange Home Page .134 http://www.li e.com/ Welcome to LIFFE .129 http://www.cme.com/ Futures and Options at the Chicago Mercantile Exchange .120 http://update.wsj.com/ The Wall Street Journal Interactive Edition The Nasdaq Stock Market Home Page - Reload Often .118 http://www.nasdaq.com/ .117 http://www.cboe.com/ CBOE - The ChicagoBoard Options Exchange .116 http://www.quote.com/ 1- Quote.com - Stock Quotes, Business News, Financial Market .113 http://networth.galt.com/ NETworth .109 http://www.lombard.com/ Lombard Home Page Note the diculties inherent in compiling such lists through text-based methods: many of the above pages consist almost entirely of images, with very little text; and the text that they do contain has very little overlap. Our approach, on the other hand, is determining, via the presence of links, what the creators of www pages tend to \classify" together with www.honda.com and the given pages . www.nyse.com 5 Connections with Related Work The analysis of link structures with the goal of understanding their social or informational organization has been an issue in a number of overlapping areas. In this section, we review some of the approaches that have been proposed, divided into three main areas of focus. First, and most closely related to our work here, we discuss research on the use of a link structure for de ning notions of standing , impact ,and influence | measures with the same motivation as our notion of authority. We then discuss other ways in which links have been integrated into hypertext and search techniques. Finally, we review some work that www has made use of link structures for explicit clustering of data. Standing, Impact, and Influence The study of social networks has developed several ways to measure the Social Networks. standing | roughly, \importance" | of individuals in an implicitly de ned network. relative We can represent the network, as above, by a graph =( V; E ); an edge ( i; j ) corresponds G roughly to an \endorsement" of j by i . This is in keeping with the intuition we have already invoked regarding the role of www hyperlinks as conferrors of authority. Links may have weights , corresponding to the strength of di erent endorsements; let di erent (non-negative) th denote the matrix whose ( i; j ) A entry represents the strength of the endorsement from a node i 2 V to a node j 2 V . 14

16 Katz [35] proposed a measure of standing based on path-counting, a generalization of h r i ,let P ranking based on in-degree. For nodes and j i denote the number of paths of ij i to j .Let from 1 be a constant chosen to be small enough that r length exactly b< P r i h 1 r Q = P , j b of node converges for each pair ( i; j ). Now Katz de nes s standing ,the j ij ij =1 r P | in this model, standing is based on the total number of paths terminating at Q to be ij i j node , weighted by an exponentially decreasing damping factor. It is not dicult to obtain th a direct matrix formulation of this measure: s is proportional to the column sum of the j j βˆ’ 1 bA βˆ’ I matrix ( ) I are 0 or 1. I denotes the identity matrix and all entries of A βˆ’ ,where Hubbell [32] proposed a similar model of standing by studying the equilibrium of a certain th weight-propagation scheme on nodes of the network. Recall that A ,the( i; j ) entry of ij our matrix A , represents the strength of the endorsement from i to j .Let e denote an a j priori estimate of the standing of node j . Then Hubbell de nes the standings f s g to be a j set of values so that the process of endorsement maintains a type of equilibrium | the total \quantity" of endorsement entering a node j , weighted by the standings of the endorsers, is equal to the standing of j . Thus, the standings are the solutions to the system of equations P s = e g + , then the vector of e A f s denotes the vector of values ,for j =1 ;:::;n .If e i ij j j j i 1 T βˆ’ standings in this model can be shown to be ( I βˆ’ A e . ) Before discussing the relation of these measures to our work, we consider the way in which they were extended by research in the eld of bibliometrics. Scienti c Citations. [22] is the study of written documents and their cita- Bibliometrics tion structure. Research in bibliometrics has long been concerned with the use of citations to produce quantitative estimates of the importance and \impact" of individual scienti c papers and journals, analogues of our notion of authority. In this sense, they are concerned with evaluating standing in a particular type of social network | that of papers or journals linked by citations. impact factor [26], used to provide The most well-known measure in this eld is Gar eld's a numerical assessment of journals in Journal Citation Reports of the Institute for Scienti c Information. Under the standard de nition, the impact factor of a journal j in a given year is the average number of citations received by papers published in the previous two years of journal j [22]. Disregarding for now the question of whether two years is the appropriate period of measurement (see e.g. Egghe [21]), we observe that the impact factor is a ranking measure based fundamentally on a pure counting of the in-degrees of nodes in the network. Pinski and Narin [45] proposed a more subtle citation-based measure of standing, stem- ming from the observation that not all citations are equally important. They argued that a journal is \influential" if, recursively, it is heavily cited by other influential journals. One can recognize a natural parallel between this and our self-referential construction of hubs and authorities; we will discuss the connections below. The concrete construction of Pinski and Narin, as modi ed by Geller [27], is the following. The measure of standing of journal 15

17 j will be called its and denoted w influence weight . The matrix A of connection strengths j will have entries speci ed as follows: A denotes the fraction of the citations from journal ij that go to journal j j should i . Following the informal de nition above, the influence of ,withthesumweightedby be equal to the sum of the influences of all journals citing j . Thus, the set of influence weights f w j the amount that each cites g is designed to be a j P non-zero, non-negative solution to the system of equations w = A w ; and hence, if w ij i j i T w 0, w 6 is the vector of influence weights, one has A =0,and  w w . This implies that w = T is a principal eigenvector of A . Geller [27] observed that the influence weights correspond to the stationary distribution of the following random process: beginning with an arbitrary journal j , one chooses a random reference that has appeared in j and moves to the journal speci ed in the reference. Doreian [19, 20] showed that one can obtain a measure of standing that corresponds very closely to influence weights by repeatedly iterating the computation ll's measure of standing: In the rst iteration one computes Hubbell stand- underlying Hubbe f s ings g from the apriori weights f e apriori g estimates for f s then become the g ;the j j j the next iteration. Finally, there was been work aimed at the troublesome issue of how to handle journal self-citations (the diagonal elements of the matrix A ); see e.g. de Solla Price [15] and Noma [42]. Let us consider the connections between this previous work and our algorithm to com- pute hubs and authorities. We also begin by observing that pure in-degree counting, as manifested by the impact factor, is too crude a measure for our purposes, and we seek a type of link-based equilibrium among relative node rankings. But the World Wide Web and the scienti c literature are governed by very di erent principles, and this contrast is nicely captured in the distinction between Pinski-Narin influence weights and the hub/authority weights that we compute. Journals in the scienti c literature have, to a rst approximation, a common purpose, and traditions such as the peer review process typically ensure that highly authoritative journals on a common topic reference one another extensively. Thus it makes sense to consider a one-level model in which authorities directly endorse other author- ities. The www , on the other hand, is much more heterogeneous, with www pages serving many di erent functions | individual subscribers have home pages, and multinational aol corporations have home pages. Moreover, for a wide range of topics, the strongest authorities consciously do not link to one another | consider, for example, the home pages of search engines and automobile manufacturers listed above. Thus, they can only be connected by an intermediate layer of relatively anonymous hub pages, which link in a correlated way to a thematically related set of authorities; and our model for the conferral of authority on the www takes this into account. This two-level pattern of linkage exposes structure among both the set of hubs, who may not know of one another's existence, and the set of authorities, who may not wish to acknowledge one another's existence. 16

18 Hypertext and WWW Rankings. There have been several approaches to ranking pages www . In work predating the emergence of the , in the context of hypertext and the www Botafogo, Rivlin, and Shneiderman [7] worked with focused, stand-alone hypertext environ- index nodes | an index node is and ments. They de ned the notions of reference nodes one whose out-degree is signi cantly larger than the average out-degree, and a reference node is one whose in-degree is signi cantly larger than the average in-degree. They also proposed measures of based on node-to-node distances in the graph de ned by the centrality link structure. www Carri ere and Kazman [9] proposed a ranking measure on pages, for the goal of re-ordering search results. The rank of a page in their model is equal to the sum of its link in-degree and its out-degree; thus, it makes use of a \directionless" version of the www structure. Both of these approaches are based principally on counting node degrees, parallel to the impact factor . In contrast, Brin and Page [8] have recently proposed structure of Gar eld's a ranking measure based on a node-to-node weight-propagation scheme and its analysis via eigenvectors. Speci cally, they begin from a model of a user randomly following hyperlinks: at each page, the user either selects an outgoing link uniformly at random, or (with some probability 1) jumps to a new page selected uniformly at random from the entire www . p< i The stationary probability of node in this random process will correspond to the \rank" of i , referred to as its page-rank . Alternately, one can view page-ranks as arising from the equilibrium of a process analo- gous to that used in the de nition of the Pinski-Narin influence weights, with the incorpora- tion of a term that captures the \random jump" to a uniformly selected page. Speci cally, www contains assuming the pages, letting A denote the n  n adjacency matrix of the n www , and letting d , the probability of a transition from denote the out-degree of node i i 1 βˆ’ 0 βˆ’ 1 in the Brin-Page model is seen to be equal to A to page i j page +(1 pn p ) d βˆ’ = . A ij i ij 0 0 . The vector of ranks is then a non-zero, denote the matrix whose entries are A A Let ij 0 T A non-negative solution to ( ) r = r , and hence it corresponds to the principal eigenvector T 0 A of ( : ) One of the main contrasts between our approach and the page-rank methodology is that | like Pinski and Narin's formulation of influence weights | the latter is based on a model in which authority is passed directly from authorities to other authorities, without interposing a notion of hub pages. Brin and Page's use of random jumps to uniformly selected pages is a way of dealing with the resulting problem that many authorities are essentially \dead-ends" in their conferral process. It is also worth noting a basic contrast in the application of these approaches to www search. In [8], the page-rank algorithm is applied to compute ranks for all the nodes in a 24 million page index of the www ; these ranks are then used to order the results of subsequent text-based searches. Our use of hubs and authorities, on the other hand, proceeds without 17

19 direct access to a www rst invokes a text- index; in response to a query, our algorithm based search and then computes numerical scores for the pages in a relatively small subgraph constructed from the initial search results. Other Link-Based Approaches to WWW Search Frisse [25] considered the problem of document retrieval in singly-authored, stand-alone works of hypertext. He proposed basic heuristics by which hyperlinks can enhance notions of relevance and hence the performance of retrieval heuristics. Speci cally, in his framework, the relevance of a page in hypertext to a particular query is based in part on the relevance of the pages it links to. Marchiori's HyperSearch algorithm [39] is based on such a methodology www pages: A relevance score for a page p is computed by a method that applied to incorporates the relevance of pages reachable from p , diminished by a damping factor that p decays exponentially with distance from . In our construction of focused subgraphs from search engine results in Section 2, the underlying motivation ran also in the opposite direction. In addition to looking at where p a page to increase our understanding of its contents, we implicitly used the text pointed on pages that pointed to p . (For if pages in the root set for \search engines" pointed to www.yahoo.com , then we included www.yahoo.com in our subgraph.) This notion is related to that of searching based on anchor text , in which one treats the text surrounding a hyperlink as a descriptor of the page being pointed to when assessing the relevance of that page. The use of anchor text appeared in one of the oldest search engines, McBryan's World www Wide Web Worm [40]; it is also used in [8, 11, 10]. www Another direction of work on the integration of links into search is the construction of search formalisms capable of handling queries that involve predicates over both text and links. Arocena, Mendelzon, and Mihaila [1] have developed a framework supporting www queries that combines standard keywords with conditions on the surrounding link structure. Clustering of Link Structures Link-based clustering in the context of bibliometrics, hypertext, and the www has focused largely on the problem of decomposing an collection of nodes into explicitly represented \cohesive" subsets. As such, it has mainly been applied to moderately size sets of objects | for example, a focused collection of scienti c journals, or the set of pages on a single www site. Earlier we indicated a sense in which the issues we study here are fundamentally di erent from those encountered in this type of clustering: Our primary concern is that of representing an enormous collection of pages implicitly , through the construction of hubs and authorities for this collection. We now discuss some of the prior work on citation- based and hypertext clustering so as to better elucidate its connections to the techniques we develop here. In particular, this will also be useful in Section 6 when we discuss methods 18

20 for computing multiple sets of hubs and authorities within a single link structure; this can be viewed as a way of representing multiple, potentially very large clusters implicitly. similarity function among objects, At a very high level, clustering requires an underlying from this similarity function. Two basic similarity and a method for producing clusters bibliographic coupling functions on documents to emerge from the study of bibliometrics are (due to Small [52]). For a pair of documents p and (due to Kessler [36]) and ,the co-citation q p and q , and the latter former quantity is equal to the number of documents cited by both p quantity is the number of documents that cite both q . Co-citation has been used as a and measure of the similarity of pages by Larson [37] and by Pitkow and Pirolli [47]. Weiss www et al. [56] de ne linked-based similarity measures for pages in a hypertext environment that generalize and bibliographic coupling to allow for arbitrarily long chains of links. co-citation Several methods have been proposed in this context to produce clusters from a set of nodes annotated with such similarity information. Small and Grith [54] use breadth- rst search to compute the connected components of the undirected graph in which two nodes are joined by an edge if and only if they have a positive co-citation value. Pitkow and Pirolli [47] apply this algorithm to study the link-based relationships among a collection of www pages. One can also use principal components analysis [31, 34] and related dimension-reduction techniques such as multidimensional scaling to cluster a collection of nodes. In this frame- work, one begins with a matrix M containing the similarity information between pairs of nodes, and a representation (based on this matrix) of each node i as a high-dimensional f vector v g . One then uses the rst few non-principal eigenvectors of the similarity ma- i can be projected; g f trix M to de ne a low-dimensional subspace into which the vectors v i a variety of geometric or visualization-based techniques can be employed to identify dense clusters in this low-dimensional space. Standard theorems of linear algebra (e.g. [30]) in fact provide a precise sense in which projection onto the rst k eigenvectors produces the minimum distortion over all k -dimensional projections of the data. Small [53], McCain [41], and others have applied this technique to journal and author co-citation data. The applica- www pages based on co-citation has been tion of dimension-reduction techniques to cluster employed by Larson [37] and by Pitkow and Pirolli [47]. The clustering of documents or hyperlinked pages can of course rely on combinations of textual and link-based information. Combinations of such measures have been studied by Shaw [50, 51] in the context of bibliometrics. More recently, Pirolli, Pitkow, and Rao [46] have used a combination of link topology and textual similarity to group together and categorize pages on the www . Finally, we discuss two other general eigenvector-based approaches to clustering that have been applied to link structures. The area of spectral graph partitioning was initiated by the work of Donath and Ho man [18] and Fiedler [23]; see the recent book by Chung [12] for an overview. Spectral graph partitioning methods relate sparsely connected partitions 19

21 of an undirected G to the eigenvalues and eigenvectors of its adjacency matrix A . graph A G , and thus can be viewed Each eigenvector of has a single coordinate for each node of as an assignment of weights to the nodes of G . Each non-principal eigenvector has both positive and negative coordinates; one fundamental heuristic to emerge from the study of these spectral methods is that the nodes corresponding to the large positive coordinates of a given eigenvector tend to be very sparsely connected to the nodes corresponding to the large negative coordinates of the same eigenvector. In a di erent direction, is a clustering method designed for representing centroid scaling two types of objects in a common space [38]. Consider, for example, a set of people who have provided answers to the questions of a survey | one may wish to represent both the people and the possible answers in a common space, in a way so that each person is \close" to the answers he or she chose; and each answer is \close" to the people that chose it. Centroid scaling provides an eigenvector-based method for accomplishing this. In its formulation, it thus resembles our de nitions of hubs and authorities, which used an eigenvector approach to produce related sets of weights for two distinct types of objects. A fundamental di erence, however, is that centroid scaling methods are typically not concerned with interpreting only the largest coordinates in the representations they produce; rather, the goal is to infer a notion of similarity among a set of objects by geometric means. Centroid scaling has been applied to citation data by Noma [43], for jointly clustering citing and cited documents. In the context of information retrieval, the Latent Semantic Indexing methodology of Deerwester et al. [16] applied a centroid scaling approach to a vector- space model of documents [48, 49]; this allowed them to represent terms and documents in a common low-dimensional space, in which natural geometrically de ned clusters often separate multiple senses of a query term. 6 Multiple Sets of Hubs and Authorities densely linked collection of hubs and The algorithm in Section 3 is, in a sense, nding the most authorities in the subgraph G . de ned by a query string  . There are a number of settings,  however, in which one may be interested in nding several densely linked collections of hubs and authorities among the same set S of pages. Each such collection could potentially be  relevant to the query topic, but they could be well-separated from one another in the graph G for a variety of reasons. For example,  (1) The query string  may have several very di erent meanings. E.g. "jaguar" (a useful example we learned from Chandra Chekuri [13]). (2) The string may arise as a term in the context of multiple technical communities. E.g. "randomized algorithms" . 20

22 (3) The string may refer to a highly polarized issue, involving groups that are not likely . to link to one another. E.g. "abortion" In each of these examples, the relevant documents can be naturally grouped into several clusters. The issue in the setting of broad-topic queries, however, is not simply how to achieve a dissection into reasonable clusters; one must also deal with this in the presence of the Abundance Problem. Each cluster, in the context of the full ,isenormous, www and so we require a way to distill a small set of hubs and authorities out of each one. We can thus view such collections of hubs and authorities as implicitly providing broad-topic summaries of a collection of large clusters that we never explicitly represent. At a very high level, our motivation in this sense is analogous to that of an information retrieval technique [14], which seeks to represent very large document clusters through Scatter/Gather such as text-based methods. In section 3, we related the hubs and authorities we computed to the principal eigenvectors T T A of the matrices ,where A is the adjacency matrix of G and A AA . The non-principal  T T A eigenvectors of A provide us with a natural way to extract additional densely and AA S linked collections of hubs and authorities from the base set . We begin by noting the  following basic fact. T T and have the same multiset of eigenvalues, and their eigenvec- A A AA Proposition 6.1 T T ( )= A! ( . A AA A ) ! tors can be chosen so that i i T  T  x Thus, each pair of eigenvectors A ! A ), = ( ), related as in Proposi- = ! ( AA y i i i i   tion 6.1, has the following property: applying an I operation to ( x ;y -weights x ) keeps the i i     x parallel to x . , and applying an ;y O operation to ( ) keeps the y -weights parallel to y i i i i   x Hence, each pair of weights ( ;y mutually reinforcing relationship ) has precisely the that i i IO (resp. OI we are seeking in authority/hub pairs. Moreover, applying ) multiplies the   magnitude of x (resp. y j j ) by a factor of j   j ;thus gives precisely the extent to which i i i i   y the hub weights x reinforce and authority weights one another. i i Now, unlike the principal eigenvector, the non-principal eigenvectors have both positive   and negative entries. Hence each pair ( x ;y ) provides us with two densely connected sets i i of hubs and authorities: those pages that correspond to the c coordinates with the most positive values, and those pages that correspond to the coordinates with the most negative c values. These sets of hubs and authorities have the same intuitive meaning as those produced in Section 3, although the algorithm to nd them | based on non-principal eigenvectors | is less clean conceptually than the method of iterated I and O operations. Note also   x that since the extent to which the weights in and y reinforce each other, the hubs and i i authorities associated with eigenvectors of larger absolute value will typically be \denser" as subgraphs in the link structure, and hence will often have more intuitive meaning. In Section 5, we observed that spectral heuristics for partitioning undirected graphs [12, 18, 23] have suggested that nodes assigned large positive coordinates in a non-principal 21

23 eigenvector are often well-separated from nodes assigned large negative coordinates in the same eigenvector. Adapted to our context, which deals with directed rather than undirected graphs, one can ask whether there is a natural \separation" between the two collections of authoritative sources associated with the same non-principal eigenvector. We will see that in some cases there is a distinction between these two collections, in a sense that has meaning for the query topic. It is worth noting here that the signs of the coordinates in any non- principal eigenvector represents a purely arbitrary resolution of the following symmetry: if     x x  y ,thensoare βˆ’ are eigenvectors associated with and and βˆ’ y . i i i i i Basic Results. We now give some examples of the way in which the application of non- principal eigenvectors produces multiple collections of hubs and authorities. One interesting phenomenon that arises is the following. The pages with large coordinates in the rst few non-principal eigenvectors tend to recur, so that essentially the same collection of hubs and authorities will often be generated by several of the strongest non-principal eigenvectors. (Despite being similar in their large coordinates, these eigenvectors remain orthogonal due to di erences in the coordinates of smaller absolute value.) As a result, one obtains fewer distinct collections of hubs and authorities than might otherwise be expected from a set of non-principal eigenvectors. This notion is also reflected in the output below, where we have selected (by hand) several distinct collections from among the rst few non-principal eigenvectors. We issue the rst query as "jaguar*" , simply as one way to search for either the word or its plural. For this query, the strongest collections of authoritative sources concerned the Atari Jaguar product, the NFL football team from Jacksonville, and the automobile. (jaguar*) Authorities: principal eigenvector .370 http://www2.ecst.csuchico.edu/ jschlich/Jaguar/jaguar.html  .347 http://www-und.ida.liu.se/  t94patsa/jserver.html  .292 http://tangram.informatik.uni-kl.de:8001/ rgehm/jaguar.html .287 http://www.mcc.ac.uk/ dlms/Consoles/jaguar.html Jaguar Page nd non-principal vector, positive end (jaguar jaguars) Authorities: 2 .255 http://www.jaguarsnfl.com/ Ocial Jacksonville Jaguars NFL Website .137 http://www.nando.net/SportServer/football/nfl/jax.html Jacksonville Jaguars Home Page .133 http://www.ao.net/ brett/jaguar/index.html Brett's Jaguar Page  .110 http://www.usatoday.com/sports/football/sfn/sfn30.htm Jacksonville Jaguars rd non-principal vector, positive end (jaguar jaguars) Authorities: 3 .227 http://www.jaguarvehicles.com/ Jaguar Cars Global Home Page .227 http://www.collection.co.uk/ The Jaguar Collection - Ocial Web site .211 http://www.moran.com/sterling/sterling.html .211 http://www.coys.co.uk/ For the query "randomized algorithms" , none of the strongest collections of hubs and 22

24 authorities could be said to be precisely on the query topic, though they all consisted of thematically related pages on a closely related topic. They included home pages of theoretical computer scientists, compendia of mathematical software, and pages on wavelets. st (\randomized algorithms") Authorities: 1 non-principal vector, positive end  Michel X. Goemans http://theory.lcs.mit.edu/ goemans/ .125  spielman/ .122 http://theory.lcs.mit.edu/ Dan Spielman's Homepage http://www.nada.kth.se/  johanh/ Johan Hastad .122 http://theory.lcs.mit.edu/  rivest/ Ronald L. Rivest : HomePage .122 st non-principal vector, negative end (\randomized algorithms") Authorities 1 StatLib Index -.00116 http://lib.stat.cmu.edu/ Tela -.00115 http://www.geo.fmi. /prog/tela.html -.00107 http://gams.nist.gov/ GAMS : Guide to Available Mathematical Software -.00107 http://www.netlib.org Netlib th non-principal vector, negative end (\randomized algorithms") Authorities 4 -.176 http://www.amara.com/current/wavelet.html Amara's Wavelet Page -.172 http://www-ocean.tamu.edu/  baum/wavelets.html Wavelet sources -.161 http://www.mathsoft.com/wavelets.html Wavelet Resources -.143 http://www.mat.sbg.ac.at/ uhl/wav.html Wavelets  We also encounter examples where pages from the positive and negative ends of the same non-principal eigenvector exhibit a natural separation. One case in which the meaning of "abortion" . The natural question this separation is particularly striking is for the query is whether one of the non-principal eigenvectors produces a division between pro-choice and pro-life authorities. The issue is complicated by the existence of hub pages that link nd extensively to pages from both sides; but in fact the 2 non-principal eigenvector produces a very clear separation: nd (abortion) Authorities: 2 non-principal vector, positive end .321 http://www.caral.org/abortion.html Abortion and Reproductive Rights Internet Resources .219 http://www.plannedparenthood.org/ Welcome to Planned Parenthood .195 http://www.gynpages.com/ Abortion Clinics OnLine .172 http://www.oneworld.org/ippf/ IPPF Home Page The National Abortion Federation .162 http://www.prochoice.org/naf/ .161 http://www.lm.com/  lmann/feminist/abortion.html nd non-principal vector, negative end (abortion) Authorities: 2 -.197 http://www.awinc.com/partners/bc/commpass/lifenet/lifenet.htm LifeWEB -.169 http://www.worldv illage.com/wv/square/chapel/xwalk/html/peter.htm Healing after Abortion -.164 http://www.nebula.net/  maeve/lifelink.html -.150 http://members.aol.com/pladvocate/ Pro-Life Advocate -.144 http://www.clark.net/pub/je d/factbot.html The Right Side of the Web -.144 http://www.catholic.net/HyperNews/get/abortion.html 23

25 7 Di usion and Generalization Let us return to the method of Section 3, in which we identi ed a single collection of hubs and G authorities in the subgraph  . The algorithm computes associated with a query string  a densely linked collection of pages without regard to their contents; the fact that these pages are relevant to the query topic in a wide range of cases is based on the way in which we construct the subgraph G , ensuring that it is rich in relevant pages. We can view the  issue as follows: Many di erent topics are represented in G , and each is centered around  a competing collection of densely linked hubs and authorities. Our method of producing a focused subgraph G aims at ensuring that the most relevant such collection is also the  \densest" one, and hence will be found by the method of iterated and O operations. I When the initial query string speci es a topic that is not suciently broad, however,  G there will often not be enough relevant pages in from which to extract a suciently dense  subgraph of relevant hubs and authorities. As a result, authoritative pages corresponding to  competing, \broader" topics will win out over the pages relevant to , and be returned by the algorithm. In such cases, we will say that the process has di used from the initial query. Although it limits the ability of our algorithm to nd authoritative pages for narrow or speci c query topics, di usion can be an interesting process in its own right. In particular, the broader topic that supplants the original, too-speci c query  very often represents a natural generalization of  . As such, it provides a simple way of abstracting a speci c query topic to a broader, related one. Consider, for example, the query . At the time we tried this query, "WWW conferences" AltaVista indexed roughly 300 pages containing the string; however, the resulting subgraph G contained pages concerned with a host of more general www -related topics, and the  www resources. main authorities were in fact very general (\WWW conferences") Authorities: principal eigenvector .088 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/whats-new.html The What's New Archive .088 http://www.w3.org/hypertext/DataSources/WWW/Servers.html World-Wide Web Servers: Summary .087 http://www.w3.org/hypertext/DataSources/bySubject/Overview.html The World-Wide Web Virtual Library In the context of similar-page queries, a query that is \too speci c" corresponds roughly p that does not have suciently high in-degree. In such cases, the process of toapage di usion can also provide a broad-topic summary of more prominent pages related to p . Consider, for example, the results when p was sigact.acm.org , the home page of the ACM Special Interest Group on Algorithms and Computation Theory, which focuses on theoretical computer science. (sigact.acm.org) Authorities: principal eigenvector .197 http://www.siam.org/ Society for Industrial and Applied Mathematics .166 http://dimacs.rutgers.edu/ Center for Discrete Mathematics and Theoretical Computer Science 24

26 .150 http://www.computer.org/ IEEE Computer Society Yahoo! .148 http://www.yahoo.com/ .145 http://e-math.ams.org/ e-MATH Home Page IEEE Home Page .141 http://www.ieee.org/ .140 http://glimpse.cs.arizona.edu:1994/bib/ Computer Science Bibliography Glimpse Server eccc.uni-trier.de/eccc/ ECCC - The Electronic Colloquium on Computational Complexity .129 http://www. .129 http://www.cs.indiana.edu/cstr/search UCSTRI | Cover Page .118 http://euclid.math.fsu.edu/Science/math.html The World-Wide Web Virtual Library: Mathematics The problem of returning more speci c answers in the presence of this phenomenon is the subject of on-going work; in Sections 8 and 9, we briefly discuss current work on the use of textual content for the purpose of focusing our approach to link-based analysis [6, 10, 11]. The use of non-principal eigenvectors, combined with basic term-matching, can be a simple way to extract collections of authoritative pages that are more relevant to a speci c query topic. For example, consider the following fact: Among the sets of hubs and authorities corresponding to the rst 20 non-principal eigenvectors, the one in which the pages collectively contained the string "WWW conferences" the most was the following. th non-principal vector, negative end (\WWW conferences") Authorities: 11 -.097 http://www.igd.fhg.de/www95.html Third International World-Wide Web Conference AUUG'95 and Asia-Paci c WWW'95 Conference -.091 http://www.csu.edu.au/special/conference/WWWWW.html The Second International WWW Conference '94 -.090 http://www.ncsa.uiuc.edu/SDG/IT94/IT94Info.html Fourth International World Wide Web Conference -.083 http://www.w3.org/hypertext/Conferences/WWW4/ WWW'95: Papers -.079 http://www.igd.fhg.de/www/www95/papers/ 8 Evaluation The evaluation of the methods presented here is a challenging task. First, of course, we are attempting to de ne and compute a measure, \authority," that is inherently based on human judgment. Moreover, the nature of the www adds complexity to the problem of evaluation | it is a new domain, with a shortage of standard benchmarks; the diversity of authoring styles is much greater than for comparable collections of printed, published documents; and it is highly dynamic, with new material being created rapidly and no comprehensive index of its full contents. In the earlier sections of the paper, we have presented a number of examples of the output from our algorithm. This was both to show the reader the type of results that are produced, and because we believe that there is, and probably should be, an inevitable component of resipsaloquitur in the overall evaluation | our feeling is that many of the results are quite striking at an obvious level. However, there are also more principled ways of evaluating the algorithm. Since the appearance of the conference version of this paper, three distinct user studies performed by 25

27 two di erent groups [6, 10, 11] have helped assess the value of our technique in the context of a tool for locating information on the . Each of these studies used a system built www primarily on top of the basic algorithm described here, for locating hubs and authorities in a subgraph G via the methods discussed in Sections 2 and 3. However, each of these  systems also employed additional heuristics to further enhance relevance judgments. Most signi cantly, they incorporated text-based measures such as anchor text scores to weight the contribution of individual links di erentially. As such, the results of these studies should not be interpreted as providing a direct evaluation of the pure link-based method described here; rather, they assess its performance as the core component of a www search tool. We briefly survey the structure and results of the most recent of these three user studies, involving the Clever system of Chakrabarti et al. [10], and refer the reader to that work automatic resource compilation |the for more details. The basic task in this study was construction of lists of high-quality www pages related to a broad search topic | and Clever the goal was to see how the output of compared to that of a manually generated www search service Yahoo! [58] for a set of 26 topics. compilation such as the Thus, for each topic, the output of the system was a list of ten pages: its ve Clever top hubs and ve top authorities. Yahoo! was used as the main point of comparison, since its manually compiled resource lists can be viewed as representing judgments of \authority" by the human ontologists who compile them. The top ten pages returned by AltaVista were also selected, so as to provide representative pages produced by a fully automatic text-based search engine. All these pages were collected into a single topic list for each topic in the study, without an indication of which method produced which page. A collection of 37 users was assembled; the users were required to be familiar with the use of a Web browser, but were not experts in computer science or in the 26 search topics. The users were then asked to rank the pages they visited from the topic lists as \bad," \fair," \good," or \fantastic," in terms of their utility in learning about the topic. This yielded 1369 responses in all, which Clever , Yahoo! were then used to assess the relative quality of , and AltaVista on each topic. For approximately 31% of the topics, the evaluations of Yahoo! and Clever were equivalent to within a threshold of statistical signi cance; for approximately 50% Clever was evaluated higher; and for the remaining 19% Yahoo! was evaluated higher. Of course, it is dicult to draw de nitive conclusions from these studies. A service such Yahoo! is indeed providing, by its very nature, a type of human judgment as to which as pages are \good" for a particular topic. But even the nature of the quality judgment is not well-de ned, of course. Moreover, many of the entries in Yahoo! are drawn from outside submissions, and hence represent less directly the \authority" judgments of Yahoo! 's sta . Many of the users in these studies reported that they used the lists as starting points from which to explore, but that they visited many pages not on the original topic lists generated by the various techniques. This is, of course, a natural process in the exploration of a broad topic on the www , and the goal of resource lists appears to be generally for the purpose of 26

28 facilitating this process rather than for replacing it. 9 Conclusion We have discussed a technique for locating high-quality information related to a broad search topic on the www , based on a structural analysis of the link topology surrounding \author- itative" pages on the topic. It is useful to highlight four basic components of our approach. For broad topics on the www , the amount of relevant information is growing extremely  rapidly, making it continually more dicult for individual users to lter the available resources. To deal with this problem, one needs notions beyond those of relevance and clustering | one needs a way to distill a broad topic, for which there may be millions of relevant pages, down to a representation of very small size. It is for this purpose that we de ne a notion of \authoritative" sources, based on the link structure of the www .  We are interested in producing results that are of as a high a quality as possible in www globally . Our underlying domain is not the context of what is available on the restricted to a focused set of pages, or those residing on a single Web site.  At the same time, we infer global notions of structure without directly maintaining an index of the www or its link structure. We require only a basic interface to any of a number of standard www search engines, and use techniques for producing \enriched" samples of www pages to determine notions of structure and quality that make sense globally. This helps to deal with problems of scale in handling topics that have an enormous representation on the . www We began with the goal of discovering authoritative pages  , but our approach in fact identi es a more complex pattern of social organization on the www ,inwhichhub pages link densely to a set of thematically related authorities. This equilibrium between hubs and authorities is a phenomenon that recurs in the context of a wide variety of topics on the www . Measures of impact and influence in bibliometrics have typically lacked, and arguably not required, an analogous formulation of the role that hubs www is very di erent from the scienti c literature, and our framework seems play; the appropriate as a model of the way in which authority is conferred in an environment such as the Web. This work has been extended in a number of ways since its initial conference appearance. In Section 8 we mentioned systems for compiling high-quality www resource lists that have been built using extensions to the algorithms developed here; see Bharat and Henzinger [6] and Chakrabarti et al. [10, 11]. The implementation of the Bharat-Henzinger system made 27

29 use of the recently developed Connectivity Server (Bharat et al. [5]), which provides very ecient retrieval for linkage information contained in the AltaVista index. With Gibson and Raghavan, we have used the algorithms described here to explore the structure of \communities" of hubs and authorities on the www [28]. We nd that the notion of topic generalization discussed in Section 7 provides one valuable perspective from which to view the overlapping organization of such communities. In a separate direction, also with Gibson and Raghavan, we have investigated extensions of the present work to the analysis of relational data, and considered a natural, non-linear analogue of spectral heuristics in this setting [29]. There a number of interesting further directions suggested by this research, in addition to the currently on-going work mentioned above. We will restrict ourselves here to three such directions. First, we have used structural information about the graph de ned by the links of the , but we have not made use of its patterns of trac, and the paths that users implicitly www traverse in this graph as they visit a sequence of pages. There are a number of interesting and fundamental questions that can be asked about www trac, involving both the modeling of such trac and the development of algorithms and tools to exploit information gained from trac patterns (see e.g. [2, 3, 33]). It would be interesting to ask how the approach developed here might be integrated into a study of user trac patterns on the www . Second, the power of eigenvector-based heuristics is not something that is fully understood at an analytical level, and it would be interesting to pursue this question in the context of the algorithms presented here. One direction would be to consider random graph models that contain enough structure to capture certain global properties of the ,andyetare www simple enough so that the application of our algorithms to them could be analyzed. More generally, the development of clean yet reasonably accurate random graph models for the www could be extremely valuable for the understanding of a range of link-based algorithms. Some work of this type has been undertaken in the context of the latent semantic indexing technique in information retrieval [16]: Papadimitriou et al. [44] have provided a theoretical analysis of latent semantic indexing applied to a basic probabilistic model of term use in documents. In another direction, motivated in part by our work here, Frieze, Kannan, and Vempala have analyzed sampling methodologies capable of approximating the singular value decomposition of a large matrix very eciently [24]; understanding the concrete connections between their work and our sampling methodology in Section 2 would be very interesting. Finally, the further development of link-based methods to handle information needs other than broad-topic queries on the www poses many interesting challenges. As noted above, work has been done on the incorporation of textual content into our framework as a way of \focusing" a broad-topic search [6, 10, 11], but one can ask what other basic informational structures one can identify, beyond hubs and authorities, from the link topology of hyperme- dia such as the www . The means by which interaction with a link structure can facilitate 28

30 the discovery of information is a general and far-reaching notion, and we feel that it will continue to o er a range of fascinating algorithmic possibilities. Acknowledgements In the early stages of this work, I bene ted enormously from discussions with Prabhakar Raghavan and with Robert Kleinberg; I thank Soumen Chakrabarti, Byron Dom, David Gibson, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins for on-going collaboration on extensions and evaluations of this work; and I thank Rakesh Agrawal, Tryg Ager, Rob Barrett, Marshall Bern, Tim Berners-Lee, Ashok Chandra, Monika Henzinger, Alan Ho man, David Karger, Lillian Lee, Nimrod Megiddo, Christos Papadim- itriou, Peter Pirolli, Ted Selker, Eli Upfal, and the anonymous referees of this paper, for their valuable comments and suggestions. References [1] G.O. Arocena, A.O. Mendelzon, G.A. Mihaila, \Applications of a Web query language," Proc. 6th International World Wide Web Conference , 1997. [2] R. Barrett, P. Maglio, D. Kellem, \How to personalize the Web," Proc. Conf. on Human Factors in Computing Systems , 1997. [3] O. Berman, M.J. Hodgson, D. Krass. \Flow-interception problems," in Facility Location: A Survey of Applications and Methods , Z. Drezner, Ed., Springer 1995. [4] T. Berners-Lee, R. Cailliau, A. Luotonen, H.F. Nielsen, A. Secret. \The World-Wide Web," Communications of the ACM , 37(1994), pp. 76{82. [5] K. Bharat, A. Broder, M.R. Henzinger, P. Kumar, S. Venkatasubramanian, \Connectivity Proc. 7th Intl. World Wide Web Server: Fast Access to Linkage Information on the Web," Conf. , 1998. [6] K. Bharat, M.R. Henzinger, \Improved algorithms for topic distillation in a hyperlinked environment," Proc. ACM Conf. Res. and Development in Information Retrieval , 1998. [7] R. Botafogo, E. Rivlin, B. Shneiderman, \Structural analysis of hypertext: Identifying hierarchies and useful metrics," ACM Trans. Inf. Sys. , 10(1992), pp. 142{180. [8] S. Brin, L. Page, \Anatomy of a Large-Scale Hypertextual Web Search Engine," Proc. 7th International World Wide Web Conference , 1998. [9] J. Carri ere, R. Kazman, \WebQuery: Searching and visualizing the Web through con- nectivity," Proc. 6th International World Wide Web Conference , 1997. 29

31 [10] S. Chakrabarti, B. Dom, D. Gibson, S.R. Kumar, P. Raghavan, S. Rajagopalan and A. Tomkins, \Experiments in Topic Distillation," ACM SIGIR Workshop on Hypertext , 1998. Information Retrieval on the Web [11] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, S. Rajagopalan, \Au- tomatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text," Proc. 7th International World Wide Web Conference , 1998. [12] F.R.K. Chung, Spectral Graph Theory , AMS Press, 1997. [13] C. Chekuri, M. Goldwasser, P. Raghavan and E. Upfal \Web search using automated 6th International World Wide Web Conference , 1997. classi cation," poster at [14] D.R. Cutting, J. Pedersen, D.R. Karger, J.W. Tukey, \Scatter/gather: A cluster-based approach to browsing large document collections," Proc. ACM Conf. Res. and Develop- , 1992. ment in Information Retrieval [15] D. de Solla Price, \The analysis of square matrices of scientometric transactions," Sci- , 3(1981), pp. 55{63. entometrics [16] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, R. Harshman, \Indexing by latent semantic analysis," J. American Soc. Info. Sci. , 41(1990), pp. 391{407. [17] Digital Equipment Corporation, AltaVista search engine , http://altavista.digital.com/ . IBM Jour- [18] W.E. Donath, A.J. Ho man, \Lower bounds for the partitioning of graphs", , 17(1973). nal of Research and Development [19] P. Doreian, \Measuring the relative standing of disciplinary journals," Inf. Proc. and Management , 24(1988), pp. 45{56. [20] P. Doreian, \A measure of standing for citation networks within a wider environment," Inf. Proc. and Management , 30(1994), pp. 21{31. [21] L. Egghe, \Mathematical relations between impact factors and average number of cita- tions," Inf. Proc. and Management , 24(1988), pp. 567{576. [22] L. Egghe, R. Rousseau, , Elsevier, 1990. Introduction to Informetrics [23] M. Fielder, \Algebraic connectivity of graphs," Czech. Math. J. , 23(1973), pp. 298{305. [24] A. Frieze, R. Kannan, S. Vempala, \Fast Monte-Carlo Algorithms for Finding Low-Rank Approximations," Proc. 39th IEEE Symp. on Foundations of Computer Science , 1998. 30

32 [25] M.E. Frisse, \Searching for information in a hypertext medical handbook," Communi- , 31(7), pp. 880{886. cations of the ACM , 178(1972), pp. [26] E. Gar eld, \Citation analysis as a tool in journal evaluation," Science 471{479. Inf. Proc. and [27] N. Geller, \On the citation influence methodology of Pinski and Narin," Management , 14(1978), pp. 93{95. [28] D. Gibson, J. Kleinberg, P. Raghavan, \Inferring Web Communities from Link Topol- ogy," Proc. 9th ACM Conference on Hypertext and Hypermedia , 1998. [29] D. Gibson, J. Kleinberg, P. Raghavan, \Clustering Categorical Data: An Approach Proc. 24th Intl. Conf. on Very Large Databases Based on Dynamical Systems," , 1998. Matrix Computations , Johns Hopkins University Press, 1989. [30] G. Golub, C.F. Van Loan, [31] H. Hotelling, \Analysis of a complex statistical variable into principal components," , 24(1933), pp. 417{441. J. Educational Psychology [32] C.H. Hubbell, \An input-output approach to clique identi cation," Sociometry , 28(1965), pp. 377-399. [33] B. Huberman, P. Pirolli, J. Pitkow, R. Lukose, \Strong Regularities in World Wide Web Sur ng," Science , 280(1998). 1986. [34] I.T. Jolli e. Principal Component Analysis. Springer-Verlag, Psychometrika , [35] L. Katz, \A new status index derived from sociometric analysis," 18(1953), pp. 39{43. American Documen- [36] M.M. Kessler, \Bibliographic coupling between scienti c papers," tation , 14(1963), pp. 10{25. [37] R. Larson, \Bibliometrics of the World Wide Web: An exploratory analysis of the intellectual structure of cyberspace," Ann. Meeting of the American Soc. Info. Sci. , 1996. [38] J.H. Levine, \Joint-space analysis of `pick-any' data: Analysis of choices from an un- Psychometrika , 44(1979), pp. 85{92. constrained set of alternatives," [39] M. Marchiori, \The quest for correct information on the Web: Hyper search engines," Proc. 6th International World Wide Web Conference , 1997. [40] O. McBryan, \GENVL and WWWW: Tools for taming the Web," Proc. 1st Interna- tional World Wide Web Conference , 1994. 31

33 [41] K. McCain, \Co-cited author mapping as a valid representation of intellectual struc- ture," , 37(1986), pp. 111{122. J. American Soc. Info. Sci. [42] E. Noma, \An improved method for analyzing square scientometric transaction matri- ces," Scientometrics , 4(1982), pp. 297{316. , J. American Soc. Info. Sci. [43] E. Noma, \Co-citation analysis and the invisible college," 35(1984), pp. 29{33. [44] C.H. Papadimitriou, P. Raghavan, H. Tamaki, S. Vempala, \Latent semantic indexing: A probabilistic analysis," Proc. ACM Symp. on Principles of Database Systems , 1998. [45] G. Pinski, F. Narin, \Citation influence for journal aggregates of scienti c publica- tions: Theory, with application to the literature of physics," Inf. Proc. and Management , 12(1976), pp. 297{312. [46] P. Pirolli, J. Pitkow, R. Rao, \Silk from a sow's ear: Extracting usable structures from the Web," Proceedings of ACM SIGCHI Conference on Human Factors in Computing , 1996. [47] J. Pitkow, P. Pirolli, \Life, death, and lawfulness on the electronic frontier," Proceedings of ACM SIGCHI Conference on Human Factors in Computing , 1997. [48] C.J. van Rijsbergen, Information Retrieval , Butterworths, 1979. [49] G. Salton. Automatic Text Processing. Addison-Wesley, Reading, MA, 1989. [50] W.M. Shaw, \Subject and Citation Indexing. Part I: The clustering structure of com- posite representations in the cystic brosis document collection," J. American Soc. Info. Sci. , 42(1991), pp. 669{675. [51] W.M. Shaw, \Subject and Citation Indexing. Part II: The optimal, cluster-based re- J. American Soc. Info. Sci. , 42(1991), trieval performance of composite representations," pp. 676{684. [52] H. Small, \Co-citation in the scienti c literature: A new measure of the relationship between two documents," J. American Soc. Info. Sci. , 24(1973), pp. 265{269. [53] H. Small, \The synthesis of specialty narratives from co-citation clusters," J. American Soc. Info. Sci. , 37(1986), pp. 97{110. [54] H. Small, B.C. Grith, \The structure of the scienti c literatures I. Identifying and graphing specialties," Science Studies 4(1974), pp. 17{40. [55] E. Spertus, \ParaSite: Mining structural information on the Web," Proc. 6th Interna- tional World Wide Web Conference , 1997. 32

34 [56] R. Weiss, B. Velez, M. Sheldon, C. Nemprempre, P. Szilagyi, D.K. Gi ord, \HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering," Proceedings of the Seventh ACM Conference on Hypertext , 1996. [57] Wired Digital, Inc., Hotbot , http://www.hotbot.com . [58] Yahoo! Corporation, Yahoo! , http://www.yahoo.com . 33

Related documents

DER Directory

DER Directory

FAA CONSULTANT DER DIRECTORY May 9, 2019 AIR-6F0, Delegation & Organizational Procedures Branch This directory is generated from information in the FAA Designee Information Network (DIN). If you are a...

More info »
Microsoft Word   A) Division 245.docx

Microsoft Word A) Division 245.docx

tables Attachment Division 245, including A: Nov. 15-16, 2018, EQC meeting 1 of 121 Page Division 245 CLEANER AIR OREGON 340-245-0005 Purpose and Overview (1) This statement of purpose and overview is...

More info »
Baldwin Copyright Wars CC.pdf

Baldwin Copyright Wars CC.pdf

The Copyright Wars T THE COPYRIGHT WARS: THREE CENTURIES OF TRANS-ATLANTIC BATTLE by Peter Baldwin is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Licens...

More info »
Gutmans Frontmatter

Gutmans Frontmatter

Gutmans_Frontmatter Page i Thursday, September 23, 2004 9:05 AM PHP 5 Power Programming

More info »
ayout 1

ayout 1

0465039146-FM:FM 12/5/06 12:25 AM Page i C O D E

More info »
AWS Certificate Manager Private Certificate Authority   User Guide

AWS Certificate Manager Private Certificate Authority User Guide

AWS Certificate Manager Private Certificate Authority User Guide Version latest

More info »
Board of Property Tax Appeals Manual, 150 303 484

Board of Property Tax Appeals Manual, 150 303 484

Board of Property Tax Appeals Manual Oregon Department of Revenue 150-303-484 (Rev. 08-17)

More info »
Ubuntu Server Guide

Ubuntu Server Guide

Ubuntu Server Guide

More info »
Hawaii Administrative Rules Chapter 16   97 Private Detectives and Guards

Hawaii Administrative Rules Chapter 16 97 Private Detectives and Guards

HAWAII ADMINISTRATIVE RULES TITLE 16 DEPARTMENT OF COMMERCE AND CONSUMER AFFAIRS CHAPTER 97 PRIVATE DETECTIVES AND GUARDS Subchapter 1 General Provisions Β§16-97-1 Objective Β§16-97-2 Definitions Β§16-97...

More info »
TPM Main Part 3 Commands v1.2 rev116 01032011

TPM Main Part 3 Commands v1.2 rev116 01032011

1 2 1 2 3 4 5 6 7 8 TPM Main Part 3 Commands 9 10 Specification Version 1.2 11 Level 2 Revision 116 12 1 March 2011 13 TCG Published 14 15 16 17 Contact: [email protected] 18 19 20 21 22...

More info »
Linux Client Migration Cookbook Version 2

Linux Client Migration Cookbook Version 2

Front cover Linux Client Migration Cookbook, Version 2 A Practical Planning and Implementation Guide for Migrating to Desktop Linux For any organization that is exploring or planning for a Linux deskt...

More info »
Qualys(R) Asset Management and Tagging v2 API User Guide

Qualys(R) Asset Management and Tagging v2 API User Guide

Asset Management & Tagging API User Guide Version 2.37 March 29 , 2019

More info »
Hawaii Administrative Rules Chapter 90 C Nursing Home Administrators

Hawaii Administrative Rules Chapter 90 C Nursing Home Administrators

HAWAII ADMINISTRATIVE RULES { } PRIVATE 16 TITLE DEPARTMENT OF COMMERCE AND CONSUMER AFFAIRS 90 CHAPTER NURSING HOME ADMINISTRATORS Subchapter 1 General Provisions Β§16-90-1 Objective Β§16-90-2 Definiti...

More info »
Division 2, Subdivision D, Walking Working Surfaces *

Division 2, Subdivision D, Walking Working Surfaces *

Oregon Administrative Rules Chapter 437 Division 2 ccupational Safety and Health General O Subdivision Working Surfaces Walking - AO 2 - 201 7

More info »
Bugs for Cisco IOS Release 15.5(3)M

Bugs for Cisco IOS Release 15.5(3)M

Bugs for Cisco 15.5(3)M IOS Release and Resolved Bugs, on page 1 β€’ Open Tool, on page 2 the Bug Search β€’ Using 15.5(3)M9 , on page 3 Bugsβ€”Cisco IOS Release β€’ Resolved β€’ Resolved IOS Release 15.5(3)M8 ...

More info »
Qualys Web Application Scanning API User Guide

Qualys Web Application Scanning API User Guide

Web Application Scanning API User Guide Version 2.38 April 09, 2019

More info »
CityNT2019TentRoll 1

CityNT2019TentRoll 1

STATE OF NEW YORK 2 0 1 9 T E N T A T I V E A S S E S S M E N T R O L L PAGE 1 VALUATION DATE-JUL 01, 2018 COUNTY - Niagara T A X A B L E SECTION OF THE ROLL - 1 CITY - North Tonawanda TAX MAP NUMBER ...

More info »
Microsoft Word   JUSTICE #3954535 v6 OAR 137 050 0700 commentary

Microsoft Word JUSTICE #3954535 v6 OAR 137 050 0700 commentary

Child Support Guideline Rules Rule Number 137-050-0700 General Provisions Calculating Support 137-050-0710 137-050-0715 Income Adjusted Income 137-050-0720 Basic Support Obligation 137-050-0725 Parent...

More info »