ht98 comm.dvi

Transcript

1 Inf unities Topology Comm Web erring from Link berg Prabhak vid Jon Klein Da Gibson van Ragha ar h Center Science of Computer Researc Almaden Science Dept. Dept. of Computer eley ersity Univ IBM Cornell UC Berk USA eley, CA 94720 NY 14853 Ithaca, CA 95120 USA Berk USA San Jose, +1 510 643 5425 +1 408 927 1804 +1 607 254 4636 [email protected] klein erkeley.edu [email protected] [email protected] individuals t. Thus, ten con ed link while can imp or- CT ABSTRA ose level, its local extremely at an der is organization global The tralized, Wide Web gro ws a decen through World in some | \unplanned" utterly high-lev el struc- sense, and has this cess, pro hic anarc almost in a large resulted a posteriori analysis. ture can emerge only through organiza- kind the without ed corpus of logical hyperlink into more traditionally-created hy- tion that can be built permedia. suc under structure meaningful To extract h ts a consider- link structure of the www represen The erlinke circumstances, d com- hyp of a notion elop we dev able thus o ers and annotation, t human t of laten amoun of the link munities on the www through an analysis a promising point for structural studies of the starting top . By invoking a simple, mathematically clean ology t of work Web. There has been a gro wing directed amoun de ning the of these structure metho osing exp and d for link at the t and ten con of textual integration informa- we are unities, ber of themes: e a num to deriv able comm purp ose of organizing [2, 4, 13, 24], visualiz- the for tion as con taining a core of ed be view can unities comm The ing [7, 21] and searc hing [1, 5, 6, 14, 17, 19, 22, 23, 25, 26] tral, cen \authoritativ by \hub together ed link pages e" presen in hypermedia suc h as t work the . The www pages"; type of hierarc a natural exhibit hical they and ; originates from the problem of searc on the www hing be inferred directly from topic generalization that can to deal explicitly with de n- on this, it attempts building pattern sho of link age. Our investigation the al- that ws a meaningful notion of structur e in suc h an environ- ing create cess pro the though h users Web of the by whic h as navigation men t, as a way of addressing issues suc to understand and pages links is very dicult at a \lo- and information very. disco cal" in a much greater degree of orderly level, it results high-lev el structure has typically been than assumed. link an is on here emphasis Our investigation of the www pervasiv of the ology e themes top fairly , and some ords: Hyp ertext comm unities, information explo- Keyw structure the about ti ed we have iden of hypertextual e annotation. ration, World Wide Web, collab orativ We will communities have dev elop ed in the Web. that notion vides a surpris- community of a pro this that see ODUCTION INTR clear seemingly persp ectiv ingly the h to view whic e from complex- As hyperlink ed environmen ts gro w in size and t of the Web's haphazard dev elopmen infrastructure. high-lev meaningful ting represen el vering ity, disco and in them structure challenging increasingly an becomes ber of re- in a num valuable are that themes The emerge This is particularly eviden t in the case of the problem. Our spects. of the link structure of the www analysis Web h for structure searc the ), and www World Wide ( on-going and creation of page suggests cess the pro that The reasons. eral sev for elling is comp here is a www a local understand to dicult very while age, link at it con- hypertext corpus of enormous complexit y, and in structure results or- more is considerably that level, to ver, Moreo rate. a phenomenal at expand it tinues a derly than is typically assumed. Thus, it gives us form be view ed hyper- as an intricate can of \populist global understanding of the ways in whic h indep enden t y ts, participan of on-line man h millions in whic media," users build connections to one another in hypermedia con icting goals, are con tinuously creating hyper- with that arises fashion, and it pro vides a in a distributed for h on-line the way in whic basis comm uni- predicting ties computer-orien ted disciplines will dev elop as in less they become increasingly \wired." It also suggests some of the higher-lev el information that types of structured, designers of information disco very tools may be able to about, pro both for, and vide user populations on the Web.

2 imp ortan t point: our use of the term community does Our a hyperlink-orien ted is based study on exp with erience imply constructed have been structures these that not metho in [17], by Klein introduced hing searc d for berg Rather, tralized our exp er- fashion. or planned in a cen hits Topic with and erlink-Induced an ex- h), Searc (Hyp that ws sho hits with tation imen of unities h comm suc perimen tal system built around un- The technique. this hubs e of the and are a recurring authorities conse quenc technique is discussed in detail in [17]. Here we derlying way on the www link to one of pages creators in which technique, invoke this eloping it for our study of com- dev another in the context of topics est. ead inter of widespr the Web. We begin with a brief summary on munities from (in the follo wing section) of the main concepts [17] OVER VIEW HITS OF necessary understanding our study . Follo w- are that for with a summary of the main concepts from We begin ing of the a discussion er is then pap of the bulk the this, t [17] that are necessary for understanding the presen our and in- dology metho motiv ation basic underlying work. vestigation www comm unities, and the main themes of have emerged. that is concerned with the iden ti cation hits authoritative of set T sources broad-topic information disc ov- hypermedia for . We brie y ery illustrate these notions with a set of set S examples. Consider, on the one hand, an individual the who wishes to use Web to nd the phone num ber Law School. Harv of a friend who is a studen t at ard to information desired The be con tained on is likely two pages, or one and hence the main issue is only On to locate the this num ber of small relevant pages. use hand, consider an individual who wishes to other Harv the Web to nd information about ard y. ersit Univ pages 000 ; 800 than more are there Since the www on con taining the term \Harv ard," a shortage of relev ant from a bles assem hits , query 1. Starting a user-supplied a broad-topic Rather, problem. the is no longer pages for root set of pages: typically , up to 200 pages returned S suc h as this, the user requires a way of iden tify- search [9] on that suc by a searc h engine h as AltaVista query . It authori- or tral, cen most of the collection a small ing expands by adding T set in any base to a larger then this , pages standard Most ard." \Harv topic tative the on . S pages that point to, or are pointed to by, any page in e authoritativ searc h engines do not, for example, return size ent the (To prev of from a xed- T explo ding, only interesting . The www.harvard.edu h as ques- suc pages to by, any or pointed to, pages of the subset size pointing how can tion arising in this con text is: one determine, in See the accompan ying is considered.) S page single without ention, that www.harvard.edu interv is human of the gure, root in whic relationship the h we depict be considered indeed a page that authoritative should and S set . T base set the ard"? \Harv topic the for a ( p h page ) and p eac with ciate 2. We asso h weight hub stems The technique underlying hits from two premises initialized to 1. Let p ! q an weight a ( p ), all authority vided by hu- pro annotation implicit the that [17]: rst, ". then hits q to page a hyperlink has p \page denote t informa- of hyperlinks sucien tains creators con man 's as follo iterativ 's and h the updates ws: ely a to infer tion and second, build- y"; of \authorit a notion X em- on this, that ing sucien tly broad topics con tain p ( ) := a ; ) q h ( of communities bedded We view pages. ed hyperlink ! q p X h comm suc inter- but two distinct, taining as con unities : h ( p ) := a ( q ) (highly-referenced related, authorities types of pages: ! q p as well that pages numerous as the on pages) topic, y of the authorities, and thus serv e to \point" to man a single of the ) by the p ( a replaces iteration sum Thus, \pull" them together. We refer to pages of the latter ( h replaces then , and p to pointing of pages ()'s h ) by p tral cen e as strong serv they since type as \hubs," points . of the p to by pointed of pages ()'s a sum the y is \conferred" h authorit ant pages. whic on from relev Hubs could be called a what exhibit authorities and pages, 3. The updating operations are performed all the for : a good relationship ly reinfor mutual hub points to cing (normalizing weigh ts af- the eated is rep cess and the pro man a good to by authority is pointed y good authorities; ter eac e It can be pro ved that this iterativ h iteration). this We break apparen y good hubs. man t circularit y by pro cess con verges to stable sets of authorit y and hub in the ed section. d describ e metho iterativ an An next weigh ts [17]. We declare the 10 pages with the highest

3 a with the 10 pages with the highest () values together unities as well [17]. For example, hits also ingful comm of a comm ber y. (The core num to be the () values h unit , a large ti es, in the base iden for the topic "Harvard" set is not , and arbitrary or less here is more 10 for crucial dra wn bio-medical on y of pages unit comm topics, into the our , we wish tially essen discussion; wing follo com- these between age link strong of the because set base the human for that to have a size munities is \manageable" man the and pages ci- asso labs medical and y biological users.) Harv ard. This is a case in whic h the main with ated y can be considered \on-topic," with sev eral comm unit arising smaller \o -topic" comm unities as well. values of the equilibrium ved that be pro it can the In fact ts and to coor- ond ts corresp hub weigh authorit y weigh M out discussed It turns M the matrices that and auth hub of matrices dinates in the principal eigen vectors of a pair to allo dis- the w for information enough tain above con M [17]. structure link the ed from deriv M and hub auth h additional covery of suc the that Recall unities. comm algorithm basic unit for be tifying a comm iden y could mathe- Note that the metho d is extremely simple and of the princip al eigenve ctors of M analyzed in terms hub con vergence analyze prop- can its matically clean: one M few rst the that out . It turns non-princip and al auth in a rigorous fashion, and the only tunable pa- erties have the M and M of ctors eigenve same intu- hub auth pro cedure We feel rameter xing is the the root set. for rep- itiv e meaning as the principal eigen vector: they framew technique mak this that an app ealing es the ork resen t pairs of weigh t assignmen ts exhibiting the mutu- in Web in whic h to searc h for inheren t structure com- ally reinfor authorities and of hubs relationship cing [17]. fact to run d is designed metho the that The munities. sev by computing Thus, non-principal eral eigen- of the structure, ne-tuning or without link on an arbitrary of M M vectors and ver additional disco can hits , hub auth ert kno about the the , incorp oration of exp wledge www comm ed of link set base same the pages. within unities ations suggests that the structural emerge that observ the use We will to to al community princip term refer to Web are largely intrinsic the itself, rather than an basic the unit ciated asso algorithm, comm by the y found algorithm. \over-trained" of an artifact the M of vectors with and M ; it is eigen principal auth hub \Harv We return to our initial ard", example, the topic comm the ely intuitiv pat- densest the y exhibiting unit top The setting. in a concrete notions these to illustrate The hubs between age of link tern addi- and authorities. this the are hits by as generated authorities for topic, comm tional eigen- non-principal with ciated asso unities consisted root set (The wing. follo returned pages of 200 . Since vectors will be called non-princip al communities on by AltaVista .) "harvard" query the M can non-principal eigen vectors of the M and hub auth of their be naturally asso ci- magnitude (by the ordered www.harvard.edu eigen ordering this ated a natural values), induces on the .edu www.hbs.harvard unities comm as well. non-principal .edu www.law.harvard form comm be- set base in the Multiple also can unities edu ksgweb.harvard. term has sev a query meanings in di eren t cause eral texts; for example, the topic "geometry" pro duces con The top hubs in this case consist of pages created by comm unities on computational geometry , di eren tial ge- individuals not various necessarily located at Harv ard, ometry . geometry seismic , and to a large author- consisting num ber of the top of links ities. for a of a similar examples y other Man nature, WORK TED RELA range be found in [17]. can of topics, of partitioning The oses purp the for vectors of eigen use and by Donath was introduced a graph in [10] Ho man Note the content crucial fact that the textual of the The been has and ely since. studied d un- metho extensiv in the initial step, is only pages considered involved derlying hits is technically distinct from that of [10] (it h engine. a when a searc from bled is assem root set Fol- partition does not heuris- the though eb graph"), \W the weigh lowing the this, algorithm simply propagates t over intuition tic underlying For similar. quite is clearly both links to the without further relevanc e of the pages regard non-h retriev information an ed corp al tech- ora, yperlink fact with. it is working that hits can reliably iden- The semantic latent as wn nique kno es use mak [8] indexing only tify also but authoritative rele- not are that pages of the deriv ed from the in- vectors of a matrix singular about vant to the user's initial query implies something verted other hand, the , on hits corpus. of the index is of the breadth the root set initial since topic: was the operating the ed of a hyperlink structure link on purely densest the ant pages, rich in relev tly sucien comm u- weigh term ts. corpus, and mak es no use of matrices with authorities nity of hubs and surrounding base set in the ant as well. was relev ve searc The use of link information to impro h perfor- has www on the mance work; in previous anced adv been this In general, of course, the base set con tains not only analysis relev enhancing for used been has ance hyperlink ber of other densest comm unit y but a large num mean-

4 unit judgmen 12, 14, 19, 26], as well as for \ranking" "Harvard" topic the for authorities top y. Thus the ts [1, a single pages of pages of a mixture consist iteration for after www [5, 22, 23, 17]. Link structures have been schools researc predates the www ; in the and topics, h that on bio-medical pages ard, at Harv studied in hypertext pages Botafogo graph-theoretic [4] introduce et al. particular, home of yahoo and Microsoft. As we increase the based num we are able to watch the base set dis- de node-to-no y and densit link on measures ber of iterations, tances of hubs t comm into coheren e" itself and \resolv unities for clustering and searc hing in hypermedia. Their enc index e nodes bear a relation to authorities. of and refer notions however, here; used authorities and of hubs notions the below. and Issues (1) (2) are discussed in detail To are they based purely on the out- and in-degrees of indi- ely, we de ne study issue (3) al princip the quan titativ t. (See vidual documen ts in the hyperlink ed environmen C community top 10 au- , as before, to be the of the set of the of some in apply- diculties a discussion [17] for that thorities and top 10 hubs | 20 pages in all. Recall the degree-coun ting metho ds ing on pure to a domain a function parameter is implicitly C set this of one R , www .) of the scale \con vergence" the However to study size. root set the num , the N , we include C ber of iterations, as sec- to the of cita- patterns The eld of bibliometrics studies ond principal ) is the R; N ( C so we say that parameter; ti c tion implicit type of \link age" | among scien | an hits comm unit y obtained by running for N iterations ers. pap See [27] for mea- ber of their A num a review. set a root from starting our erience exp . In R of size sures in the con text of hypermedia; some have meaning y of topics, on trials hits a variet hundred eral with sev can in [18]. One are connections of these also studied stable the comm unities that form predictably become interpret a type of on as relying hits the beha vior of so a root set with of 200 and after 50 size iterations; [20]. et al. by Marshall , as studied memory community  for C Now, various C we de ne = (200 ; of 50) values : kind In essence, hits is searc hing for a particular of link between C ( , we can ) look at the overlap R and N R; N | and this structure | reinforcing hubs and authorities  num and C | the ber of pages they have in common. structure by comm is created unities of people collating 1{6, over- the nal page, we plot these on In Figures useful information, whether for their own bene t or for N 10 ; 50 and for R = 25 ; 50 ; 100 ; 200, for six laps 3 ; ; = 1 others'. tativ study: in our e topics represen OBSER METHODOLOGY AND VATIONS Harvard emergence We now discuss our use of hits to study the cryptography we also a high-lev el Web; the on unities of comm pro vide literature English next In the ations. section main of our summary observ skiing ations we discuss in greater detail. these observ optimization research operations issues: three We focus on What (1) non-principal) comm and (principal the are u- ect (Eac h curv e in these plots measures overlap with resp nities disco vered by hits ? a xed = 50 N . Again, N of value for size, to root set end dep vered disco comm unities the on (2) How do the case.) essen tially represen ts con vergence in our ort choice of root set? We rep results on two kinds of root set: of the root set the bling assem (i) variations have di eren t lev- tativ represen our that Note e topics from di eren t sources (AltaVista vs InfoSeek), or from of computer els science, of connection to the broad area a query issued in di eren t languages "astrophysics" ( from ranging that are hea vily computer-orien ted topics vs "astrophysique" varying the size of the root ); (ii) ( "cryptography" ) to entirely di eren t academic disci- 25, root sets 50, we consider comp set: of the top osed plines ( "English literature" ). Suc h topics also ex- and the has This by AltaVista. returned results 200 100 a range hibit quan tity of link- and t patterns of di eren e ect ts; exten of \fo cusing" the text query to varying has this and h the age, at whic prin- rate the on e ect an that unities we can then analyze the comm result. of the in terms y \crystallizes," unit comm cipal ber num (3) How quic kly do the comm unities crystallize as the size. will trasts con and These root set the of iterations num h iteration eac that (Recall ws? gro ber of iterations The in In Figures plots below. further be discussed 1{6 ts in terms of one y weigh authorit and the hub updates only of using danger a single the clearly illustrate also most (emerging tly-knit tigh unities comm Are another.) iteration. after e or does it tak ber of iterations), num a small only emerge some themes that We now summarize principal them y iterations for e? For instance e shap man to tak unities su- comm that We nd analysis. on our from are top after a single iteration, the authorities simply tly \robust" to have a fairly tend struc- topics broad cien the pages in the base set with the largest num ber of to be rel- ture: the groupings of pages disco vered tend indeed highly are pages these While links. incoming ref- choice exact We t of the of root set. ativ ely indep enden tend erenced (by de nition), they to lack any thematic

5 these exp erimen ts, regardless of how the root set is con- also that nd the of hits dep ends on both the success since or less have more unities comm However, structed. kno of human discipline breadth the and of a topic wl- iden the sets, t base in di eren tation represen tity of the edge This under densit y h it falls. is because the whic what princip comm unit y is not alw ays the same. al But in greater is far of hyperlinking eness comprehensiv and erimen exp multiple ts with di eren t is that suggests this computer some disciplines (e.g., science) than in others root sets are pro viding us with multiple, altered tly sligh (e.g., in the academic humanities, whic h are sub jects views set of underlying comm unities, h whic of a small moving onto the Web more slowly .) On topics h do whic choices of root sets. by the sampled being are various have a sucien tends hits y of hyperlinking, t densit not alize gener that in comm to nd topic initial the unities h the To illustrate this point, consider the way in whic be another of sev one senses; focus of our eral will this principal for the topic "astrophysics" recur authorities below. discussion unities for Frenc h and Ger- in non-principal comm the of the versions The top 5 authorities for man topic. been coun Note that a fairly emerg- ter-in tuitiv e point has are "astrophysics" ing from the dev elopmen t above. Speci cally: The great- est degree of orderly structur e, as extr acte d by hits , is the found in communities for which numb er of relevant fits.cv.nrao.edu/ www/astronomy.html and pages, the density of hyp erlinking, is the largest. We cdsweb.u-strasbg. fr/Simbad.html have seen with "cryptography" and this phenomenon www.aas.org/ literature" is in con trast "English stan- to the . This .gov heasarc.gsfc.nasa point of view the www is becoming increas- dard that adsabs.harvard.ed u/abstract-service.html to mo dicult it suggests \chaotic" ingly that del; and technique underlying hits more becoming is actually the th top For "astrophysique" , the 8 in the 5 authorities to increase. the e as size of the Web con tinues e ectiv non-principal unit y are comm exp our eri- A consequence of this is that we can use ) to "cryptography" (e.g. ed topics highly-link with ence cdsweb.u-strasbg. fr/CDS.html structure ts about e statemen e predictiv of com- the mak u/ adswww.harvard.ed setting in the h up" have yet to fully that munities \catc cdsweb.u-strasbg. fr/Simbad.html of the . www adsabs.harvard.ed u/abstract-service.html fits.cv.nrao.edu/ www/astronomy.html hits on highly-link ed comm unities is akin The power of man to a \law of large num bers": y indep enden t, largely th another will meta- and random annotations reinforce one top 5 authorities in the 7 For , the non- "astrophysik" with morphose into a broad-topic comm unit y replete y are comm unit principal struc- of regular emergence the that We note structure. h of researc e topic is an works net in random ture activ adswww.harvard.ed u/ e.g. [3]), and we feel that it would in com (see binatorics fr/Simbad.html cdsweb.u-strasbg. h connections further in to investigate be interesting suc u/abstract-service.html adsabs.harvard.ed hits and the www . the con text of bonn.de:8000/ aibn55.astro.uni- www.univ-rennes1. fr/ASTRO/astro.english.html COMMUNITIES THE STR OF UCTURE We now detail our main ndings on the structure of There Generalization. Topic boundary be- sharp is no comm unities. that aren't; tween topics that are \broad" and those those For pro duces stable, hits topics, broad ustness. Rob our one of the from emerges but that themes primary a very robust comm unities despite from starting small that exp topics to \generalize" tends hits is that erience of relev sample in the root set. We have ant pages initial sucien tly that broad. By this we mean the are not hits this by sev eral direct metho ds, pro viding explored comm unit y of hubs and authorities will principal be rel- ant to the same with a variet y of di eren t root sets relev is larger the evant to a topic whic h includes, but than, we issue the same query string to topic. For example, hits to vided pro topic initial . multiple [16], searc h engines (e.g. AltaVista [9], Infoseek To mak on three we focus concrete, e this examples. basic and root sets duces pro typically [11]); this Excite with de ned er by prop of topics notion the rst Consider little very , one can Similarly intersection. obtain root names. etball bask (the Jordan" "Michael topic The term that are nearly disjoin t by issuing a query sets there nu- broad: turns are star) out to be sucien tly vs di eren t languages eral (e.g., in sev "astrophysics" Mic merous hub pages con taining links to pages on hael "astrophysique" "astrophysik" vs ). Jordan and his team, and hence the principal comm u- topic the hand, nity is highly other the On ant. relev to recur We nd that the main comm unities tend in all

6 "Dennis author of the C programming (an Ritchie" www.crs4.it/HTML/ Literature.html pro the top 3 authorities: follo wing duces language) humanities.uchica go.edu/ARTFL/ARTFL.html un2sg1.unige.ch/w ww/athena/html/francaut.html www.cm.cf.ac.uk /Dave/C/CE.html www.cyberdiem.c om/vin/learn.html of the two link level, examination an At the highest www.lysator.liu .se/c/index.html structures indicates that is much age of link density the greater the on pages for on those for than topic former the pages on highly-referenced are of these All C pro- We latter, the vior. t beha di eren to their leads this and principal the while Thus, itself. language gramming below. trast con this to say about have more shall comm y is relev topic of the alization gener ant to a unit interesting Generalization is an feature of hits , since the hie," Ritc \Dennis swal- been has himself individual the it allo automatic characterization of certain ws for tly lowed sub prominen by the ject to whic h he up most For ex- generalizations. of their in terms topics speci c belongs. an through purely it was of the link analysis ample, pushes One sees a sense in whic h the mec hanism of hits that hits placed structure topic "Dennis Ritchie" the wards" implicit in an hy, hierarc topic speci c topics \up language. in a comm unit y on C programming the points on tly topics broad represen t xed while sucien of Con ver gent Generalization and a ìTree Topics. î The not that ant authorities overly relev nds hits h whic are topic be explored can discussed hy just hierarc implicit "optimization" topic the example, for Consider, general. . . fully through further exp erimen tation with hits more In- top three authorities are the home pages of the The It is natural cess pro the to picture of abstraction and Op Researc h and the Managemen t stitute for erations most : the tree of topics a on as occurring generalization the Sciences, CMU Graduate School of Industrial Ad- (e.g. topics general Art, closest are Recreation) Science, and y for Industrial and Applied Societ the ministration, t sub-topics. to the root, and their descendan ts represen Mathematics. been has idea h an Suc the , by human www on realized through hier- searchable of construction the ontologists, www.informs.org hierarc archies suc h as yahoo . A searc hable hy includes mat.gsia.cmu.ed u hand-annotation and of the various topics (e.g. Science www.siam.org a purely through even automated however, Art); anal- gain of the ysis h suc about information links, we can examination of these t migh authorities and other An of topics. trees the that suggest ant to a larger relev y is in fact unit comm eld topic that includes much of the of optimization explanation The hypothesis of an underlying tree-based | y is sup- researc namely possibilit h. This , operations phe- for generalization is supp orted by the follo wing topic the y for unit comm principal the by nding ported alization gent nomenon of conver gener , whic h hits ex- vered have sig- "operations research" ; the pages disco that, We have found hibits. topic on a broad given "optimization" for those with t overlap ni can . The generalize, does not dis- technique often h the whic one are top three authorities unities hits by applying comm similar very covers to a range speci c sub-topics. of more www.informs.org clear example . "cryptography" topic by the is pro vided One mat.gsia.cmu.ed u reliably This topic is sucien hits general that nds tly www.gams.com focused hub and authorities very ose supp Now, pages. more speci c sub-topics of cryptograph y | we choose h would whic nat- topics that Finally , it is interesting One cryptographers. , names speci cally of individual \parallel" be considered ve di eren tly can urally beha nds generalized are sub-topics of these most that in be with resp ect to this type of generalization. This can unities have signi can that ways, similar very to comm t t of hyperlinking amoun in the to di erences due among the topic larger with and h other eac with both overlap the For example, the top authorities for ant pages. relev . "cryptography" focused: remain literature" "English topic tree of the Intuitiv ely, we are looking at a portion u/Shakespeare/works.html the-tech.mit.ed in the y"; vicinit we and \cryptograph y of the node for nn.edu/~jlynch/Lit www.english.upe y of the that man child nd nodes generalize upward to humanitas.ucsb. edu/shuttle/eng-mod.html this common paren t. the "German literature" top , on The authorities for Other Factor s Aff ecting Generalization. In addition to other are ant to a range relev ean of topics in Europ hand, the natural notion of generalization discussed above, we literature . more generally have iden ti ed a num ber of other factors that cause hits

7 pdg.lbl.gov/ to con completely fo- unities to comm verge that are not initial factors topic. recur in a num- the on cused These highligh ber of di eren t t con texts, and again seem to One in the con- phenomenon at work witness can this of hyper- fundamen tal structure the about some themes and "English between trast literature" and underlying in- the link ed comm unities on the Web, "German link greater the . In particular, literature" In particular, of user populations in this domain. terests explained be partially can topic densit y of the former two factors: discuss we will tric" eb-cen \W wing the follo rep- ber of in uences, including the increased by a num sub-topics, and commercialization. American of North tation resen y home pages, ersit univ and whic h tend to be highly link the fact that the ed, The of these rst is the ultimately that, notion , what eare de- is a much more ositories rep esp of Shak creation determines \generalit is y" of a topic in this setting the the ed enterprise on comparable is the than www velop www the on representation its topics . Thus, certain activit authors. German In the other direction, we y for broad, seem can and others arti cially nar- arti cially unit comm a large that interesting it quite nd ci- y asso interest row, because they are of more or less to creators the with ated was in literature" "German of output pages. ectiv e from persp another vides pro of Web This around relates this literature; vian focused Scandina fact tends hits that notion the view h to whic be most to about ations observ anecdotal numerous the to closely e within topics that have the most \wired" com- e ectiv the unusually high densit among Web pages y of linking munities. Norw ay. Sweden, Finland, from and The is the principle of this manifestation simplest fol- One way to view this general issue (though it is an over- lowing: in a large Web that topic the of cases, fraction simpli cation): at the root of our conceptual tree of most are pages itself; Web involves the with concerned Wide World the sits topics itself. Web structure and , the subtly or less more in uence, can this disco hits that vers. comm of the In particular, unities The main in uence we touc other h on here is that of ear to be a spe- app at rst the principal y can unit comm on commercial/adv link structure. ertising in uences the cialization , rather than a generalization, of the initial on The man creation of page the www has cess pro y topic; u- this on focused has hits y, in realit but comm bring ts, and sim some of these can participan ultaneous nity because version tric" eb-cen \W ts a more it represen on engineering to bear man structure link the y resources of the topic. topics with both in a way that favors Thus, them. for commercial and individual involvemen t, the authorities examples. We consider For the topic "linguistics" , three in the principal to tend overwhelmingly y will unit comm are authorities top the be highly-commercialized pages. A simple is example : the topic "tennis" www.cs.columbia. edu/~acl/home.html www.cs.columbia. edu/~radev/cgi-bin/universe.cgi com/ www.tennisserver. www.ling.rochest er.edu/linglinks.html www.uspta.org/ .com/ten/ espnet.sportszone eld the rst two of these are strong authorities for The is a sub-topic of computational linguistics . While this to mis- we stress However, it is considerably that harder of the query , hits has con verged it because to initial lead hits by \spamming" than to mislead a term-based that typically cause it to gen- in uences same of the y searc collectiv h engine: hits measures the e authorit setting in the eralize: www it is a topic with of the vested it is by man in con trast, y people; in a resource a considerably age densit y of link than classical greater \boost- pages of popular creators to for common to try we consider Next, linguistics. as of "Harvard" topic the by including engines term-based on ranking their ing" are 1997. Jan Included among the top authorities uary num large a (As ords. keyw hosen of judiciously-c bers Harv the Inter- on Conference ard the for pages home pragmatic way to limit one of self-conferral the matter, net and Societ in y, a very held conference Internet large propagation authorit y is to limit { or even ignore { the www the on From ectiv persp the y 1996. Ma , e of pages own in its pages from into a page links ts along of weigh this is a prominen t topic closely related to the initial sub-domain.) internet whic query , "Harvard" . Finally , an observ ation h should topic be obvious in hindsigh t: The top authorit y for the these in uences, both bines A phenomenon whic h com "physics" is CERN, the location where the Web was up frequen whic h sho ws and quite recurring tly, is the in a sense \born": presence of pages suc h as AltaVista, yahoo , and within www.microsoft.c om and a comm unit y of hubs authorities, regardless of the topic. The point is that www.cern.ch/ all these pages are link ed to so hea vily , from essen tially aps.org/

8 ing from a root set of pages in the 2-step vicinit y of portions tly sho w up in of the www frequen , that they consist authorities principal , the www.nytimes.com of a yahoo was the example, for us, (Th of authorities. lists news organizations of on-line mixture and popular In- .) the We authorit th-rank nin y for ed topic "physics" The rst two non-principal comm unities sites. ternet tion us it tells about because men something this issue sharply: mixture this separate very h suc the exten t to whic hyper- h pages have in ltrated as a pragmatic link disciplines; in all unities comm ed is- from be remo sue, the output of they hits could ved in first community: non-principal Authorities by essen t of as the alen -equiv www tially them treating / www.microsoft.com \stop words" in information retriev al. www.ibm.com/ www.apple.com/ Temporal Issues. Comparing the principal authorities www.hp.com/ 1997, for "Harvard" uary , Jan , August and "Harvard" www.sun.com/ 1997, brings up an additional theme concerning hyper- unities. There can often be substan tial link ed comm Authorities in second non-principal community: factors temp that of short-term orarily set the in uence www.nytimes.com/ principal the in a topic; above exam- for authorities www.usatoday.com/ Internet on the Conference ard Societ Harv the ple, y and www.cnn.com/ ard" in was still a prominen t feature of the topic \Harv / www.sjmercury.com 1997, uary Another Jan 1997. but not in August exam- ne.com/ www.chicago.tribu "comets" vered disco ple is the topic hits ; in Ma y 1997 with alien a large comm unit y concerned visits the and ven's \Hea group. Gate" FUR THER DIRECTIONS wth gro dynamic The Web immediately suggests of the Short-term die out as pages and links are re- in uences be should we have performed type of analysis the that though over time; moved can be arti cially kept they over time eated rep results compared. Suc h an the and inclusion in the indices of searc h en- \curren t" by their vides approac a way of studying the temp oral evolu- h pro gines. long- the obtaining d for metho interesting An the of understanding and Web, on unities of comm tion of of a topic is to sup erimp \core" term the results ose techniques considered here will adapt as the how the over sev on the same topic, spaced out th eral-mon hits size w in both to gro tinues con Web y. complexit and perio ds. tly investigating ways of using hits to curren We are Earlier, a Topic. on Inuences the Dissecting we indi- by al tasks retriev information on ve performance impro hits underlying , through the technique the that cated com work of hyperlinks; structure the and text bining for to use of eigen vectors, is able disco ver multiple com- direction, goal here is to create a see in this [6]. One munities prin- a single topic: a given with ciated asso whose disco very system h clien t-side information searc unit an with y, together ber of comm cipal num arbitrary parameters | at both the text-based and link-based al non-princip sparser typi- unities. comm We nd that | are fully \tunable" by individual users, levels cally in uences leading to topic generalization, in the the CONCLUSION not forms above, are concen trated in some but discussed has gro wn into a hypertext environmen www t of The unities by all of the disco vered hits ; and that some comm underlying process the and y; complexit enormous its \purely" fo- of the non-princip al comm unities are more been has en in a chaotic fashion by the indi- wth gro driv cused of the all examining By h topic. searc the on participan vidual actions Our of numerous erience ts. exp assem a par- one together, unities comm strongest bles ects the however, that in man y resp with hits suggests, t of the arrangemen overlapping partially nested, tially migh end ose: as one as chaotic is not duct pro the t supp the di eren t hyperlink ed sub-comm unities that mak e up egate vior www the on populations of user aggr beha can h topic. searc , one "physics" topic in the For example, a mathematically be studied for technique clean through whose unities comm strong nds com- are authorities top top this can one ology use link Web's the analyzing , and entirely of (i) academic departmen ts, (ii) high- posed d commu- erlinke hyp about themes tify to iden technique and (iii) energy accelerators and colliders, professional and of interests range a wide to span ear app that nities 1997), topic societies. For the "Harvard" (Aug. the disciplines. of bio-medical pages can likewise be easily iden- group y, separate ti ed unit as a coheren t non-principal comm ACKNO WLEDGEMENTS ard. at Harv from the home pages of schools on We have bene ted from discussions with man y people be an also type of dissection e ectiv e way to This can the topics covered here. In particular, we thank Rak esh a topic. separate out the \W eb-cen tric" in uences on Agra wal, Soumen Chakrabarti, Byron Dom, Nimro d Megiddo, An start- very this ws [17] sho cleanly: from example Christos Papadimitriou, Sridhar Rajagopalan, Ted Selk er

9 Proc. 8th ACM Confer enc e on Hyp ertext , 67{74, and work rst and second authors The Upfal. Eli of the 1997. Almaden IBM the visiting while in part was performed ter. Researc in is supp author The second orted h Cen Golub, 15. G. C.F. Van Loan, Matrix Computations , by an part by and wship h Fello Researc P. Sloan Alfred Johns Hopkins Univ ersit y Press, 1989. y Early Facult NSF ard CCR- elopmen Dev t Aw Career search 16. Corp oration, Infose ek Infoseek engine , 9701399. . www.infoseek.com \Authoritativ berg, J. Klein 17. a hyper- in e sources REFERENCES link environmen t," Symp o- Proc. ACM-SIAM ed Mendelzon, A.O. cena, Aro 1. G.O. Mihaila, G.A. ete on Discr sium ears as app Also , 1998. Algorithms language," Proc. 6th \Applications of a Web query h Rep Ma 10076(91892) RJ ort y 1997, Researc IBM World , 1997. e enc Confer International Web Wide . at www.cs.cornell.ed u/home/kleinber/ and \Sense- 2. M. Wang Baldonado, T. Q Winograd, 18. World Wide of the \Bibliometrics Larson, R. Web: Mak er: An information-exploration interface sup- structure intellectual of the analysis An exploratory porting the con textual evaluation of a user's inter- an Soc. erspace," Ann. Meeting of the Americ of cyb enc e on Proc. ACM Human ests," SIGCHI Confer Sci. Info. , 1996. , 1997. in Computing Factors hiori, M. correct for quest Marc \The 19. information Press, , Academic 3. B. Bollob as, Random Graphs In- on the Web: Hyp er searc h engines," Proc. 6th 1985. World ternational , 1997. e enc Confer Web Wide \Struc- B. Rivlin, E. Botafogo, 4. R. Shneiderman, F. M. III, R. Shipman Marshall, C. C. 20. J. McCall, hypertext: tifying tural analysis Iden of hierar- \Putting Issues from Ex- Libraries Digital to Work: ACM metrics," useful and chies Sys. Inf. , Trans. unit y Memories", Proc. First perience with Comm 10(1992), 142{180. pp. and ory the e on enc Confer Annual e of Practic The ebQuery: \W Kazman, R. ere, 5. J. Carri hing Searc , 1994. Libr Digital aries connectivit y," Web the visualizing and through and Views text Focus+Con Hara. Y. S. Mukherjea 21. Confer- Wide World International Proc. 6th Web Con- of World-Wide Web Nodes. Proc. 8th ACM , 1997. e enc , 187{196, Hyp e on ferenc 1997. ertext 6. S. Chakrabarti, B. Dom, D. Gibson, berg, J. Klein L. Page, \PageRank: Bringing order Web," to the 22. van, P. Ragha \Automatic resource S. Rajagopalan. 1997- Stanford Digital Libraries working pap er by analyzing list structure hyperlink compilation 0072. World text." Proc. 7th International and asso ciated Mot R. S. Brin, L. Page, 23. \The Winograd, T. wani, , 1998. e enc Confer Web Wide ranking: PageRank Bringing order to the citation visualizing 7. C. Chen. WWW the Structuring and Web," for submitted publication. ACM by generalised similarit y analysis. Proc. 8th a sow's ow, Rao, R. \Silk from 24. P. Pirolli, J. Pitk Confer Hyp ertext , 177{186, 1997. enc e on the Web," structures from usable ear: Extracting Furnas, T. Landauer, S. Dumais, ester, 8. S. Deerw G. SIGCHI Proc. ACM Factors e on Human enc Confer tic R. Harshman, \Indexing by laten t seman analy- , 1996. in Computing sis," J. Americ an Soc. Info. Sci. , 41(1990). \ParaSite: Spertus, E. 25. informa- Mining structural search t Corp Equipmen 9. Digital AltaVista oration, the Web," Proc. 6th International tion World on . l.com/ altavista.digita , engine Confer enc Web Wide , 1997. e par- 10. W.E. A.J. Ho man, \Algorithms Donath, for 26. R. Weiss, B. Velez, M. C. Nemprempre, P. Sheldon, of graphs titioning computer logic based on and hical A Hierarc \HyPursuit: Gi ord, D.K. Szilagyi, of connections eigen IBM Tech- vectors matrices," work Net t-Link ten Con Exploits that h Engine Searc , 15(1972). letin e Bul nical Disclosur Hyp Proceedings of the Seventh ertext Clustering," Inc., Excite . www.excite.com , 11. Excite , 1996. ertext Hyp e on enc Confer ACM 12. for information in a hyper- \Searc Frisse, hing M.E. in \Bibliometrics," H.D. McCain, K.W. White, 27. ook," medical text Communic ations of the handb Sci. and Technolo gy , Elsevier, Info. Ann. Rev. ACM pp. 880{886. , 31(7), 119-186. pp. 1989, M. F. III, C. C. Marshall, 13. R. Furuta, Shipman o! Yaho o! Corp. Yaho 28. . www.yahoo.com , D. ertext and H-W. Hsieh. paths and Hyp Brenner Walden's with eriences exp web: world-wide the paths. ertext Hyp e on enc , ACM Proc. 8th Confer 167{176, 1997. 14. query the link: the told the . What vchinsky Golo G. Retriev integration of Hyp ertext and Information al.

10 1 iteration 1 iteration 3 iterations 3 iterations 10 iterations 10 iterations 50 iterations 50 iterations 20 20 15 15 10 10 5 Overlap with full community Overlap with full community 5 25 200 100 50 200 100 50 25 Root set size Root set size Figure 1. ‘‘Harvard’’ Figure 2. ‘‘Cryptography’’ 1 iteration 1 iteration 3 iterations 3 iterations 10 iterations 10 iterations 50 iterations 50 iterations 20 20 15 15 10 10 5 Overlap with full community Overlap with full community 5 200 100 100 50 50 25 25 200 Root set size Root set size Figure 4. ‘‘Skiing’’ Figure 3. ‘‘English Literature’’ 1 iteration 1 iteration 3 iterations 3 iterations 10 iterations 10 iterations 50 iterations 50 iterations 20 20 15 15 10 10 5 5 Overlap with full community Overlap with full community 200 100 50 25 200 25 100 50 Root set size Root set size Figure 6. ‘‘Operations Research’’ Figure 5. ‘‘Optimization’’

Related documents

MPI: A Message Passing Interface Standard

MPI: A Message Passing Interface Standard

MPI : A Message-Passing Interface Standard Version 3.0 Message Passing Interface Forum September 21, 2012

More info »
DoD7045.7H

DoD7045.7H

DoD 7045.7-H EPARTMENT OF D EFENSE D F UTURE Y EARS D EFENSE P ROGRAM (FYDP) S TRUCTURE Codes and Definitions for All DoD Components Office of the Director, Program Analysis and Evaluation A pril 2004

More info »
WEF GGGR 2017

WEF GGGR 2017

Insight Report The Global Gender Gap Report 2 017

More info »
book

book

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference...

More info »
insurance codes

insurance codes

INSURANCE COMPANIES AUTHORIZED TO DO BUSINESS IN THE STATE OF NEW JERSEY AND THEIR CODE NUMBERS 10, 2019 January LAST UPDATED NAME OF INSURANCE COMPANY ADDRESS ID NO ADDRESS CITY ST ZIP NOTES ADDRESS ...

More info »
Microsoft Word   04 17 Print Version Report of the APA Task Force on Ads an…

Microsoft Word 04 17 Print Version Report of the APA Task Force on Ads an…

Report of the APA Task Force on Advertising and Children Submitted by Brian L. Wilcox, PhD Dale Kunkel, PhD Joanne Cantor, PhD Peter Dowrick, PhD Susan Linn, EdD Edward Palmer, PhD February 20, 2004 F...

More info »
June2018CUR

June2018CUR

CHANCELLOR'S UNIVERSITY REPORT JUNE 25 2018

More info »
ayout 1

ayout 1

0465039146-FM:FM 12/5/06 12:25 AM Page i C O D E

More info »
Assessment of Minority Voting Rights Access

Assessment of Minority Voting Rights Access

AN ASSESSMENT OF MINORITY VOTING RIGHTS ACCESS IN THE UNITED STATES An Assessment of MINORITY VOTING RIGHTS ACCESS in the United States ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ 018 STATUTORY REPORT 2 U.S. COMMISSION ON CI...

More info »
Protecting Statutory Conscience Rights in Health Care; Delegations of Authority

Protecting Statutory Conscience Rights in Health Care; Delegations of Authority

This HHS‐approved document is b eing submitted to the Office of publication and has not the Federal Register (OFR) for y or published in the Federal R yet been placed on public displa egister. This do...

More info »
ASHACAADoE

ASHACAADoE

CAA Accredited Program Listing as of 3/27/2019 Alabama Alabama Alabama A&M University Samford University 4900 Meridian Street Department of Communication Sciences & Disorders 800 Lakeshore Dr Normal, ...

More info »
The Fire and Fuels Extension to the Forest Vegetation Simulator: Updated Model Documentation

The Fire and Fuels Extension to the Forest Vegetation Simulator: Updated Model Documentation

United States Department of The Fire and Fuels Extension Agriculture to the Forest Vegetation Forest Service Forest Management : Updated Model Simulator Service Center Fort Collins, CO Documentation 2...

More info »
fragstats.help.4.2

fragstats.help.4.2

FRAGSTAT S HELP 21 April 2015 Kevin McGariga l, onsulting Sole proprietor, LandEco C onservation essor Prof , Department of Environmental C University of Mas sachusetts , Amherst [email protected] mcgariga o.umas...

More info »
NI XNET Hardware and Software Manual   National Instruments

NI XNET Hardware and Software Manual National Instruments

XNET NI-XNET Hardware and Software Manual NI-XNET Hardware and Software Manual July 2014 372840H-01

More info »
CityNT2019TentRoll 1

CityNT2019TentRoll 1

STATE OF NEW YORK 2 0 1 9 T E N T A T I V E A S S E S S M E N T R O L L PAGE 1 VALUATION DATE-JUL 01, 2018 COUNTY - Niagara T A X A B L E SECTION OF THE ROLL - 1 CITY - North Tonawanda TAX MAP NUMBER ...

More info »
action sd

action sd

A ETERMINANTS OF HEALTH : CTI ON ON THE SOCIAL D IOUS EXPERI ENCES LEARNING FROM PREV A background paper prepared for the Comm eterm inants of Health ission on Social D March 2005 World Health Or gani...

More info »
Microsoft Word   A) Division 245.docx

Microsoft Word A) Division 245.docx

tables Attachment Division 245, including A: Nov. 15-16, 2018, EQC meeting 1 of 121 Page Division 245 CLEANER AIR OREGON 340-245-0005 Purpose and Overview (1) This statement of purpose and overview is...

More info »
Programming Guide for ZPL II, ZBI 2, Set Get Do, Mirror, WML (en)

Programming Guide for ZPL II, ZBI 2, Set Get Do, Mirror, WML (en)

Programming Guide ZPL II ZBI 2 Set-Get-Do Mirror WML

More info »
520 UM001I EN E PowerFlex 520 Series Adjustable Frequency AC Drive User Manual

520 UM001I EN E PowerFlex 520 Series Adjustable Frequency AC Drive User Manual

User Manual PowerFlex 520-Series Adjustable Frequency AC Drive PowerFlex 523 Catalog Number 25A, Series B PowerFlex 525 Catalog Number 25B Original Instructions

More info »
Layout 1

Layout 1

Id an d Action: eas Ad dr essi ng the So cial Factors that Influence Sex ua l a nd R eproductive Health

More info »