Lin.PDF

Transcript

1 R OUGE : A Package for Automatic Evaluation of Summaries Yew Lin - Chin Information Sciences Institute University of Southern California 4676 Admiralty Way Marina del Rey, CA 90292 [email protected] i.e. could , tics s occurrence stati be applied co gram - n - we introduce In this paper, summaries. e t evalua to Abstract a package for automatic evaluation of su m- , OUGE R , study for Oriented Unde - stands for Recall OUGE R r maries s . and its evaluation stands for R OUGE R e- includes ng Evaluation. It Gisti measures o- to aut Oriented Understudy for Gisting Evaluation - call . It matically determine the quality of a summary by includes several automatic evalu that tion methods a paring it to other (ideal) summaries created by m co We d measure the similarity between summaries. e- r- humans. The measure s count the number of ove - L scribe R OUGE - in Section R OUGE N in Section 2, gram, word sequences, and lapping units such as n - and ion in Sect S - OUGE R W in Section 4, 3, R OUGE - word pairs between the computer - generated su m- how these measures correlate s show . Section 6 5 e- mary to be evaluated and the ideal summaries cr with human jud g ments using DUC 2001, 2002, and ated by humans. This paper introduces four different 2003 data . Section 7 conclude s this paper and di s- OUGE L, R OUGE measures: R OUGE - N, R OUGE - W, - R c es cuss future dire . tions R S included in the - OUGE R and a- summariz OUGE a tistics Occurrence St - gram Co - N: N - OUGE R 2 a ns. Three tion package and their evaluatio tion evalu r- of them have been used in the Document Unde - Formally, e t R OUGE gram recall b N is an n ween a - scale standing Conference (DUC) 2004, a large - a- candidate summary and a set of reference summ zation evaluation sponsored by NIST. i summar OUGE - R N is computed as follows: ries. 1 Introduction - OUGE R N n volves Traditionally evaluation of summarization i gram ) ( Count ∑ ∑ n match r o , f u s of different quality metrics man judgment h Summaries ∈ } Referemce ∈ S gram { S n = (1) ple, coherence, conciseness, grammaticality, exa m gram ( ) Count ∑ ∑ n However, . 2001) , (Mani readability, and content Reference gram ∈ ∈ } { Summaries S S n even simple manual evaluation of summaries on a large scale over a few lingui tic quality questions s stands for the length of the n Where - gram, n as in the Document Unde r- and content coverage ( , and gram Count m- gram ) is the maximum nu n n match Over and Yen , (DUC) standing Con 2003) ference ( - ndidate summary occurring in a ca - ber of n grams co f would require over 3 f human e ,000 hours o forts. and a set of reference summaries. pensive and difficult to co This is very ex duct in a n N s- - is a recall - OUGE R It is clear that related mea frequent basis. Therefore, how to evaluate summ a- ure because the denominator of the equation is the automat i ries ally has drawn a lot of attent i on in the c total sum of the number of n ring at the r grams occu - nity in recent years. summarization re search comm u f erence summary side. A closely related measure, re For examp le, Sa ggion et al. (2002) proposed three , L used in automatic evalu B EU tion of machine a s - content ure based evaluation methods that mea based measure. - B LEU translation, is a precision a similarity between summ ries . These methods are: measures how well a candidate translation matches i- (i.e. unigram or b unit overlap , cosine similarity r- a set of reference translations by counting the pe Ho gram) ever, w . subsequence longest common , and r- grams in the candidate translation ove - centage of n sults of these automatic e t hey did not show how the r h the refe see Papineni et Please ences. r lapping wit evaluation methods correlate to human judgments. e . LEU B al. (2001) for d tails about f ul Following t tion of automatic a applic he success - grams in the denomin a- Note that the number of n ( Pa p ineni et al. , evaluation method s , such as B LEU tor of the R OUGE - N formula increases as we add Lin and 2001 tion , a valu , in machine translation e ) more references. This is intuitive and reasonable Hovy (2003 LEU , B showed that methods similar to ) because there might exist multiple g ood summ a ries.

2 Every time we add a reference into the pool, we e x- is a common subsequence with maximum length. Y pand the space of alternative summaries. By co n- i- LCS has been used in identifying cognate cand best translation lex - of N dates during construction tro l ling what types of references we add to the i- o con from parallel text. Melamed (1995) used the reference pool, we can design evaluations that f cus note ratio (LCSR) between the length of the LCS of two ent aspects of summarization. Also on diffe r a- words and the length of the longer word of the two that the numerator sums over all reference summ tively gives more weight to matching words to measure the cognateness between them. c ries. This effe He used LCS as an a n - grams occurring in multiple references. Therefore ng matching proximate stri p i algorithm. Saggion et al. (2002) used normalized date summary that contains words shared by a cand s- pairwise LCS to compare simila ity between two r more references is favored by the R OUGE - N mea o texts in aut u e- matic summarization evaluation. re. This is again very intuitive and reasonable b cause we normally prefer a candidate summary that Level LCS - Sentence 3.1 is more similar to consensus among reference su m- maries. To apply LCS in summarization evaluation, we of words. view a summary sentence as a sequence Multiple References 2.1 The intuition is that the longer the LCS of two summary sentences is, the more similar the two So far, we only demonstrated how to compute i en mult N using a single reference. Wh - based F - ries are. We propose using LCS a - OUGE summ ple R m - measure to estimate the similarity b e tween two mary references are used, we compute pairwise su mary m N between a candidate su - OUGE R level X and m- , assu n of length Y and m of length summaries s is a ence set. We then is a r efe r ence summary sentence and Y X ing every reference, r r , in the refe i l candidate summary sentence, as fo lows: take the maximum of pairwise summary - level ence R OUGE - N scores as the final multiple refe r R lows: l N score. This can be written as fo - OUGE ( X LCS ) , Y (2 = R ) lcs m - = argmax R OUGE OUGE R - N ( r ) s , N i i multi ) X ( , Y LCS P = (3 ) lcs This procedure is also applied to computation of n 2 R R L (Section 3), - OUGE W (Section 4) - OUGE , and 1 β P + ) ( R lcs lcs (4 ) = F lcs - S (Section 5). OUGE In the implementation, we use R 2 P R β + lcs lcs a Jackknifing procedure. Giv en M refe r ences, we compute the best score over M sets of M - 1 refe r- ) is the length of a longest co m- LCS Where , Y X ( ences. The final N score is the average of - OUGE R Y , and ß = P mon subsequence of X an d / when R lcs lcs OUGE the M R r- - N scores using different M - 1 refe R ? ? is set to / ß F ? _ = _ In DUC, P ? / F a very . lcs lcs lcs lcs knifing procedure is adopted since k ences. The Jac big number ( d- is consi R . Therefore, only ? 8) lcs e we often need to compare system and human p r- - ered. We call the LCS a- measure, i.e. Equ - based F formance and the reference su maries are usually m - L is 1 when OUGE R tice that L. N - OUGE R , tion 4 X o u the only h man summaries available. Using this L is zero when = ; while R OUGE - Y LCS ( X , Y ) = 0, i.e. procedure, we are able to estimate average human e tween X and - . F Y there is nothing in common b performance by avera g ing M R OUGE - N scores of lents has been shown to have a measure or its equiv r ence vs. the rest M - 1 references. Although one refe u- met several theoretical criteria in measuring acc ing procedure is not necessary when we the Jackknif r- racy involving more than one factor (Van Rijsbe tiple l OUGE R just want to compute scores using mu gen tors are LCS c 1979). The composite fa based - , references, it is applied in all R OUGE score comp u- cision in this case. Melamed et al. e recall and pr evaluation package. OUGE R a t tions in the (2003) used unigram F - chine a measure to estimate m mea ure s In the next section, we describe a R OUGE gram F translation quality and showed that un i - based on longest common subs equences b e tween LEU measure was as good as B . two summaries. One advantage of using LCS is that it does not r e- sequence matches - n quire consecutive matches but i L: Longest Common Subs e OUGE quence - R 3 - that reflect sentence level word order as n grams. ] is a subsequence of , ..., z , z z = [ Z A sequence 2 n 1 n The other advantage is that it automatically i cludes ], if there exists a another sequence X = [ x , x , ..., x - sequence common n 2 m grams, therefore no - longest in 1 i , i ] of indices of strict increasing sequence [ , ..., i 2 1 k gram length is necessary. pr e defined n - ll such that for a X = z , we have 1, 2, ..., k x = j ij j R OUGE - L as defined in Equation 4 has the pr o p- 1989). Given two s and quences X e (Cormen et al. , erty that its value is less than or equal to the min i- e , the longest common subs Y and X quence (LCS) of mum of unigram F - measure of X and Y . Unigram

3 2 recall reflects the proportion of words in (refe r- X R P 1 ( β + ) lcs lcs (7) F = ence summary sentence) that are also present in Y lcs 2 β + P R lcs lcs e- (candidate summary sentence); while unigram pr cision i s the proportion of words in Y that are also in is set to a very big number ( ß in ? 8) Again X . Unigram recall and precision count all co - r C ( , ) is considered. R , i.e. only DUC is the LCS occurring words regardless their orders; while lc s ∪ i occurrences. - - R L counts only in OUGE sequence co - LCS score of the union longest common subs e- i gram sequence un By only awarding credit to in - date i and cand r reference sentence quence between i L als - OUGE R matches, o captures sentence level and w w w , w = w r . For example, if C mary m su i 1 3 4 5 2 structure in a natural way. Consider the fo l lowing cont n tences: c w = w two w se w C c w ains and 7 8 6 1 1 2 2 w w the longest common subs e- w example: w w = , then 5 3 9 1 8 c the longest co ” and w m- quence of r w and is “ 1 i 1 2 S1. police killed the gunman is “ r mon subsequence of and c w w w ”. The i 3 1 2 5 S2. man n the gu kill police u c , ion r c of longest common subsequence n , and 2 1 i the gunman kill p o lice S3. C r LCS w w = 4/5. w is “ w ” and ) , ( 5 3 2 1 ∪ i N i.e. 2, - OUGE We only consider R r- =2, for the pu L vs. Normalized Pairwise LCS R OUGE - 3.3 pose of explanation. Using S1 as the refe r ence and The normalized pairwise LCS proposed by Radev et S2 and S3 as the candidate summary se n tences , S2 ries S1 and a al. (page 51, 2002) between two summ score, since 2 - and S3 would have the same R OUGE LCS(S , is written as fo l lows: S S2, ) , 1 MEAD 2 they bot h have one bigram, i.e. “the gunman”. Ho w- ever, S2 and S3 have very different meanings. In the s s s LCS + ) , ( max ) , ( max s LCS s ∈ ∈ S i j i S j s OUGE ∑ case of - L, S2 has a score of 3/4 = 0.75 and ∑ R i j 2 1 ( 8) ∈ S s S ∈ s i j 1 2 = 1. Ther a score of 2/4 = 0.5, with S3 has ß e fore S2 s ( length length s + ) ( ) j i ∑ ∑ ∈ s S s ∈ S 1 2 j i R is better than S3 according to m- OUGE L. This exa - - L can work reli a bly R ple also illustrated that OUGE Assuming words, n words and S2 has m S1 has at sentence level. Equation 8 can be rewritten as Equation 9 due to However, LCS suffers one disadvantage that it symmetry: - fore, sequence words; ther only counts the main in e quences are e CSes and shorter s other alternative L s ) s ( max * 2 LCS , S j ∈ i s ∑ j 2 flected in the final score. For example, given e not r S ∈ s i 1 (9) ing candidate sentence: the follo w n m + police killed S4. the gunman Using S1 as its reference, LCS counts either “the ) and We then define MEAD LCS recall ( R lcs - MEAD gunman” or “p o lice killed”, but not both; therefore, ) as follows: P MEAD LCS precision ( MEAD - lcs S4 has the same R OUGE 2 - OUGE R L score as S3. - would prefer S4 than S3. s , ( max LCS s ) j i ∈ s S ∑ 2 j s ∈ S 1 i R = (10) - MEAD lcs 3.2 - Summary Level LCS m ( , ) max LCS s s n- Previous section described how to compute se S j ∑ i ∈ s j 2 ∈ S s 1 i P (11) = MEAD lcs - tence - p- - based F - measure score. When a level LCS n plying to summary - level, we take the union LCS ence, matches between a reference su , r mary sent m i We can rewrite Equation (9) in terms of R - lcs MEAD candidate summary . Given a c sentence, and every j P and l- = 1 as fo ß with a constant parameter MEAD - lcs reference summary of sentences containing a total u lows: sentences summary of words and a m of v candidate 2 R + P ) 1 ( β − MEAD lcs MEAD − lcs mary level - m , the su s word n containing a total of S (12) = , LCS(S ) MEAD 2 1 2 LCS - based F - measure can be computed as follow s: R β P + − − MEAD lcs MEAD lcs u Equation 12 shows that normalized pairwise LCS ) ( LCS r , C ∪ i ∑ mented as defined in Radev et al. (2002) and impl e = i 1 (5) R = lcs in MEAD is also a F - measure with ß = 1. Sentence - m level normalized pairwise LCS is the same as u L with = 1. Besides setting - OUGE ß m- = 1, su ß R , ( ) LCS r C i ∪ ∑ f - ferent level normalized pairwise LCS is di mary 1 i = P = (6) lcs from R OUGE - L in how a sentence gets its LCS score n from its references. Normalized pai r wise LCS takes

4 the best LCS score while f ). Notice that by providing di j , i ( c tion, ferent R OUGE - L takes the union LCS score. weighting function f , we can parameterize the WLCS algorithm to assign different credit to co n- O UGE - R b- W: Weighted Longest Common Su 4 - equence matches. s secutive in sequence The weighting function must have the pro p erty f gers e ) for any positive int y ( f ) + x ( f ) > y + x f ( that x e scribed LCS has many nice properties as we have d and secutive matches are n . In other words, co y in the previous sections. Unfortunately, the basic consecutive matches. - awarded more scores than non LCS also has a problem that it does not di f ferentiate α >= 0, k when β – k > - = - ) k ( f ample, x For e , α and β LCSes of different spatial relations within their e m- 0. This function charges a gap pe for β – alty of n e g s beddin quences. For example, given a reference consecutive n - each non gram sequences. Another - X Y Y sequence and and two candidate sequences 1 2 c tion family is the polynomial family of possible fun as follows: α α where > 1. However, in order to k - the form ize the final - r l R norma OUGE W score, we also prefe A [ : X E F G] D C B to have a function that has a close form inverse D C B A H I K] Y : [ 1 2 f function. For example, k ) - = - k ( has a close form [ A : H B K C I D ] Y 2 - 1/2 1 n verse function . F k f measure based on - ( k - - ) i = e- WLCS can be computed as follows, given two s have the same and w- Y L score. Ho Y R OUGE - 2 1 of length Y and m : n of length quences X ever, in this case, Y should be the better ch oice than 1 has consecutive matches. To improve because Y Y , ) (  WLCS X Y  2 1 1 −   R (13) f = wlcs member the e the basic LCS method, we can simply r   ) ( f m   n length of consecutive matches encou tered so far to  Y X WLCS  ) , a regular two dimensional dynamic program table ( 1 −   P (14) f = wlcs   puting LCS. We call this weighted LCS m co n ( ) f   and use k r- to indicate the length of the cu (WLCS) 2 1 R ( P β rent consecutive matches ending at words ) x . and y + j i wlcs wlcs F (15) = wlcs 2 X tences n Given two se and Y , the WLCS score of X R + P β wlcs wlcs Y and can be computed using the following dynamic programming procedur e : - 1 f is the inverse function of . In DUC, ß f Where is set to a very big number ( ? 8) fore, only e . Ther ; m <= i 0; = i (1) For ( ++) i - based F - sidered. We call the WLCS n is co R wlcs table c ( i , j ) = 0 // initialize c - measure, i.e. Equation 15, a- W. Using Equ - OUGE R ) = 0 // table - initialize w w j , i ( 2 ) - = - k tion 15 and as the weighting fun c tion, the f k ( i For ( ) (2 ++) i ; m <= i ; 1 = quences OUGE - W Y scores for s e R Y and are 0.571 1 2 j For ( n ++) j ; <= j ; 1 = and 0.286 would be Y tively. Therefore, c respe 1 = y Then If x i j ranked higher than using WLCS. We use the Y 2 // the length of consecutive matches at α k polynomial fun OUGE c tion of the form R in the 1 - // position i - 1 and j In the next section, we intr evaluation package. o- w(i - 1, j = - 1 ) k duce the skip - bigram co - occurrence stati s tics. , f ) + 1 - j 1 - i ( c ) = j , i ( c ) k ( f – ) k+1 ( // remember the length of consecutive - Occur Bigram Co - Skip S: - a- rence St 5 R OUGE matches at position i, j // tistics i,j ( w +1 k ) = Otherwise Skip - bigram is any pair of words in their se tence n ) Then - j ) > c ( i , j - 1 , i 1 If c ( order, allowing for arbitrary gaps. Skip - bigram co - i - 1 , j ) c ( i , ) = c ( j occurrence statistics measure the ove - lap of skip r , j w ( i ) = 0 // no match at i , j a bigrams between a candidate transl tion and a set of j - 1 ) Else c ( i , , ) = c ( i j reference translations. Using the ample given in x e no match at ) = 0 // j , ( w i j , i Section 3.1: ) = Y , X ( WLCS ) c ) n , m ( (3 man n police killed the gu S1. c ( is the dynamic programming table, i , j ) Where c S2. man n police kill the gu y x stores the WLCS score ending at word of X and lice o the gunman kill p S3. i j tive u is the table storing the length of consec w , Y of S4. the gunman police killed and ches ended at c table position i mat j , and f is a function of consecutive matches at the table pos i-

5 1 each sentence has - = 6 skip bigrams. For e (4,2) C x- R of R OUGE - S Extension U: S - OUGE 5.1 ample, S1 has the following skip - bigrams: S is that it OUGE - One po tential problem for R does police the poli ce gunman (“ ”, ”, “ ”, “ police killed not give any credit to a candidate sentence if the killed gunman ”, “ n “ killed the ”) man ”, “ the gu does not have any word pair co n se - occurring tence o- - bigram matches with S1 (“ p S2 has three skip with its references. For example, the following se n- ”), S3 has lice the ”, “ police gunman the gunman ”, “ S score of zero - OUGE R has a tence : man one skip - bigram match with S1 (“ the gu ”), and n o S4 has two skip - bigram matches with S1 (“ p lice gunman the killed police S5. kille of d X lations s ”). Given tran the gunman ”, “ length m r- is a refe X , assuming n of length Y and S5 is the exact reverse of S1 and there is no skip Y is a candidate translation, we ence translation and we However, bigram match between them. would measure as fo compute skip l lows: - based F - bigram - sentences similar to S5 from to differentiate like 2 X , ( Y ) SKIP co tences that se do - n word single have no t (16) R = skip2 2 ( ) C , m with S1. extend To achieve this, we occurrence ddition of unigram as counting the a with S - OUGE R 2 X SKIP ) Y ( , P = (17) skip2 is called unit - OUGE R SU. tended version x e . The We ( ) 2 C n , from - can also obtain d- ad by S - OUGE R SU R OUGE 2 R + β P ( ) 1 - gin e ing a b of sentence marker at the beginning - of 2 skip 2 skip F = (18) skip2 2 tences. n se candidate and reference + β R P skip skip 2 2 OUGE Evaluations of 6 R ) is the number of skip Where SKIP2 ( X , Y - bigram measures, OUGE R ess of assess the effectiven and Y , ß controlling the relative matches between X we To P OUGE i- is the comb C , and R and comp importance of ute the correlation between signed s a R skip2 skip2 nation fun tion. We call the skip - bigram - based F - c mary m human assigned su and summary scores tion 18, a measure, i.e. Equ OUGE - S. R a good evaluation mea The intuition is scores. that s- r- e = 1 and S1 as the ref ß Using Equation 18 with sum good ure should assign a good score to a mary , and S score is 0.5, S3 is 0.167 - OUGE R ence, S2’s a bad score ound truth and The gr to a bad su m m ary . u- is based on S4 is 0.333. Therefore, S2 is better than S3 and S4, human assigned scores. Acquiring h are man judgments u- usually v ery expensive; fort i tive and S4 is better than S3. This result is more intu have DUC 2001, 2002, and 2003 e nately, w than using B LEU - 2 and R OUGE - L. One adva n tage of bigram vs. - LEU is that it does not require co n- B skip s human judgment that include evaluation data for i secutive matches but is still sens o word order. tive t l : lowing the fo - bigram m bigram with LCS, skip - paring skip Co document summaries of about 100 Single • 2 order matching word pairs while LCS counts all in - words s- for DUC 2001 14 sy 12 systems : and quence. e only counts one longest common subs single docu 49 1 tems for . 2002 ment summaries bigram without any constraint on - Applying skip per system in were judged 295 DUC 2001 and the distance between the words, spurious matches were judged in DUC 2002 . the the ” might be counted as of in “ ” or “ such as very short • summaries of about Single document i valid matches. To reduce these spur ous matches, : - 10 words (headline like, keywords, or phrases) d we can limit the maximum skip distance, , b e- skip . DUC 2003 14 systems for m- 624 very short su - tween two in order words that is allowed to form a maries were judged in DUC 2003. per system d bigram. For exa skip m - ple, if we set to 0 then skip about words: document summaries of - Multi • 10 OUGE R - S is equ iv a lent to bigram overlap F - 50 words: 14 sy s tems 6 systems for ; DUC 2002 d to 4 then only word pairs of measure. If we set skip for DUC 2001 and 10 systems for DUC 2002; bigrams. - at most 4 words apart can form skip 100 rds: 14 systems for DUC 2001, wo s- sy 10 i- Adjusting Equations 16, 17, and 18 to use max tems C 2002, and 18 systems for DUC for DU mum skip distance limit is straightforward: we only ; 200 2003 DUC 2001 and words: 14 systems for - bigram matches, SKIP2 ( X , Y ), within count the skip 400 words: s- ; 2002 DUC 10 systems for 14 sy a- the maximum skip distance and replace denomin ries were judged tems for DUC 2001 a . 29 summ ,2), and 17, m ,2), with C tors of Equations 16, n ( C ( per system in DUC 2001, 59 per summary size bigrams the actual numbers of within distance skip - were judged in DUC 2002, and 30 were judged from the reference and the candidate respe c tively. 2003. in DUC 2 All systems includ e 1 or 2 baselines. Please see DUC 1 C(4,2) = 4!/(2!*2!) = 6. website for details.

6 4 manual summaries (CASE set ), stemmed version of DUC 2001 100 WORDS SINGLE DOC DUC 2002 100 WORDS SINGLE DOC 1 REF 1 REF 3 REFS 2 REFS the summaries (STEM and stopped version of , ) set STOP Method STEM STOP CASE STEM CASE STEM STOP CASE STEM STOP CASE 0.98 R-1 0.84 0.80 0.78 0.84 0.76 0.98 0.99 0.98 0.98 0.99 0.76 he summaries (STOP set). For example, we co t m- 0.87 0.84 0.86 0.99 0.99 0.99 0.99 0.99 0.99 R-2 0.84 0.83 0.87 0.99 0.80 0.85 0.99 0.99 0.99 0.86 0.99 0.99 0.83 0.82 R-3 0.86 R puted scores for the 12 sy tems participated s OUGE 0.83 0.81 0.81 0.77 0.84 0.84 R-4 0.99 0.99 0.98 0.99 0.99 0.99 0.99 0.98 0.99 0.99 0.81 0.83 0.83 0.75 0.79 0.98 0.99 R-5 0.79 in the DUC 2001 single document summarization 0.99 0.98 0.99 0.81 0.81 0.98 0.79 0.99 0.71 0.77 0.76 R-6 0.97 0.79 0.65 0.74 0.73 R-7 0.76 0.99 0.99 0.97 0.98 0.98 0.97 0.80 ence r using the CASE set with single refe evaluation 0.78 0.96 0.98 0.98 0.78 0.61 0.72 0.97 0.99 0.99 R-8 0.69 0.71 0.98 0.97 0.97 0.95 0.69 0.76 0.76 0.59 0.98 0.96 R-9 0.65 0.67 tion scores for a three correl the calculated and then 0.83 R-L 0.99 0.99 0.99 0.86 0.86 0.86 0.83 0.99 0.99 0.99 0.83 0.74 0.77 0.97 0.98 0.98 0.98 0.82 R-S* 0.74 0.98 0.80 0.78 0.98 these 12 s scores vs. human assigned OUGE R ystems’ R-S4 0.99 0.99 0.99 0.99 0.99 0.99 0.87 0.88 0.87 0.84 0.85 0.84 0.87 0.99 0.99 R-S9 0.84 0.85 0.84 0.99 0.88 0.87 0.99 0.99 0.99 After that w e repeated the r age scores. average cove 0.83 0.78 0.74 0.74 0.81 0.77 R-SU* 0.98 0.98 0.98 0.98 0.98 0.98 R-SU4 0.99 0.99 0.84 0.84 0.85 0.87 0.87 0.87 0.99 0.99 0.99 0.99 ences and then r ing multiple refe process us using 0.84 0.85 0.87 0.87 0.87 0.99 0.99 0.99 0.99 0.99 0.99 R-SU9 0.84 0.85 0.85 R-W-1.2 0.85 0.87 0.87 0.87 0.99 0.99 0.99 0.99 0.99 0.99 STEM and STOP sets. Therefore, 2 (multi or single) x 3 (CASE, STEM, or STOP) x 3 (Pearson, Spea r- OUGE 1: Table R Pearson’s correlations of 17 ollected for man, or Kendall) = data points were c 18 ure scores vs. human judgments for the DUC mea s OUGE R each sess s a To measure and each DUC task. 2001 and 2002 100 words single documen m- t su the significance of the results, we applied bootstrap tion tasks mariz a 1997) e sampling technique (Davison and Hinkley , r DUC 2003 10 WORDS SINGLE DOC 4REFS 1 REF 4 REFS 1 REF 1 REF 4 REFS r- to estimate 95% confidence intervals for every co STOP STEM CASE Method 0.90 0.90 0.95 0.96 R-1 0.95 0.95 relation computation . 0.77 R-2 0.75 0.75 0.75 0.76 0.76 0.71 0.68 0.73 0.70 0.70 0.70 R-3 17 s- u measures were tested for each run OUGE R 0.64 0.65 0.62 0.66 0.69 R-4 0.63 R-5 0.62 0.64 0.60 0.63 0.60 0.63 OUGE R ing N - OUGE R : evaluation package v1.2.1 0.54 0.46 0.61 0.55 0.62 0.57 R-6 0.46 0.56 R-7 0.44 0.60 0.58 0.56 OU GE - L, ROUGE - W with with N = 1 to 9, R 0.55 0.53 0.54 R-8 0.55 0.24 0.00 R-9 0.49 0.00 0.14 0.47 0.51 0.51 R = 1.2, SU α weighting factor - OUGE R S and - OUGE R-L 0.96 0.97 0.96 0.97 0.96 0.97 0.92 R-S* 0.95 0.85 0.88 0.87 0.89 d = 1, 4, and 9. with max mum skip distance i Due skip 0.88 0.88 0.89 0.88 0.95 0.96 R-S4 0.92 0.92 R-S9 0.91 0.97 0.95 0.92 to limitation of space, we only report correlation R-SU* 0.96 0.93 0.94 0.89 0.91 0.90 0.97 0.97 0.95 0.96 0.96 0.98 R-SU4 relation coe f- analysis results based on Pearson’s co r 0.94 0.95 0.97 R-SU9 0.95 0.97 0.96 R-W-1.2 0.96 0.96 0.96 0.96 0.96 0.96 Correlation analyses based on Spea ficient. man’s r e and Kendall’s corr lation coefficients are tracking Pearson’s correlations of 17 : OUGE R Table 2 at the rson’s very closely and will be posted later Pea UC measure scores vs. human judgments for the D 6 5 for ref R OUGE The critical value website erence. 2003 very short summary task for Pearson’s correl a tion is 0.632 at 95% confidence 3 sets Besides these human judgments, we also have dom. with 8 degrees of fre e of manual summaries for DUC 2001, 2 sets for Table 1 shows the Pearson’s correlation coeff i- DUC 2002, and 4 sets for DUC 2003. Human OUGE jud measures vs. human cients of the 17 R g- i- judges assigned content coverage scores to a cand ments on DUC 2001 and 2002 100 word s single n- date summary by examining the percentage of co The best values in . marization data document sum tween a manual summary unit, i.e. tent overlap be (green) color and each column are marked with dark elementary discourse unit or sentence, and the ca n- s l to the best va ues tatistically equivalent values are e summary using Summary Evaluation Env i- didat tions were mark ed with gray. We found that correl a 3 (SEE) developed by ronment versity of i the Un not affected by stemming or removal of stopwords Southern California’s Information Sciences I n stitute t in this data se , R OUGE - 2 performed be ter among t summary score is the (ISI) . The overall candidate N variants, W, and R OUGE - - OUGE R L, R the OUGE - average of the content co v erage score s of all the OUGE R - S were all performing well, and using mu l- manual su units mary. m in the Note that human tiple references improved pe r formance though not judges used only one manual summary in all the much. R OUGE measures achieved very good All ries a evaluations although multiple alternative summ e DUC 2002 ments in th correlation with human jud g were available. data. This might due to the double sa m ple size in computed With the DUC data, we Pearson’s DUC 2002 (295 vs. 149 in DUC 2001) for each sy s- , Spea product moment correlation coefficient s r- tem. man’s rank order correlation coefficient , and s Table 2 shows the cor relation analysis results on sy s tem s’ Kendall’s correlation coefficient between s the DUC 2003 single document very short su m mary average signed s scores an OUGE R d their human a R R We found that - OUGE - data. L, R OUGE 1, - OUGE average coverage scores using single reference and To investigate the effect of multiple references. 4 Porter’s stemmer was used. stemming and inclusion or exclusion of stopwords, 5 n i al automatic and R we also ran experiments over orig website: http://www.isi.edu/~cyl/ROUGE. OUGE 6 The critical values for Pearson’s correlation at 95% h 10, 12, 14, and 16 degrees of freedom confidence wit 3 SEE is available online at http://www.isi.edu/~cyl. tively. c are 0.576, 0.532, 0.497, and 0.468 respe

7 (A3) DUC 2003 100 WORDS MULTI (A2) DUC 2002 100 WORDS MULTI (A1) DUC 2001 100 WORDS MULTI 4 REFS 1 REF 2 REFS 1 REF 3 REFS 1 RFF CASE STOP CASE STOP STEM STEM STOP STEM CASE STOP STEM CASE STOP STEM CASE STOP STEM CASE Method 0.71 0.56 0.86 0.53 0.57 0.87 0.66 0.66 0.77 0.71 0.71 0.78 0.58 0.57 0.48 0.58 0.57 0.71 R-1 R-2 0.57 0.64 0.59 0.61 0.71 0.83 0.83 0.80 0.88 0.87 0.85 0.69 0.67 0.71 0.79 0.79 0.81 0.55 0.89 0.46 0.47 0.53 0.53 0.55 0.85 0.84 0.76 0.45 0.88 0.83 0.54 0.51 0.48 0.76 0.75 0.74 R-3 R-4 0.39 0.39 0.43 0.48 0.49 0.47 0.80 0.80 0.63 0.83 0.82 0.75 0.37 0.36 0.36 0.62 0.61 0.52 R-5 0.38 0.33 0.47 0.48 0.43 0.73 0.73 0.45 0.73 0.73 0.62 0.25 0.25 0.27 0.45 0.44 0.38 0.39 0.66 R-6 0.20 0.45 0.46 0.39 0.71 0.72 0.38 0.39 0.64 0.46 0.21 0.21 0.26 0.34 0.31 0.29 0.39 R-7 0.31 0.31 0.17 0.44 0.44 0.36 0.63 0.65 0.33 0.56 0.53 0.44 0.20 0.20 0.23 0.29 0.27 0.25 0.23 R-8 0.19 0.09 0.40 0.40 0.31 0.23 0.22 0.18 0.21 0.18 0.18 0.52 0.46 0.50 0.52 0.55 0.55 0.28 0.38 0.38 0.06 0.12 0.11 R-9 0.21 0.21 0.21 0.19 0.16 0.16 0.52 0.42 0.45 0.52 0.54 0.54 R-L 0.49 0.56 0.56 0.56 0.62 0.62 0.62 0.65 0.65 0.65 0.50 0.50 0.50 0.53 0.53 0.53 0.49 0.49 0.73 0.45 0.51 0.54 0.86 0.69 0.69 0.77 0.73 0.84 0.79 0.60 0.60 0.67 0.61 0.60 0.70 R-S* 0.52 0.46 0.50 0.71 0.54 0.57 0.78 0.79 0.80 0.79 0.84 0.85 0.82 0.63 0.64 0.70 0.73 0.73 0.78 R-S4 0.84 0.42 0.77 0.53 0.56 0.81 0.79 0.80 0.78 0.83 0.49 0.81 0.65 0.65 0.70 0.70 0.70 0.76 R-S9 R-SU* 0.45 0.52 0.84 0.51 0.54 0.87 0.69 0.69 0.77 0.73 0.73 0.79 0.60 0.59 0.67 0.60 0.60 0.70 0.80 R-SU4 0.80 0.55 0.58 0.83 0.76 0.76 0.79 0.53 0.81 0.81 0.64 0.64 0.74 0.68 0.68 0.76 0.47 R-SU9 0.44 0.50 0.80 0.53 0.57 0.84 0.77 0.78 0.78 0.81 0.82 0.81 0.65 0.65 0.72 0.68 0.68 0.75 0.67 R-W-1.2 0.52 0.52 0.60 0.60 0.60 0.67 0.67 0.52 0.69 0.69 0.69 0.53 0.53 0.53 0.58 0.58 0.58 (E1) DUC01 200 (E2) DUC02 200 (F) DUC01 400 (C) DUC02 10 (D1) DUC01 50 (D2) DUC02 50 CASE STOP STEM CASE Method STOP STEM CASE STOP STEM CASE STOP STEM CASE STOP STEM STOP STEM CASE R-1 0.71 0.49 0.49 0.49 0.73 0.44 0.48 0.80 0.81 0.81 0.90 0.84 0.84 0.91 0.74 0.73 0.90 0.68 0.85 0.82 0.43 0.45 0.59 0.47 0.49 0.62 0.84 0.80 0.86 0.93 0.93 0.94 0.88 0.88 0.87 R-2 0.85 0.59 0.74 0.75 0.32 0.33 0.39 0.36 0.36 0.45 R-3 0.80 0.81 0.90 0.91 0.91 0.84 0.84 0.82 0.80 R-4 0.36 0.16 0.28 0.26 0.36 0.28 0.28 0.39 0.25 0.78 0.78 0.87 0.88 0.88 0.80 0.80 0.75 0.77 R-5 -0.25 -0.25 -0.24 0.30 0.29 0.31 0.28 0.30 0.49 0.77 0.76 0.72 0.82 0.83 0.84 0.77 0.77 0.70 0.75 R-6 0.00 0.22 0.23 0.41 0.18 0.21 -0.17 0.00 0.75 0.67 0.78 0.79 0.77 0.74 0.74 0.63 0.00 R-7 0.00 0.00 0.00 0.26 0.23 0.50 0.11 0.16 0.00 0.72 0.72 0.62 0.72 0.73 0.74 0.70 0.70 0.58 0.68 R-8 0.00 0.32 0.32 0.34 -0.11 -0.11 0.00 0.00 0.68 0.54 0.71 0.71 0.70 0.66 0.66 0.52 0.00 R-9 0.00 0.00 0.00 0.30 0.30 0.34 -0.14 -0.14 0.00 0.64 0.64 0.48 0.70 0.69 0.59 0.63 0.62 0.46 R-L 0.78 0.78 0.56 0.56 0.56 0.50 0.50 0.50 0.81 0.81 0.81 0.88 0.88 0.88 0.82 0.82 0.82 0.78 0.80 0.83 0.46 0.45 0.74 0.46 0.49 0.80 0.80 0.69 0.90 0.84 0.85 0.93 0.75 0.74 0.89 R-S* 0.82 0.85 0.86 0.76 0.40 0.41 0.69 0.42 0.44 0.73 R-S4 0.82 0.87 0.91 0.91 0.93 0.85 0.85 0.85 0.82 R-S9 0.81 0.69 0.42 0.41 0.72 0.40 0.43 0.78 0.82 0.82 0.86 0.90 0.90 0.92 0.83 0.83 0.84 0.81 R-SU* 0.75 0.74 0.56 0.46 0.46 0.74 0.46 0.49 0.80 0.80 0.80 0.90 0.84 0.85 0.93 0.75 0.74 0.89 0.82 R-SU4 0.58 0.45 0.45 0.72 0.44 0.46 0.78 0.75 0.83 0.89 0.90 0.90 0.93 0.84 0.84 0.88 0.76 R-SU9 0.74 0.73 0.56 0.44 0.44 0.73 0.41 0.45 0.79 0.82 0.82 0.88 0.89 0.89 0.92 0.83 0.82 0.87 0.84 R-W-1.2 0.78 0.78 0.56 0.56 0.56 0.51 0.51 0.51 0.84 0.84 0.78 0.90 0.90 0.90 0.86 0.86 0.86 Table 3: Pearson’s correla tions of 17 R OUGE measure scores vs. human judgments for the DUC 2001, 2002, and 2003 mul document summarization tasks - ti S U4 and 9, and OUGE - W were very good measures ples in the samples; while we only had about 30 sam R multi R gory, N with N > 1 performed e in this cat OUGE - document tasks. The only task s that had over - significantly worse than all other measures, and e x- from DUC 2002 and the correl a- 30 samples was measures with human judgments on OUGE R tions of clusion of stopwords improved performance in ge n- R the 100 words summary task were much better and Due to the large number . 1 - OUGE eral except for ) in this data set, using multiple re f- of samples (624 more stable than similar tasks in DUC 2001 and a tions. erences did not improve correl 2003. Statistically stable human judgments of sy s- In Table 3 A1, A2, and A3, we show correlation tem pe r formance might not be obtained due to lack of samples turn analysis results on DUC 2001, 2002, and 2003 100 r- of co nstability i caused and this in The document summarization data. relation analyses. words multi - m- results indicated that using multiple r eferences i Conclusions 7 u- proved correlation and exclusion of stopwords us - R OUGE ally improved performance. 1, 2, and 3 In this paper , we introduced R OUGE , an aut o matic OUGE 1, pe r formed fine but were not consistent. R - evaluatio n package for summarization, and co n- - OUGE R S9, and - R SU4, - OUGE R S4, - OUGE R OUGE ducted comprehensive evaluations of the automatic correlation above d moval ha e SU9 with stopword r R OUGE package using measures included in the R OUGE - W did not work well in 0 0.7 . R OUGE - L and To check the significance years of DUC data. three this set of data. of the results, we estimated co n fidence intervals of e- the corr Table 3 C, D1, D2, E1, E2, and F show We found correlations using bootstrap resampling. rest lation analyses using multiple refe r ences on the - W, and that OUGE L, - (1) R OUGE R 2, R OUGE - of DUC data. These results again suggested that worked well in single document summ a- OUGE R - S r formance exclusion of stopwords achieved better pe OU GE - 1, R - R W, riza tion tasks, (2) OUGE R , L - OUGE pecially in multi - document summaries of 50 es R SU9 performed great in - R SU4, and - OUGE OUGE words. Better correlations (> 0.70) were o b served ing very short summaries (or headline t evalua like - on long summary tasks, i.e. 200 and 400 words , (3) correlation of high 90% was hard to ries) a summ summaries. R OUGE The relative performance of achieve for multi s but - document summarization task tern of the 100 words measures followed the pa t OUGE - 1, R - S4, R OUGE - 9, S OUGE - 2, R R OUGE multi document summarizat ion task. - R OUGE - SU4, and R OUGE - SU9 worked reason a bly Comparing the results in Table 3 with Table s 1 ing, well when stopwords were exc luded from matc h and - , we found that correlation values in the multi 2 e- stopwords usually improved corr exclusion of (4) x document tasks rarely reached high 90% e cept in correlations to human jud g ments lation, and (5) long summary tasks. One possible explan a tion of were increased by using multiple references. that we did not have large amoun t of t his outcome is k pac age OUGE In summary, we sho wed that the R . s document task - samples for the multi In the single a tion of could be used effectively in automatic evalu document summariz tion tasks we had over 100 a , summaries. In a separate study (Lin and Och 2004),

8 R OUGE L, W, and S were also shown to be very http://www - nlpir.nist.gov/projects/duc/pubs/ - 2003slides/duc2003in evaluation of tro.pdf automatic machine in effective l R tran OUGE s at a tion . The stability and rel iability of Papineni, K., S. Roukos, T. Ward, and W. J. Zhu. - f di ferent sample sizes was reported by the author in valu A m ethod for a B e a tion 2001. : LEU utomatic However, how to 2004). , (Lin high correl a- achieve IBM Research Report . ranslation t achine of m document - tion with human judgments in multi . RC22176 (W0109 - 022) R OUGE summarization tasks as already did in si n gle Saggion H., D. Radev, S. Teufel, and W. Lam. document summarization tasks is still an open r e- - 2002. Meta evaluation of summaries in a c ro ss - search topic. based m etrics. ontent lingual env i - ronment using c In Procee d ings of COLING i- , Taipei, Ta 2002 - Acknowledgements 8 wan. e- The author would like to thank the anonymous r D. Radev, t- S. Teufel, H. Saggion, W. Lam, J. Bli viewers for their constructive comments, Paul Over zer, A. Gelebi, H. Qi, E. Drabek, and D. Liu. R at NIST , U.S.A , and OUGE users around the world tion of Text Summarization in a Evalu 2002. a lier r for testing and providing useful feedback on ea Lingual Info - r mation Retrieval Framework . Cross s of the R version OUGE the evaluation package, and Technical report, Center for Language and DARPA TIDES project for supporting this r search. e Speech Processing, Johns Hopkins University, more, MD, USA. i Balt References Van Rijsbergen, C. J. 1979. Information Retrieval . Cormen, T. R., C. E. Leiserson, and R. L. Rivest. Bu terworths. London. t 1989. duction to Algorithms . The MIT Press. o Intr Bootstrap Davison, A. C. and D. V. Hinkley. 1997. d Their Application . Cambridge Un i- Methods an ve r sity Press. a- Lin, C. - Y . and E. H. Hovy . 2003. Automatic e valu gram o c - - occurrence tion of s u m maries u sing n s Proceedings of 2003 Language tatistics. In Technology Confe r ence (HLT - NAACL 2003), Edmonton, Ca n ada. Looking for a f Lin, C. - Y. 2004. ew g ood m e t rics: R OUGE and its e valuation. In Proceedings of NTCIR Workshop 2004 , Tokyo, Japan. a- Automatic e J. Och. 2004. valu Lin, C. - Y. and F. ranslation sing machine t tion of ongest l uality q u c tatistics. gram s i b ommon kip s ubsequence and s - nd Procee d ings of 42 In Annual Meeting of ACL (ACL 200 Ba r celona , Spai n . 4 ), . Automatic Summarization n- Mani, I. 200 1 . John Be mins Pu lishing Co. a j b i- Melamed, I. D. 1995. Automatic evaluation and un a- c i nducing n - best t ran sl ascades for ilter form f rd Workshop Proceedings of the 3 cons. In i ex l tion on Very Large Corpora (WVLC3) . Boston, U.S.A. Melamed, I. D., R. Green and J. P. Turian (2003). Prec sion and r ecall of machine t ranslation . In i d ings of 2003 Language Technology Co n- Procee (HLT ference NAA CL 2003), Edmonton, Ca n- - ada . Over, P. and J. Yen. 2003. An i ntrod uction to DUC Intrinsic e – 2003 n ext t ews eneric g valuation of AAAAAAAAAA s ummariz a s tems. s y tion

Related documents

2017 Without Dependents BAH Rates

2017 Without Dependents BAH Rates

2017 BAH Rates - WITHOUT DEPENDENTS O07 O06 W04 W05 O02E O03E W03 O02 O03 O04 O05 O01E O01 E07 E08 E09 W01 W02 E03 E05 E06 E04 E02 E01 MHA_NAME MHA 1155 KETCHIKAN, AK 1521 1527 1587 1659 1788 1530 163...

More info »
MF991 Wheat Variety Disease and Insect Ratings 2018

MF991 Wheat Variety Disease and Insect Ratings 2018

MF991 • Wheat Ratings Wheat Variety Disease and Insect Ratings 2018 R. Jeff Whitworth Romulo Lollato Erick D. De Wolf Entomologist Agronomist Plant Pathologist Variety selection is one of the most imp...

More info »
The Fire and Fuels Extension to the Forest Vegetation Simulator: Updated Model Documentation

The Fire and Fuels Extension to the Forest Vegetation Simulator: Updated Model Documentation

United States Department of The Fire and Fuels Extension Agriculture to the Forest Vegetation Forest Service Forest Management : Updated Model Simulator Service Center Fort Collins, CO Documentation 2...

More info »
pg165 ERC2 RA6C

pg165 ERC2 RA6C

ERC2 ROBO Cylinder Slider Type Controller-Integrated Rod Type 58mm Width Pulse Motor Straight Type Mini ERC2-RA6C Standard Configuration: RA6C PM I ERC2 Controllers Integrated I/O Type Cable Length En...

More info »