meteor 1.5

Transcript

1 Meteor Universal: Language Specific Translation Evaluation for Any Target Language Michael Denkowski Alon Lavie Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 USA { mdenkows,alavie } @cs.cmu.edu Given only the bitext used to build a standard Abstract phrase-based translation system, Meteor Universal This paper describes Meteor Universal, re- learns a paraphrase table and function word list, leased for the 2014 ACL Workshop on two of the most consistently beneficial language Statistical Machine Translation. Meteor specific resources employed in versions of Me- Universal brings language specific evalu- teor. Whereas previous versions of Meteor require ation to previously unsupported target lan- human ranking judgments in the target language guages by (1) automatically extracting lin- to learn parameters, Meteor Universal uses a sin- guistic resources (paraphrase tables and gle parameter set learned from pooling judgments function word lists) from the bitext used to from several languages. This universal parameter train MT systems and (2) using a univer- set captures general preferences shown by human sal parameter set learned from pooling hu- evaluators across languages. We show this ap- man judgments of translation quality from proach to significantly outperform baseline BLEU several language directions. Meteor Uni- for two new languages, Russian and Hindi. The versal is shown to significantly outperform following sections review Meteor’s scoring func- baseline BLEU on two new languages, tion (§2), describe the automatic extraction of lan- Russian (WMT13) and Hindi (WMT14). guage specific resources (§3), discuss training of 1 Introduction the universal parameter set (§4), report experimen- tal results (§5), describe released software (§6), Recent WMT evaluations have seen a variety of and conclude (§7). metrics employ language specific resources to replicate human translation rankings far better 2 Meteor Scoring than simple baselines (Callison-Burch et al., 2011; ˇ ́ Meteor evaluates translation hypotheses by align- cek and Bojar, a Callison-Burch et al., 2012; Mach ing them to reference translations and calculating 2013; Snover et al., 2009; Denkowski and Lavie, sentence-level similarity scores. For a hypothesis- 2011; Dahlmeier et al., 2011; Chen et al., 2012; reference pair, the space of possible alignments is ). While the Wang and Manning, 2012, inter alia constructed by exhaustively identifying all possi- wealth of linguistic resources for the WMT lan- ble matches between the sentences according to guages allows the development of sophisticated the following matchers: metrics, most of the world’s 7,000+ languages lack Exact: Match words if their surface forms are the prerequisites for building advanced metrics. identical. Researchers working on low resource languages Stem words using a language appropriate Stem: are usually limited to baseline BLEU (Papineni et Snowball Stemmer (Porter, 2001) and match if the al., 2002) for evaluating translation quality. stems are identical. Meteor Universal brings language specific eval- Synonym: Match words if they share member- uation to any target language by combining lin- ship in any synonym set according to the WordNet guistic resources automatically learned from MT database (Miller and Fellbaum, 2007). system training data with a universal metric pa- Match phrases if they are listed as Paraphrase: rameter set that generalizes across languages.

2 paraphrases in a language appropriate paraphrase hypothesis and reference) and number of chunks table (described in §3.2). ( ch ): ) ( β All matches are generalized to phrase matches ch Pen = γ · with a span in each sentence. Any word occur- m ring within the span is considered covered by the The Meteor score is then calculated: match. The final alignment is then resolved as the largest subset of all matches meeting the following − · = (1 Score Pen ) F mean criteria in order of importance: δ , γ , β , , and w ...w are tuned α The parameters i n 1. Require each word in each sentence to be to maximize correlation with human judgments. covered by zero or one matches. 3 Language Specific Resources 2. Maximize the number of covered words Meteor uses language specific resources to dra- across both sentences. matically improve evaluation accuracy. While 3. Minimize the number of , where a chunks some resources such as WordNet and the Snowball is defined as a series of matches that chunk stemmers are limited to one or a few languages, is contiguous and identically ordered in both other resources can be learned from data for any sentences. language. Meteor Universal uses the same bitext used to build statistical translation systems to learn 4. Minimize the sum of absolute distances be- function words and paraphrases. Used in con- tween match start indices in the two sen- junction with the universal parameter set, these re- tences. (Break ties by preferring to align sources bring language specific evaluation to new phrases that occur at similar positions in both target languages. sentences.) 3.1 Function Word Lists Alignment resolution is conducted as a beam The function word list is used to discriminate be- search using a heuristic based on the above cri- tween content and function words in the target lan- teria. guage. Meteor Universal counts words in the tar- The Meteor score for an aligned sentence pair is get side of the training bitext and considers any calculated as follows. Content and function words 3 − 10 word with relative frequency above to be a h are identified in the hypothesis ( ) and ref- , h c f function word. This list is used only during the erence ( ) according to a function word list r , r c f δ scoring stage of evaluation, where the tunable (described in §3.1). For each of the matchers parameter controls the relative weight of content m ( ), count the number of content and function i versus function words. When tuned to match hu- words covered by matches of this type in the hy- man judgments, this parameter usually reflects a , ) ) and reference ( m ( m ( , r ) pothesis ( h ( ) h m i c i i c f greater importance for content words. r ( m ). Calculate weighted precision and re- ) i f w call using matcher weights ( ...w ) and content- i n 3.2 Paraphrase Tables function word weight ( δ ): Paraphrase tables allow many-to-many matches ∑ m · ( h δ )) · m w ( h ( ) + (1 − δ ) · i i i c f i that can encapsulate any local language phenom- P = δ − + (1 | | h ·| δ h ·| ) c f ena, including morphology, synonymy, and true ∑ paraphrasing. Identifying these matches allows r ( m · δ ( · )) ( m · ) r δ − ) + (1 w i i c i f i far more sophisticated evaluation than is possible R = r δ ·| r ·| | + (1 − δ ) | c f with simple surface form matches. In Meteor Uni- P The parameterized harmonic mean of R and versal, paraphrases act as the catch-all for non- (van Rijsbergen, 1979) is then calculated: exact matches. Paraphrases are automatically ex- tracted from the training bitext using the transla- · R P = F tion pivot approach (Bannard and Callison-Burch, mean − α ) · R α · P + (1 2005). First, a standard phrase table is learned from the bitext (Koehn et al., 2003). Paraphrase To account for gaps and differences in word order, extraction then proceeds as follows. For each tar- a fragmentation penalty is calculated using the to- get language phrase ( ) in the table, find each e m tal number of matched words ( , averaged over 1

3 Direction Judgments e source phrase translates. Each alternate f that 1 11,021 cs-en f is considered e phrase ( 6 ) that translates e = 1 2 11,934 de-en e | f ) . | ( ( · P e a paraphrase with probability P ) f 1 2 es-en 9,796 being a paraphrase of e The total probability of 2 fr-en 11,594 : is the sum over all possible pivot phrases f e 1 en-cs 18,805 ∑ 14,553 en-de | | f ) · ( ) e P f P e ) = e | e ( P ( 2 2 1 1 en-es 11,834 f en-fr 11,562 To improve paraphrase precision, we apply Total 101,099 several language independent pruning techniques. The following are applied to each paraphrase in- Table 1: Binary ranking judgments per language f , e stance ( ): e , 1 2 direction used to learn parameters for Meteor Uni- versal • Discard instances with very low probability ( P ( f | e ) · P 0 ( e ). . 001 ) f | < 2 1 which judgment data does exist. We learn this pa- rameter set by pooling over 100,000 binary rank- , or • Discard instances where contain e e f , 2 1 ing judgments from WMT12 (Callison-Burch et punctuation characters. al., 2012) that cover 8 language directions (de- e • Discard instances where e con- , f , or tails in Table 1). Data for each language is scored 2 1 tain only function words (relative frequency using the same resources (function word list and 3 − above in the bitext). 10 paraphrase table only) and scoring parameters are τ tuned to maximize agreement (Kendall’s ) over The following are applied to each final paraphrase all judgments from all languages, leading to a sin- e ) after summing over all instances: ( e , 2 1 gle parameter set. The universal parameter set en- codes the following general human preferences: • Discard paraphrases with very low probabil- • Prefer recall over precision. 01 ). . 0 < ) ity ( e | P e ( 2 1 • Prefer word choice over word order. • Discard paraphrases where e is a sub-phrase 2 of e . 1 • Prefer correct translations of content words over function words. This constitutes the full Meteor paraphrasing exact • Prefer matches over paraphrase pipeline that has been used to build tables for matches, while still giving significant credit fully supported languages (Denkowski and Lavie, to paraphrases. 2011). Paraphrases for new languages have the added advantage of being extracted from the same Table 2 compares the universal parameters to those bitext that MT systems use for phrase extraction, learned for specific languages in previous versions resulting in ideal paraphrase coverage for evalu- of Meteor. Notably, the universal parameter set is ated systems. more balanced, showing a normalizing effect from generalizing across several language directions. 4 Universal Parameter Set 5 Experiments Traditionally, building a version of Meteor for a new target language has required a set of human- We evaluate the Universal version of Meteor scored machine translations, most frequently in against full language dedicated versions of Meteor the form of WMT rankings. The general lack of and baseline BLEU on the WMT13 rankings. Re- availability of these judgments has severely lim- sults for English, Czech, German, Spanish, and ited the number of languages for which Meteor French are biased in favor of Meteor Universal versions could be trained. Meteor Universal ad- since rankings for these target languages are in- dresses this problem with the introduction of a cluded in the training data while Russian consti- ”universal” parameter set that captures general hu- tutes a true held out test. We also report the re- man preferences that apply to all languages for sults of the WMT14 Hindi evaluation task. Shown

4 Language α β γ δ w w w w syn stem par exact 0.60 0.75 English 0.60 0.80 0.60 0.20 0.85 1.00 0.20 0.60 0.80 1.00 – – 0.40 Czech 0.95 1.00 German 0.55 1.00 0.80 – 0.20 0.95 0.55 0.65 0.50 0.80 1.00 0.80 – 0.60 Spanish 1.30 0.90 1.40 0.60 0.65 1.00 0.20 – 0.40 French 0.70 1.40 0.70 1.00 – – 0.60 0.30 Universal Table 2: Comparison of parameters for language specific and universal versions of Meteor. τ M-Full M-Universal BLEU WMT13 Meteor 1.5 can be downloaded from the official 1 0.206 English 0.214 0.124 webpage and a full tutorial for Meteor Universal 2 0.044 0.085 Czech 0.092 is available online. Building a version of Meteor German 0.163 0.157 0.097 for a new language requires a training bitext Spanish 0.106 0.101 0.068 ( ) and a standard Moses format corpus.e , corpus.f 0.150 French 0.137 0.099 ) (Koehn et al., phrase table ( phrase-table.gz Russian – 0.128 0.068 2007). To extract linguistic resources for Meteor, run the new language script: τ M-Universal M-Full WMT14 BLEU \ $ python scripts/new_language.py out 0.264 0.227 Hindi – corpus.f corpus.e phrase-table.gz To use the resulting files to score translations with Table 3: Sentence-level correlation with human Meteor, use the new language option: rankings (Kendall’s τ ) for Meteor (language spe- cific versions), Meteor Universal, and BLEU $ java -jar meteor- \ .jar test ref -new * out/meteor-files in Table 3, Meteor Universal significantly out- Meteor 1.5, including Meteor Universal, is free performs baseline BLEU in all cases while suf- software released under the terms of the GNU fering only slight degradation compared to ver- Lesser General Public License. sions of Meteor tuned for individual languages. 7 Conclusion For Russian, correlation is nearly double that of BLEU. This provides substantial evidence that This paper describes Meteor Universal, a version Meteor Universal will further generalize, bringing of the Meteor metric that brings language specific improved evaluation accuracy to new target lan- evaluation to any target language using the same guages currently limited to baseline language in- resources used to build statistical translation sys- dependent metrics. tems. Held out tests show Meteor Universal to sig- For the WMT14 evaluation, we use the tradi- nificantly outperform baseline BLEU on WMT13 tional language specific versions of Meteor for all Russian and WMT14 Hindi. Meteor version 1.5 is language directions except Hindi. This includes freely available open source software. Russian, for which additional language specific re- sources (a Snowball word stemmer) help signifi- Acknowledgements cantly. For Hindi, we use the release version of This work is supported in part by the National Sci- Meteor Universal to extract linguistic resources ence Foundation under grant IIS-0915327, by the from the constrained training bitext provided for Qatar National Research Fund (a member of the the shared translation task. These resources are Qatar Foundation) under grant NPRP 09-1140-1- used with the universal parameter set to score all 177, and by the NSF-sponsored XSEDE program system outputs for the English–Hindi direction. under grant TG-CCR110017. 6 Software 1 http://www.cs.cmu.edu/~alavie/METEOR/ 2 Meteor Universal is included in Meteor version http://www.cs.cmu.edu/~mdenkows/meteor- 1.5 which is publicly released for WMT14. universal.html

5 George Miller and Christiane Fellbaum. 2007. Word- References Net. http://wordnet.princeton.edu/. Colin Bannard and Chris Callison-Burch. 2005. Para- phrasing with bilingual parallel corpora. In Pro- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- ceedings of the 43rd Annual Meeting of the Associa- Jing Zhu. 2002. BLEU: A method for automatic , pages tion for Computational Linguistics (ACL’05) evalution of machine translation. In Proceedings of 597–604, Ann Arbor, Michigan, June. Association the 40th Annual Meeting of the Association for Com- for Computational Linguistics. putational Linguistics , pages 311–318, Philadelphia, Pennsylvania, USA, July. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar Zaidan. 2011. Findings of the 2011 Martin Porter. 2001. Snowball: A language for stem- workshop on statistical machine translation. In Pro- ming algorithms. http://snowball.tartarus.org/texts/. ceedings of the Sixth Workshop on Statistical Ma- , pages 22–64, Edinburgh, Scot- chine Translation Matthew Snover, Nitin Madnani, Bonnie Dorr, and land, July. Association for Computational Linguis- Richard Schwartz. 2009. Fluency, adequacy, or tics. HTER? Exploring different human judgments with a tunable MT metric. In Proceedings of the Fourth Chris Callison-Burch, Philipp Koehn, Christof Monz, , pages Workshop on Statistical Machine Translation Matt Post, Radu Soricut, and Lucia Specia. 2012. 259–268, Athens, Greece, March. Association for Findings of the 2012 workshop on statistical ma- Computational Linguistics. chine translation. In Proceedings of the Seventh , pages Workshop on Statistical Machine Translation C. J. van Rijsbergen, 1979. , Information Retrieval ́ 10–51, Montr eal, Canada, June. Association for chapter 7. Butterworths, London, UK, 2nd edition. Computational Linguistics. Mengqiu Wang and Christopher Manning. 2012. Boxing Chen, Roland Kuhn, and George Foster. 2012. Spede: Probabilistic edit distance metrics for mt Improving amber, an mt evaluation metric. In Pro- evaluation. In Proceedings of the Seventh Work- ceedings of the Seventh Workshop on Statistical Ma- , pages 76– shop on Statistical Machine Translation ́ , pages 59–63, Montr eal, Canada, chine Translation ́ 83, Montr eal, Canada, June. Association for Com- June. Association for Computational Linguistics. putational Linguistics. Daniel Dahlmeier, Chang Liu, and Hwee Tou Ng. 2011. Tesla at wmt 2011: Translation evaluation and tunable metric. In Proceedings of the Sixth Work- shop on Statistical Machine Translation , pages 78– 84, Edinburgh, Scotland, July. Association for Com- putational Linguistics. Michael Denkowski and Alon Lavie. 2011. Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems. In Pro- ceedings of the Sixth Workshop on Statistical Ma- chine Translation , pages 85–91, Edinburgh, Scot- land, July. Association for Computational Linguis- tics. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proc. of NAACL/HLT 2003 . Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexan- dra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine transla- tion. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Com- panion Volume Proceedings of the Demo and Poster Sessions , pages 177–180, Prague, Czech Republic, June. Association for Computational Linguistics. ˇ ́ ˇ ˇ a s Mach cek and Ond Matou rej Bojar. 2013. Results of the WMT13 metrics shared task. In Proceedings of the Eighth Workshop on Statistical Machine Trans- lation , pages 45–51, Sofia, Bulgaria, August. Asso- ciation for Computational Linguistics.

Related documents

CityNT2019TentRoll 1

CityNT2019TentRoll 1

STATE OF NEW YORK 2 0 1 9 T E N T A T I V E A S S E S S M E N T R O L L PAGE 1 VALUATION DATE-JUL 01, 2018 COUNTY - Niagara T A X A B L E SECTION OF THE ROLL - 1 CITY - North Tonawanda TAX MAP NUMBER ...

More info »
G:\COMP\PHSA\PHSA.bel

G:\COMP\PHSA\PHSA.bel

G:\COMP\PHSA\PHSA-MERGED.XML PUBLIC HEALTH SERVICE ACT [As Amended Through P.L. 115–408, Enacted December 31, 2018] References in brackets ¿ ø¿ ø are to title 42, United States Code TITLE I—SHORT TITL...

More info »
RIE Tenant List By Docket Number

RIE Tenant List By Docket Number

SCRIE TENANTS LIST ~ By Docket Number ~ Borough of Bronx SCRIE in the last year; it includes tenants that have a lease expiration date equal or who have received • This report displays information on ...

More info »
Programming Guide for ZPL II, ZBI 2, Set Get Do, Mirror, WML (en)

Programming Guide for ZPL II, ZBI 2, Set Get Do, Mirror, WML (en)

Programming Guide ZPL II ZBI 2 Set-Get-Do Mirror WML

More info »
doj final opinion

doj final opinion

UNITED STAT ES DIS TRICT COURT IC F OR THE D ISTR T OF CO LU M BIA UNITED STAT F AMERICA, : ES O : : la in t if f, P 99 No. on cti l A vi Ci : 96 (GK) -24 : and : TOBACCO-F UND, : REE KIDS ACTION F : ...

More info »
MDS 3.0 RAI Manual v1.16 October 2018

MDS 3.0 RAI Manual v1.16 October 2018

Centers for Medicare & Medicaid Services Long-Term Care Facility Resident Assessment Instrument 3.0 User’s Manual Version 1.16 October 2018

More info »
Numerical Recipes

Numerical Recipes

Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43108-5) Permission is granted for internet users to make one paper copy for their own personal use. Further reprod...

More info »
HANDBOOK of METAL ETCHANTS

HANDBOOK of METAL ETCHANTS

HANDBOOK of METAL ETCHANTS Editors Perrin Walker William H. Tarn CRC Press Boca Raton Boston London New York Washington, D.C. © 1991 by CRC Press LLC

More info »
CalCOFI Atlas 33

CalCOFI Atlas 33

THE EARLY STAGES IN OF THE FISHES CALIFORNIA CURRENT REGION CALIFORNIA FISHERIES COOPERATIVE OCEANIC INVESTIGATIONS ATLAS NO. 33 BY THE SPONSORED STATES OF COMMERCE DEPARTMENT UNITED OCEANIC AND ATMOS...

More info »
Microsoft Word   A) Division 245.docx

Microsoft Word A) Division 245.docx

tables Attachment Division 245, including A: Nov. 15-16, 2018, EQC meeting 1 of 121 Page Division 245 CLEANER AIR OREGON 340-245-0005 Purpose and Overview (1) This statement of purpose and overview is...

More info »
Department of Defense   Law of War Manual (June 2015)

Department of Defense Law of War Manual (June 2015)

D E A R T M E N T O F D E F E N S E P N A L O F W A R M A W U A L J U N E 2 0 1 5 O F F I C E O F G E N ER A L C O U N S E L D P A R T M E N T E O F D E F E N S E

More info »
vol9 organic ligands

vol9 organic ligands

C HERMODYNAMICS HEMICAL T OMPOUNDS AND C OMPLEXES OF OF C U, Np, Pu, Am, Tc, Se, Ni and Zr O ELECTED WITH RGANIC L IGANDS S Wolfgang Hummel (Chairman) Laboratory for Waste Management Paul Scherrer Ins...

More info »
LawReferenceBook2018

LawReferenceBook2018

California Contractors License Law & Reference Book 2018 Edition With Rules and Regulations Contractors State License Board State of California Edmund G. Brown, Jr., Governor

More info »
NB18

NB18

Table of Contents National Board Pressure Relief Device Certificati ons NB-18 FOREWARD... 1 NATIONAL BOARD PRESSURE RELIEF DEVICE CERTIFICATION... 2 DETERMINATION OF CERTIFIED RELIEVING CAPACITIES... ...

More info »
June2018CUR

June2018CUR

CHANCELLOR'S UNIVERSITY REPORT JUNE 25 2018

More info »
oldnew 11.dvi

oldnew 11.dvi

C ́edric Villani O ptimal transport, old and new June 13, 2008 Springer Berlin Heidelberg NewYork Hong Kong London Milan Paris Tokyo

More info »
MCO 1200.17E MILITARY OCCUPATIONAL SPECIALTIES MANUAL (SHORT TITLE: MOS MANUAL)

MCO 1200.17E MILITARY OCCUPATIONAL SPECIALTIES MANUAL (SHORT TITLE: MOS MANUAL)

DEPAR T MENT THE NAVY OF ADQ UARTE UNI T ED ST ATE S MAR INE CORPS HE RS RINE COR N PS PENT 3000 MA AGO 20350-3000 NGTON, HI D.C. W AS 7E 00 .1 12 MCO 465 c AUG 0 8 013 2 ORDER 1200.17E MARINE CORPS C...

More info »
An Introduction to Computer Networks

An Introduction to Computer Networks

An Introduction to Computer Networks Release 1.9.18 Peter L Dordal Mar 31, 2019

More info »
CRPT 116hrpt9 u2

CRPT 116hrpt9 u2

U:\2019CONF\HJRes31Front.xml APPRO. SEN. [COMMITTEE PRINT] REPORT { } CONGRESS 116TH 1st HOUSE OF REPRESENTATIVES Session 116- FURTHER APPROPRIATIONS FOR MAKING CONTINUING OF HOMELAND SECURITY FOR THE...

More info »