Machine translation Evaluation

September 5, 2016
11 Outline Machine

The Meteor automatic evaluation metric scores machine translation hypotheses by aligning them to one or more reference translations. Alignments are based on exact, stem, synonym, and paraphrase matches between words and phrases. Segment and system level metric scores are calculated based on the alignments between hypothesis-reference pairs. The metric includes several free parameters that are tuned to emulate various human judgment tasks including WMT ranking and NIST adequacy. The current version also includes a tuning configuration for use with MERT and MIRA. Meteor has extended support (paraphrase matching and tuned parameters) for the following languages: English, Czech, German, French, Spanish, and Arabic. Meteor is implemented in pure Java and requires no installation or dependencies to score MT output. On average, hypotheses are scored at a rate of 500 segments per second per CPU core. Meteor consistently demonstrates high correlation with human judgments in independent evaluations such as EMNLP WMT 2011 and NIST Metrics MATR 2010.

Meteor X-ray uses XeTeX and Gnuplot to create visualizations of alignment matrices and score distributions from the output of Meteor. These visualizations allow easy comparison of MT systems or system configurations and facilitate in-depth performance analysis by examination of underlying Meteor alignments. Final output is in PDF form with intermediate TeX and Gnuplot files preserved for inclusion in reports or presentations. The Examples section includes sample alignment matrices and score distributions from Meteor X-ray.

Meteor (current version, including paraphrase tables and X-ray):
  • Michael Denkowski and Alon Lavie, "Meteor Universal: Language Specific Translation Evaluation for Any Target Language", Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, 2014 [PDF] [bib]
Older versions of Meteor:
  • Michael Denkowski and Alon Lavie, "Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems", Proceedings of the EMNLP 2011 Workshop on Statistical Machine Translation, 2011 [PDF] [bib]

  • Michael Denkowski and Alon Lavie, "METEOR-NEXT and the METEOR Paraphrase Tables: Improved Evaluation Support For Five Target Languages", Proceedings of the ACL 2010 Joint Workshop on Statistical Machine Translation and Metrics MATR, 2010 [PDF] [bib]

  • Michael Denkowski and Alon Lavie, "Extending the METEOR Machine Translation Evaluation Metric to the Phrase Level", Proceedings of NAACL/HLT, 2010 [PDF] [bib]

  • Alon Lavie and Michael Denkowski, "The METEOR Metric for Automatic Evaluation of Machine Translation", Machine Translation, 2010 [PDF]

  • Abhaya Agarwal and Alon Lavie, "METEOR, M-BLEU and M-TER: Evaluation Metrics for High-Correlation with Human Rankings of Machine Translation Output", Proceedings of the ACL 2008 Workshop on Statistical Machine Translation, 2008 [PDF]

  • Alon Lavie and Abhaya Agarwal, "METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments", Proceedings of the ACL 2007 Workshop on Statistical Machine Translation, 2007 [PDF]

  • Satanjeev Banerjee and Alon Lavie, "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments", Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, 2005 [PDF]


MORE TRANSLATION VIDEO
文献紹介:Improve Statistical Machine Translation
文献紹介:Improve Statistical Machine Translation ...
Douglas A. Jones: Machine Translation Performance
Douglas A. Jones: Machine Translation Performance ...
Adam Lopez: Machine Translation: Models, Search, and
Adam Lopez: Machine Translation: Models, Search, and ...
INTERESTING FACTS
Share this Post