As the use of machine translation (MT) becomes more popular in the global economy and MT engines continue to develop, it becomes crucial to have reliable methods to evaluate their performance. Evaluation helps improve the engines’ performance, customize them for specific purposes, and ensure they follow the best translation and localization practices.
There are two main approaches to MT evaluation: human and automatic. Human evaluation involves professional translators assessing the quality of the translated text. While this is the most reliable method, it can be expensive and time-consuming. Automatic evaluation, on the other hand, uses computer programs to score the translation quality based on predefined criteria. While not perfect, automatic evaluation offers several advantages: it is fast, less expensive, and allows easy comparison between different machine translation systems.
However, evaluating the quality of machine translation can be challenging. Ideally, we would want an evaluation metric that perfectly aligns with human judgment of translation quality. The challenge lies precisely in how to measure “quality”. It is safe to say that quality goes beyond accuracy, fluency, naturalness, and formality level; but then, is there a method that can account for all of that?
In this blog post, we’ll take a look at the types of MT evaluation and go over some of the most renowned automated metrics to answer that question.
Types of MT Evaluation
There are three main categories of automatic MT evaluation (Source: MachineTranslation.com)
- Sentence-based: This approach compares individual sentences in the machine translation output to sentences in a reference translation.
- Document-based: This method looks at the overall quality of a translated document, rather than focusing on individual sentences.
- Context-based: The context-based approach considers the purpose of the translation, such as the genre of the text and the target audience.
Within these categories, there are both traditional metrics and newer neural metrics. Traditional metrics, like BLEU and TER have been widely used for many years despite their limitations. Neural metrics, based on deep learning techniques, are a promising new direction in MT evaluation.
Traditional Metrics
Some of the most common traditional metrics include (Source: Medium.com):
BLEU (Bilingual Evaluation Understudy)
The BLEU metric evaluates MT outputs by measuring how many word sequences (n-grams) in the translated text match those in a collection of approved translations. Although BLEU is a common standard, it does not consistently align with human evaluation. For example, a translation might receive a high BLEU rating for being grammatically sound, yet it could still come across as unnatural or awkward.
Another downside is that BLEU penalizes word order and sentence length, which makes some language pairs score less than others. However, this is still the most used MT evaluation method due to its low computational requirements and cost. Overall, BLEU would give a quick indication of vocabulary used and accuracy.
METEOR (Metric for Evaluation of Translation with Explicit Ordering)
METEOR goes beyond just comparing word sequences by also taking into account synonyms, the base forms of words, and the sequence in which words appear. This word-based metric aims to make up for BLEU’s shortcomings by including precision and recall weights, which allows it to account for word variations and word order.
In comparison with BLEU, this approach seems more nuanced and correlates better with human judgment. Nevertheless, METEOR has its own set of constraints, including its dependence on existing lists of synonyms to function.
TER (Translation Error Rate)
TER determines how many changes need to be made to a machine translation output to match a standard reference translation. These changes can include adding, removing, or replacing words.
TER is appreciated for its straightforwardness, ease of use, and low cost. It’s ideal when the purpose of a translation is to be sent to post-edit; however, it does not consider how smooth or natural the translation sounds and does not perform well with morphologically rich languages.
Neural Metrics and Recent Developments
Newer neural metrics, based on deep learning techniques, offer a more nuanced approach to MT evaluation. Let’s take a look at some of them:
COMET (Crosslingual Optimized Metric for Evaluation of Translation)
The COMET metric leverages a pre-trained language model (XLM-RoBERT) to predict how human evaluators would score the translation on factors like fluency, adequacy, and grammar. Different than previous metrics, this approach also takes into account the source text and human rankings. COMET is trained on multilingual language models and aims to show more linguistic understanding, and directly mimic human judgment.
BLEURT (Bilingual Evaluation Understudy with Representations from Transformers)
This metric is designed to detect complex semantic parallels between sentences. BLEURT provides strong and reliable assessments that closely mirror human evaluations, achieving a new standard of excellence. It leverages the latest breakthroughs in transfer learning to recognize common language patterns, e.g., rephrasing.
Although BLEURT is trained on machine translation texts, which leaves more room for recognizing “good” MT outputs, it is more expensive to train.
hLEPOR (enhanced Length Penalty, Precision, n-gram Position difference Penalty and Recall)
The enhanced LEPOR metric, known as hLEPOR, addresses previous criticisms about its insensitivity to syntax and meaning. It analyzes the inclusion of linguistic elements like parts of speech (POS), incorporating them as a feature that reflects both syntactic and semantic considerations. For example, if a word in the translated sentence is a verb when it should be a noun, a penalty is applied.
Conversely, if the POS matches but the specific word differs, like “good” versus “nice”, the translation is awarded some points. hLEPOR derives the final score from a weighted combination of the scores at both the word and POS levels. It also addresses the weak spots of BLEU (sentence length, word order, etc.,) and has shown better results in relation to human judgement.
Considerations for MTE
Choosing the right MT evaluation metric depends on the specific purpose of the translation.
For general-purpose evaluation, BLEU remains a popular choice due to its simplicity and ease of interpretation. However, it is important to know its limitations and consider using it along with other metrics. For tasks where fluency is crucial, metrics like COMET and BLEURT can be valuable tools.
For highly specialized domains, human evaluation may still be the best option. Neural metrics may still not be able to capture the nuances of domain-specific language.
It is important to note that there is no single “best” metric for MT evaluation. The most effective approach often involves using a combination of different metrics depending on the purpose of the translation. Consider using both traditional and neural options and/or human evaluation so you can use this data to fine-tune your systems and improve the performance of MT engines.