The (Un)Suitability of Automatic Evaluation Metrics for Text Simplification
Authors
Read the full paper
What is this paper about?
Text Simplification consists of rewriting sentences to make them easier to read and understand, while preserving as much as possible of their original meaning. Simplified texts can help non-native speakers, people with low-literacy levels, and those suffering from some cognitive impairments. Researchers in Natural Language Processing have developed a variety of models that “learn” to simplify sentences automatically. Independently of the approach used, it is important to measure the quality of the simplifications that the models generated. Ideally, this would always be done through human evaluation: ask users to rate the quality of the automatic simplifications by comparing them to the original sentences. However, at development time, this is impractical since it is expensive and time-consuming. As such, researchers rely on metrics that attempt to automatically assess the quality of the simplifications by comparing them to reference simplifications previously produced by human editors. However, in order to trust the score that a metric computes, we should know that it correlates with human assessments (i.e. if a metric provides a good score for a simplification, it is because a human would have also provided one). In this paper, we present the first comprehensive study on the correlation between automatic metrics and human judgements on simplicity.
Why is the research important?
Studies on the correlation of human judgments on simplicity and automatic scores have been performed when introducing new metrics or data sets. However, these studies did not analyse if the absolute correlations varied in different subgroups of the data. In contrast, our study shows that correlations are affected by the perceived quality of the simplifications, the types of the simplification systems, and the set of manual references used to compute the metrics. For instance, we show that: (a) metrics can more reliably score low-quality simplifications; (b) most metrics are better at scoring system outputs from neural models (the current trend in the area); and (c) computing metrics using all available manual references for each original sentence does not significantly improve their correlations. Based on all these findings, we also provide a set of guidelines on which metrics to compute and how to interpret their scores.
Anything else that you would like to highlight about the paper?
Our study contributes to the ongoing conversation on the reliability of automatic metrics used in evaluating Natural Language Generation tasks (e.g. Machine Translation, Summarisation, Caption Generation, etc.). Similar to other research, we show that computing absolute correlations is insufficient to verify the reliability of automatic metrics, and that more detailed analysis should be performed to identify specific strengths and weaknesses of each metric. For Text Simplification, in particular, we hope our study motivates the development of new metrics for the task.