This document describes the Correlation Calculator, "calc_corr", perl script, designed to score correlation for the NIST Metric MATR 08. This script scores the "NIST" output format (the system01.seg.scr, system01.sys.scr, system01.sys.scr, ... etc) files.
The following files have been bundled into the tar.gz file. [ DOWNLOAD tar.gz file ]
usage:
calc_corr.pl [-hmsfFw] <human-judgement-csv> <segment-wgt> <nist-prefix> [segfitfname]
if segfitfname is specified then the segment level fit values will be written
to that file, where line is of the form: <id> <hj_val> <eval_val> <target_eval_val>
-h : Show this help message
-m : Show min and max indiviual systems
-s : Show all individual systems
-f : Show linear fit equation (EVAL = a + b*HJ)
-F : Show linear fit equation, but reverse it so that we have (HJ = a + b*EVAL)
-w : Show weighted segment and document correlations (pearson only)
{user@blackbox}% calc_corr_v3.1.st.pl -ms adequacy.csv seg_wgt.txt output/MATR08_MT06/terp_norm3/terp_norm3_system0
Loading segment weights from seg_wgt.txt
Loading judgment scores from adequacy.csv
Loading automatic scores
Type System Pearson 95%-Interval Spearman 95%-Interval Tao N
SEG ALL -0.77519 (-0.792 -0.757) -0.746 (-0.765 -0.726) -0.750 1992
system01 -0.63667 (-0.705 -0.556) -0.642 (-0.710 -0.563) -0.657 249
system02 -0.68002 (-0.742 -0.607) -0.605 (-0.678 -0.520) -0.624 249 (max)
system03 -0.62484 (-0.695 -0.543) -0.607 (-0.680 -0.522) -0.609 249
system04 -0.45713 (-0.550 -0.353) -0.411 (-0.509 -0.302) -0.650 249
system05 -0.45458 (-0.548 -0.350) -0.459 (-0.552 -0.355) -0.656 249
system06 -0.45192 (-0.546 -0.347) -0.440 (-0.535 -0.334) -0.627 249 (min)
system07 -0.46653 (-0.558 -0.363) -0.275 (-0.386 -0.156) -0.726 249
system08 -0.59841 (-0.673 -0.512) -0.588 (-0.664 -0.501) -0.734 249
DOC ALL -0.94422 (-0.958 -0.927) -0.885 (-0.912 -0.851) -0.718 200
system01 -0.74043 (-0.879 -0.488) -0.738 (-0.877 -0.483) -0.567 25
system02 -0.75035 (-0.884 -0.505) -0.748 (-0.882 -0.500) -0.587 25
system03 -0.71658 (-0.866 -0.448) -0.618 (-0.814 -0.294) -0.460 25
system04 -0.58269 (-0.795 -0.244) -0.615 (-0.812 -0.290) -0.447 25 (min)
system05 -0.72267 (-0.870 -0.458) -0.595 (-0.801 -0.261) -0.440 25
system06 -0.79221 (-0.904 -0.578) -0.708 (-0.862 -0.434) -0.533 25 (max)
system07 -0.76538 (-0.891 -0.531) -0.730 (-0.873 -0.471) -0.567 25
system08 -0.68431 (-0.850 -0.396) -0.606 (-0.808 -0.277) -0.460 25
SYS ALL -0.99370 (-0.999 -0.964) -0.929 (-0.987 -0.648) -0.857 8
The ALL lines show the correlation when all the data from system is combined. You can also look at the individual systems. Pearson correlation is used to determine the min and max system for each level of correlation. If a correlation is not significant (meaning it is not significantly different from the NULL hypothesis which is that they aren't correlated) then a "!" will be placed next to the score. This is only done for Pearson and Spearman correlations.
One useful ability of this scoring package is the ability to calculate a weighted pearson correlation. In the standard pearson correlation all of the data points are treated equally. However, short segments contribute very little to the document or system level assessments of translation quality, while longer segments are of greater importance. These short segments also tend to be very difficult to both translate and to score well. This causes the short segments to be outliers from the overall distribution. While these short segments do not matter much when averaged into the judgments at the document or system level, they exert a great deal of pressure on the correlation coefficient calculated at the segment level. A more desired measure of correlation would be one that weights the longer segments more heavily than the shorter segments when calculating the Pearson correlation coefficient.
This is done, in short, by multiplying each score (both the human judgment score and the score output by the evaluation metric) by the number of words in the segment. The scoring package supports this weighted correlation, and provides weighted Pearson correlations at the segment and document level when it used with the "-w" flag. Note, the confidence intervals and significance scores output in these cases are not necessarily accurate. Weighted pearson correlations (as well as spearman correlations) are generally supported by the MattStat perl module.
This weighting of correlation can have significant effects on the reported correlation. For example, here are correlations calculated after running several baseline evaluation metrics on the MATR08 TransTac and MATR08 MT06 development data. The Pearson, Spearman, and Tao correlation coefficients are shown for each evaluation metric at the segment (SEG), document (DOC), and system (SYS) level. Adequacy scores for the documents at the document level were found by taking the segment-length weighted mean of the adequacy scores of the segments in each document. The weighted mean of the document adequacy scores was used to similarly calculate the system level correlations. A '*' next to a correlation number indicates that this score is not a significant correlation.
Case insensitive BLEU as calculated by IBM BLEU was used for the "bleu" metric. Limiting BLEU to ngrams of size 2 was used for the "bleu_n2" metric. METEOR using the exact matching, Porter Stemming, and WordNet synonym modules was used to calculate the "meteor" metric. Case insensitive TER was used to calculate the "ter" score. Running case insensitive TER with punctuation stripping enabled was used to calcualte the "ter_np" metric. The unweighted correlations are shown below:
MATR08_transtac +----------+-------------------------+-------------------------+-------------------------+ | | SEG | DOC | SYS | | METRIC | PEAR SPEAR TAO | PEAR SPEAR TAO | PEAR SPEAR TAO | +----------+-------------------------+-------------------------+-------------------------+ | bleu | 0.384 0.345* 0.052 | 0.150* 0.300* 0.200 | 0.150* 0.300* 0.200 | | bleu_n2 | 0.325* 0.345* 0.054 | -0.011* 0.100* 0.000 | -0.011* 0.100* 0.000 | | meteor | 0.464 0.531 0.184 | 0.156* 0.300* 0.200 | 0.156* 0.300* 0.200 | | ter | -0.344* -0.351* -0.408 | -0.015* 0.100* 0.000 | -0.015* 0.100* 0.000 | | ter_np | -0.351* -0.356 -0.415 | -0.037* 0.100* 0.000 | -0.037* 0.100* 0.000 | +----------+-------------------------+-------------------------+-------------------------+ MATR08_MT06 +----------+-------------------------+-------------------------+-------------------------+ | | SEG | DOC | SYS | | METRIC | PEAR SPEAR TAO | PEAR SPEAR TAO | PEAR SPEAR TAO | +----------+-------------------------+-------------------------+-------------------------+ | bleu | 0.603 0.606 0.192 | 0.861 0.794 0.598 | 0.954 0.738* 0.643 | | bleu_n2 | 0.637 0.614 0.198 | 0.883 0.799 0.604 | 0.952 0.738* 0.643 | | meteor | 0.739 0.730 0.290 | 0.898 0.876 0.693 | 0.958 0.922 0.786 | | ter | -0.609 -0.631 -0.663 | -0.863 -0.801 -0.618 | -0.961 -0.786* -0.643 | | ter_np | -0.630 -0.652 -0.678 | -0.870 -0.812 -0.631 | -0.963 -0.833* -0.714 | +----------+-------------------------+-------------------------+-------------------------+Here are the same results when weighted correlations are used for the segment and document level Pearson correlation scores:
MATR08_transtac +----------+-------------------------+-------------------------+-------------------------+ | | SEG | DOC | SYS | | METRIC | PEAR SPEAR TAO | PEAR SPEAR TAO | PEAR SPEAR TAO | +----------+-------------------------+-------------------------+-------------------------+ | bleu | 0.398 0.345* 0.052 | 0.150* 0.300* 0.200 | 0.150* 0.300* 0.200 | | bleu_n2 | 0.327* 0.345* 0.054 | -0.011* 0.100* 0.000 | -0.011* 0.100* 0.000 | | meteor | 0.441 0.531 0.184 | 0.156* 0.300* 0.200 | 0.156* 0.300* 0.200 | | ter | -0.395 -0.351* -0.408 | -0.015* 0.100* 0.000 | -0.015* 0.100* 0.000 | | ter_np | -0.408 -0.356 -0.415 | -0.037* 0.100* 0.000 | -0.037* 0.100* 0.000 | +----------+-------------------------+-------------------------+-------------------------+ MATR08_MT06 +----------+-------------------------+-------------------------+-------------------------+ | | SEG | DOC | SYS | | METRIC | PEAR SPEAR TAO | PEAR SPEAR TAO | PEAR SPEAR TAO | +----------+-------------------------+-------------------------+-------------------------+ | bleu | 0.662 0.606 0.192 | 0.889 0.794 0.598 | 0.954 0.738* 0.643 | | bleu_n2 | 0.696 0.614 0.198 | 0.902 0.799 0.604 | 0.952 0.738* 0.643 | | meteor | 0.769 0.730 0.290 | 0.914 0.876 0.693 | 0.958 0.922 0.786 | | ter | -0.705 -0.631 -0.663 | -0.890 -0.801 -0.618 | -0.961 -0.786* -0.643 | | ter_np | -0.707 -0.652 -0.678 | -0.891 -0.812 -0.631 | -0.963 -0.833* -0.714 | +----------+-------------------------+-------------------------+-------------------------+
The Transtac data is too small to draw conclusions from, but the MT06 data is much more informative.
It should be noted that according to the unweighted correlation, "bleu_n2" has a slightly higher segment level Pearson correlation than "ter_np". When a weighted correlation is examined however, the correlation of both "ter" and "ter_np" exceed the correlation of "bleu_n2". The "ter" measures seems especially prone to incorrectly estimating translation quality on short segments. In addition the "ter_np" metric seems to greatly outperform the "ter" metric according to the unweighted correlation, but does insignficantly better according to the weighted correlation, indicating that the benefit of stripping punctuation is primarily on short segments.