Correlation Calculator


Written by Matthew Snover.

This document describes the Correlation Calculator, "calc_corr", perl script, designed to score correlation for the NIST Metric MATR 08. This script scores the "NIST" output format (the system01.seg.scr, system01.sys.scr, system01.sys.scr, ... etc) files.

The following files have been bundled into the tar.gz file. [ DOWNLOAD tar.gz file ]

  1. README.html, which you're reading right now.
  2. The license for the Correlation Calculator
  3. calc_corr_v3.1.st.pl - This is the scoring script that reads in the NIST format and outputs a table of correlations.
    It takes as required arguments the human judgement csv file (provided by NIST/LDC - adequacy.csv), a file containing the lengths of the segments (not provided by LDC, but I'm sending you ours for the MT06 data) so that it can compute document and system level correlations, and the prefix of the NIST format files to score (if your files were in output/system01.sys.scr, output/system01.seg.sc, output/system02.seg.sc ... then you could give "output/" as the prefix, or even "output/system", it just uses a perl glob to find all the {seg,doc,sys}.scr files that start with that prefix. You can also provide a filename to have it output segment level line fitted projected scores - probably a diagnostic you don't need.
    The usage statement also describes the other options.
    usage:
    calc_corr.pl [-hmsfFw] <human-judgement-csv> <segment-wgt> <nist-prefix> [segfitfname]
      if segfitfname is specified then the segment level fit values will be written
        to that file, where line is of the form: <id> <hj_val> <eval_val> <target_eval_val>
      -h  : Show this help message
      -m  : Show min and max indiviual systems
      -s  : Show all individual systems
      -f  : Show linear fit equation (EVAL = a + b*HJ)
      -F  : Show linear fit equation, but reverse it so that we have (HJ = a + b*EVAL)
      -w  : Show weighted segment and document correlations (pearson only)
    
  4. The last file you'll need is a list of segment lengths. A list for MT06 (mt06_seg_wgt.txt) is provided to show the proper format. These are used to generate the document and system level adequacy scores (since NIST only provides segment level scores). These are also used to compute weighted correlations. These lengths were generated by the software's author (using mteval tokenization as performed by TER), they are not from NIST.


Running the code with a command line like this should produce output like:
{user@blackbox}% calc_corr_v3.1.st.pl -ms adequacy.csv seg_wgt.txt output/MATR08_MT06/terp_norm3/terp_norm3_system0
Loading segment weights from seg_wgt.txt
Loading judgment scores from adequacy.csv
Loading automatic scores
Type System        Pearson      95%-Interval  Spearman     95%-Interval  Tao          N
SEG  ALL           -0.77519  (-0.792 -0.757)    -0.746  (-0.765 -0.726)    -0.750  1992
     system01      -0.63667  (-0.705 -0.556)    -0.642  (-0.710 -0.563)    -0.657   249
     system02      -0.68002  (-0.742 -0.607)    -0.605  (-0.678 -0.520)    -0.624   249 (max)
     system03      -0.62484  (-0.695 -0.543)    -0.607  (-0.680 -0.522)    -0.609   249
     system04      -0.45713  (-0.550 -0.353)    -0.411  (-0.509 -0.302)    -0.650   249
     system05      -0.45458  (-0.548 -0.350)    -0.459  (-0.552 -0.355)    -0.656   249
     system06      -0.45192  (-0.546 -0.347)    -0.440  (-0.535 -0.334)    -0.627   249 (min)
     system07      -0.46653  (-0.558 -0.363)    -0.275  (-0.386 -0.156)    -0.726   249
     system08      -0.59841  (-0.673 -0.512)    -0.588  (-0.664 -0.501)    -0.734   249
DOC  ALL           -0.94422  (-0.958 -0.927)    -0.885  (-0.912 -0.851)    -0.718   200
     system01      -0.74043  (-0.879 -0.488)    -0.738  (-0.877 -0.483)    -0.567    25
     system02      -0.75035  (-0.884 -0.505)    -0.748  (-0.882 -0.500)    -0.587    25
     system03      -0.71658  (-0.866 -0.448)    -0.618  (-0.814 -0.294)    -0.460    25
     system04      -0.58269  (-0.795 -0.244)    -0.615  (-0.812 -0.290)    -0.447    25 (min)
     system05      -0.72267  (-0.870 -0.458)    -0.595  (-0.801 -0.261)    -0.440    25
     system06      -0.79221  (-0.904 -0.578)    -0.708  (-0.862 -0.434)    -0.533    25 (max)
     system07      -0.76538  (-0.891 -0.531)    -0.730  (-0.873 -0.471)    -0.567    25
     system08      -0.68431  (-0.850 -0.396)    -0.606  (-0.808 -0.277)    -0.460    25
SYS  ALL           -0.99370  (-0.999 -0.964)    -0.929  (-0.987 -0.648)    -0.857     8

The ALL lines show the correlation when all the data from system is combined. You can also look at the individual systems. Pearson correlation is used to determine the min and max system for each level of correlation. If a correlation is not significant (meaning it is not significantly different from the NULL hypothesis which is that they aren't correlated) then a "!" will be placed next to the score. This is only done for Pearson and Spearman correlations.


Weighted Correlations

One useful ability of this scoring package is the ability to calculate a weighted pearson correlation. In the standard pearson correlation all of the data points are treated equally. However, short segments contribute very little to the document or system level assessments of translation quality, while longer segments are of greater importance. These short segments also tend to be very difficult to both translate and to score well. This causes the short segments to be outliers from the overall distribution. While these short segments do not matter much when averaged into the judgments at the document or system level, they exert a great deal of pressure on the correlation coefficient calculated at the segment level. A more desired measure of correlation would be one that weights the longer segments more heavily than the shorter segments when calculating the Pearson correlation coefficient.

This is done, in short, by multiplying each score (both the human judgment score and the score output by the evaluation metric) by the number of words in the segment. The scoring package supports this weighted correlation, and provides weighted Pearson correlations at the segment and document level when it used with the "-w" flag. Note, the confidence intervals and significance scores output in these cases are not necessarily accurate. Weighted pearson correlations (as well as spearman correlations) are generally supported by the MattStat perl module.

An Example

This weighting of correlation can have significant effects on the reported correlation. For example, here are correlations calculated after running several baseline evaluation metrics on the MATR08 TransTac and MATR08 MT06 development data. The Pearson, Spearman, and Tao correlation coefficients are shown for each evaluation metric at the segment (SEG), document (DOC), and system (SYS) level. Adequacy scores for the documents at the document level were found by taking the segment-length weighted mean of the adequacy scores of the segments in each document. The weighted mean of the document adequacy scores was used to similarly calculate the system level correlations. A '*' next to a correlation number indicates that this score is not a significant correlation.

Case insensitive BLEU as calculated by IBM BLEU was used for the "bleu" metric. Limiting BLEU to ngrams of size 2 was used for the "bleu_n2" metric. METEOR using the exact matching, Porter Stemming, and WordNet synonym modules was used to calculate the "meteor" metric. Case insensitive TER was used to calculate the "ter" score. Running case insensitive TER with punctuation stripping enabled was used to calcualte the "ter_np" metric. The unweighted correlations are shown below:

MATR08_transtac
+----------+-------------------------+-------------------------+-------------------------+
|          |           SEG           |           DOC           |           SYS           |
| METRIC   |  PEAR    SPEAR    TAO   |  PEAR    SPEAR    TAO   |  PEAR    SPEAR    TAO   |
+----------+-------------------------+-------------------------+-------------------------+
| bleu     |  0.384   0.345*  0.052  |  0.150*  0.300*  0.200  |  0.150*  0.300*  0.200  |
| bleu_n2  |  0.325*  0.345*  0.054  | -0.011*  0.100*  0.000  | -0.011*  0.100*  0.000  |
| meteor   |  0.464   0.531   0.184  |  0.156*  0.300*  0.200  |  0.156*  0.300*  0.200  |
| ter      | -0.344* -0.351* -0.408  | -0.015*  0.100*  0.000  | -0.015*  0.100*  0.000  |
| ter_np   | -0.351* -0.356  -0.415  | -0.037*  0.100*  0.000  | -0.037*  0.100*  0.000  |
+----------+-------------------------+-------------------------+-------------------------+

MATR08_MT06
+----------+-------------------------+-------------------------+-------------------------+
|          |           SEG           |           DOC           |           SYS           |
| METRIC   |  PEAR    SPEAR    TAO   |  PEAR    SPEAR    TAO   |  PEAR    SPEAR    TAO   |
+----------+-------------------------+-------------------------+-------------------------+
| bleu     |  0.603   0.606   0.192  |  0.861   0.794   0.598  |  0.954   0.738*  0.643  |
| bleu_n2  |  0.637   0.614   0.198  |  0.883   0.799   0.604  |  0.952   0.738*  0.643  |
| meteor   |  0.739   0.730   0.290  |  0.898   0.876   0.693  |  0.958   0.922   0.786  |
| ter      | -0.609  -0.631  -0.663  | -0.863  -0.801  -0.618  | -0.961  -0.786* -0.643  |
| ter_np   | -0.630  -0.652  -0.678  | -0.870  -0.812  -0.631  | -0.963  -0.833* -0.714  |
+----------+-------------------------+-------------------------+-------------------------+
Here are the same results when weighted correlations are used for the segment and document level Pearson correlation scores:
MATR08_transtac
+----------+-------------------------+-------------------------+-------------------------+
|          |           SEG           |           DOC           |           SYS           |
| METRIC   |  PEAR    SPEAR    TAO   |  PEAR    SPEAR    TAO   |  PEAR    SPEAR    TAO   |
+----------+-------------------------+-------------------------+-------------------------+
| bleu     |  0.398   0.345*  0.052  |  0.150*  0.300*  0.200  |  0.150*  0.300*  0.200  |
| bleu_n2  |  0.327*  0.345*  0.054  | -0.011*  0.100*  0.000  | -0.011*  0.100*  0.000  |
| meteor   |  0.441   0.531   0.184  |  0.156*  0.300*  0.200  |  0.156*  0.300*  0.200  |
| ter      | -0.395  -0.351* -0.408  | -0.015*  0.100*  0.000  | -0.015*  0.100*  0.000  |
| ter_np   | -0.408  -0.356  -0.415  | -0.037*  0.100*  0.000  | -0.037*  0.100*  0.000  |
+----------+-------------------------+-------------------------+-------------------------+

MATR08_MT06
+----------+-------------------------+-------------------------+-------------------------+
|          |           SEG           |           DOC           |           SYS           |
| METRIC   |  PEAR    SPEAR    TAO   |  PEAR    SPEAR    TAO   |  PEAR    SPEAR    TAO   |
+----------+-------------------------+-------------------------+-------------------------+
| bleu     |  0.662   0.606   0.192  |  0.889   0.794   0.598  |  0.954   0.738*  0.643  |
| bleu_n2  |  0.696   0.614   0.198  |  0.902   0.799   0.604  |  0.952   0.738*  0.643  |
| meteor   |  0.769   0.730   0.290  |  0.914   0.876   0.693  |  0.958   0.922   0.786  |
| ter      | -0.705  -0.631  -0.663  | -0.890  -0.801  -0.618  | -0.961  -0.786* -0.643  |
| ter_np   | -0.707  -0.652  -0.678  | -0.891  -0.812  -0.631  | -0.963  -0.833* -0.714  |
+----------+-------------------------+-------------------------+-------------------------+

The Transtac data is too small to draw conclusions from, but the MT06 data is much more informative.

It should be noted that according to the unweighted correlation, "bleu_n2" has a slightly higher segment level Pearson correlation than "ter_np". When a weighted correlation is examined however, the correlation of both "ter" and "ter_np" exceed the correlation of "bleu_n2". The "ter" measures seems especially prone to incorrectly estimating translation quality on short segments. In addition the "ter_np" metric seems to greatly outperform the "ter" metric according to the unweighted correlation, but does insignficantly better according to the weighted correlation, indicating that the benefit of stripping punctuation is primarily on short segments.