Using WFSTs for OCR Correction

An Exercise in Using WFSTs for OCR Error Correction

This assignment was created by Okan Kolak and revised by Philip Resnik. For a fuller description of the problem and a more fully elaborated WFST-based approach, see: Okan Kolak, William Byrne, Philip Resnik, A Generative Probabilistic OCR Model for NLP Applications, in Proceedings of the Human Language Technology Conference (HLT-NAACL 2003), Edmonton, Canada, May 27-June 1, 2003.

Others are welcome to use this assignment in teaching. If you do so, please cite Okan Kolak and Philip Resnik (2004), "An Exercise in Using WFSTs for OCR Error Correction", this URL.

Overview

In this homework, you will get some experience using weighted finite state transducers (WFSTs) on a practical task. The practical problem is to correct the output of an optical character recognition (OCR) system run on French text, using a noisy channel model.

The model has two components: a language model (source model) and an error model (channel model). You will use data provided with this assignment to express and combine these component models using the AT&T FSM toolkit, and then you will run the combined model on test data, also provided.

We assume you have covered the basics of:

Mehryar Mohri and Michael Riley (2002). Tutorial on Weighted Finite-State Transducers in Speech Recognition, (Part 1), International Conference on Spoken Language Processing 2002 (ICSLP '02), Denver, Colorado, September 2002.
Fernando C. N. Pereira and Michael Riley. Speech Recognition by Composition of Weighted Finite Automata. In E. Roche and Y. Schabes, editors, Finite-State Language Processing. MIT Press, Cambridge, Massachusetts. 1997.

We also assume that you have access to an installed version of the AT&T Toolkit.

Provided Files

Each line in a .trt file contains the correct ("ground truth") version of the OCR output on the corresponding line in the .ocr file. (A typical pair might be semblable for ground truth paired with sernblable as OCR output.) The first line in train.trt is the correct version of the first line in train.ocr, etc. Tokens that are produced by word merge/split errors are eliminated, to simplify the assignment. Consequently, each line contains only one token.

Note that test.trt and small.test.trt contains the ground truth for the test sets, i.e. the answers! Please don't look at these data until you're ready to test your solution.

errorModel.probs

This is the error model trained on the training files. See discussion of the error model below.

lm.char.3gram.fsa

This is a character level 3-gram language model trained on the train.trt file. You can use this file as your LM FSM if you do not have time to train and encode your own LM.

Note: this LM FSM the tropical semiring, and the numbers on the arcs are negative logs (natural base) of the corresponding probabilities. You will need to use the same semiring (tropical) for the FST and negative logs of probabilities with the same log base (natural) for the arc costs for the error model as well. (This would pretty much be the default approach in this framework in any case.)

symbols.txt

This is a symbol file (mapping tokens to integers) that you can use with the AT&T FSM toolkit. It includes all the symbols, and some more, used in the text files as well as in the error model.

Language Model

For exercise, and for better performance, you may want to provide your own language model instead of using the provided LM FSA. (This shouldn't be too hard if you created an n-gram model in a previous assignment.) Since the words in the dataset are provided in isolation (without context), a word level language model may not be very useful, and the usual approach would be to use a character level language model. You can use the truth side of the training data (train.trt) to train your language model.

If you are interested, you are free to use a word level model, or a combination of word and character level models. Note that if you decide to use a word level model, you will need a lexicon that you have to obtain/generate, and you cannot use test.trt or small.test.trt to generate it. Therefore, you may want to think about the unknown word problem.

You are also welcome to use additional French text for language model training if you so desire. A good choice might be the Bible, or you could explore cross-corpus effects by using French text from the Web. Note that the data in this assignment were created using the French Bible, so technically speaking you'd be training on test data and this would not be a "clean" experiment. That's ok since the point of this exercise is to get experience in modeling using WFSTs. (It also means you'll be able to get away with using an unsmoothed language model if you want.)

Remember, if you decide to use your own language model, you will need to express it as a finite-state machine.

Error Model

The error model in this assignment is a generalized confusion matrix based on the character edit model of Brill and Moore (2000). It provides confusion probabilities for individual pairs of characters (e.g. P(c|e), the probability of the OCR system recognizing 'c' when the true input is 'e'), as well as confusion probabilities for sequences of characters (e.g. Pr(am|ani)).

The error model probabilities are provided for you in the file errorModel.probs. The file has the following format:

__edit_probs__
<src chars>     <trg chars>     <P(<trg chars>|<src chars>)>
.
.
.
__end_data__

where multiple source (target) characters are separated by spaces, and the three table columns are separated by tabs. Source and target 'chars' can be one, two, or three character sequences. There are special symbols for epsilon (__eps__), start marker (__sta__) end marker (__end__) and space marker (__spc__, although this one does not appear in the files provided for this homework). Smoothing is already performed on the error model, so all you need to do is to convert the table into a WFST.

Because of the way the error model is generated, you should use a tropical semiring, not the log semiring. If you use the log semiring, you may end up with probabilities greater than one. To give an example, assume you have the following entries in the error model table:

a b c    a b c    0.8
a        a        0.7
b        b        0.9
c        c        0.85

When you are processing 'a b c -> a b c', you will get two paths, 'a b c' -> 'a b c' with probability 0.8 and 'a' -> 'a', 'b' -> 'b', 'c' -> 'c' with probability of 0.54. The combined probability is 1.34! The short explanation is this: because of a simplification in the segmentation component of Brill and Moore's model (which they do also), the two paths are coming from different chunkings, and this version of the model is ignoring the probability of a particular chunking to simplify things.

Correction (and what to turn in)

In order to perform correction on test data, you will need to convert input tokens to FSAs, combine the input FSAs, the language model WFSA and the error model WFST, and extract answers from the resulting FSTs.

Since the search space for correction is potentially really big, you may need to employ some restrictions to make the search feasible, depending on your implementation. One such restriction is to limit number of character errors per token. Whether you use such restrictions, and what restrictions you use, is your decision.

In order to help with the processing time constraints, you are provided with two test sets. The first set, test.ocr, contains 733 tokens. The second set, small.test.ocr, is a subset of the first one and contains only 20 tokens. The tokens in the small set are selected from corrected tokens in a test run, so you may expect to correct all 20. You are free to return results for either set. One possible approach would be to work on the small set during development, and once you have your final system, correct the large set if you have enough time. The experiments on the small set should enable you to estimate whether you will have time to correct the big set.

For each line in the test file of your choice, you should output a line that contains the top ten most likely corrections along with their probabilities. Use the following format:

Input:
token_1
token_2
.
.
.

Output:
[ correction_1_1 prob_1_1 ] [ ... ] [ correction_1_10 prob_1_10 ]
[ correction_2_1 prob_2_1 ] [ ... ] [ correction_2_10 prob_2_10 ]
.
.
.

Getting into FSM mindset can be a challenge, especially for people who are used to the imperative mindset of most programming languages and algorithms. If you get stuck at any point during the assignment, feel free to send an e-mail to the class mailing list.

Please turn in a short writeup of the approach you took, together with your output file.