Others are welcome to use this assignment in teaching. If you do so, please cite Okan Kolak and Philip Resnik (2004), "An Exercise in Using WFSTs for OCR Error Correction", this URL.
The model has two components: a language model (source model) and an error model (channel model). You will use data provided with this assignment to express and combine these component models using the AT&T FSM toolkit, and then you will run the combined model on test data, also provided.
We assume you have covered the basics of:
Note that test.trt and small.test.trt contains the ground truth for the test sets, i.e. the answers! Please don't look at these data until you're ready to test your solution.
This is the error model trained on the training files. See discussion of the error model below. This is a character level 3-gram language model trained on the train.trt file. You can use this file as your LM FSM if you do not have time to train and encode your own LM.Note: this LM FSM the tropical semiring, and the numbers on the arcs are negative logs (natural base) of the corresponding probabilities. You will need to use the same semiring (tropical) for the FST and negative logs of probabilities with the same log base (natural) for the arc costs for the error model as well. (This would pretty much be the default approach in this framework in any case.)
This is a symbol file (mapping tokens to integers) that you can use with the AT&T FSM toolkit. It includes all the symbols, and some more, used in the text files as well as in the error model.If you are interested, you are free to use a word level model, or a combination of word and character level models. Note that if you decide to use a word level model, you will need a lexicon that you have to obtain/generate, and you cannot use test.trt or small.test.trt to generate it. Therefore, you may want to think about the unknown word problem.
You are also welcome to use additional French text for language model training if you so desire. A good choice might be the Bible, or you could explore cross-corpus effects by using French text from the Web. Note that the data in this assignment were created using the French Bible, so technically speaking you'd be training on test data and this would not be a "clean" experiment. That's ok since the point of this exercise is to get experience in modeling using WFSTs. (It also means you'll be able to get away with using an unsmoothed language model if you want.)
Remember, if you decide to use your own language model, you will need to express it as a finite-state machine.
The error model probabilities are provided for you in the file errorModel.probs. The file has the following format:
__edit_probs__ <src chars> <trg chars> <P(<trg chars>|<src chars>)> . . . __end_data__where multiple source (target) characters are separated by spaces, and the three table columns are separated by tabs. Source and target 'chars' can be one, two, or three character sequences. There are special symbols for epsilon (__eps__), start marker (__sta__) end marker (__end__) and space marker (__spc__, although this one does not appear in the files provided for this homework). Smoothing is already performed on the error model, so all you need to do is to convert the table into a WFST.
Because of the way the error model is generated, you should use a tropical semiring, not the log semiring. If you use the log semiring, you may end up with probabilities greater than one. To give an example, assume you have the following entries in the error model table:
a b c a b c 0.8 a a 0.7 b b 0.9 c c 0.85
When you are processing 'a b c -> a b c', you will get two paths, 'a b c' -> 'a b c' with probability 0.8 and 'a' -> 'a', 'b' -> 'b', 'c' -> 'c' with probability of 0.54. The combined probability is 1.34! The short explanation is this: because of a simplification in the segmentation component of Brill and Moore's model (which they do also), the two paths are coming from different chunkings, and this version of the model is ignoring the probability of a particular chunking to simplify things.
Since the search space for correction is potentially really big, you may need to employ some restrictions to make the search feasible, depending on your implementation. One such restriction is to limit number of character errors per token. Whether you use such restrictions, and what restrictions you use, is your decision.
In order to help with the processing time constraints, you are provided with two test sets. The first set, test.ocr, contains 733 tokens. The second set, small.test.ocr, is a subset of the first one and contains only 20 tokens. The tokens in the small set are selected from corrected tokens in a test run, so you may expect to correct all 20. You are free to return results for either set. One possible approach would be to work on the small set during development, and once you have your final system, correct the large set if you have enough time. The experiments on the small set should enable you to estimate whether you will have time to correct the big set.
For each line in the test file of your choice, you should output a line that contains the top ten most likely corrections along with their probabilities. Use the following format:
Input: token_1 token_2 . . . Output: [ correction_1_1 prob_1_1 ] [ ... ] [ correction_1_10 prob_1_10 ] [ correction_2_1 prob_2_1 ] [ ... ] [ correction_2_10 prob_2_10 ] . . .Getting into FSM mindset can be a challenge, especially for people who are used to the imperative mindset of most programming languages and algorithms. If you get stuck at any point during the assignment, feel free to send an e-mail to the class mailing list.
Please turn in a short writeup of the approach you took, together with your output file.