Language ID using n-gram models

Language ID using n-gram models

Language identification

Language ID is the problem of taking a document in an unknown language and determining what language it's written in. This is frequently a necessary step before processing the document in other ways. It turns out that n-gram models are simple and very effective way to perform language identification. You will write a program that does language ID using a bigram model. The program (or top-level LISP function, if you're writing in LISP) should take three arguments, infile1, infile2, and testfile, which will be filenames. The first two files will contain text in two languages, e.g. infile1 could contain English text and infile2 could contain French text. (We'll call these language1 and language2 just to keep things general.) The third argument identifies a file containing text in an unknown language. Next I describe what your program or function should do with its inputs.

What the program computes

  1. Using infile1, create a bigram frequency table for language1. How to represent bigrams and the choice of data structure for the frequency table is up to you, but one easy way would be to create a hash table where each key is a character bigram (e.g. represented as a dotted pair in LISP) and the value for that key is the number of times you've seen that bigram. For example, if the file contained "there is the dog" then the character bigram <t,h> would have frequency 2, and the character bigram <SPACE,d> would have frequency 1. (LISP programmers will find the read-char function very helpful here.) You'll also need to record the total number of bigrams seen in infile1.

  2. Do the same thing for language2 using infile2.

  3. Compute the probability for the string of characters in testfile, using a maximum likelihood estimate based on the bigram frequencies from infile1. That is, compute Pr(testfile|language1). (Or, if you prefer, compute the log probability instead.) So, for example, testfile might contain "Mary ran to the store.\nJohn likes Mary.\n", in which case you would be estimating the bigram-model probability of seeing that 40-character string based on the frequencies from infile1. (Here \n represents the newline character.)

  4. Compute the probability for the string of characters in testfile, using a maximum likelihood estimate based on the bigram frequencies from infile2. That is, compute Pr(testfile|language2). (Or, if you prefer, compute the log probability instead.)

  5. Output "Guess LANGUAGE1" or "Guess LANGUAGE2" depending on which conditional probability is greater, or "UNDECIDED" if the two are equal.

Data to work with

For developing and testing your program, you can work with the following data files:
  1. ftp://umiacs.umd.edu/pub/resnik/ling645/bible/GEN.EN (The Book of Genesis in English)
  2. ftp://umiacs.umd.edu/pub/resnik/ling645/bible/GEN.FR (The Book of Genesis in French)
  3. ftp://umiacs.umd.edu/pub/resnik/ling645/bible/21999.1 (a long test document in English)
  4. ftp://umiacs.umd.edu/pub/resnik/ling645/bible/21999.2 (a long test document in French)
  5. ftp://umiacs.umd.edu/pub/resnik/ling645/bible/test.1 (a short test document in English)
  6. ftp://umiacs.umd.edu/pub/resnik/ling645/bible/test.2 (a short test document in French)

Notes and recommendations

  1. For your own sanity, do not do your development using the complete GEN.EN and GEN.FR files nor the long test documents! Create (much) shorter files to use as infile1 and infile2 so you can make sure your code works. On my fairly fast machine, running my program using GEN.EN and GEN.FR can take a good minute or so. (The head command in Unix is useful for such purposes.)

  2. An alternative, if you're clever, would be to write one program to create the two frequency tables and write them to disk files, and then have your language ID program read in the frequency tables rather than recomputing them each time. But that's not required.

  3. When creating a bigram frequency table based on a file, the first character is obviously special since there is no preceding character. The easiest thing to do is write your code so that the first character in a file is treated as if it were preceded by a space (character #\space in LISP).

  4. If you run into underflow problems work with logprobs instead of probabilities.

  5. If you're up for doing something more realistic, modify your program so it uses smoothed probability estimates rather than maximum likelihood estimates. (Add-one smoothing is feasible, but Good-Turing would probably be too much to attempt given the time constraints unless you're very ambitious.)

Turn in

  1. Program listing.

  2. Printout showing the program running successfully. It's ok if it's only on the small data and/or test files, but what I'd really like to see are runs on testfiles test.1, test.2, 21999.1 and 21999.2, using infile1=GEN.EN and infile2=GEN.FR.

  3. Answer this question: When you test on files 21999.1 and 21999.2 (with GEN.EN and GEN.FR as the infiles), why does the program turn out to report "undecided" with probabilities both equal to zero (or logprobs both equal to negative infinity)? You should be able to answer this question even without running the program, by looking at the training and test data. (Note that in some LISP implementations you might get tiny, miniscule probabilities rather than probabilities exactly equal to zero.)

For less experienced programmers

If you really have no idea where to start, you can take a look at the following LISP code. This is a LISP code skeleton for one way of doing language ID using bigram models. Completing all the "insert code here" sections of code is one way to solve the problem. I recommend against using this code skeleton unless you need to, since my solution matches the way I think, not necessarily the way you think, and runs the risk of confusing you more than it helps!