Clusters have been one of the staples of language modeling research for almost as long as there has been language modeling research. I will give a novel clustering approach that allows us to create smaller models, and to train maximum entropy models faster. First, I examine how to use clusters for language model compression, with a surprising result. I achieve my best results by first making the models larger using clustering, and then pruning them. This can result in a factor of three or more reduction in model size at the same perplexity. I then go on to examine a novel way of using clustering to speed up maximum entropy training. Maximum entropy is considered by many people to be one of the more promising avenues of language model research, but it is prohibitively expensive to train large models. I show how to use clustering to speed up training time by up to a factor of 35 over standard techniques, while slightly improving perplexity. The same approach can be used to speed up some other learning algorithms that try to predict a very large number of outputs.
For the colloquium series schedule, see the UMD Computational Linguistics Colloquium Series web page at http://umiacs.umd.edu/~resnik/cl_colloquium/. If you are interested in meeting with the speaker, please contact Philip Resnik (resnik@umiacs.umd.edu).