What is the right representation for a natural language? A Markov chain? A stochastic branching process? A contingency table? .... While each such model describes a specific linguistic phenomenon of natural language, over the last forty years, we have lacked a unified probabilistic framework to encode language that is able to simultaneously take into account the local information inherent in Markov chain models, the hierarchical syntactic structure of sentences in stochastic branching processes, and the semantic content of documents in bag-of-words categorical mixture log-linear models. Recently we proposed a latent maximum entropy principle that is able to provide just such a tool for statistical language modeling. In this talk, I will describe the latent maximum entropy principle, which extends Jaynes' original maximum entropy principle in a way that accommodates latent variables. I will give the problem formulation, its solution, and certain convergence properties. Then I will show how to use this machine learning technique for statistical language modeling in a principled way with mixtures of exponential families that have rich expressive power. Finally, I will draw some conclusions and point out future research directions.
For the colloquium series schedule, see the UMD Computational Linguistics Colloquium Series web page at http://umiacs.umd.edu/~resnik/cl_colloquium/. If you are interested in meeting with the speaker, please contact Philip Resnik (resnik@umiacs.umd.edu).