The vast amount of information now available in electronic form has led to increasing demand for applications that process natural language. (Example applications include machine translation, summarization and information extraction). Accurate methods for parsing unrestricted text will almost certainly be a key component in these applications. Unfortunately, the traditional approach to syntactic analysis -- writing a grammar by hand -- has encountered two major problems. First, ambiguity: even moderate-length sentences often receive thousands of analyses, with no indication of which is correct. Second, coverage: constructing an exhaustive grammar of English has proved to be extremely difficult owing to the huge number of rules needed. In this talk I will describe my work on machine learning methods for parsing. A statistical model is trained from a corpus of sentences that have been annotated for syntactic structure. Competing analyses for a test data sentence can then be ranked by their probability under the model; moreover the most probable analysis can be efficiently found. I will show how careful design of the model can lead to linguistically motivated parameters, and crucially to parameters that condition heavily on lexical information.
For the colloquium series schedule, see the UMD Computational Linguistics Colloquium Series web page at http://umiacs.umd.edu/~resnik/cl_colloquium/. If you are interested in meeting with the speaker, please contact Mari Broman Olsen (molsen@umiacs.umd.edu) or Philip Resnik (resnik@umiacs.umd.edu).