UMIACS Computational Linguistics Colloquium, March 4, 1999

Distributional Chunking of an English Corpus


Steve Finch


Thomson Labs, Rockville, MD


UMIACS Computational Linguistics Colloquium

March 4, 1999, 4pm, AVW Room 2120


Using only an unannotated corpus of English, and without the use of a lexicon or any other linguistically sophisticated resource, an algorithm is presented which infers a simple finite state grammar which produces a linguistically interesting "chunking" of parts of the corpus. From the chunking of the corpus we show how to naturally derive notions of simple constituents such as noun-phrase, prepositional phrase, verb-phrase, simple sentence, and the like (as distinct from sequences of words which are non-constituents). I shall also show how applying various statistical analyses to the resulting chunks can derive more complex constituent-level structure.


For the colloquium series schedule, see the UMD Computational Linguistics Colloquium Series web page at http://umiacs.umd.edu/~resnik/cl_colloquium/. If you are interested in meeting with the speaker, please contact Mari Broman Olsen (molsen@umiacs.umd.edu) or Philip Resnik (resnik@umiacs.umd.edu).