Using only an unannotated corpus of English, and without the use of a lexicon or any other linguistically sophisticated resource, an algorithm is presented which infers a simple finite state grammar which produces a linguistically interesting "chunking" of parts of the corpus. From the chunking of the corpus we show how to naturally derive notions of simple constituents such as noun-phrase, prepositional phrase, verb-phrase, simple sentence, and the like (as distinct from sequences of words which are non-constituents). I shall also show how applying various statistical analyses to the resulting chunks can derive more complex constituent-level structure.
For the colloquium series schedule, see the UMD Computational Linguistics Colloquium Series web page at http://umiacs.umd.edu/~resnik/cl_colloquium/. If you are interested in meeting with the speaker, please contact Mari Broman Olsen (molsen@umiacs.umd.edu) or Philip Resnik (resnik@umiacs.umd.edu).