UMIACS Computational Linguistics Colloquium, Nov 13, 2000

A Unified Approach to Statistical Language Modeling for Chinese
and
A Block-Based Robust Dependency Parser for Unrestricted Chinese Text


Dr. Jianfeng Gao and Dr. Ming Zhou


Microsoft Research China


UMIACS Computational Linguistics Colloquium

Special day/time: Nov 13, 2000, 10am, AVW Room 2120


ABSTRACTS

Talk 1: A Unified Approach to Statistical Language Modeling for Chinese
Speaker: Dr. Jianfeng Gao, Microsoft Research China

In this talk, I present a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigrams to Chinese is challenging because (1) there is no standard definition of words in Chinese, (2) word boundaries are not marked by spaces, and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the Web, creates a high-quality lexicon, and segments the training data using this lexicon, all using a maximum likelihood principle, which is consistent with the trigram training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.

Talk 2: A Block-Based Robust Dependency Parser for Unrestricted Chinese Text
Speaker: Dr. Ming Zhou, Microsoft Research China

This talk introduces a practical system for Chinese parsing by using a hybrid model of phrase structure partial parsing and dependency parsing, which is called "block based dependency parsing." This system showed good performance and high robustness in parsing unrestricted texts and has been applied in a successful Chinese-Japanese machine translation product. For an input sentence, basic components of sentence, i.e., "blocks" are first identified by an ATN-like partial parsing procedure, which produces a clear skeleton of the sentence structure. In our phrase structure analysis, we don't try to deduce the whole sentence into root S, instead, we only try to get the components, namely blocks. This partial parsing strategy guarantees high robustness. Then dependency parsing is applied in order to build dependency relations among blocks. The dependency parsing skips ungrammatical portions it encounters. This strategy confines ungrammatical portion and avoids errors to be propagated globally. By partial parsing and skip strategy, this parser can handle long, complicated, or even faulty sentences. The experiments show that this parser is very robust and powerful. A parser constructed based on this approach has been developed, with 220,000 words, 5,000 part-of-speech tagging rules, over 1,000 block parsing rules and 300 dependency parsing rules.


About the speakers:

Dr. Jianfeng Gao joined Microsoft Research China as an Associate Researcher in January of 1999. He received his B.S. in Industrial Design from Shanghai Jiaotong University (SJTU) in 1993, as well as his M.S. in Computer Science in 1996, and, finally, his Ph.D. in Computer Science in 1999. It was while working towards his Ph.D. at SJTU that he developed BYLCAD, the first-ever commercial CAD system. The author of more than 30 different papers in his field, Dr. Gao's research interests include statistical language modeling, information retrieval, machine translation, speech recognition, natural language processing, and intelligent CAD.

Dr. Ming Zhou received his B.S. degree in computer engineering from Chongqing University in 1985, and his M.S. degree and Ph.D. in computer science from Harbin Institute of Technology in 1988 and 1991. He did post-doctoral work at Tsinghua University from 1991 to 1993, when he became an associate professor there. He visited the Chinese University of Hong Kong as a research associate in 1985 and the City University of Hong Kong as a research fellow in 1986. Between November of 1996 and March of 1999, he worked for Kodensha Ltd. Co. in Japan as the project leader of the Chinese-Japanese machine translation project that came out with the "J-Beijing" commercial software in 1998. Between April and August 1999, he was the leader of the NLP research group of the Department of Computer Science, Tsinghua University. He designed the CEMT-I machine translation system, the first experiment of Chinese-English machine translation in Mainland China. Dr. Zhou joined Microsoft Research China in Sept. 1999 as a researcher and project leader in charge of projects in the natural language computing group, including machine aided translation and authoring, cross-language information retrieval, and Chinese spelling checking.


For the colloquium series schedule, see the UMD Computational Linguistics Colloquium Series web page at http://umiacs.umd.edu/~resnik/cl_colloquium/. If you are interested in meeting with the speaker, please contact Philip Resnik (resnik@umiacs.umd.edu).