(This is joint work with Martha Palmer, Nianwen Xue and others)
With growing interest in Chinese Language Processing, numerous NLP tools (e.g. word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on corpora with different segmentation criteria, part-of-speech tagsets and bracketing guidelines, and therefore, comparisons are difficult. As a first step towards addressing this issue, we have been preparing a 100-thousand-word bracketed corpus since late 1998 and plan to release it to the public this summer. In this talk, we will address two challenges in building the corpus, namely, creating annotation guidelines and ensuring annotation accuracy. We will outline our methodology for guideline preparation and discuss some specific problems with segmentation and syntactic bracketing we encountered while creating the guidelines. Our primary tool for evaluating annotation consistency is the Parseval software which we use to compare annotations by different annotators during our Gold Standard preparation. We also use a grammar development tool called LexTract for checking accuracy of the annotation. We will briefly describe the main components of LexTract and its function in improving annotation accuracy. Two of the major applications of LexTract are the extraction of a Lexicalized Tree-adjoining Grammar (LTAG) from an annotated corpus and the building of derivation trees (similar to dependency trees) for sentences in the corpus. We have run LexTract on the Penn English Treebank and our Chinese Treebank, and will discuss these results.
For the colloquium series schedule, see the UMD Computational Linguistics Colloquium Series web page at http://umiacs.umd.edu/~resnik/cl_colloquium/. If you are interested in meeting with the speaker, please contact Philip Resnik (resnik@umiacs.umd.edu).