Spanish
Language Processing at University of Maryland: Building Infrastructure for
Multilingual Applications
University of Maryland, College Park, MD 20742
{clarac,bonnie,resnik}@umiacs.umd.edu
Abstract: We describe here our construction of lexical resources, tool creation,
building of an aligned parallel corpus, and an approach to automatic treebank
creation that we have been developing using Spanish data, based on projection
of English syntactic dependency information across a parallel corpus.
Introduction
NLP
researchers at the University of Maryland are currently working on the
construction of resources and tools for several multilingual applications, with
a focus on broad coverage machine translation (MT) and cross-language
information retrieval. We describe here our construction of lexical resources,
tool creation, building of an aligned parallel corpus, and an approach to
automatic treebank creation, which we have been developing using Spanish data,
based on projection of English syntactic dependency information across a
parallel corpus.
Creating lexical databases for Spanish
We have built two types of
lexical databases for Spanish: one that
is semantico-syntactic, based on a representation called Lexical Conceptual Structure (LCS), and one that is morphological,
based on Kimmo-style Spanish entries.
An LCS is a directed graph
with a root that reflects the semantics of a lexical item by a combination of
semantic structure and semantics content. LCS representations are both language
and structure independent; they were originally formulated by Jackendoff (1983,
1990) and have been used as interlingua in a number of machine translation
projects including UNITRAN and MILT (Dorr 1993; Dorr 1997).
The creation of a Spanish
LCS lexicon relied heavily on the existence of a large hand-generated database
of English LCS entries, which were ported over to Spanish LCS entries by means
of a bilingual lexicon and acquisition procedures as described in Dorr (1997).
Our Spanish morphological
lexicon was originally derived from a two-level Kimmo-based morphology system
(Dorr 1993). This lexicon contains 273 roots and 99 types of endings, with an
upper bound of possible morphological realizations of 27,027 (the product of
number of roots, multiplied by the number of endings).
The structure of the lexicon
consists of (a) root word entries, (b) continuation classes, and (c) endings.
(DEF-MORPH-ROOT language root features
(string-root1 continuation-class1 features1)
(string-root2 continuation-class2 features2))
For example, the Spanish
words ‘veo’ (‘I see’) and ‘visto’ (‘seen’), would have the following entries in
the lexicon:
(DEF-MORPH-ROOT Spanish VER [v]
(“ve” *ER-IRREG-6 NIL)
(“visto” NIL [perf-tns]))
We used this lexicon for English-to-Spanish query translation in several
cross-language information retrieval experiments. The results were presented at
the First International Conference on Language Resource Evaluation (LREC) in
Granada, Spain (Dorr and Oard 1998).
Applying
a Spanish LCS Lexicon in MT
We have experimented with an interlingual approach
to Spanish-English machine translation, using LCS representations as the
interlingua. In our most recent experiments in Spanish to English translation,
we have used LCS together with Abstract Meaning Representations (AMR) as
developed at USC/ISI (Langkilde and Knight, 1998a). AMRs are semantic-syntactic
language-specific representations.
After parsing the Spanish
sentence, we create a semantic representation (LCS), which is then transformed
into a syntactic-semantic representation of the target language sentence (AMR).
This representation serves as the input to Nitrogen, a generation tool
developed by USC/ISI (Langkilde and Knight, 1998a; Langkilde and Knight 1998b).
Nitrogen is responsible for (a) transforming the Spanish syntactic
representation into an English syntactic representation, (b) Creating a word by
generating all the possible surface orderings (linearizations) for the English
sentence, (c) Using a n-gram language model to choose the optimal
linearization, and finally (d) generating morphological realizations, i.e.
producing the surface form for the English sentence which corresponds to the
translation of the Spanish original sentence.
Acquiring
bilingual dictionary entries
In addition to building and applying the more
sophisticated LCS lexical representations, we have explored the automatic
acquisition of simple word-to-word correspondences from parallel corpora, based
on cross-language statistical association between word co-occurrences. The noisy, confidence-ranked bilingual
lexicons obtained in this way can be useful in porting LCS lexicons to new
languages, as described above, and are also useful by themselves in improving
dictionary-based cross language information retrieval (Resnik, Oard, and Levow,
2001).
Constructing an Aligned Corpus
Parallel corpora have emerged as a crucial resource
for acquiring and improving lexical resources such as bilingual lexicons, and
for developing broad coverage machine translation techniques. We have therefore devoted effort to
acquiring English-Spanish parallel text using traditional and less traditional
channels.
Collecting
Parallel Text
We have obtained parallel data in three ways. First, we have taken advantage of
community-wide corpus distribution channels, such as the Linguistic Data
Consortium (LDC), the European Language Resource Distribution Agency (ELDA) and
the Foreign Broadcast Information Service (FBIS). These sources provide data that are generally clean and often
aligned or easily alignable, and which have the advantage of being available in
common to a large community of researchers.
Second, we have collected
parallel text from the World Wide Web using the STRAND system for acquiring parallel
Web documents (Resnik, 1999). (One such
collection of Spanish-English documents is available, as a set of URL pairs, at
http://umiacs.umd.edu/~resnik/strand.) Data collected from the Web have the
advantage of great diversity in contrast to the often more domain- or
genre-specific forms of text available from standard sources; on the other
hand, they are also often of extremely diverse quality.
Third, we have obtained a
parallel English-Spanish version of the Bible as part of our general project
collecting freely available Bible versions and annotating their parallel
structure using the Corpus Encoding Standard (CES), as a parallel resource for
use in computational linguistics. Our
empirical studies of the Bible’s size and vocabulary coverage – using LDOCE and
the Brown Corpus for comparison – suggest that modern-language Bibles are a
surprisingly viable source of information about everyday language research
(Resnik, Olsen, and Diab, 1999). CES-annotated
parallel English and Spanish versions are available on the Web at http://umiacs.umd.edu/~resnik/parallel/.
In the work we describe
here, we have been focusing our development on the Spanish-English United
Nations Parallel Corpus, available from LDC, which has data generated from 1989
through 1991.
Aligning
the Text at the Sentence Level
The U.N. Parallel Corpus is already aligned at the
document level. Our alignment of the
corpus at lower levels uses a combination of existing tools and components we
have constructed.
As a first stage in below-document-level alignment,
we preprocess the text in order to obtain alignments at the paragraph level
using simple document structure.
HTML-style markup, indicating a number of within-text boundaries above
the sentence level, is introduced automatically on the basis of relevant cues
in the text. The resulting marked-up
document is passed to a structure-based alignment tool designed for use with
HTML documents (Resnik, 1999), which uses dynamic programming (Unix diff)
to generate an alignment between text chunks on the basis of
correspondences in markup. Because only
boundary markup is used, not content, the process is entirely language independent. Although the introduction of markup is
pattern-based and therefore somewhat heuristic, it succeeds well at avoiding
the introduction of spurious (intra-sentential) boundaries.
Tokenization
Our ultimate goal being word-level alignment, we
required tokenized text. We implemented a tokenizer for Spanish using a number
of Perl pattern matching rules, some of them adapted from the Spanish
Kimmo-style morphological analyzer (Dorr, 1993). In its current state, this tokenizer removes SGML tags, bad spacing
characters (tabs/spaces/ansi space/ etc.) and punctuation (in the case of
periods at the end of the sentence, it actually separates them from the
preceding word). It also merges over
2000 frequently co-occurring words that form fixed expressions, e.g. the tokens
in 'dentro de' will be merged into 'dentro_de'. Finally, it performs
morphological analysis. In the case of verbs, it uses 70 Perl substitution
rules in order to make sure that the accentuation patterns and spelling change
according to the resulting verb base form. For example, the first person
singular 'finjo' (I fake) becomes the infinitive 'fingir' and not
*'finjir'. This tokenizer has been used
in our initial dependency tree inference experiments for Spanish, described
below.
Aligning Text at the Word
Level
Once
the text has been reduced to aligned sentences, we train IBM statistical MT
models using software developed by
Al-Onaizan et al. (1999). The training process produces model parameters and,
as a side-effect, it produces the most likely word-level alignment for each
sentence pair in the training corpus.
Preliminary analysis of these alignments is what led us to move from an
extremely unsophisticated Spanish tokenizer to one that takes into account
morphology and frequent multi-word co-occurrences.
Statistical methods in NLP have led to major
advances, with supervised training methods leading the way to the greatest
improvements in performance on tasks such as part-of-speech tagging, syntactic
disambiguation, and broad-coverage parsing.
Unfortunately, the annotated data needed for supervised training are
available for only a small number of languages.
The University of Maryland
has recently begun a project in collaboration with Johns Hopkins University
aimed at breaking past this bottleneck.
A central idea in this effort is to take advantage of the rich resources
available for English, together with parallel corpora: the English side of a
parallel corpus is annotated using existing tools and resources, and the
results projected to the language on the other side using word-level alignments
as a bridge; finally supervised training is used to create tools that perform
well despite noise in the automatically annotated corpus. Yarowsky et al. (2001) have shown extremely
promising results of this annotation-projection technique for part-of-speech
tagging, named entities, and morphology, and at Maryland we have been focusing
on the challenges of projecting syntactic dependency relations.
Figure 1 shows our baseline
architecture, which includes not only the creation of a noisy treebank but also
its application in an end-to-end machine translation process. Briefly, a
word-aligned parallel corpus is created as discussed in the previous section. The English side is analyzed using Dekang
Lin’s Minipar parser (Lin, 1997), which produces syntactic dependencies, e.g.
indicating arguments of verbs, modifiers, etc.
Crucially, the resulting dependency representation is independent of
word order.
Projection of syntactic
dependencies relies on a fairly strong hypothesis: that major grammatical relations are preserved across
languages. Operationally, the transfer
process begins by assuming that if words e1 and e2 in
English correspond to s1 and s2 in Spanish, respectively,
and there is a dependency relation r between e1 and e2,
then r will hold between s1 and s2. For example, ‘black
cat’ in English corresponds to ‘gato negro’ in Spanish. Therefore the
relationship adjmod(cat,black) is transferred into the Spanish analysis as
adjmod(gato,negro). Notice that the
relationship abstracts away from word order.
These resulting representations constitute a noisy dependency treebank,
which we are using as the training set for Ratnaparkhi’s (1997) MXPOST POS
tagger and Collins’s (1997) stochastic parser.
As stated, the hypothesis of
direct dependency transfer is clearly false – indeed, the issue of divergences
in translation has been an important focus in our previous work (Dorr,
1993). However, we are optimistic that
cross-language correspondence of dependencies is a suitable starting point for
investigation on both theoretical and empirical grounds. Theoretically, grammatical relations are
closer than constituency relations to the thematic relationships underlying the
sentence meaning common to both sides of the translation pair; thus the
fundamental correspondences are likely to hold much of the time. Moreover, lexical dependencies have proven
to be instrumental in advances in monolingual syntactic analysis (e.g. Collins, 1997). These considerations distinguish our approach from Wu’s (2000) approach, which characterizes the
cross-language syntactic relationships using a non-lexicalized bilingual
grammar formalism.
Our second cause for
optimism is empirical: in preliminary efforts we have attempted the direct
dependency transfer approach with Spanish and Chinese, with bilingual speakers
and linguists inspecting the results. The
results of dependency transfer look promising, and the problems that are
evident so far tend to be linguistically interesting and amenable to
language-specific post-transfer processing.
As one example, English parses projected into Spanish will not lead to
useful dependencies involving the reflexive se when, as is often the case,
it has no lexically realized correspondent on the English side; post-processing
of the Spanish can be used to introduce a dependency relationship between the
verb and the reflexive morpheme. The
use of English-side information contrasts with the unsupervised
dependency-based translation models of Alshawi et al. (2000).
Figure 2 provides an
illustrative example using English and Basque, which have very different
linguistic properties. The figure shows
that the verb-subject, verb-object, and modification relationships (most
dependency labels suppressed) transfer directly to the Basque sentence (a
fluent translation in neutral word order).
The indirect object relationship is expressed in the English parse via prepositional
modification between ‘got’ and ‘for’, together with the relationship between
‘for’ and ‘brother’; on the Basque side the dative component of meaning and the
morpheme for ‘brother’ are conflated in the word ‘anaiari’; the resulting pattern of syntactic dependency links on
the Basque side can be post-processed, with the word-internal dependency being
converted into a lexical feature.
As an important part of our
initial efforts, we are developing rigorous evaluation criteria based on
precision and recall of dependency triples, using manually created dependencies
as a gold standard and using inter-annotator precision and recall to provide an
upper bound.
Analysis
and evaluation of MT output from existing systems (including Systran) reveals
that there is a great deal of work to be done to provide improved quality. We
are currently focusing our efforts on (a) providing linguistically motivated
knowledge to enhance our existing source-language parsing module; (b) using
additional knowledge about divergence categories to improve on alignments
between source- and target-language dependencies; and (c) conditioning
statistical translation components, including parsing to and generation from
dependency structures, on linguistic features not currently taken advantage of
in the traditional IBM-style models.
As one example, we take advantage of semantically
classed verbs (Dorr, 1997) to capture valence and other linguistic information
to improve parsing operations such as PP attachment. For example, all verbs in the class {arrange, immerse, install,
lodge, mount, place, position, put, set, situation, sling, stash, stow} take a
locative prepositional phrase as an argument; if our training data contains
only the most frequently occurring verbs in this class (such as `put’), we can
deduce, by association, that others (such as `sling’) have the same PP
attachment properties – and thus can improve parsing for these sparsely
occurring verbs.
As another example, stochastic alignment algorithms
are likely to map the English predicate `kick’ to the corresponding French
`coup’ (especially since the two words also co-occur as nouns in the absence of
`donner’ leaving the actual predicate `donner’ unaligned when we generate the
aligned dependency-tree database (to be described in the next section). This
just one instance of a more general phenomenon: languages sometimes package up
elements of meaning, particularly verb meaning, into different constituents
than English does (i.e., language divergences). To address this issue, we pre-process the English using the
semantically classed verbs, so that we automatically expand verbs in selected
divergence classes into alignable constituents.
A third example is the use of supervised word sense
disambiguation techniques in lexical selection. We have developed a set of tools for supervised WSD that uses a
combination of broad-window and local collocational features to represent
contexts for an ambiguous word. A
variety of classification algorithms can be used – we obtained promising
results for English, Spanish, and Swedish
in the recent SENSEVAL-2 evaluation exercise (Cotton et al., 2001) using
support vector machines for the classification process. We are currently investigating the adaptation
of this method to perform lexical selection, with the target English word
playing the same role as the sense tag for Spanish words.
Acknowledgements
The
authors are supported, in part, by PFF/PECASE Award IRI-9629108, DOD Contract
MDA904-96-C-1250, Office of Naval Research MURI Contract FCPO.81054826, and DARPA/ITO
Contracts N66001-97-C-8540 and N66001-00-28910.
H. Alshawi, B. Srinivas, and
S. Douglas. 2000. Learning dependency translation models as collections of finite
state head transducers. Computational Linguistics, 26.
Al-Onaizan et al.
1999., Statistical MT: Final Report, Johns Hopkins Summer Workshop 1999, CLSP
Technical report, Johns Hopkins University.
Michael Collins. 1997. Three
generative, lexicalized models for statistical
parsing. In Proceedings of the 35th Annual Meeting of the
ACL, Madrid, Spain.
S. Cotton, P. Edmonds, A.
Kilgarriff, and M. Palmer. 2001.
SENSEVAL-2: Second International Workshop on Evaluating Word Sense
Disambiguation Systems, Toulouse, July 2001.
Bonnie
J. Dorr. 1993. Machine
Translation: A View from the Lexicon. The MIT Press, Cambridge, MA.
Bonnie
J. Dorr. 1997. Large-Scale
Acquisition of LCS-based lexicons for foreign language tutoring. In Proceedings
of the ACL 5th Conference on Applied Natura lLanguage Processing (ANLP), Washington, DC.
Bonnie J. Dorr and Douglas W.
Oard. 1998. Evaluating resources for query translation in cross-language
information retrieval. In 1st International Conference on
Language Resource Evaluation (LREC), Granada, Spain.
Ray Jackendoff. 1983.
Semantics and Cognition. The MIT Press, Cambridge, MA.
Ray Jackendoff. 1990.
Semantic Structures. The MIT Press, Cambridge, MA.
Dekang Lin. 1997. Using
syntactic dependency as local context to resolve word sense ambiguity. In Proceedings
of the 35th ACL/EACL’97, Madrid, Spain, July.
Adwait Ratnaparkhi. 1996. A
Maximum entropy model for part-of-speech tagging. In Proceedings of the
Empirical Methods in Natural Language Processing Conference, University of
Pennsylvania.
Philip Resnik. 1999. Mining
the web for bilingual text. In 37th Annual Meeting of the
Association for Computational Linguistics (ACL’99), College Park, Maryland,
June.
Philip Resnik, Douglas Oard, and Gina Levow. 2001. Improved cross language retrieval using backoff translation. 2001. Proc. HLT’2001. San Diego.
Philip Resnik, Mari Broman
Olsen, and Mona Diab. 2000. The Bible
as a Parallel Corpus: Annotating ‘The
Book of 2000 Tongues’. Computers and
the Humanities, 33(1-2).
Jeffrey C. Reynar and Adwait
Ratnaparkhi. 1997. A Maximum entropy approach to identifying sentence
boundaries. Proceedings of the ACL 5th Conference on Applied
Natural Language Processing (ANLP), Washington, DC.
Dekai Wu. 2000. Alignment
using stochastic inversion transduction grammars. In Jean Veronis, editor, Parallel
Text Processing. Kluwer.
David Yarowsky, Grace Ngai, and Richard Wicentowsky,. 2001. Inducing Multilingual Text and Tools via Robust Projection Across Aligned Corpora. Proc. HLT’2001, San Diego.