UMIACS Computational Linguistics Colloquium, April 10, 2002

From Words to Corpora: Recognizing Translation


Noah Smith


Johns Hopkins University


UMIACS Computational Linguistics Colloquium

April 10, 2002,
10am-11:30am, AVW Room 2120


I present a technique for discovering translationally equivalent texts. It is comprised of (a) the application of a matching algorithm at two different levels of analysis and (b) a well-founded similarity score. This approach can be applied to any segmented, multilingual corpus using any kind of translation lexicon; it is therefore adaptable to many levels of multilingual resource availability. Experimental results are shown on two tasks: a search for matching thirty-word segments in a corpus where some segments are mutual translations, and classification of candidate pairs of web pages that may or may not be translations of each other. The latter results compare competitively with previous, document-structure-based approaches to the same problem.


For the colloquium series schedule, see the UMD Computational Linguistics Colloquium Series web page at http://umiacs.umd.edu/~resnik/cl_colloquium/. If you are interested in meeting with the speaker, please contact Philip Resnik (resnik@umiacs.umd.edu).