I present a technique for discovering translationally equivalent texts. It is comprised of (a) the application of a matching algorithm at two different levels of analysis and (b) a well-founded similarity score. This approach can be applied to any segmented, multilingual corpus using any kind of translation lexicon; it is therefore adaptable to many levels of multilingual resource availability. Experimental results are shown on two tasks: a search for matching thirty-word segments in a corpus where some segments are mutual translations, and classification of candidate pairs of web pages that may or may not be translations of each other. The latter results compare competitively with previous, document-structure-based approaches to the same problem.
For the colloquium series schedule, see the UMD Computational Linguistics Colloquium Series web page at http://umiacs.umd.edu/~resnik/cl_colloquium/. If you are interested in meeting with the speaker, please contact Philip Resnik (resnik@umiacs.umd.edu).