UMIACS Computational Linguistics Colloquium, Jan 29, 2002

MALACH: Multilingual Access to Large spoken ArCHives


Bhuvana Ramabhadran


IBM


UMIACS Computational Linguistics Colloquium (joint with HCIL)

January 29, 2002,
3:45pm, AVW Room 3460


The principal goal of MALACH ("angel" in Hebrew) is to develop methods for improved access to large multi- lingual spoken archives by capitalizing on the unique characteristics (unconstrained natural speech) of the Survivors of the Shoah Visual History Foundation's (VHF) multimedia digital archive of oral histories. The multi- media digital archive collected by the VHF contains over 116,000 hours of interviews with over 52,000 survivors, liberators, rescuers and witnesses of the Nazi Holocaust, recorded in 32 languages. Four thousand of the interviews in English have been manually cataloged at great expense, producing an exceptional source of labeled training data. Automatic technologies presently have relatively limited capabilities; capabilities that must be dramatically enhanced if the full potential of digital archiving is to be realized. In this project, we seek to make just such a leap. Some of the research challenges include the automated transcription of emotional and heavily accented speech in multiple languages in the presence of spontaneous speech phenomena such as whispering and uncued language switching, followed by automated cataloguing, indexing and retrieval.

MALACH is a NSF-funded joint effort with JHU and UMD. IBM's role in this project is to develop robust speech recognition and information retrieval technologies in English. In this talk, a more detailed overview of MALACH will be presented, as well as preliminary ASR results with an analysis of the technical challenges ahead.


For the colloquium series schedule, see the UMD Computational Linguistics Colloquium Series web page at http://umiacs.umd.edu/~resnik/cl_colloquium/. If you are interested in meeting with the speaker, please contact Philip Resnik (resnik@umiacs.umd.edu).