Untitled Document

Recent Research Projects

DARPA GALE SRI NIGHTINGALE

Mary Harper (PI U Maryland subcontract)

This work has been supported by DARPA.

The goal of the GALE (Global Autonomous Language Exploitation) program is to develop and apply computer software technologies to absorb, analyze and interpret huge volumes of speech and text in multiple languages, REDUCING the need for linguists and analysts and automatically providing relevant, distilled actionable information to military command and personnel in a timely fashion. Automatic processing "engines" will convert and distill the data, delivering pertinent, consolidated information in easy-to-understand forms to military personnel and monolingual English-speaking analysts in response to direct or implicit requests.

SRI's team, Nightingale, is one of three GALE teams working on machine translation and distillation. The goal of Harper's UMD group is to provide structural metadata support for the MT and distillation efforts. We have developed methods to significantly improve Chinese parsing and tagging accuracy. We have been investigating semi-supervised methods to improve the accuracy of word segmentation and parsing for Chinese. This is important because there are insufficient quantities of labeled data for Chinese speech genres. We have explored the efficacy of the SParseval scores that were developed for parses with different words and word segmentations than the gold trees to provide as an alternative optimization metric for MT rescoring. Finally, we are developing novel language modeling techniques that integrate latent word class information leaned when we train our parsers for a articular language and genre.

References:

[1] D. Hillard, M. Hwang, M. Harper, and M. Ostendorf, ``Parsing-based Objective Functions for Speech Recognition in Translation Applications,'' in Proceedings of the 2008 IEEE International Conference on Acoustics, Speech, and Signal Processing, Las Vegas NV, March 30-April 4, 2008.

[2] M. Ostendorf, B. Favre, R. Grishman, D. Hakkani-Tur, M. Harper, D. Hillard, J. Hirschberg, H. Ji, J. G. Kahn, Y. Liu, S. Maskey, E. Matusov, H. Ney, A. Rosenberg, E. Shriberg, W. Wang, and C. Wooters, ``Speech Segmentation and its Impact on Spoken Document Processing,'' Signal Processing Magazine, 2008.

[3] W. Wang, Z. Huang, and M. Harper, ``Semi-supervised Learning for Part-of-Speech Tagging of Mandarin Transcribed Speech,'' in Proceedings of the 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing, Hawaii, 2007.

[4] D. Filimonov and M. P. Harper, ``Recovery of Empty Nodes in Parse Structures,'' in Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic, June 28-30, 2007.

[5] Z. Huang, M. P. Harper, and W. Wang, ``Mandarin Part-of-Speech Tagging and Discriminative Reranking,'' in Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic, June 28-30, 2007.

[6] D. Hillard, Z. Huang, H. Ji, R. Grishman, D. Hakkani-Tur, M. Harper, M. Ostendorf, W. Wang, ``Impact of Automatic Comma Prediction on POS/Name Tagging of Speech,'' in Proceedings of the First Workshop on Spoken Language Technology (SLT), Aruba, December 10-13, 2006.

Collaborative Research: Landmark-based Robust Speech Recognition Using Prosody-Guided Models of Speech Variability re

Carol Espy-Wilson (PI) and Mary P. Harper (Co-PI)

Collaborative investigators: Abeer Alwan (UCLA), Jennifer Cole (UIUC), Mark Hasegawa-Johnson (UIUC), Louis Goldstein (Yale), Elliot Saltzman (BU)

This work has been supported by the National Science Foundation.

Despite great strides in the development of automatic speech recognition technology, we do not yet have a system with performance comparable to humans in automatically transcribing unrestricted conversational speech, representing many speakers and dialects, and embedded in adverse acoustic environments. This approach applies new high-dimensional machine learning techniques, constrained by empirical and theoretical studies of speech production and perception, to learn from data the information structures that human listeners extract from speech. To do this, we will develop large-vocabulary psychologically realistic models of speech acoustics, pronunciation variability, prosody, and syntax by deriving knowledge representations that reflect those proposed for human speech production and speech perception, using machine learning techniques to adjust the parameters of all knowledge representations simultaneously in order to minimize the structural risk of the recognizer. The team will develop nonlinear acoustic landmark detectors and pattern classifiers that integrate auditory-based signal processing and acoustic phonetic processing, are invariant to noise, change in speaker characteristics and reverberation, and can be learned in a semi-supervised fashion from labeled and unlabeled data. In addition, they will use variable frame rate analysis, which will allow for multi-resolution analysis, as well as implement lexical access based on gesture, using a variety of training data.

The work will improve communication and collaboration between people and machines and also improve understanding of how human produce and perceive speech. The work brings together a team of experts in speech processing, acoustic phonetics, prosody, gestural phonology, statistical pattern matching, language modeling, and speech perception, with faculty across engineering, computer science and linguistics. Support and engagement of students and postdoctoral fellows are part of the project, engaging in speech modeling and algorithm development. Finally, the proposed work will result in a set of databases and tools that will be disseminated to serve the research and education community at large.

Hierarchal Perceptual Organization with the Center-Surround Algorithm

Jeff Siskind (PI), Charles Bouman (Co-PI), Ilya Pollak (Co-PI), Mary Harper

This work has been supported by the National Science Foundation IIS0329156.

The objective of this research is to develop a general analytic and algorithmic framework for multidimensional context-free grammars (PCFG) that can be used to model the hierarchical structure of images and other multidimensional data sets. This framework extends the notions of PCFGs from 1D word strings to 2D image data and similarly extends the inside-outside algorithm to support training, classification, and parsing on 2D image data with these extended PCFGs. The extended framework is called spatial random trees (SRTs) and the extended algorithm the center-surround algorithm. The framework is both sound and efficient because of a novel notion of constituency that constrains the allowable ways to partition a parent segment into child subsegments during parsing.

This research has great intellectual merit because it forms a fundamental basis for:

Inferring semantically meaningful hierarchal structure from low-level image properties such as edge saliency and region shape, color, texture, and relative position.

Discovering the common hierarchal structure shared by a collection of natural images in an unsupervised fashion from unlabeled training data.

Distinguishing between different natural image-scene classes on the basis of global hierarchal structure, rather than local low-level features.

This research achieves broad impact by addressing a problem that is shared among a wide array of applications in a variety of technical fields. In particular:

Extend the SRT framework so that it can be used to accurately model the geometric relations between constituents in hierarchal structures. This will enhance the value of SRTs in high-level modeling of images.

Develop tools for combining SRT models and merging SRT models with other available data models. This will provide a general framework for both improved speed and accuracy of the methods.

Explore the use of SRTs as a distance metric for classifying high-dimensional data. This opens the techniques to potential applications such as Web clustering.

Develop a unified approach to combined spatial and temporal parsing of video. These new methods can support both video indexing and surveillance tasks. Develop novel approaches for the parsing and recognition of images. This can be useful in applications such as the analysis of printed information, the monitoring of surveillance video, or the analysis of medical imagery.

References:

[1] J.M. Siskind, J. Sherman, I. Pollak, M.P. Harper, C.A. Bouman, ``Spatial Random Tree Grammars for Modeling Hierarchal Structure in Images,'' IEEE Transactions on Pattern Analysis and Machine Intelligence , pp. 1504-1519, vol. 29, no. 9, September 2007.

[2] W. Wang, I. Pollak, T.-S. Wong, C.A. Bouman, M.P. Harper, and J.M. Siskind, ``Hierarchical Stochastic Image Grammars for Classification and Segmentation.,'' IEEE Transactions on Image Processing, 15(10):3033-3052, October 2006.

[3] W. Wang, T.-S. Wong, I. Pollak, C.A. Bouman, and M. P. Harper, ``Modeling Hierarchical Structure of Images with Stochastic Grammars,'' in Computational Imaging IV, Proceedings of SPIE, Volume EI114, the Conference on Computational Imaging, IS&T/SPIE 18th Annual Symposium on Electronic Imaging Science and Technology, January 15-19, 2006, San Jose.

[4] W. Wang, I. Pollak, M.P. Harper, and C.A. Bouman, ``Spatial Random Trees with Applications to Image Classification,'' in Proceedings of the Conference on Mathematical Methods in Pattern and Image Analysis, SPIE Optics & Photonics 2005 Symposium, 31 July-4 August 2005, San Diego, CA.

[5] W. Wang, I. Pollak, C.A. Bouman, and M.P. Harper, ``Classification of Images Using Spatial Random Trees,'' in Proceedings of the IEEE Statistical Signal Processing Workshop, June 17-20 2005, Bordeaux, France.

[6] I. Pollak, J.M. Siskind, M.P. Harper, C.A. Bouman, ``Modeling of Images with Spatial Random Trees,'' presented at SIAM Conference on Imaging Science, May 3-5, 2004, Salt Lake City, Utah.

[7] B. Bitlis, X. Feng, J. L. Harris, I. Pollak, C. A. Bouman, M. P. Harper, J. P. Allebach, ``A Hierarchical Document Description and Comparison Method,'' in Proceedings of the Society for Imaging Science and Technology IS&T Archiving Conference, San Antonio, Texas, April 20-23, 2004.

[8] I. Pollak, J.M. Siskind, M.P. Harper, and C.A. Bouman, ``Spatial Random Trees and the Center-Surround Algorithm,'' Technical Report TR-ECE-03-03, Purdue University, School of Electrical and Computer Engineering, January 2003.

[9] I. Pollak, J.M. Siskind, M.P. Harper, and C.A. Bouman, ``Multiscale Random Tree Models,'' IS&T/SPIE 15th Annual Symposium on Electronic Imaging, January 21-24, 2003, Santa Clara, CA.

[10] I. Pollak, J.M. Siskind, M.P. Harper, and C.A. Bouman, ``Modeling and Estimation of Spatial Random Trees with Application to Image Classification,'' in Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. III, Hong Kong, April 6-12, 2003, 305-308.

[11] I. Pollak, J.M. Siskind, M.P. Harper, and C.A. Bouman, ``Parameter Estimation for Spatial Random Trees Using the EM Algorithm,'' in Proceedings of the IEEE International Conference on Image Processing, Barcelona Spain, September 2003, pp. 257-260.

[12] J.M. Siskind, I. Pollak, M.P. Harper, C.A. Bouman, ``Stochastic Grammars for Images on Arbitrary Graphs,'' IEEE Workshop on Statistical Signal Processing, St. Louis, Missouri, September 28-October 1, 2003.

Parsing and Spoken Structural Event Detection

Mary Harper (Team Leader)

This work has been supported by the Center for Language and Speech Processing with National Science Foundation sponsorship.

Even though speech-recognition accuracy has improved significantly over the past 10 years, these systems do not currently generate/model structural information (meta-data) such as sentence boundaries (e.g., periods) or the form of a disfluency (e.g., in .I want [to go] * {I mean} meet with Fred., .to go. is an edit, which is signaled by an interruption point indicated as *, as well as an edit term .I mean..). Automatic detection of these phenomena would simultaneously improve parsing accuracy and provide a mechanism for cleaning up transcriptions for the downstream text processing modules. Similarly, constraints imposed by text processing systems such as parsers can be used to assign certain types of meta-data for correct identification of disfluencies.

The goal of this workshop was to investigate the enrichment of speech recognition output using parsing constraints and the improvement of parsing accuracy due to speech recognition enrichment. We investigated the following questions: (1) How does the incorporation of syntactic knowledge affect sentence boundary and disfluency detection accuracy? [1,2] (2) How does the availability of more accurate sentence boundaries and disfluency annotation affect parsing accuracy? [1,3,5] This workshop project was interdisciplinary bringing together researchers from the speech recognition and natural language processing communities. The Fisher treebank was created to support this workshop [5]. A tool was developed to evaluate`q parse accuracy when the words and sentence segmentations differ from the gold standard [3].

References:

[1] M. Harper, B. Dorr, J. Hale, B. Roark, I. Shafran, M. Lease, Y. Liu, M. Snover, L. Yung, A. Krasnyanskaya, R. Stewart, ``2005 Johns Hopkins Summer Workshop Final Report on Parsing and Spoken Structural Event Detection,'' Technical Report, 2005.

[2] B. Roark, Y. Liu, M. Harper, R. Stewart, M. Lease, M. Snover, I. Shafran, B. Dorr, J. Hale, A. Krasnyanskaya, and L. Yung, ``Reranking for Sentence Boundary Detection in Conversational Speech,'' in Proceedings of the 2006 IEEE International Conference on Acoustics, Speech, and Signal Processing, Toulouse, France, May 15-19, 2006.

[3] B. Roark, M. P. Harper, E. Charniak, B. Dorr, M. Johnson, J. Kahn, Y. Liu, M. Ostendorf, J. Hale, A. Krasnyanskaya, M. Lease, I. Shafran, M. Snover, R. Stewart, L.Yung, ``SParseval: Evaluation Metrics for Parsing Speech,'' in Proceedings of Language Resource and Evaluation Conference, Genoa, Italy, May 24-26, 2006.

[4] A. Bies, S. Strassel, H. Lee, K. Maeda, S. Kulick, Y. Liu, M. Harper, M. Lease, ``Linguistic Resources for Speech Parsing,'' in Proceedings of Language Resource and Evaluation Conference, Genoa, Italy, May 24-26, 2006.

[5] Hale, I. Shafran, L. Yung, B. Dorr, M. Harper, A. Krasnyanskaya, M. Lease, Y. Liu, B. Roark, M. Snover, and R. Stewart, ``PCFGs with Syntactic and Prosodic Indicators of Speech Repairs,'' Joint Meeting of the International Conference on Computational Linguistics and the Association for Computational Linguistics (Coling/ACL), Sydney, Australia, July 2006.

Cross Modal Analysis of Signal and Sense: Multimedia Corpora and Computational Tools for Gesture, Speech, and Gaze Research

Mary Harper (PI Purdue subcontract)

This work has been supported by the National Science Foundation and ARDA.

Human gesture, speech, and gaze function as an integrated whole, no part of which can be fully understood in isolation [1]. This multi-disciplinary, multi-institutional project investigated the fusion of visual and spoken information. People utilize a wide variety of knowledge sources such as acoustic, lexical, syntactic, prosodic, semantic, contextual, and visual information in order to understand speech. Past research has often focussed on using knowledge sources related to the speech signal together with linguistic information such as syntax and semantics.

Speech and gesture have been shown to exhibit a synchronous relationship in human communication. The incorporation of gesture is hypothesized to improve accuracy in detecting phrase, sentence, and discourse boundaries. A combination of stochastic and symbolic techniques were used to build models that combine linguistic, acoustic, and visual information to improve the accuracy of a computer language understanding system with visual and spoken inputs.

We have investigated the detection, recognition, and understanding of audio/video events associated with formal and informal meetings. Meetings are gatherings of humans for the purpose of communication. Such communication may have various purposes: planning, conflict resolution, negotiation, collaboration, confrontation, etc. Our ultimate purpose was to support the goal of efficient querying/retrieval and browsing of meeting video.

References:

[1] D. McNeill, "Growth Points, Catchments, and Contexts," Cognitive Studies: Bulletin of the Japanese Cognitive Science Society. 7(1):22-36, 2000.

[2] D. McNeill, S. Duncan, A. Franklin, J. Goss, I. Kimbara, F. Parrill, H. Welji, L. Chen, M. Harper, and F. Quek, ``Mind-Merging,'' in Morsella, E. (ed.), Expressing Oneself / Expressing One's Self: Communication, Language, Cognition, and Identity, London: Taylor and Francic, to appear.

[3] L. Chen, M. P. Harper, and Z. Huang, ``Using Maximum Entropy (ME) Model to Incorporate Gesture Cues for SU Detection,'' in Proceedings of the 6th International Conference on Multimodal Interfaces, Banff Canada, November 2-4, 2006.

[4] L. Chen, M. P. Harper, A. Franklin, R. T. Travis, I. Kimbara, Z. Huang, and F. Quek, ``A Multimodal Analysis of Floor Control in Meetings,'' in Proceedings of MLMI 2006 Workshop, Washington DC.

[5] L. Chen, R. Travis, F. Parrill, X. Han, J. Tu, Z. Huang, I. Kimbara, H. Welji, M. Harper, F. Quek, D. McNeill, S. Duncan, R. Tuttle, and T. Huang, ``VACE Multimodal Meeting Corpus,'' in Proceedings of MLMI 2005 Workshop, Edinburgh, July 2005.

[6] L. Chen, E. Maia, Y. Liu, and M. P. Harper, ``Evaluating Factors Impacting Forced Alignment in a Multimodal Corpus,'' in Proceedings of the4th International Conference on Language Resources and Evaluation, May 26-28, 2004, Lisbon Portugal.

[7] L. Chen, Y. Liu, M. P. Harper, and E. S. Shriberg, ``Multimodal Model Integration for Sentence Unit Detection,'' in Proceedings of the 6th International Conference on Multimodal Interfaces, State College, PA, October 13-15 2004.

[8] L. Chen, M. P. Harper, F. Quek, ``Gesture During Speech Repairs'' 4th IEEE International Conference on Multimodal Interfaces, October 14-16, 2002, Pittsburgh PA, pp. 155-160.

[9] F. Quek, M. P. Harper, Y. Haciahmetoglu, L. Chen, and L. Ramig, ``Speech Pauses and Gestural Holds in Parkinson's Disease'' in Proceedings of the Seventh International Conference on Spoken Language Processing, Vol. 4, September 2002, Denver CO, pp. 2485-2488.

[10] F. Quek, D. McNeill, R. Bryll, and M. Harper, ``Gesture Spatialization in Natural Discourse Segmentation'' in Proceedings of the Seventh International Conference on Spoken Language Processing, Vol. 1, September 2002, Denver CO, pp.189-192.

[11] F. Quek, R. Bryll, D. McNeill, and M. Harper, ``Gestural Origo and Loci-Transitions in Natural Discourse Segmentation,'' in Proceedings of the IEEE Workshop on Cues in Communication, Kauai Marriot, Hawaii, December 9, 2001.

[12] F. Quek, R. Bryll, M. Harper, L. Chen, and L. Ramig, ``Speech and Gesture Analysis for Evaluation of Progress in LSVT Treatment in Parkinson's Disease,'' Eleventh Biennial Conference on Motor Speech: Motor Speech Disorders, Williamsburg, VA, March 14-17, 2002.

[13] F. Quek, R. Bryll, M. P. Harper, L. Chen, and L. Ramig, ``Audio and Vision-Based Evaluation of Parkinson's Disease from Discourse Video,'' in Proceedings of the IEEE International Symposium of Bio-Informatics and Bio-Engineering, BIBE 2001, Bethesda, MA, November 4-6, 2001, pp. 245-252.

This page was last modified June 10, 2009.

Back to Homepage for Mary P. Harper