Recognizing Human Actions by Learning and Matching Shape-Motion Prototype Trees

TitleRecognizing Human Actions by Learning and Matching Shape-Motion Prototype Trees
Publication TypeJournal Articles
Year of Publication2012
AuthorsZhuolin Jiang, Lin Z, Davis LS
JournalPattern Analysis and Machine Intelligence, IEEE Transactions on
Pagination533 - 547
Date Published2012/03//
ISBN Number0162-8828
Keywordsaction prototype, actor location, brute-force computation, CMU action data set, distance measures, dynamic backgrounds, dynamic prototype sequence matching, flexible action matching, frame-to-frame distances, frame-to-prototype correspondence, hierarchical k-means clustering, human action recognition, Image matching, image recognition, Image sequences, joint probability model, joint shape, KTH action data set, large gesture data set, learning, learning (artificial intelligence), look-up table indexing, motion space, moving cameras, pattern clustering, prototype-to-prototype distances, shape-motion prototype-based approach, table lookup, training sequence, UCF sports data set, Video sequences, video signal processing, Weizmann action data set

A shape-motion prototype-based approach is introduced for action recognition. The approach represents an action as a sequence of prototypes for efficient and flexible action matching in long video sequences. During training, an action prototype tree is learned in a joint shape and motion space via hierarchical K-means clustering and each training sequence is represented as a labeled prototype sequence; then a look-up table of prototype-to-prototype distances is generated. During testing, based on a joint probability model of the actor location and action prototype, the actor is tracked while a frame-to-prototype correspondence is established by maximizing the joint probability, which is efficiently performed by searching the learned prototype tree; then actions are recognized using dynamic prototype sequence matching. Distance measures used for sequence matching are rapidly obtained by look-up table indexing, which is an order of magnitude faster than brute-force computation of frame-to-frame distances. Our approach enables robust action matching in challenging situations (such as moving cameras, dynamic backgrounds) and allows automatic alignment of action sequences. Experimental results demonstrate that our approach achieves recognition rates of 92.86 percent on a large gesture data set (with dynamic backgrounds), 100 percent on the Weizmann action data set, 95.77 percent on the KTH action data set, 88 percent on the UCF sports data set, and 87.27 percent on the CMU action data set.