Representing Videos using Mid-level Discriminative Patches

Arpit Jain1, Abhinav Gupta2, Mikel Rodriguez3, Larry S. Davis1

1University of Maryland College Park, 2Carnegie Mellon University, 3Mitre

ajain[@], abhinavg[@], mdrodriguez[@], lsd[@]


How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatio-temporal patch in the video. What defines these spatio-temporal patches is their discriminative and representative properties.  We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate state-of-the-art performance on UCF50 and Olympics datasets.

paper (pdf)



Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis,

Representing Videos using Mid-level Discriminative Patches, In IEEE Conf. on Computer Vision and Pattern Recognition, 2013









 (coming soon)




This research is partially supported by ONR N000141010766 and Google. It is also supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20071.  The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.