Introduction

Consider the video visualized as a spatio-temporal volume in the figure on the right. What does it mean to understand this video and how might we achieve such an understanding? Currently, the most common answer to this question involves recognizing the particular event or action that occurs in the video.

For the video shown in the figure it would simply be “clean and jerk”. But this level of description does not address issues such as the temporal extent of the action. It typically uses only a global feature-based representation to predict the class of action. We additionally would like to determine structural properties of the video such as the time instant when the person picks up the weight or where the weights are located. We want to understand actions at a finer level, both spatially and temporally. Instead of representing videos globally by a single feature vector, we need to decompose them into their relevant “bits and pieces”. This could be addressed by modeling videos in terms of their constituent semantic actions and objects. The general framework would be to first probabilistically detect objects (e.g, weights, poles, people) and primitive actions (e.g, bending and lifting). These probabilistic detections could then be combined using Bayesian networks to build a consistent and coherent interpretation such as a storyline. So, the semantic objects and actions form primitives for representation of videos. However, recent research in object and action recognition has shown that current computational models for identifying semantic entities are not robust enough to serve as a basis for video analysis. Therefore, such approaches have, for the most part, only been applied to restricted and structured domains such as baseball and office scenes.

Strong alignment allows us to richly annotate test videos
using a simple label transfer technique. — Strong alignment allows us to richly annotate test videos using a simple label transfer technique.

Following recent work on discriminative patch-based representation, we represent videos in terms of discriminative spatio-temporal patches rather than global feature vectors or a set of semantic entities. These spatio-temporal patches might correspond to a primitive human action, a semantic object, human-object pair or perhaps a random but informative spatio-temporal patch in the video. They are determined by their discriminative properties and their ability to establish correspondences with videos from similar classes. We automatically mine these discriminative patches from training data consisting of hundreds of videos.

The figure below shows some of the mined discriminative patches for the “weightlifting” class. We show how these mined patches can act as a discriminative vocabulary for action classification and demonstrate state-of-the-art performance on the Olympics Sports dataset and the UCF-50 dataset1. But, more importantly, we demonstrate how these patches can be used to establish strong correspondence between spatio-temporal patches in training and test videos. We can use this correspondence to align the videos and per- form tasks such as object localization, finer-level action detection etc. using label transfer techniques. Specifically, we present an integer-programming framework for selecting the set of mutually-consistent correspondences that best explains the classification of a video from a particular category. We then use these correspondences for representing the structure of a test video.

Mining Discriminative Patches in Video

Given a set of training videos, we first find discriminative spatio-temporal patches which are representative of each action class. These patches satisfy two conditions: 1) they occur frequently within a class; 2) they are distinct from patches in other classes. The challenge is that the space of potential spatio-temporal patches is extremely large given that these patches can occur over a range of scales. And, the overwhelming majority of video patches are uninteresting, consisting of background clutter (track, grass, sky etc).

One approach would be to follow the bag-of-words paradigm: sample a few thousand patches, perform k- means clustering to find representative clusters and then rank these clusters based on membership in different action classes. However, this has two major drawbacks: (a) High-Dimensional Distance Metric: K-means uses standard distance metrics such as Euclidean or normalized cross-correlation. These standard distance metrics do not work well in high-dimensional spaces(In our case, we use HOG3D to represent each spatio-temporal patch and the dimensionality of the feature space is 1600).

The Euclidean distance fails to retrieve visually similar patches. Instead, we learn a discriminative distance metric to retrieve similar patches and, hence, representative clusters. (b) Partitioning: Standard clustering algorithms partition the entire feature space. Every data point is assigned to one of the clusters during the clustering procedure. However, in many cases, assigning cluster memberships to rare background patches is hard. Due to the forced clustering they significantly diminish the purity of good clusters to which they are assigned.

Figure 4. Examples of highly ranked discriminative spatio-temporal patches.

We address these issues by using an exemplar-based clustering approach which avoids partitioning the entire feature space. Every spatio-temporal patch is considered as a possible cluster center and we determine whether or not a discriminative cluster for some action class can be formed around that patch. We use the exemplar-SVM (e- SVM) approach of Malisiewicz et al. to learn a discriminative distance metric for each cluster. However, learn- ing an e-SVM for every spatio-temporal patch in the training dataset is computationally infeasible; instead, we use motion based sampling to generate a set of initial cluster centers and then use simple nearest neighbor verification to prune candidates.