In this work we consider the problem of modeling and recognizing collective activities performed by groups of people sharing a common purpose. For this aim we take into account the social contextual information of each person, in terms of the relative orientation and spatial distribution of people groups. We propose a method able to process a video stream and, at each time instant, associate a collective activity with each individual in the scene, by representing the individual -- or target -- as a part of a group of nearby people -- the target group. To generalize with respect to the viewpoint we associate each target with a reference frame based on his spatial orientation, which we estimate automatically by semi-supervised learning. Then, we model the social context of a target by organizing a set of instantaneous descriptors, capturing the essence of mutual positions and orientations within the target group, in a graph structure. Classification of collective activities is achieved with a multi-class SVM endowed with a novel kernel function for graphs.
We report an extensive experimental analysis on benchmark datasets that validates the proposed solution and shows significant improvements with respect to state-of-art results.