Classification based on dynamic information raised an increasing interest in the last years due to the large availability of video data. It finds application in different video-centered fields, ranging from video surveillance and monitoring, to activity and action recognition, scene understanding, video mining and retrieval.
Despite the intense reaserch activity that lead to the design of efficient solutions for rather well-defined tasks, a number of open problems -- concerning in particular handling huge quantity of data and coping with the high variability due to acquisition settings, illumination changes, dynamic content -- still deserve attention.
Our reasearch looks toward the development of automatic video analysis modules -- from low-level features extraction to decision making -- with the final aim of equipping video surveillance systems with the capability of modeling complex events with little human supervision. The term complex event may refer to the presence of different classes of moving objects (e.g. people, cars, animals), the interaction between a variable number of actors (e.g. single actor, group, crowd), the richness of the dynamic content (i.e. the different behavioral patterns or structured activities). Also, complexity may refer to the scene extent, so both single and multi camera settings are considered.
The availability of long-time observations calls for solutions to be adaptive and able to exploit knowledge from previously seen scenarios. Results pursued in the last years showed how this can be achieved by combining classical computer vision techniques with statistical learning from examples. In this context, we are particularly interested in applying unsupervised methods to extract knowledge from (possibly) unlabeled data.
Figure 1: the general video surveillance architecture we consider
Our main goals are to study, develop and test robust methods able to
- solve and exploit the (possibly multi) camera calibration problem
- retrieve low-level spatio temporal information from video
- detect, describe and classify objects of interest
- compute higher level temporal data descriptors to promote homogeneity and attenuate the noise for the use in a learning framework
- perform higher level analysis, adopting decision making mechanisms to infer the properties of the observed scene content.
We design a rather general architecture for video analysis (see Figure 1), for which different instances can be considered depending on the specific applications and goals.
Applications and results
Within the general architecture we propose, we consider specific instances to address the problem of single person or small groups behavior analysis (for low or medium scene density) and people counting (for crowded scenes). In particular, we address the following sub-tasks:
From videos to temporal data: we perform change detection on a video stream acquired by a single still camera (left and center) and, when the scene occupancy is appropriate, we adopt our graph-based tracking method  to correlate moving objects descriptions over time (right) obtaining a set of trajectories.
From temporal data to strings: we map the spatio-temporal trajectories (left) into a string description based on a data-driven alphabet . It is automatically estimated with spectral clustering applied on the features space which is made of all the observations laying on the input trajectories (on the right, the mapping on the first two dimensions is visualized). By associating a label to each cluster a string can be finally obtained by considering the sequence of states crossed by a trajectory.
From strings to behavioral patterns: we apply again spectral clustering, this time on the strings set, to detect coherent patterns on the data. A P-Spectrum kernel is adopted. Model selection is performed by considering an environment annotation and selecting the best strings partitioning as the one maximizing the correct clustering rate. A step of clusters refinement (outliers removal, cluster merge) is adopt to improve the models . Each final cluster, corresponding to an observed behavior, is compactly represented by a candidate string (below, examples for different scenarios), that is used for test analsiys. The divergence of a test string with respect to all the candidates indicates that an anomaly occurred. The use of an appropriate feature space allows us to discriminate between different classes of objects (e.g. people or cars by using appearance features), different dynamics (e.g. walking or running by considering motion features) and so on. The entire pipeline is independent on the choice of the input space and thus highly adaptable to different surveillance settings.
People counting: when the scene is more densely occupied we focus on groups of people rather than singles. We exploit camera calibration to build a description of the group and geometrical reasoning is applied to finally obtain an estimate of the number of people . The system feedback is temporally filtered to increase the robustness of the estimates.
Our current works aims at extending the architecture toward different directions, in particular:
- improving the ability of evolving the behavioral patterns over time to adapt to permanent and phisiological scene changes (a first attempt was in )
- enriching the group/crowd behavior analysis by including a step of pedestrian detection that will follow the people counting module and start off from its results. Then the group behaviors is modelled over time with a graph-based fashion to describe intersactions and their evolution in the scene with the final aim of detection common patterns of activity.
- Part of this research has been carried out within a technology transfer program with the company Imavis srl
- Noceti, N. et al. "Combined motion and appearance models for robust object tracking in real-time". Proc of AVSS, 2009.
- Odone, Francesca, Nicoletta Noceti and Matteo Santoro. "Learning Behavioral Patterns of Time Series for Video-Surveillance". Machine Learning for Vision-Based Motion Analysis. Ed. Pietikäinen, Liang Wan- Guoying Zhao - Li Cheng - Matti. Springer, 2011
- Odone, Francesca and Nicoletta Noceti. "Unsupervised video surveillance". ACCV - Workshop on, 2010.
- Zini, Luca, Nicoletta Noceti and Francesca Odone. "An adaptive video surveillance architecture for video analysis". Eurographics Italian Chapter Conference. Ed. Puppo, E., A. Brogni and L. De Floriani, 2010.