3D object identification with time-invariant features

From image sequence to time-invariant features
The object Pinocchio placed in different setting [Click to enlarge]

The study and the analysis of the visual information coming from an image can be tackled with different approaches: to global image description we preferred the local approach since recent research has demonstrated that it leads to a more compact and robust representation of the image even when there are major changes in the object appearance.

We model 3D objects using a visual vocabulary whose words represent the most meaningful component of the object: the description obtained is complete and compact and is capable to describe the object when it is seen from different points of view. The robustness of this approach is remarkable also when the object is in a very cluttered scene and it is partially occluded.

Our modeling and matching method exploits temporal coherence both in training and test.


From image sequence to time-invariant features
From keypoints to visual vocabulary [Click to enlarge]

From keypoints to visual vocabulary

Our method is based on describing objects, no matter how complex, by means of local image structures. Starting from this local information we find descriptors of the objects that are characteristic and meaningful.

The content of an image sequence is redundant both in space and time, thus we obtain compressed descriptions for the purpose of recognition, extracting a collection of trains of features and discarding all the other information. We call this collection model or visual vocabulary of the sequence.

Since the 3D object of interest is described with an image sequence we can sketch the procedure by the following steps:

  1. Keypoints detection and description: local keypoints (such as corners or centres of blob) are extracted and described with SIFT descriptors.

  2. Keypoints tracking: We track the keypoints over the sequence to obtain a robust clustering of similar keypoints.

  3. The visual vocabulary: Each keypoint trajectory describes the feature from different viewpoints: let us call it train of feature. For each train of feature we compute a compact representation that we call time-invariant feature consisting of:
    • a spatial appearance descriptor that is the average of all SIFT vectors of its trajectory
    • a temporal descriptor that contains information on when the feature first appeared in the sequence and on when it was last observed.

2-stage matching
First and second steps of matching procedure [Click to enlarge]

3D object recognition

We propose a two-steps matching procedure that exploits the richness of our temporal features. To achieve such a compromise the steps are:

  1. First obtain a set of hypotheses on the presence of a given object on the test sequence using a simple nearest neighbour matching looking for a given model in the test sequence.
  2. Then, we refine this hypothesis using spatio-temporal constraints: time constraints allow us to focus on the more appropriate view-point range, discarding all information contained in the model that is not useful for the current match; spatial constraints expand the search area and help us to confirm or reject each hypothesis.

The recognition phase is based on the one-class recognition approach: the higher the number of matches, the higher the probability the sequence contains a given object model.

Experiments and results

Here we present some of the experiments and the results that we have obtained with our two-stage matching procedure for 3D object recognition.

Method assessment

We test the system with simple changes in imaging conditions: illumination and scale variations, background clutter and occlusions of the objects.

Model sequences (click on the images to download the video)

Sequences with changes in illumination and scale (click on the images to download the video)

Circles: dewey’s features; squares: book, crosses: winnie; X’s: goofy.

Sequences with cluttered background and occlusion (click on the images to download the video)

Real scene environment

We test the system with sequences of objects placed in a real scene environment: the background is highly cluttered and there are several object appearing together.

When the number of trajectories decreases the number of matches decreases. The following graph shows the number of matches per frame computed in a sequence in which the objects appear as in a tracking shot.

2-stage matching
Number of matches and number of trajectories for frame [Click to enlarge]

First video in a real scene environment
[Click to download]

Second video in a real scene environment
[Click to download]

Increasing the number of objects

We test the system when the number of objects increases. We compare recognition performances when the number of objects passes from 5 to 10 and finally to 20. Results are reported in the following video and tables.

Number of recognition experiments: 840 Precision=59% Recall= 84% Specificity=5%

Comparing 20 objects
Matches obtained comparing 20 object models [Click to enlarge]
Comparing 20 objects
Matches obtained comparing 20 object models [Click to enlarge]


People: E. Delponte, N. Noceti, F. Odone, A. Verri