Probabilistic models for object recognition in artificial vision

From 2007 Scientific Report
Jump to: navigation, search

Hervé Guillaume, Nicolas Cuperlier, Philippe Tarroux



Fig. 1. An illustration of the ambiguity of perception

Role of prior knowledge for recognition: a probabilistic and inferential approach

Perception is intrinsically an ambiguous process. Ambiguity arises from two main reasons, (i) when specific object attributes are partially occluded in a given environment, (ii) when the 2D projection of real world 3D objects leads to an ambiguity in the resulting retinal image. In both situations the bottom-up information coming from the scene through the visual channel is not sufficient for achieving the identification of what is really seen [1].

To solve these ambiguities, the visual system uses prior knowledge. One striking example is presented Fig. 1. It is easy to see, by a rotation of 180° of the figure, that bumps and holes cannot be distinguished using the information coming from the scene. However, endowed with the additional information that usually light comes from above, ambiguity disappears. Fig. 2. shows another illustration of this kind of ambiguity problem. In this case the projection of a 3D object cannot be correctly interpreted without the help of external information. Bayesian inference is a suitable framework for modeling this kind of mechanisms because it take into account prior beliefs to estimate hypothesis likelihood. Recognition is thus viewed like an inferential process in which the brain both (i) adds prior knowledge to the information coming form the scene and (ii) actively searches for information able to fulfill the requirements and the expectancies of the system. Thus long term memory and previous learning together with attentional mechanisms are parts of this inferential process.

Fig. 2. 2D projections on the retinal plane may correspond to several external 3D object. Maximum likelihood is a method of choice for achieving this kind of disambiguation. (Reprinted from [2])

Deictic attributes: how prior contextual knowledge makes object recognition easier

One approach to object recognition is based on the description of objects in terms of their intrinsic attributes. However, in this case recognition often fails when poor observation conditions alter the identification of these attributes. On the contrary, some works from natural visual perception [3] show an early use of contextual information in object recognition, thus allowing contextual priming of objects. It is indeed well known that object recognition is facilitated when objects are presented in congruent contexts. Using a bayesian framework, Torralba has modeled the relationship between prior contextual knowledge, identification and location of objects in a visual scene [4].

If we assume a relationship between context and objects, the information needed for object recognition should be lesser when the object is presented in context. This observation leads to the conclusion that, in this case, objects are not defined by a collection of intrinsic attributes but rather by a limited set of attributes the meaning of which depends on the context in which they are perceived [5]. To emphasize they contingent meaning we have called these attributes deictic attributes.

However, the recognition of this relationship between objects and context poses the question of the definition of the context. A first step toward an implementation of contextual priming is indeed the definition of a context representation independent from the objects present in the scene. What we intend here by context is the overall behavioral conditions as they can be defined by informations coming from the visual scene.

Context recognition

In robotics, most of the works focus on the localization of the system rather than on the recognition of the context in which it is acting. Landmark-based models and SLAM approaches [6] belong to this kind of models. Following our definition, a context has a semantic meaning related to its usage while a location is related to a point in the space unrelated to the semantic characteristics of its surrounding. It seems obvious that priming for object detection can mainly be triggered by high-level contexts. One of the most striking differences between locations and contexts is that locations can be used to form metric or topologic maps but that their cannot be aggregated in classes as can be higher level categories of environments. Only these categories carry a sufficient amount of prior knowledge to be useful in contextual priming. That's why landmark-based models as SLAM are useless for our problem.

Holistic approaches

In order to learn context rather than location and to prime object recognition, texture-based holistic models of the scene have been developed [7, 8]. However these models are grounded on supervised learning [4] and the development of autonomous robotics systems requires unsupervised algorithms. To overcome this limitation, we tried to learn unsupervised context categories [9]. We showed that the joint use of compact codes and self-organizing algorithms allowed us to cluster visual scenes according to intrinsic characteristics mainly based on their statistical properties (Fig. 3). With this approach, objects belonging to different context are grouped on different units of a self-organizing map (Fig. 4) allowing priming to be done according to the obtained clusters. However in an indoor environment, where the perceptual variety is limited, this approach has some limitations : contrary to the landmark-based-models the representation is too generic for discriminating very similar place.
Fig. 3. A simplified representation of a 10x10 map showing examples of the projection of similar images onto the same cluster. Clustering is carried out using a 20-dimension signature vector based on spatial frequency coding, with a whole of 2496 images drawn from the Corel database.
Fig. 4. (a) Cumulated histograms of the number of examples of each category associated with the map units. The theoretical curve stands for the case of a uniform distribution of the examples over the map (1% by units). (b) Self-organizing map of 10x10 units showing the dispersion of the object categories over the units.

An attention-based model

To overcome this problem, we have developed a model of context recognition based on attentional mechanisms. We made the hypothesis that the visual exploration of an unknown scene provides the information required to recognize it only after a few gaze fixations depending on the information collected. Recognition of a context is thus transformed into the recognition of a temporal sequence of fixations (Fig 5). To test this hypothesis we have endowed a robot with a biologically-inspired attentional algorithm allowing the visual inspection of its environment. The attentional process allows to link discrete landmarks each to each other in a way related to the context. This process allows to recognize context from a reduced number of aliased landmarks. More importantly, we are not compelling to learn new landmarks to characterize each new environment but strictly the way in which landmarks are connected. Thus, contrary to landmarks model of location it can work in an open environment. Eventually, according to what signature is learned for each salient point, fine or coarse categories can be learned from the context. This last property is very important to our object recogniction model because it allows to learn more or less generic context catégory and thus to generalize the links between object and place.
Fig. 5. The robot explore its environment according to its visual input. Multiscale features are extracted from the input images and then combined to obtain the most salient visual point. The robot moves in order to focus the winning point on his visual field. The new imput image allows to restart the same process. An inhibition of return mechanism, implements thanks to compass information, avoids gaze fixation on an already visited point. At the same time, a snapshot is extracted from the surrounding of the most visual salient point and a signature is computed using multiscale conspicuity maps. This signature is clustered onto a Self Organizing Map. The cluster sequence obtained during the environment exploration by the robot is learned by a Markov chain model and allow to recognize context


The construction of autonomous systems able to identify objects in their environments requires the elaboration of efficient visual object recognition algorithms. Our knowledge of natural perception mechanisms suggests that contextual information is useful for object recognition in ecological situations. The conventional localization models are not very helpful because they are unable to endow the observed locations with semantic values. Consequently, the presence of an object in the surrounding of a given location cannot be generalized.

The model presented here is a first step toward an implementation of contextual priming in the framework of mobile robotics. We first investigated the use of holistic representations independently of the objects in the visual scene. The obtained results support the idea that the holistic representations of the context are suitable for establishing a specific relationship between the nature of the context and the object categories. However, these representations are too generic to be used in indoor environments.

To overcome this problem, we have developed an attention-based model for context recognition. Instead of being defined by static and global cues computed from the visual scene, this model is based on the information collected from one salient point to another during the visual exploration of the scene. The context is thus defined as a variable length sequence (a dynamic scanpath) of snapshots corresponding to the regions in the scene where the information is maximized. This approach has revealed itself as being flexible enough to adapt to the abstraction level of the categories that we want to extract, through the signature used for each snapshot.


1. Kersten, D., P. Mamassian, and A. Yuille, Object perception as Bayesian inference. Annual Review of Psychology, 2004. 55: p. 271-304.
2. Kersten, D. and A. Yuille, Bayesian models of object perception. Current Opinion in Neurobiology, 2003. 13(2).
3. Chun, M.M. and Y. Jiang, Contextual cueing: implicit learning and memory of visual context guides spatial attention. Cognitive Psychology, 1998. 36(1): p. 28-71.
4. Torralba, A., Contextual priming for object detection. International Journal of Computer Vision, 2003. 53(2): p. 169-191.
5. Ballard, D.H., et al., Deictic Codes for the Embodiment of Cognition. Behavioral and Brain Sciences, 1997. 20(4): p. 723-767.
6. Montemerlo, M., FastSLAM: A Factored Solution to the Simultaneous Localization and Mapping Problem with Unknown Data Association, in Computer Science. 2003, Carnegie Mellon University: Pittsburgh.
7. Oliva, A. and A. Torralba, Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 2001. 42(3): p. 145-175.
8. Guérin-Dugué, A. and A. Oliva, Classification of scene photographs from local orientations features. Pattern Recognition Letters, 2000. 21: p. 1135-1140.
9. Guillaume, H., N. Denquive, and P. Tarroux. Contextual priming for artificial visual perception. in European Symposium on Artificial Neural Networks. 2005. Bruges.