Object Recognition by Fast Hypothesis Generation and Reasoning about Object Interactions

Mark Westling
Concept Five Technologies, Inc.
7525 Colshire Drive
McLean, Virginia 22102 USA
westling@concept5.com

Larry S. Davis
Center for Automation Research
University of Maryland
College Park, Maryland 20742 USA
lsd@umiacs.umd.edu

Abstract

We present a novel, two-step approach for recognizing multiple 3-D objects in single 2-D images. In the first step, hypotheses of object instances are generated using a memory-based technique. This technique relies on an array, which is computed off-line, that associates a large number of object poses with corresponding image features. During actual recognition, the array serves as a discrete approximation of the inverse projection function, and each image feature returns a set of poses that are accumulated by a Generalized Hough Transform. In the second step, the configuration of hypotheses that best interprets the image is calculated using a Bayesian network. The network represents both visual effects, such as the creation and occlusion of image features, and physical constraints, such as object interference.

1 Introduction

The generalized Hough transform (GHT) is a common approach to object recognition in transformation space. The idea is to collect evidence for transformations, e.g., poses of rigid 3-D objects, by using a accumulator array that represents a tessellation of the transformation space. Each feasible match between a set of 2-D image features and a set of 3-D object model features provides evidence for some set of transformations, and the corresponding bucket (or buckets) in the accumulator array is incremented. A final search of the accumulator array then reveals the regions in the transformation space with high degrees of evidence, in terms of individual, independent feature-to-model matches.

This independent accumulation of match evidence is one of the major shortcomings of the GHT, for it often leads to false positives, i.e., buckets that correspond to accidental alignments of features rather than actual objects. (Another cause of false positives is that multiple distinct transformation ranges may be grouped in the same bucket due the tessellation of the transformation space. For an alternative approach to the tessellation problem, see [3].) For this reason, the GHT is often used to generate only initial hypotheses as a first stage in an object recognition system. The second stage might be either verification and refinement, or a new search using the limited transformation space [4].

Alternative approaches to object recognition are view-based methods, which entail a search in feature space. Instead of using 3-D object models, view-based methods typically use appearance models that represent expected visible image features over a range of views. There are many strong arguments for the view-based approach, including ease of representing and acquiring complex models [1] and evidence that humans employ some kind of view-based approach [11]. Unlike the GHT, all matching is done in feature space, between the set of image features and 2-D transformations of each model view. Basic problems are determining how many views should be stored for each object model and the cost of matching each view to the image.

It should be noted, however, that in many applications it is practical to store all views (to within some small resolution) of a model, thereby eliminating the need for 2-D transformations at the cost of extra storage. In environments such as offices and factories, most objects do not possess six degrees of freedom; chairs, wastebaskets, walls, etc., can be effectively represented using two translations (across the floor) and one rotation (about an axis perpendicular to the floor). Such an environment might be the domain of a small, autonomous robot. By limiting the degrees of freedom, and by approximating the locations and properties of image features to some resolution, the entire range of views can be stored effectively. (Other examples of such a memory-based approach to 3-D object recognition are described in [7] and [10].)

These two approaches to 3-D object recognition - the generalized Hough transform and view-based recognition - can be combined to produce a system that quickly recognizes complex configurations of objects in many practical environments. We introduce as a recognition tool the feature-pose map, a table that maps image features to possible object poses, and show how this replaces the usual inverse projective function that is applied in the GHT when matching 2-D features to 3-D features. Secondly, instead of refining hypotheses by back-projecting the model into the image or by resorting to a correspondence space search, a more accurate pose is calculated by interpolating pose parameters around each peak in the accumulator array. We employ this method to improve overall computational efficiency of our system.

Finally, to reduce the number of false positives and, in general, to find a consistent set of hypotheses that best accounts for the image, we enforce physical and visual constraints using a Bayesian network. A final set of hypotheses must be physically realizable; for example, no pair of objects can overlap or otherwise spatially interfere with each other. Also, any occlusions should be considered in evaluating the belief in an hypothesis. Others have argued that when verifying an object, one should not consider features matches to be independent: if one feature is matched, then nearby features are more likely to be matched. This captures the notion that single, large occlusions are more likely to be found than many small, scattered occlusions [2]. We extend this by requiring that the presence of an occluding object should increase the belief in an object that it partially occludes, and vice versa.

2 Hypothesis generation

2.1 The feature-pose map

In model-based vision, 3-D object recognition by a top-down search of the transformation space has always been considered impractical. Consider, for example, a transformation space of six degrees of freedom, consisting of three translations within a volume of a meter3 and three rotations. If the resolution of each translation is 0.5 cm. and the resolution of each rotation is 5¡, the transformation space will contain over 4.6 x 1010 points. It would be clearly impractical to compute the appearance of each object within the transformation space and compare it to the image.

The size of the transformation space can, however, be reduced by restricting its degrees of freedom or increasing its granularity. In navigating an office or factory environment, a small mobile robot with a single fixed camera will encounter objects either resting in stable configurations on the floor (e.g., chairs, wastebaskets, tools) or mounted in standard configurations on walls (e.g., baseboards, electric outlets). From a camera-centered viewpoint, each object has three degrees of freedom: two translations across the ground place (i.e., the floor) and a single rotation. If we now consider a transformation space of one meter2 with a single rotation, and if we set the translational resolution to 2 cm. and the rotational resolution to 10¡, the size of the transformation space shrinks to 90,000 points.

In a space of this size, the appearance of all poses can be computed off-line and stored in a feature-pose map. The feature-pose (FP) map is a table that relates image features to object poses, as shown in Figure 1. More precisely, it maps feature bins from a spatial tessellation of the image feature space to sets of poses sampled from the transformation space. The underlying assumption is that small changes in poses usually result in small changes in the locations and properties of features; thus, we can save considerable storage without much loss of information by approximating features (through a tessellation of the feature space) and approximating poses (through a sampling of the transformation space).

Figure 1: Feature Pose Map

The map is, in effect, a discrete approximation of the inverse projection function for a single object from a single viewpoint. To build an FP map, synthetic images are generated for each pose in the transformation space by a ray tracer or some other rendering program. A similar strategy is used in [9], where "virtual examples" are synthesized to enrich a training set. Since each input image has an associated precise pose, and the number of images for a single model numbers in the tens of thousands, acquisition using real images is impractical. Features are extracted from each of these images as they would be extracted from an actual image, i.e., using the same image processing parameters that will subsequently be applied to real images of the object. The feature space, like the transformation space, is also reduced in resolution by sampling across a grid. Implicit in this definition is that every FP map is associated with a set of image processing parameters through which the image features were extracted. Thus, one map might be based on one type of edge detector (or set of parameters) while another map might be based on a different edge detector, or a texture or color analysis algorithm.

The recognition algorithm is very similar to the GHT, except that calculation of the inverse projection is replaced by indexing into the FP map. Features are extracted from the image, and an accumulator array corresponding to the transformation space is created. Each feature provides an index into the map, and from each index a list of poses is retrieved. Each pose increments a bucket in the accumulator array. Searching this array for peaks produces the initial set of hypotheses, which are then refined by interpolation with nearby bins.

2.2 Selection of image features

A persistent problem in object recognition is incomplete features, in particular, edges broken by occlusion or noise. The FP map overcomes this problem by storing local feature information, such as the presence of an edge in a small neighborhood of the image, rather than depending on extended features, such as contours. The image can be tessellated into a grid, for example, and any variety of feature can be measured within each rectangular grid element. In the implemented system, features are represented as triples , where are the Cartesian coordinates of the grid element, and is the quantized value of some measurable quantity specific to the feature type. For example, using the image in Figure 2a:

Edges. is angle of the undirected edge, in the range [0, pi), passing through the neighborhood. The angle is found by fitting lines to pixel chains in the neighborhood. (See Figure 2b.)

Junctions. is the angle of the directed edge, in the range [0, 2pi), radiating from a vertex in the neighborhood. (See Figure 2c.)

Colors. The number of colors in the image is reduced according to some given palette, and is the palette index of the dominant color, if any, in the neighborhood. (See Figure 2d.)

(a) Original (b) Edges (c) Junctions (d) Colors

Figure 2: Types of image features

This list can be expanded easily to add texture or another measurable (or classifiable) quality of a small neighborhood of the image.

2.3 Hypothesis combination

Since an object can have several FP maps, each based on different features or different image processing parameters, different sets of hypotheses can be generated for a single object. One could generate one set of hypotheses from edges and a second set from colors, with the idea of fusing the results from these different data sources. Or, one could use maps that were generated using different edge detectors or with different thresholds. In the implementation, equivalent hypotheses, i.e., identical poses of the same object, are merged automatically.

2.4 Pose refinement

Since the poses that are used to generate an FP map are sampled at intervals, a peak in the accumulator array will correspond only approximately to the exact pose of an object. Adjacent buckets, corresponding to adjacent poses, may also have relatively large counts. To find a better estimate of the pose, we perform a simple interpolation between the peak and neighboring bins. Intuitively, if the pose of an object falls between adjacent sampled poses, then the buckets corresponding to the poses will have values roughly proportional to the distance from the true pose. (A survey of approaches to this problem can be found in [5].)

Since improvements in pose estimation accuracy are tightly coupled with resolution of transformation and feature spaces, a simple method has been adopted for interpolating pose parameters in the neighborhood of a peak. Each parameter is treated independently and is assumed to be symmetric around the peak. A small neighborhood around the peak bucket is found, with a maximum radius on the order of three or four bins on each side. The neighborhood is expanded on one side if its value is not too different from the value on the opposite side. One the neighborhood has been constructed, the average value of the parameters in the neighborhood, weighted by the value of the buckets, is calculated. This is similar to the classical methods of density estimation.

2.5 Open issues

The FP map uses discrete approximations of the feature space and the transformation space, and the main issue in implementation is the degree of resolution of each. A dense sampling of each space increases the accuracy of pose recovery at the expense of memory. Memory costs can be lowered by reducing the resolution of the feature space, or the transformation space, or both, but if one is lowered without the other, oversampling or undersampling may occur. In oversampling, the feature space is so coarse (or, likewise, the transformation space is so fine) that adjacent poses produce mostly or all of the same features. Peaks in the GHT accumulator array are wide and flat, corresponding to many possible matching poses. Accuracy is reduced, and the cost of building the accumulator array is increased. In undersampling, the feature space is so fine (or the transformation space is so sparse) that ÒgapsÓ emerge between corresponding features from adjacent poses. In other words, as an object changes from one pose to the next, a feature will move from one bin to a non-adjacent one. An object that lies between the two poses will have features that will fall in these gaps in the map, and hence not correspond to either sample.

What makes this a difficult problem is that for a given object, the density of features varies with the pose of the object. For a transformation space evenly tessellated along every dimension, under perspective projection, poses further from the camera will produce denser features than poses close to the camera. The solution to this seems to be to vary the resolution of the pose space, and more study must be done.

Another issue is how to implement an FP map to minimize storage requirements. Except for a relatively small amount of support data, the FP map is an array of pose lists. In most cases, a list will contain a group of contiguous poses. For example, consider the case of constructing a map for an orthogonal projection of a 2-D rectangular object. As the object is translated across the y-axis, horizontal edges will overlap through most their extent, and each corresponding feature bin in the map will contain many collinear poses. We can take advantage of this by compressing ranges of poses within a map. In the implementation, compression is performed on a single pose parameter, rotation. Thus, each element in the pose list contains single x and y translations but a range of contiguous z rotations. A smarter (and more complicated scheme) might compress in multiple dimensions, or in an arbitrary dimension.

Compression might also be achieved by looking at the information content in each pose list. The greater the number of poses in a list, the less information contained in the corresponding feature bin, suggesting that in situations where memory is limited, large pose lists might be discarded. On the other hand, a list with only one or two poses might indicate undersampling and suggest an examination of adjacent lists for the same poses. If they are not found, then undersampling is definitely occurring at this point, and additional samples should be taken around this pose.

3 Scene interpretation

3.1 Overview

The first part of this system generates hypotheses quickly, but, as with most GHT applications, false positives occur frequently in cluttered, complicated scenes. An extra source of false positives is that for speed, the GHT is applied only once: the more traditional method for recognizing objects in cluttered environments is to apply the GHT once, find the best hypothesis, remove the matching features, apply the GHT again, and so on [4].

The goal of the second part of the system is to eliminate these false positives and return the set of hypotheses that best interprets the scene. Object recognition programs typically define ÒbestÓ in terms of degree of match, perhaps as a percentage of matched object features, although work has been done in considering spatial distribution of matches [2]. Thus, verification is performed in feature space. Transformation space, however, also provides much information that can be applied to verification, in terms of physical constraints. Here, the term physical constraint means a property of objects that must be maintained in any physically realizable scene. For example, objects cannot overlap (or in CAD terms, interfere) in space. Likewise, in a static scene, objects must be in stable poses. The implemented system currently considers only spatial interference.

Physical constraints greatly limit the combinations of objects that can be used to interpret a scene. At this stage, special purpose representations can be employed to simplify verification; for example, 3-D interference can be reduced to 2-D object overlap by using the footprints of two objects resting on the floor. In the implementation, soda cans are represented by circles, and walls are represented by half-planes. A major problem with applying physical constraints are errors in pose estimates. Without accurate poses, one cannot determine whether two hypotheses that slightly overlap are incompatible in the same interpretation or if the overlap was due to measurement error. To handle constraint uncertainty, a Bayesian network is constructed to represent physical relationships between hypotheses. (For another example of the use of Bayesian networks in object recognition, see [6].)

The Bayesian network also provides a framework for representing visual interactions between objects. Not only can it capture the direct causality between object features and image features, but it can also explain features that are incomplete and missing due to partial occlusion. If an image feature is predicted by a hypothesis, and that feature is not detected, then belief in the hypothesis should be lowered. If, however, it can be demonstrated that the presence of another hypothesis would occlude the feature, then the error has been explained and the hypotheses support each other.

3.2 Bayesian network construction and evaluation

The Bayesian network is constructed initially from two type of nodes. First, every hypothesis generated by the FP map is represented as a 0/1-valued node, all with equal prior probability. Next, every image feature predicted by each of these hypotheses is represented, again by a 0/1-valued node. For those features that were detected in the image, the prior probability is set to 1.0; for those that were predicted but not detected, the prior is set to 0.0. The next step is to created other nodes that represent physical and visual constraints, and to link these nodes together.

One can represent physical constraints between a pair of hypotheses in a Bayesian network in two ways [8]. One way is to create a single node that represents combinations of both hypotheses, and simply not allow it to have the a value representing the presence of both hypotheses. This can become complicated quickly when one has a chain of interfering hypotheses: A interferes with B, which interferes with C, which does not interfere with A but does interfere with D, etc. Also, in this representation the notion of interference is absolute and does not allow the desirable inclusion of uncertainty. The second way of representing physical constraints, and the one used in the implementation, is to introduce a 0/1-valued constraint node that is linked directly to the two interfering hypotheses. (See Figure 3.) The incoming links represent the conditional probability , and to represent interference, we let , else . The meaning of this is that the constraint node c is never true when both hypotheses are true. The value of the constraint node is then fixed to 1. This provides a simple mechanism for representing uncertainty: instead of setting to 1.0, it might be set to 0.8 to indicate some doubt about the interference.

Figure 3: Constraint between two hypotheses

The causal relationship between hypotheses and image features might be represented by direct links from the hypothesis to a feature node, but this fails to account for the possibility that more than one hypothesis could be linked to a feature (since all hypotheses are generated at once, and not iteratively with the removal of matched features). To handle this, an intermediate match node is inserted whose values indicate the source of the feature, which can be any matching hypothesis, or noise. (See Figure 4.) This formulation not only prevents one feature from matching multiple objects, but also allows the expression of occlusion. The occluding hypothesis is linked to the match node, along with the hypothesis that should have produced the feature. The link values are set to indicate that the feature should be present only if the occluding hypothesis is not present and the partially occluded hypothesis is present.

Figure 4: Representing feature matching and occlusion

Other high level knowledge about scenes can be incorporated in this framework. For example, one could add domain knowledge about prior likelihoods of objects or the likelihood of object configurations (e.g., power strips usually lie close to walls).

Once the network is generated, it can be evaluated to find the set of hypotheses that best account for the image features. Note that, in general, the formulation presented here results in a multiply-connected network, meaning that any distributed belief revision scheme applied to this network must cope with loops. Potential candidates are clustering, conditioning, and stochastic simulation [Pearl 88]. These methods are currently being investigated in the implementation.

4 Example

 

Figure 5 shows the results of applying the FP map technique to find the wall in Figure 2. The wall is recognized by detecting the baseboard. The FP map used in this example reduces the 320-by-240 pixel image from 256 colors to 16 colors, divides it into squares of 10-by-10 pixels, and replaces every pixel in the square with the most common color found in that square. The transformation space consists of a single translation along the optic axis, in increments of 1.0 cm, and a rotation from -70¡ to +70¡ at 10¡ increments. The peak in the accumulator array corresponds to a wall about 78 cm. away from the camera at an angle of about 30¡. Figure 6a shows the set of matched features and Figure 6b shows the set of unmatched features. Except for a few isolated features near the top of the image, the unmatched features can be explained by occlusion of a book and two soda cans.

(a) Original Image (b) Predicted Features

Figure 5: Estimate of wall location

(a) Matched Features

(b) Unmatched Features

Figure 6: Matched and unmatched images wall features

Figure 7 shows the result of applying FP maps for the wall and three different brands of soda to the example image in Figure 2. The new image was created by overlaying synthetic images corresponding to the four best hypotheses. Object locations are not particularly accurate, but they should suffice for purposes such as navigation.

(a) Original

(b) Interpretation (synthesized)

Figure 7: Original image and final interpretation

5 Conclusions

Fast 3-D object recognition can be accomplished by combining a view-based representation with a generalized Hough transform. In applications such as vision for a small, autonomous robot moving around a factory or an office, an objectÕs full range of views can be computed and stored off-line, thereby alleviating the need for even 2-D transformations when matching view-based representations to image features. Although this requires a large number of viewsÑin the tens of thousandsÑthe resulting table (termed the FP map) can be easily stored in a small computer. The advantages of this technique are:

Disadvantages are:

Problems that might arise from the parallel generation of hypotheses (as opposed to sequentially generating hypotheses and removing matching features), such as false positives, can be overcome by applying physical and visual constraints using a Bayesian network. The network is then evaluated to find the set of hypotheses that best explain the image. Note that this approach can be incorporated in other object recognition systems, since it is independent of the source of hypotheses.

Acknowledgments

This work was partially supported by the Defense Advanced Research Projects Agency and the Office of Naval research under Grant N00014-95-1-0521 (ARPA Order 635).

References

[1] T. M. Breuel. View-based recognition. IAPR Workshop on Machine Vision Applications, 1992.

[2] T. M. Breuel. Higher-order statistics in object recognition. Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1993.

[3] T. A. Cass. Polynomial-time object recognition in the presence of clutter, occlusion, and uncertainty. Proc. Image Understanding Workshop, 1992.

[4] W. E. L. Grimson. Object Recognition by Computer: The Role of Geometric Constraints. MIT Press, 1990.

[5] V. F. Leavers. Which Hough transform? CVGIP: Image Understanding 58(2), 1993.

[6] W. B. Mann and T. O. Binford. An example of 3d interpretation of images using Bayesian networks. Proc. Image Understanding Workshop, 1992.

[7] R. C. Nelson. 3-D recognition via 2-stage associative memory. TR-565, Dept. of Computer Science, University of Rochester, 1995.

[8] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.

[9] T. Poggio and D. Beymer. Learning networks for face analysis and synthesis. Proc. International Workshop on Automatic Face and Gesture Recognition, Zurich, June, 1995.

[10] T. Poggio and A. Hurlbert. Observations on cortical mechanisms for object recognition and learning. A.I. Memo No. 1404, MIT, 1993.

[11] S. Ullman. Aligning pictorial descriptions: an approach to object recognition. Cognition 32, 1989.