|
Q. Does it make sense
to segment the entire scene or image?
A. The answer is No.
Let's try to understand it with the
example below:
The segmentations of the left most image as given by a normalized cut
based algorithm when its parameter (the expected number of regions) is
set to 10 and 60 are the middle and the right most image respectively.
Now, if we ask, which one of the two segmentations is the most
appropriate?
  
The answer to that question depends on the object of
interest in the scene. If the tiny horse is of interest, the
segmentation on the right makes more sense as the horse constitutes
three segmented regions. However, if the tree is of interest, the
segmentation shown in the middle is more appropriate. However, the
horse is completely absent in this segmentation map. So, to define the
appropriate segmentation of a scene, it is important to identify the
object of interest. It may appear to be a chicken-and-egg problem. But
it is not.
Human visual system has an attention module that uses low-level
information efficiently to quickly find salient locations in the image.
The human eye is drawn to these salient points (also called fixations.)
This is true even when we see a picture. So, it is only obvious to have
the fixation point as a part and parcel of any vision algorithm. We
propose a segmentation algorithm which take a fixation as input and
outputs a region containing that given fixation point in the scene. For
example, if the two green crosses are two fixations points on the two
different objects (horse and tree) in the scene (see the images on the
left), the corresponding segmentation calculated by our method is shown
in the images on the right.
 
 
Full paper ( pdf )
Download Source code.
Some pertinent questions and
answers:
Q1. How is this
segmentation algorithm different from all the segmentation algorithms
proposed so far?
Ans: The proposed algorithm defines segmentation for the first time in
an optimal fashion for a given fixation. So, it is an automatic process
of segmentation which takes its only input "fixation" from an attention
system that decides where the eye looks at in a scene. Besides, such a
defintion is close to how the human visual system appears to work.
All segmentation algorithms so far in the vision literature depend on
some kind of user inputs such as expected number of regions, threshold
to stop the clustering, a box around the object of interest, seed
points on the foreground and the background etc. These inputs can not
be determined automatically for a new image/scene. Please read the
related work section of the paper for more detailed analysis and
references of the previous works.
Q2. Can the boundary edge
detector be improved?
Ans: Currently, the output of the berkeley edge detector is being used
as the probabilistic estimate of a boundary at any edge pixel, which
contains strong edges along both the real boundary and the internal
edges. In this work, motion or(and) stereo cues are used to get rid of
the internal edges. But a better startegy involving all low-level cues
and some high level cues can be used to output the boundary edges of
the scene.
Q3. What is the
application of this approach?
Ans: Segmentation reduces complexity of visual processing. Now, the
robotic systems can use existing attention system to make fixations in
the scene and then the proposed method can be used to obtain regions
corresponding to those fixations. These regions then be used as a
mid-level cue for further processing.
|