(Spring 2004)
Organizers: Ramani
Duraiswami and David Jacobs
Webmaster: Zhiyun Li
| Top |
| 01/16/04 | Unconstrained Face Recognition |
|---|---|
| Speaker | Shaohua Kevin Zhou |
| Abstract |
Current face recognition systems work well only under controlled scenarios, e.g.
the face images should at a frontal pose, view, frontal illumination, and
accurate localization. Also these systems handle various imageries, e.g. still
images, groups of still images, and video sequences, in an ad hoc manner. In our
attempt to build an unconstrained face recognition system that is able to handle
variations in illumination, pose, and localization and process various imageries
in a unified way, we have proposed (i) a generalized photometric stereo algorithm for face recognition across illuminations, (ii) an illuminating light field approach for face recognition across illuminations and poses, and (iii) a probabilistic characterization of identity for face recognition from still images, groups of still images, and video sequences. Experimental results on the PIE database demonstrate the effectiveness of the proposed approaches. |
| 01/30/04 | Compositional vision - the low-level interdependence of visual problems: |
| Speaker | Abhijit Ogale, UMD |
| Abstract |
The difficulty in solving problems such as image correspondence lies in the fact
that these problems cannot be examined in isolation. They are inseparably mixed
with other problems such as image segmentation, shape estimation, and occlusion
detection. Therefore, in order to solve one problem, we must simultaneously
solve them all. In this talk, we shall discuss the interdependence of these
problems, and present a new algorithm for image correspondence which emerges
from this compositional philosophy. We also describe the key role of shape in finding image correspondence, and how it leads to important modifications to existing constraints such as the uniqueness constraint. The correctness of this approach will be demonstrated by experimental comparisons with the best existing algorithms. We shall also discuss the critical role of occlusions in 3D motion estimation and motion segmentation, which directly depend on establishing correct image correspondence. |
| 02/06/04 | 3D Facial Features and Motion Recovery using SVD and multi-modal information |
| Speaker | Dr. Sang Hoon Kim, Hankyong National University, Kyonggi-Do, Korea |
| Abstract |
In this talk, robust extraction of 3D facial features and global motion
information from 2D image sequence is described. The work was started for MPEG-4
SNHC face model encoding. I had two topics to complete this work. First topic is
about how to detect facial area and facial features well among the complexed
background. I will explain multi-modal fusion technique which gives an idea for
increasing the probability of faces. The multi-modal information includes normalized skin color, depth information to separate background effectively and moving color information(both the color and motion information are combined). After that, principal facial features among the MPEG-4 FDP (Face Definition Parameters) are extracted automatically inside the facial region using color transform(GSCD, BWCD) and morphological processing. My talk primarily will focus on the first topic because I think the first topic is a basis and very important for practical application such as face recognition, video retrieval, video suveillence, etc. Furthermore, the results are not satisfactory due to the various condition. The second topic is about how to recover the motion and shape information from the image sequence. The extracted facial features are used to recover the 3D facial feature location and global motion of the face using paraperspective camera model and SVD(Singular Value Decomposition) factorization method. A 3D synthetic object is designed and tested to show the performance of the work and I will show the demonstration related to the work. The recovered 3D motion information is transformed into global motion parameters of FAP(Face Animation Parameters) of the MPEG-4 to synchronize a generic face model with a real face. Additionally, I will introduce myself and my future work briefly. |
| 02/13/04 | Bayesian Approaches to Image Segmentation |
| Speaker | Daniel Cremers, UCLA |
| Abstract |
When segmenting their environment into meaningful regions, human observers
exploit a number of low-level cues (such as intensity, color, texture or motion
information) and higher level knowledge about objects of interest. In my
presentation, I will present ways to incorporate such information into image
segmentation methods. In particular, I will present: - the 'Diffusion Snake' as a fast spline-based implementation of the Mumford-Shah functional - 'Motion Competition' as an extension of the Mumford-Shah framework from intensity segmentation to motion segmentation. Segmenting contours are represented either by splines or by level sets. - the integration of higher-level statistical shape priors into the segmentation processes. This permits to cope with noise, background clutter and partial occlusions of the objects of interest. |
| 02/20/04 | Computational Mechanisms of Object Recognition in Cortex |
| Speaker | Max Riesenhuber, Georgetown University |
| Abstract |
Object recognition is a difficult computational problem. Nevertheless, the human
visual system can rapidly and effortlessly recognize objects in cluttered scenes
under widely varying viewing conditions, at a level of performance far beyond
that of current machine vision systems. I will present a simple model of object recognition in cortex based on just two different operations that accounts well for the complex visual task of object recognition in clutter, is biologically plausible, and makes nontrivial testable predictions. Interestingly, the model suggests that the computational strategies chosen by the visual system differ significantly from those of the best current approaches to object detection/categorization in machine vision: Machine vision systems traditionally employ an object representation based on simple, template-like features in combination with an advanced classifier (such as a Support Vector Machine). In contrast, the biological system appears to consist of a hierarchy of processing stages in which shape specificity and invariance to stimulus transformations are increased gradually, producing a sophisticated stimulus representation that permits the use of simple classifiers. I will talk about experimental collaborations designed to test model predictions regarding (i) the neural mechanism underlying scale- and translation invariance (in a collaboration involving intracellular recordings from cat visual cortex), and (ii) the neural basis of recognition tasks (in a collaboration involving neural recordings from monkeys trained on a "cat/dog" categorization task). I will then demonstrate the performance of the biological model using a benchmark face detection task on natural images. We find that the biological model performs as well or better than the comparison machine vision systems, and offers distinct computational advantages with respect to the complexity of the learning problem, transfer across different tasks, and invariance to scaling and translation. |
| 02/27/04 | Nonlinear Decomposable Generative Models for Dynamic Shape and Dynamic Appearance |
| Speaker | Ahmed Elgammal, Rutgers |
| Abstract |
Our objective is to learn representations for the shape and the appearance of
moving (dynamic) objects that support tasks such as synthesis, pose recovery,
reconstruction, and tracking. In this talk we introduce a framework for learning
generative models for dynamic appearance. We use nonlinear dimensionality
reduction to achieve an embedding of the global deformation manifold that
preserves the geometric structure of the manifold. Given such embedding, a
nonlinear mapping is learned from such embedded space into the visual input
space. We also show how approximate solution for the inverse mapping can be
obtained in a closed form, which facilitates recovery of the intrinsic body
configuration and therefore pose recovery. We also address the question of separating style and content on manifolds representing dynamic objects. We learn decomposable generative models that explicitly decompose the intrinsic body configuration (content) as a function of time from the appearance (style) of the person performing the action as time-invariant parameter. The can be achieved by decomposing the style parameters in the space of nonlinear functions that maps between a learned unified nonlinear embedding of multiple content manifolds and the visual input space. We use the framework to learn the gait manifold as an example of a dynamic shape manifold, as well as to learn the manifolds for some simple gestures and facial expressions as examples of dynamic appearance manifolds. |
| 03/05/04 | Creating Perceptually Valid Spatial Audio |
| Speaker | Ramani Duraiswami |
| Abstract |
Humans are very good at discerning the spatial origin of sound using a mixture
of monaural and binaural cues in disparate environments ranging from open spaces
to small crowded rooms. This ability helps us to interact with others and the
environment by sorting out individual sounds from a mixture, and helps us to
survive by warning us of danger over a wider region of space compared to vision.
These advantages of spatial sound are also important in the fields of
human-computer interaction, teleconferencing and telepresence. Our research over
the past three years has been focused on the problems of rendering and acquiring
spatial audio for such applications. A fascinating aspect of human auditory
perception is the ability to extract cues from the scattering of sound from
their own body. The scattering of sound off our torso, head and especially our
external ears changes the "color" of the sound received in a way that depends on
the location of the source. When the sound is reproduced over headphones these
scattering related modifications must be reintroduced to achieve the perception
of real source. The "Head-Related Transfer Function" (HRTF) characterizes how
this scattering takes place off an individual. The HRTF shows significant
inter-personal variability and must be obtained separately via a tedious
measurement process for each listener. This individuality has made it difficult
to use the HRTF in applications, and has been a significant barrier to
widespread use of spatial audio. Other important cues for perception of spatial
audio are provided by the dynamics of the listener and by the scattering of
sound off the environment. In this talk I will present an introduction to the
problem of creating and acquiring spatially valid audio and summarize recent
results in
1) approaches to composition of the head-related transfer function
2) fast approaches to measurement of the head related transfer function
3) real-time rendering of soundscapes
4) acquisition of soundscapes
|
| 03/12/04 | Information Processing Model of Handwriting Examination |
| Speaker | Sargur N. Srihari, CEDAR/SUNY-Buffalo |
| Abstract |
The goal of this research is to develop an information processing model of
handwriting examination, so as to provide quantitative measures for several
issues with legal implications, e.g., to what extent is handwriting individual?
What is the error rate in determining whether two samples of unknown writership
originated from the same writer? What is the error rate in determining the
writership of a questioned document when there are n known writers? What is the
relationship between error rate and the size/content of the questioned sample?
Is the handwriting of short groups any different than that of the general
population? Is it possible to extract demographic information from handwriting?
etc. An information processing model involves three components: a computational
theory, a method of representation of writer information and an implementation
to test the theory and representation. The computational theory would parallel
both human cognitive skills and the methodology/expertise of expert
practitioners. Representation involves algorithms that extract discriminating
elements of handwriting from scanned handwritten documents. The implementation
of the information processing model is a software realization. The testing of
the model would be performed on handwriting samples obtained from both a
representative population as well as well as from cohort groups. A software
implementation, with user interfaces and quantitative measures of the strength
of evidence, would be a useful tool for practitioners. |
| 03/29/04 | 2D-Shape Analysis using Conformal Mapping |
| Speaker | Eitan Sharon, Division of Applied Mathematics, Brown University |
| Abstract |
The study of 2D shapes and their similarities is a central problem in the field
of vision. It arises in particular from the task of classifying and recognizing
objects from their observed silhouette. Defining natural distances between 2D
shapes creates a metric space of shapes, whose mathematical structure is
inherently relevant to the classification task. One intriguing metric space
comes from using conformal mappings of 2D shapes into each other, via the theory
of Teichmuller spaces. In this space every simple closed curve in the plane (a
"shape") is represented by a 'fingerprint' which is a diffeomorphism of the unit
circle to itself (a differentiable and invertible, periodic function). The
shortest path between each two shapes is unique, and is given by a geodesic
connecting them. Their distance from each other is given by integrating the
Weil-Petersson norm along that geodesic. In this talk I will concentrate on
solving the "welding" problem of "sewing" together conformally the interior and
exterior of the unit circle, glued on the unit circle by a given diffeomorphism,
to obtain the unique 2D shape associated with this diffeomorphism. This will
allow us to go back and forth between 2D shapes and their representing
diffeomorphisms in this "space of shapes". This is a joint work with David Mumford. |
| 04/02/04 | Visual 3D Modeling using Cameras and Camera Networks |
| Speaker | Marc Pollefeys, UNC-Chapel Hill |
| Abstract |
This talk consists of two main parts. First, a fully automatic approach to reconstruct detailed 3D models from camera images is presented. The approach can deal with uncalibrated image sequences acquired with a hand-held camera. Based on tracked or matched features the relation between multiple views are computed. From this both the structure of the scene and the motion of the camera are retrieved. The ambiguity on the reconstruction is restricted from projective to metric through self-calibration. A flexible multi-view stereo matching scheme is used to obtain a dense estimation of the surface geometry. From the computed data detailed 3D surface models or alternative visual representations are constructed. Issues such as key-frame selection, dominant planes and camera auto-exposure are addressed. Next, the calibration and synchronization of camera networks is addressed. An efficient and robust algorithm that computes the epipolar geometry from silhouttes of dynamic objects is presented. This makes this approach particularly suitable for visual hull systems. Using the pairwise epipolar geometries, the complete calibration of the network is computed, and refined through bundle adjustment. A specific difficulty is that in general only two view matches are available (frontier points). This approach is also extended to deal with unsynchronized video streams. Ongoing efforts to extend this research to active pan-tilt-zoom camera networks is also briefly discussed.
|
| 04/09/04 | Visual patterns with matching subband statistics |
| Speaker | Joshua Gluckman, Polytechnic University |
| Abstract |
Statistical representations of visual patterns are commonly used in computer vision, image processing and pattern recognition. One such representation is a statistical distribution measured from the output of a bank of filters (Gaussian, Laplacian, Gabor, wavelet etc). Both marginal and joint distributions of filter responses have been advocated and effectively used for a variety of vision tasks including: image retrieval, object recognition, texture analysis and texture synthesis.
We begin by examining the ability of these statistical representations to discriminate between an arbitrary pair of visual stimuli. Examples of patterns are derived that provably possess the same statistical properties, yet are "visually distinct." We further analyze these representations by studying classes of patterns with matching statistical properties. In particular, we show that these representations are effectively blind to certain phase correlations. Finally, we derive "higher order whitening" transformations that remove statistical redundancies from images. These transformations remove any information from the marginal and joint subband statistics. However, these transformations keep much of the perceptually important information. Thus, they demonstrate what image properties subband statistics "don't see."
|
| 04/16/04 | A component-based approach to object detection and recognition |
| Speaker | Bernd Heisele (Honda) |
| Abstract |
I will begin with a brief description of Honda's research on its latest humanoid robot ASIMO, which is 120 cm tall, weighs around 50 kilos and has 26 DOF. Its vision system consists of a pair of color CCD cameras and a Pentium III platform for image processing. As part of the research on ASIMO's vision system, I will present a hierarchical approach for object detection and recognition. The first level of the classification hierarchy includes component detectors that locate parts of an object in the image. In the second level, a combination classifier checks if the geometrical configuration of the detected components matches a learned model of the object's geometry. The main difficulty in component-based detection is the proper choice of a set of components. I will discuss an algorithm, which iteratively learns a set of components by minimizing an error bound on the component classifiers. I will show the experimental results of the algorithm applied to face detection and recognition, followed by an outline of future trends in object detection and recognition. |
| 04/23/04 | Kernel Density Approximation and Its Applications |
| Speaker | Bohyung Han |
| Abstract |
Density-based modeling of visual features is very common in computer vision, either by using non-parametric techniques or through representing the underlying density function as a weighted sum of Gaussians.
Nevertheless, current methods for the density estimation either lack flexibility by fixing the number of Gaussians in the mixture, or require large memory amounts by maintaining a non-parametric representation of density. In this talk, I explain a new approximation based on mean-shift to represent density effectively, and present a method to propagate the density modes sequentially. While the proposed density representation is memory efficient (which is typical for mixture densities), it inherits the flexibility of non-parametric methods, by allowing the number of modes to be adaptive. I will show how this technique can be applied to various computer vision problems such as background modeling and object tracking by several examples.
|
| 04/30/04 | Defocus and conquer |
| Speaker | George Barbastathis, MIT Mechanical Engineering |
| Abstract |
We have designed a class of imaging systems that use the unique optical
properties of volume holograms to extract depth from defocus information
directly from the wavefront of coherent or incoherent light without
ambiguity and with sharp resolution at long working distances. For example, we
have demonstrated 3D imaging with resolution better than 50 micrometers in all
three dimensions with objects placed as far as 1m from the optics. We have also
demonstrated the first-ever simultaneous acquisition of 3D spatial structure and
spectral composition on a digital camera without scanning. These unique imaging
modalities are enabled by design that takes properly into account the Bragg
selectivity and multiplexing properties of holograms, as well as digital image
post-processing techniques such as the regularized pseudo-inverse and the
Viterbi decoding algorithm, applied for the first time in the context of 3D
image restoration. In the talk we will describe the basic physical and
algorithmic considerations, demonstrate experimental results, and discuss
applications in security screening, biomedical imaging, and environmental
monitoring of the deep ocean. |
| 05/07/04 | On Identification of People and Characters in Video Archives |
| Speaker | Nevenka Dimitrova, Philips Research and Columbia University |
| Abstract |
The goal of video content analysis has been to derive automatic methods for
high-level description and annotation. Main topics in this area have been video
summarization, media identification, genre detection,
event detection and classification, and person identification. In this
presentation I will focus on our latest advancements in person identification:
online face learning, multimodal speaker identification and talking face
detection. Our overall approach to person identification uses visual, audio,
and textual analysis. As opposed to the traditional computer vision problem of identifying people in mug shots, here we have the continuous visual sequence, the audio signal, and in most cases movie transcript and screenplay. General TV programs and personal videos represent "uncontrolled" domains because faces appear under various angles and lighting conditions (low-key, high-key lighting) and in many instances with generous amount of makeup. Voices are mixed with the general soundtrack which includes music and environmental noise. In addition, traditional systems are usually "closed" with respect to the trained models of the faces. We have investigated an "open" system that can detect and learn new faces. We developed an online-learning face recognition system based on the Modified Probabilistic Neural Networks (MPNN) for videos. This face recognition system can detect and recognize faces, as well as automatically detect unknown faces and train the unknown faces online into face classifier so that this "unknown face" can be recognized if it appears again. We present a multimodal speaker identification method which consists of screenplay parsing, extraction of time-stamped transcript, alignment of the screenplay with the time-stamped transcript, audio segmentation and audio speaker identification. The results of the visual and audio analysis can only give candidates for full audiovisual person identification. In addition, it is necessary to identify the talking face in order to associate the name and the face. Person identification can be used in many multimedia content retrieval applications. We have built an application, InfoSip, that performs person identification and scene annotation based on actor presence. The system links this information with actors' filmographies and biographies and produces an enriched viewing experience. Nevenka Dimitrova is a Research Fellow at Philips Research and a Visiting Scientist at the Digital Video and Multimedia Group at Columbia University. She obtained her PhD (1995) and MS (1991) in Computer Science from Arizona State University (USA), and BS (1984) in Mathematics and Computer Science from University of Kiril and Metodij, Skopje, Macedonia. Her research passion has been in the areas of content management, content synthesis, video content navigation and retrieval, content understanding. More recently she has become interested in both what biological systems can do for computation as well as what computation can do for biological systems. Nevenka has over 25 issued patents, and over 90 publications. She has given keynote presentations at CIVR, Medi@net, IEEE ITCC. Actively participates in IEEE, ACM and SPIE conferences, chaired and served on 30+ different program committees, currently serving on 3 editorial boards: ACM Multimedia Systems Journal, IEEE Multimedia and ACM Transactions on Information Systems. She serves as a Special Sessions Co-chair for ICME 2004 and General Co-chair of ACM Multimedia 2004 in NYC. She believes that the advancement of multimedia information systems can help improve quality of life (and survival). Believes that our inspiration for moving from Hume-multimedia processing to Kant-multimedia processing should come from real life, from philosophy and psychology but the research should be firmly grounded on formal mathematical basis. |
| 05/21/04 | Illumination and Human Faces |
| Speaker | Dimitris Samaras, Computer Science Department, SUNY Stony Brook |
| Abstract | Illumination effects in images complicate greatly main Computer Vision tasks such as, 3D Shape Reconstruction or Face Recognition. They are also both a cause of unwanted artifacts and a source of realism in Image Based Rendering. Another area of interest both in Graphics and in Vision is modeling and analysis of human faces. We will discuss recent results in the intersection of these two areas. We will discuss a new approach for face recognition under arbitrary illumination conditions, which requires only one training image per subject and no 3D shape information. Our method is based on recent result demonstrating that the set of images of a convex Lambertian object obtained under a wide variety of lighting conditions can be approximated accurately by a low-dimensional linear subspace. We will show that we can recover basis images spanning this space from just one image taken under arbitrary illumination conditions by making use of a statistical model for the basis images. These basis images can be used effectively for Face Recognition.
The application of illumination patterns on a face can be used to acquire the geometry and texture of the face, recently at video rates. I will address fundamental issues regarding the use of high quality dense 3-D data samples undergoing motions at video speeds, e.g. human facial expressions. In order to utilize such data for motion analysis and re-targeting, correspondences must be established between data in different frames of the same faces as well as between different faces. I'll present a data driven approach that consists of four parts: 1) High speed, high accuracy capture of moving faces without the use of markers, 2) Very precise tracking of facial motion using a multi-resolution deformable mesh, 3) A unified low dimensional mapping of dynamic facial motion that can separate expression style, and 4) Synthesis of novel expressions as a combination of expression styles. The accuracy and resolution of our method allows us to capture and track subtle expression details and then retarget and morph expressions.
Dimitris Samaras is an Assistant Professor in the Computer Science Dept. of the State Univ. of New York at Stony Brook. He received his Ph.D from the University of Pennsylvania in January 2001. His research (currently funded by the Dept. of Energy and the National Science Foundation) focuses on the effects of illumination in images in Computer Vision (shape estimation, tracking, recognition) and Computer Graphics (image relighting, augmented reality). He likes to apply his results on applications related to human faces.
|
| 06/04/04 | Context-Aware Mobile Multimedia Services |
| Speaker | Timo Ojala, MediaTeam Oulu research group, University of Oulu |
| Abstract | This talk reports the work done in
the ongoing Rotuaari project at the University of Oulu, Finland. The goal
of the project is the development and empirical evaluation of the
technology and business models of future context-aware mobile multimedia
services in the real environment of use. The project contains three main
components, the service system, field trials and a value network, of which
the first two are addressed in this talk. The service system comprises of
a multi-access wireless network, versatile service platforms and a number
of prototype services. The talk discusses in detail panOULU (public access
network OULU), the WLAN 'hotcity' component of the wireless network, which
provides free wireless broadband Internet access to a large number of
people. The service system includes two service platforms, the SmartWare
architecture developed in the project and the commercial Octopus platform.
A number of context-aware mobile multimedia services have been built atop
these platforms. The services are exposed to genuine end-users in field
trials conducted in the real environment of use, which provide valuable
feedback on the functionality of technology, user experience, consumer
behavior, and viability of business models. The talk concludes with a
preview of the upcoming large-scale field trial in summer 2004. Project web site: http://www.rotuaari.net |
| 06/18/04 | Stochastic Spatio-Temporal Grammars for Images and Video |
| Speaker | Jeffrey Mark Siskind, School of Electrical and Computer Engineering, Purdue University |
| Abstract | Probabilistic Context-Free Grammars (PCFGs)
induce distributions over strings. Strings can be viewed as observations
that are maps from indices to terminals. The domains of such maps are
totally ordered and the terminals are discrete. We extend PCFGs to induce densities over observations with unordered domains and continuous-valued terminals. We call our extension Stochastic Random Tree Grammars (SRTGs). While SRTGs are context sensitive, the inside-outside algorithm can be extended to support exact likelihood calculation, MAP estimates, and ML estimation updates in polynomial time on SRTGs. We call this extension the center-surround algorithm. SRTGs extend mixture models by adding hierarchal structure that can vary across observations. The center-surround algorithm can recover the structure of observations, learn structure from observations, and classify observations based on their structure. We have used SRTGs and the center-surround algorithm to process both static images and dynamic video. In static images, SRTGs have been trained to distinguish houses from cars. In dynamic video, SRTGs have been trained to distinguish entering from exiting. We demonstrate how the structural priors provided by SRTGs support these tasks. Joint work with Charles Bouman, Shawn Brownfield, Bingrui Foo, Mary Harper, Ilya Pollak, and James Sherman. |
| 06/25/04 | Automatic Learning in Visual Surveillance by Observing Activity |
| Speaker | Dimitri Makris, University of Kingston |
| Abstract | A methodology will be presented that
provides surveillance systems with the ability to learn high-level models
of the scene and the activity. The learning process is automatic and
exploits large datasets of motion observations. This approach is
consistent to the idea of "plug'n'play" systems that can sense and learn
their environments and adapt to any possible changes. The methodology can
support the realization of a variety of applications such as event
detection, conceptual encoding, video annotation and long-term prediction.
The problem of integration of information from multiple uncalibrated
cameras is also discussed. A novel method is presented that allows
multiple surveillance systems to learn their camera topology and supports
tracking across the unseen regions of a wide-area scene. |
Questions/comments: zli(at)cs.umd.edu