W4 is a real time system for tracking people and their body parts in monochromatic
imagery. It constructs dynamic models of
people's movements to answer questions about what they are doing, and where
and when they act. It constructs appearance models
of the people it tracks so that it can track people (who?) through occlusion
events in the imagery. In these pages, we describe the computational
models employed by W4 to detect and track people and their
parts. These models are designed to allow W4 to determine types
of interactions between people and objects, and to overcome the inevitable
errors and ambiguities that arise in dynamic image analysis (such as
instability in segmentation processes over time, splitting of objects due
to coincidental alignment of objects parts with similarly colored background
regions, etc.). W4 employs a combination of shape analysis and robust techniques
for tracking to detect people, and to locate and track their body parts.
It builds ``appearance'' models of people so that they can be identified
after occlusions or after other interactions during which W4 cannot track
them individually.
I.Haritaoglu, D.Harwood, and L.Davis.
Third Face and Gesture Recognition Conference. pages:222-227 1998
(Also will appear at Image and Vision Computing Journal, Januarry 1999)
Goal
Outdoor Surveillance:
monitoring of a site for intrusion
depositing and removing objects
exchanging objects
theft, ……
Key Element: PEOPLE
Whereare
they?
What are
they doing?
Who is
who?
When
does an action occur? …...
Work with only monochromatic video sources,either visible or infrared.
While most previous work on detection and tracking of people has relied
heavily on color cues. For outdoor surveillance tasks, and particularly
for night-time or other low light level situations color will not be available,
and people need to be detected and tracked based on weaker appearance and
motion cues.
Real time system. W4 is currently implemented on a dual processor
Pentium PC and can process between 20-30 frames per second depending on
the image resolution (typically lower for IR sensors than video sensors)
and the number of people in its field of view.
W4 will be extended with models to recognize the actions of the people
it tracks. Specifically, we are interested in interactions between
people and objects - e.g., people exchanging objects, leaving objects
in the scene, taking objects from the scene. The descriptions of
people - their global motions and the motions of their parts -
developed by W4 are designed to support such activity recognition.
W4 currently operates on video taken from a stationary camera.
W4S : Integration with stereo computation
Ghost: Silhouette-based body part labeling
Real-time 3D motion capture
Assumptions:
Detect and track people and their body parts
Real Time (~15-30 fps)
Monochromatic video camera (visible or infrared)
Stationary camera
Isolated, upright, and unoccluded people
No special hardware
W4S : Integration with stereo computation
Ghost: Silhouette-based body part labeling
Real-time 3D motion capture
Background Modeling and Foreground
Pixel Detection
Camera is stationary, however
motion of background objects
camera jitter
illumination changes
make detection a difficult problem
The background scene is statically modeled by the minimum and maximum
intensity values and maximal temporal derivative for each pixel recorded
over some period, and is updated periodically.
Foreground object are segmented from the background by
thresholding, noise cleaning, morphological filtering and connected component
analysis
Tracking
Goal: Fast and robust matching
create first order component of motion models for body parts - motion of
the torso
matching objects by finding overlapping bounding boxes.
W4 has to continue track objects even if low level detection fails to segment
people as single objects
one-to-one matching
object being tracked splits into many regions
partial occlusion of a moving object or coincidental alignment with similar
“colored” background
original object might have been combination of several real objects
several objects merge into one
new objects appear
occlusion
Motion Estimation:
Two stage motion estimation algorithm
Initial displacement estimate of an object is calculated as the displacement
of the median coordinate of the foreground region between the current frame
and the previous frame.
Best match between previous silhouette edges and current silhouette edges
is found by correlating over a 5x3 displacement mask centered at the initial
displacement estimate.
Temporal Texture Template
Dynamic template matching is used to identify objects after occlusion,
and to determine “who is who”.
A temporal texture template is generated while a person is being tracked.
Average intensity of corresponding foreground region pixels, where pixel
positions are computed in a coordinate system centered at the median of
moving object
Updated as long as the person can be tracked.
The cardboard model
Locate body parts : Head, Torso, Feet, Legs, Hands
Hands are located after torso by finding extreme regions which are connect
to the torso and which are outside of torso
Height of the bounding box of an object is taken as a height of the cardboard
model, then fixed vertical scales are used to find the approximate initial
location of body parts
Head & Hand Tracking
Templates are tracked by correlation. The estimated location of an template
is calculated by global motion g(t) of body and local motion l(t) of head
(hands).
Templates for head and hands are generated and updated after they are located
by card board model.
Local motion of head (hands) is estimated by template matching
The Correlation results are monitored during tracking to determine if the
correlation is good enough to track the parts.Changes in the correlation
scores allow us to make a prediction about whether a part is becoming occluded.When
tracking fails, detection is re-initialized subsequently with the static
cardboard model